text/proposal.tex
changeset 283 c4d821c6309d
child 284 913205c1fd0d
equal deleted inserted replaced
282:bfab5aded21d 283:c4d821c6309d
       
     1 \documentclass{article}
       
     2 \usepackage{url}
       
     3 \usepackage{color}\definecolor{darkblue}{rgb}{0,0,0.5}
       
     4 \usepackage[a4paper,colorlinks=true,linkcolor=darkblue,citecolor=darkblue,filecolor=darkblue,pagecolor=darkblue,urlcolor=darkblue]{hyperref}
       
     5 \usepackage{style}
       
     6 \usepackage{langs}
       
     7 \usepackage{tikz}
       
     8 \usepackage{pgf}
       
     9 \usepackage{pgfplots}
       
    10 \usepackage{stackengine}
       
    11 
       
    12 \addtolength{\oddsidemargin}{-6mm}
       
    13 \addtolength{\evensidemargin}{-6mm}
       
    14 \addtolength{\textwidth}{12mm}
       
    15 
       
    16 
       
    17 
       
    18 %% \usepackage{accents}
       
    19 \newcommand\barbelow[1]{\stackunder[1.2pt]{#1}{\raisebox{-4mm}{\boldmath$\uparrow$}}}
       
    20 
       
    21 \begin{filecontents}{re-python2.data}
       
    22 1 0.033
       
    23 5 0.036
       
    24 10 0.034
       
    25 15 0.036
       
    26 18 0.059
       
    27 19 0.084 
       
    28 20 0.141
       
    29 21 0.248
       
    30 22 0.485
       
    31 23 0.878
       
    32 24 1.71
       
    33 25 3.40
       
    34 26 7.08
       
    35 27 14.12
       
    36 28 26.69
       
    37 \end{filecontents}
       
    38 
       
    39 \begin{filecontents}{re-java.data}
       
    40 5  0.00298
       
    41 10  0.00418
       
    42 15  0.00996
       
    43 16  0.01710
       
    44 17  0.03492
       
    45 18  0.03303
       
    46 19  0.05084
       
    47 20  0.10177
       
    48 21  0.19960
       
    49 22  0.41159
       
    50 23  0.82234
       
    51 24  1.70251
       
    52 25  3.36112
       
    53 26  6.63998
       
    54 27  13.35120
       
    55 28  29.81185
       
    56 \end{filecontents}
       
    57 
       
    58 \begin{filecontents}{re-java9.data}
       
    59 1000  0.01410
       
    60 2000  0.04882
       
    61 3000  0.10609
       
    62 4000  0.17456
       
    63 5000  0.27530
       
    64 6000  0.41116
       
    65 7000  0.53741
       
    66 8000  0.70261
       
    67 9000  0.93981
       
    68 10000 0.97419
       
    69 11000 1.28697
       
    70 12000 1.51387
       
    71 14000 2.07079
       
    72 16000 2.69846
       
    73 20000 4.41823
       
    74 24000 6.46077
       
    75 26000 7.64373
       
    76 30000 9.99446
       
    77 34000 12.966885
       
    78 38000 16.281621
       
    79 42000 19.180228
       
    80 46000 21.984721
       
    81 50000 26.950203
       
    82 60000 43.0327746
       
    83 \end{filecontents}
       
    84 
       
    85 
       
    86 \begin{document}
       
    87 
       
    88   
       
    89 \section*{Proposal:\\
       
    90   Fast Regular Expression Matching\\
       
    91   with Brzozowski Derivatives}
       
    92 
       
    93 \subsubsection*{Summary}
       
    94 
       
    95 If you want to connect a computer directly to the Internet, it must be
       
    96 immediately \mbox{hardened} against outside attacks. The current
       
    97 technology for this is to use regular expressions in order to
       
    98 automatically scan all incoming network traffic for signs when a
       
    99 computer is under attack and if found, to take appropriate
       
   100 counter-actions. One possible action could be to slow down the traffic
       
   101 from sites where repeated password-guesses originate. Well-known
       
   102 network intrusion prevention systems that use regular expressions for
       
   103 traffic analysis are Snort and Bro.
       
   104 
       
   105 Given the large
       
   106 volume of Internet traffic even the smallest servers can handle
       
   107 nowadays, the regular expressions for traffic analysis have become a
       
   108 real bottleneck.  This is not just a nuisance, but actually a security
       
   109 vulnerability in itself: it can lead to denial-of-service attacks.
       
   110 The proposed project aims to remove this bottleneck by using a little-known
       
   111 technique of building derivatives of regular expressions. These
       
   112 derivatives have not yet been used in the area of network traffic
       
   113 analysis, but have the potential to solve some tenacious problems with
       
   114 existing approaches. The project will require theoretical work, but
       
   115 also a practical implementation (a proof-of-concept at least).  The
       
   116 work will build on earlier work by Ausaf and Urban from King's College
       
   117 London \cite{AusafDyckhoffUrban2016}.
       
   118 
       
   119 
       
   120 \subsubsection*{Background}
       
   121 
       
   122 If every 10 minutes a thief turned up at your car and rattled
       
   123 violently at the doorhandles to see if he could get in, you would move
       
   124 your car to a safer neighbourhood. Computers, however, need to stay
       
   125 connected to the Internet all the time and there they have to withstand such
       
   126 attacks. A rather typical instance of an attack can be seen in the
       
   127 following log entries from a server at King's:
       
   128 
       
   129 {\small
       
   130 \begin{verbatim}
       
   131  Jan 2 00:53:19 talisker sshd: Received disconnect from 110.53.183.227: [preauth]
       
   132  Jan 2 00:58:31 talisker sshd: Received disconnect from 110.53.183.252: [preauth]
       
   133  Jan 2 01:01:28 talisker sshd: Received disconnect from 221.194.47.236: [preauth]
       
   134  Jan 2 01:03:59 talisker sshd: Received disconnect from 110.53.183.228: [preauth]
       
   135  Jan 2 01:06:53 talisker sshd: Received disconnect from 221.194.47.245: [preauth]
       
   136  ...
       
   137 \end{verbatim}}
       
   138   
       
   139 
       
   140 \noindent
       
   141 This is a record of the network activities from the server
       
   142 \href{http://talisker.inf.kcl.ac.uk/cgi-bin/repos.cgi}{talisker.inf.kcl.ac.uk},
       
   143 which hosts lecture material for students. The log indicates several
       
   144 unsuccessful login attempts via the \texttt{ssh} program from
       
   145 computers with the addresses \texttt{110.53.183.227} and so on. This
       
   146 is a clear sign of a \emph{brute-force attack} where hackers
       
   147 systematically try out password candidates.  Such attacks are blunt,
       
   148 but unfortunately very powerful.  They can originate from anywhere in
       
   149 the World; they are automated and often conducted from computers that
       
   150 have been previously hijacked.  Once the attacker ``hits'' on the
       
   151 correct password, then he or she gets access to the server. The only
       
   152 way to prevent this methodical password-guessing is to spot the
       
   153 corresponding traces in the log and then slow down the traffic from
       
   154 such sites in a way that perhaps only one password per hour can be
       
   155 tried out. This does not affect ``normal'' operations of the server,
       
   156 but drastically decreases the chances of hitting the correct password
       
   157 by a brute-force attack.
       
   158 
       
   159 
       
   160 Server administrators use \emph{regular expressions} for scanning log
       
   161 files. The purpose of these expressions is to specify patterns of
       
   162 interest (for example \texttt{Received disconnect from 110.53.183.227}
       
   163 where the address needs to be variable). A program will
       
   164 then continuously scan for such patterns and trigger an action if a
       
   165 pattern is found. Popular examples of such programs are Snort and Bro
       
   166 \cite{snort,bro}.  Clearly, there are also other kinds of
       
   167 vulnerabilities hackers try to take advantage of---mainly programming
       
   168 mistakes that can be abused to gain access to a computer. This means
       
   169 server administrators have a suite of sometimes thousands of regular
       
   170 expressions, all prescribing some suspicious pattern for when a
       
   171 computer is under attack.
       
   172 
       
   173 
       
   174 The underlying algorithmic problem is to decide whether a string in
       
   175 the logs matches the patterns determined by the regular
       
   176 expressions. Such decisions need to be done as fast as possible,
       
   177 otherwise the scanning programs would not be useful as a hardening
       
   178 technique for servers. The quest for speed with these decisions is
       
   179 presently a rather big challenge for the amount of traffic typical
       
   180 serves have to cope with and the large number of patterns that might
       
   181 be of interest.  The problem is that when thousands of regular
       
   182 expressions are involved, the process of detecting whether a string
       
   183 matches a regular expression can be excruciating slow. This might not
       
   184 happen in most instances, but in some small number of
       
   185 instances. However in these small number of instances the algorithm
       
   186 for regular expression matching can grind to a halt.  This phenomenon
       
   187 is called \emph{catastrophic backtracking} \cite{catast}.
       
   188 Unfortunately, catastrophic backtracking is not just a nuisance, but a
       
   189 security vulnerability in itself. The reason is that attackers look
       
   190 for these instances where the scan is very slow and then trigger a
       
   191 \emph{denial-of-service attack} against the server\ldots meaning the
       
   192 server will not be reachable anymore to normal users.
       
   193 
       
   194 Existing approaches try to mitigate the speed problem with regular
       
   195 expressions by implementing the matching algorithms in hardware,
       
   196 rather than in software \cite{hardware}. Although this solution offers
       
   197 speed, it is often unappealing because attack patterns can change or
       
   198 need to be augmented. A software solution is clearly more desirable in
       
   199 these circumstances.  Also given that regular expressions were
       
   200 introduced in 1950 by Kleene, one might think regular expressions have
       
   201 since been studied and implemented to death. But this would definitely
       
   202 be a mistake: in fact they are still an active research area and
       
   203 off-the-shelf implementations are still extremely prone to the
       
   204 problem of catastrophic backtracking. This can be seen in the
       
   205 following two graphs:
       
   206 
       
   207 \begin{center}
       
   208 \begin{tabular}{@{}cc@{}}
       
   209 %\multicolumn{2}{c}{Graph: $(a^*)^*\cdot b$ and strings 
       
   210 %           $\underbrace{a\ldots a}_{n}$}\smallskip\\  
       
   211 \begin{tikzpicture}
       
   212 \begin{axis}[
       
   213     xlabel={$n$},
       
   214     x label style={at={(1.05,0.0)}},
       
   215     ylabel={time in secs},
       
   216     y label style={at={(0.06,0.5)}},
       
   217     enlargelimits=false,
       
   218     xtick={0,5,...,30},
       
   219     xmax=33,
       
   220     ymax=45,
       
   221     ytick={0,10,...,40},
       
   222     scaled ticks=false,
       
   223     axis lines=left,
       
   224     width=6cm,
       
   225     height=4.5cm, 
       
   226     legend entries={Python, Java 8},  
       
   227     legend pos=north west]
       
   228 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
       
   229 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
       
   230 \end{axis}
       
   231 \end{tikzpicture}
       
   232   & 
       
   233 \begin{tikzpicture}
       
   234 \begin{axis}[
       
   235     xlabel={$n$},
       
   236     x label style={at={(1.05,0.0)}},
       
   237     ylabel={time in secs},
       
   238     y label style={at={(0.06,0.5)}},
       
   239     %enlargelimits=false,
       
   240     %xtick={0,5000,...,30000},
       
   241     xmax=65000,
       
   242     ymax=45,
       
   243     ytick={0,10,...,40},
       
   244     scaled ticks=false,
       
   245     axis lines=left,
       
   246     width=6cm,
       
   247     height=4.5cm, 
       
   248     legend entries={Java 9},  
       
   249     legend pos=north west]
       
   250 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java9.data};
       
   251 \end{axis}
       
   252 \end{tikzpicture}
       
   253 \end{tabular}  
       
   254 \end{center}
       
   255 
       
   256 \noindent
       
   257 These graphs show how long strings can be when using the rather simple
       
   258 regular expression $(a^*)^* \,b$ and the built-in regular expression
       
   259 matchers in the popular programming languages Python and Java (Version
       
   260 8 and Version 9). There are many more regular expressions that show
       
   261 the same bahaviour.  On the left-hand side the graphs show that for a
       
   262 string consisting of just 28 $a\,$'s, Python and the older Java (which
       
   263 was the latest version of Java until September 2017) already need 30
       
   264 seconds to decide whether this string is matched or not. This is an
       
   265 abysmal result given that Python and Java are very popular programming
       
   266 languages. We already know that this problem can be decided much, much
       
   267 faster.  If these slow regular expression matchers would be employed
       
   268 in network traffic analysis, then this example is already an
       
   269 opportunity for mounting an easy denial-of-service attack: one just
       
   270 has to send to the server a large enough string of a's. The newer
       
   271 version of Java is better---it can match strings of around 50,000
       
   272 $a\,$'s in 30 seconds---however this would still not be good enough
       
   273 for being a useful tool for network traffic analysis. What is
       
   274 interesting is that even a relatively simple-minded regular expression
       
   275 matcher based on Brzozowski derivatives can out-compete the regular
       
   276 expressions matchers in Java and Python on such catastrophic
       
   277 backtracking examples. This gain in speed is the motivation and
       
   278 starting point for the proposed project. \smallskip
       
   279 
       
   280 \subsection*{Research Plan}
       
   281 
       
   282 Regular expressions are already an old and well-studied topic in
       
   283 Computer Science. They were originally introduced by Kleene in 1951
       
   284 and are utilised in text search algorithms for ``find'' or ``find and
       
   285 replace'' operations \cite{Brian,Kleene51}. Because of their wide
       
   286 range of applications, many programming languages provide capabilities
       
   287 for regular expression matching via libraries.  The classic theory
       
   288 of regular expressions involves just 6 different kinds of regular
       
   289 expressions, often defined in terms of the following grammar:
       
   290 
       
   291 \begin{center}\small
       
   292 \begin{tabular}{lcll}
       
   293   $r$ & $::=$ & $\ZERO$     & cannot match anything\\
       
   294       &   $|$ & $\ONE$      & can only match the empty string\\
       
   295       &   $|$ & $c$         & can match a single character (in this case $c$)\\
       
   296       &   $|$ & $r_1 + r_2$ & can match a string either with $r_1$ or with $r_2$\\
       
   297   &   $|$ & $r_1\cdot r_2$ & can match the first part of a string with $r_1$ and\\
       
   298           &  & & then the second part with $r_2$\\
       
   299       &   $|$ & $r^*$       & can match zero or more times $r$\\
       
   300 \end{tabular}
       
   301 \end{center}
       
   302 
       
   303 \noindent
       
   304 For practical purposes, however, regular expressions have been
       
   305 extended in various ways in order to make them even more powerful and
       
   306 even more useful for applications. Some of the problems to do with
       
   307 catastrophic backtracking stem, unfortunately, from these extensions.
       
   308 
       
   309 The crux in this project is to not use automata for regular expression
       
   310 matching (the traditional technique), but to use derivatives of
       
   311 regular expressions instead.  These derivatives were introduced by
       
   312 Brzozowski in 1964 \cite{Brzozowski1964}, but somehow had been lost
       
   313 ``in the sands of time'' \cite{Owens2009}. However, they recently have
       
   314 become again a ``hot'' research topic with numerous research papers
       
   315 appearing in the last five years. One reason for this interest is that
       
   316 Brzozowski derivatives straightforwardly extend to more general
       
   317 regular expression constructors, which cannot be easily treated with
       
   318 standard techniques. They can also be implemented with ease in any
       
   319 functional programming language: the definition for derivatives
       
   320 consists of just two simple recursive functions shown in
       
   321 Figure~\ref{f}. Moreover, it can be easily shown (by a mathematical
       
   322 proof) that the resulting regular matcher is in fact always producing
       
   323 a correct answer. This is an area where Urban can provide a lot of
       
   324 expertise for this project: he is one of the main developers of the
       
   325 Isabelle theorem prover, which is a program designed for such proofs
       
   326 \cite{Isabelle}.
       
   327 
       
   328 
       
   329 \begin{figure}[t]
       
   330 \begin{center}\small
       
   331 \begin{tabular}{lcl}
       
   332 $\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\
       
   333 $\textit{nullable}(\ONE)$  & $\dn$ & $\textit{true}$\\
       
   334 $\textit{nullable}(c)$     & $\dn$ & $\textit{false}$\\
       
   335 $\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\
       
   336 $\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\
       
   337 $\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\medskip\\
       
   338 
       
   339 $\textit{der}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\
       
   340 $\textit{der}\;c\;(\ONE)$  & $\dn$ & $\ZERO$\\
       
   341 $\textit{der}\;c\;(d)$     & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\
       
   342 $\textit{der}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{der}\;c\;r_1) + (\textit{der}\;c\;r_2)$\\
       
   343 $\textit{der}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\
       
   344       & & $\textit{then}\;((\textit{der}\;c\;r_1)\cdot r_2) + (\textit{der}\;c\;r_2)$\\
       
   345       & & $\textit{else}\;(\textit{der}\;c\;r_1)\cdot r_2$\\
       
   346 $\textit{der}\;c\;(r^*)$ & $\dn$ & $(\textit{der}\;c\;r)\cdot (r^*)$\\[-5mm]
       
   347 \end{tabular}
       
   348 \end{center}
       
   349 \caption{The complete definition of derivatives of regular expressions
       
   350   \cite{Brzozowski1964}.
       
   351   This definition can be implemented with ease in functional programming languages and can be easily extended to regular expressions used in practice.
       
   352   The are more flexible for applications and easier to manipulate in
       
   353   mathematical proofs, than traditional techniques for regular expression
       
   354   matching.\medskip\label{f}}
       
   355 \end{figure}
       
   356 
       
   357 There are a number of avenues for research on Brzozowski derivatives.
       
   358 One problem I like to immediately tackle in this project is how to
       
   359 handle \emph{back-references} in regular expressions. It is known that
       
   360 the inclusion of back-references causes the underlying algorithmic
       
   361 problem to become NP-hard \cite{Aho}, and is the main reason why
       
   362 regular expression matchers suffer from the catastrophic backtracking
       
   363 problem. However, the full generality of back-references are not
       
   364 needed in practical applications. The goal therefore is to sufficiently
       
   365 restrict the problem so that an efficient algorithm can be
       
   366 designed. There seem to be good indications that Brzozowski
       
   367 derivatives are useful for that.
       
   368 
       
   369 Another problem is \emph{how} regular expressions match a string
       
   370 \cite{Sulzmann2014}. In this case the algorithm does not just give a
       
   371 yes/no answer about whether the regular expression matches the string,
       
   372 but also generates a value that encodes which part of the string is
       
   373 matched by which part of the regular expression. This is important in
       
   374 applications, like network analysis where from a matching log entry a
       
   375 specific computer address needs to be extracted.  Also compilers make
       
   376 extensive use of such an extension when they lex their source
       
   377 programs, i.e.~break up code into word-like components.  Again Ausaf
       
   378 and Urban from King's made some initial progress about proving the
       
   379 correctness of such an extension, but their algorithm is not yet fast
       
   380 enough for practical purposes \cite{AusafDyckhoffUrban2016}. It would
       
   381 be desirable to study optimisations that make the algorithm faster,
       
   382 but preserve the correctness guaranties obtained by Ausaf and Urban.
       
   383 
       
   384 \mbox{}\\[-11mm]
       
   385 
       
   386 \subsubsection*{Conclusion}
       
   387 
       
   388 Much research has already gone into regular expressions, and one might
       
   389 think they have been studied and implemented to death. But this is far
       
   390 from the truth. In fact regular expressions have become a ``hot''
       
   391 research topic in recent years because of problems with the
       
   392 traditional methods (in applications such as network traffic
       
   393 analysis), but also because the technique of derivatives of regular
       
   394 expression has been re-discovered. These derivatives provide an
       
   395 alternative approach to regular expression matching. Their potential
       
   396 lies in the fact that they are more flexible in implementations and
       
   397 much easier to manipulate in mathematical proofs. Therefore I like to
       
   398 research them in this project.\medskip
       
   399 
       
   400 \noindent {\bf Impact:} Regular expression matching is a core
       
   401 technology in Computer Science with numerous applications. I will
       
   402 focus on the area of network traffic analysis and also lexical
       
   403 analysis. In these areas this project can have a quite large impact:
       
   404 for example the program Bro has been downloaded at least 10,000 times and Snort
       
   405 even has 5 million downloads and over 600,000 registered users
       
   406 \cite{snort,bro}. Lexical analysis is a core component in every
       
   407 compiler, which are the bedrock on which nearly all programming is
       
   408 built on.
       
   409 
       
   410 % {\small
       
   411 % \begin{itemize}
       
   412 % \item[$\bullet$] \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
       
   413 % \item[$\bullet$] \url{https://vimeo.com/112065252}
       
   414 % \item[$\bullet$] \url{http://davidvgalbraith.com/how-i-fixed-atom/}  
       
   415 % \end{itemize}}
       
   416 \mbox{}\\[-11mm]
       
   417 
       
   418 \small
       
   419 \bibliographystyle{plain}
       
   420 \renewcommand{\refname}{References\\[-7mm]\mbox{}}
       
   421 \bibliography{proposal}
       
   422 
       
   423 \end{document}
       
   424 
       
   425 
       
   426 %%% Local Variables: 
       
   427 %%% mode: latex
       
   428 %%% TeX-master: t
       
   429 %%% End: