text/proposal.tex
changeset 359 fedc16924b76
parent 305 6e2cef17a9b3
equal deleted inserted replaced
358:06aa99b54423 359:fedc16924b76
       
     1 % !TEX program = pfdlatex
     1 \documentclass{article}
     2 \documentclass{article}
     2 \usepackage{url}
     3 \usepackage{url}
     3 \usepackage{color}\definecolor{darkblue}{rgb}{0,0,0.5}
     4 \usepackage{color}\definecolor{darkblue}{rgb}{0,0,0.5}
     4 \usepackage[a4paper,colorlinks=true,linkcolor=darkblue,citecolor=darkblue,filecolor=darkblue,pagecolor=darkblue,urlcolor=darkblue]{hyperref}
     5 \usepackage[a4paper,colorlinks=true,linkcolor=darkblue,citecolor=darkblue,filecolor=darkblue,pagecolor=darkblue,urlcolor=darkblue]{hyperref}
     5 \usepackage{style}
     6 \usepackage{style}
     6 \usepackage{langs}
     7 %\usepackage{langs}
     7 \usepackage{tikz}
     8 \usepackage{tikz}
     8 \usepackage{pgf}
     9 \usepackage{pgf}
     9 \usepackage{pgfplots}
    10 \usepackage{pgfplots}
    10 \usepackage{stackengine}
    11 \usepackage{stackengine}
    11 
    12 
    12 \addtolength{\oddsidemargin}{-6mm}
    13 \addtolength{\oddsidemargin}{-6mm}
    13 \addtolength{\evensidemargin}{-6mm}
    14 \addtolength{\evensidemargin}{-6mm}
    14 \addtolength{\textwidth}{12mm}
    15 \addtolength{\textwidth}{12mm} 
    15 
    16 
    16 
    17 
    17 
    18 
    18 %% \usepackage{accents}
    19 %% \usepackage{accents}
    19 \newcommand\barbelow[1]{\stackunder[1.2pt]{#1}{\raisebox{-4mm}{\boldmath$\uparrow$}}}
    20 \newcommand\barbelow[1]{\stackunbder[1.2pt]{#1}{\raisebox{-4mm}{\boldmath$\uparrow$}}}
    20 
    21 
    21 \begin{filecontents}{re-python2.data}
    22 \begin{filecontents}{re-python2.data}
    22 1 0.033
    23 1 0.033
    23 5 0.036
    24 5 0.036
    24 10 0.034
    25 10 0.034
    86 \begin{document}
    87 \begin{document}
    87 
    88 
    88   
    89   
    89 \section*{PPProposal: \\
    90 \section*{PPProposal: \\
    90   Fast Regular Expression Matching\\
    91   Fast Regular Expression Matching\\
    91   with Brzozowski Derivatives}
    92   with Brzozowski bderivatives}
    92 
    93 
    93 \subsubsection*{Summary}
    94 \subsubsection*{Summary}
    94 
    95 
    95 If you want to connect a computer directly to the Internet, it must be
    96 If you want to connect a computer directly to the Internet, it must be
    96 immediately \mbox{hardened} against outside attacks. The current
    97 immediately \mbox{hardened} against outside attacks. The current
   100 counter-actions. One possible action could be to slow down the traffic
   101 counter-actions. One possible action could be to slow down the traffic
   101 from sites where repeated password-guesses originate. Well-known
   102 from sites where repeated password-guesses originate. Well-known
   102 network intrusion prevention systems that use regular expressions for
   103 network intrusion prevention systems that use regular expressions for
   103 traffic analysis are Snort and Bro.
   104 traffic analysis are Snort and Bro.
   104 
   105 
   105 Given the large
   106 Given the large volume of Internet traffic even the smallest servers
   106 volume of Internet traffic even the smallest servers can handle
   107 can handle nowadays, the regular expressions for traffic analysis have
   107 nowadays, the regular expressions for traffic analysis have become a
   108 become a real bottleneck.  This is not just a nuisance, but actually a
   108 real bottleneck.  This is not just a nuisance, but actually a security
   109 security vulnerability in itself: it can lead to denial-of-service
   109 vulnerability in itself: it can lead to denial-of-service attacks.
   110 attacks.  The proposed project aims to remove this bottleneck by using
   110 The proposed project aims to remove this bottleneck by using a little-known
   111 a little-known technique of building bderivatives of regular
   111 technique of building derivatives of regular expressions. These
   112 expressions. These derivatives have not yet been used in the area of
   112 derivatives have not yet been used in the area of network traffic
   113 network traffic analysis, but have the potential to solve some
   113 analysis, but have the potential to solve some tenacious problems with
   114 tenacious problems with existing approaches. The project will require
   114 existing approaches. The project will require theoretical work, but
   115 theoretical work, but also a practical implementation (a
   115 also a practical implementation (a proof-of-concept at least).  The
   116 proof-of-concept at least).  The work will build on earlier work by
   116 work will build on earlier work by Ausaf and Urban from King's College
   117 Ausaf and Urban from King's College London
   117 London \cite{AusafDyckhoffUrban2016}.
   118 \cite{AusafDyckhoffUrban2016}.
   118 
   119 
   119 
   120 
   120 \subsubsection*{Background}
   121 \subsubsection*{Background}
   121 
   122 
   122 If every 10 minutes a thief turned up at your car and rattled
   123 If every 10 minutes a thief turned up at your car and rattled
   166 \cite{snort,bro}.  Clearly, there are also other kinds of
   167 \cite{snort,bro}.  Clearly, there are also other kinds of
   167 vulnerabilities hackers try to take advantage of---mainly programming
   168 vulnerabilities hackers try to take advantage of---mainly programming
   168 mistakes that can be abused to gain access to a computer. This means
   169 mistakes that can be abused to gain access to a computer. This means
   169 server administrators have a suite of sometimes thousands of regular
   170 server administrators have a suite of sometimes thousands of regular
   170 expressions, all prescribing some suspicious pattern for when a
   171 expressions, all prescribing some suspicious pattern for when a
   171 computer is under attack.
   172 computer is unbder attack.
   172 
   173 
   173 
   174 
   174 The underlying algorithmic problem is to decide whether a string in
   175 The unbderlying algorithmic problem is to decide whether a string in
   175 the logs matches the patterns determined by the regular
   176 the logs matches the patterns determined by the regular
   176 expressions. Such decisions need to be done as fast as possible,
   177 expressions. Such decisions need to be done as fast as possible,
   177 otherwise the scanning programs would not be useful as a hardening
   178 otherwise the scanning programs would not be useful as a hardening
   178 technique for servers. The quest for speed with these decisions is
   179 technique for servers. The quest for speed with these decisions is
   179 presently a rather big challenge for the amount of traffic typical
   180 presently a rather big challenge for the amount of traffic typical
   205 following two graphs:
   206 following two graphs:
   206 
   207 
   207 \begin{center}
   208 \begin{center}
   208 \begin{tabular}{@{}cc@{}}
   209 \begin{tabular}{@{}cc@{}}
   209 %\multicolumn{2}{c}{Graph: $(a^*)^*\cdot b$ and strings 
   210 %\multicolumn{2}{c}{Graph: $(a^*)^*\cdot b$ and strings 
   210 %           $\underbrace{a\ldots a}_{n}$}\smallskip\\  
   211 %           $\unbderbrace{a\ldots a}_{n}$}\smallskip\\  
   211 \begin{tikzpicture}
   212 \begin{tikzpicture}
   212 \begin{axis}[
   213 \begin{axis}[
   213     xlabel={$n$},
   214     xlabel={$n$},
   214     x label style={at={(1.05,0.0)}},
   215     x label style={at={(1.05,0.0)}},
   215     ylabel={time in secs},
   216     ylabel={time in secs},
   257 These graphs show how long strings can be when using the rather simple
   258 These graphs show how long strings can be when using the rather simple
   258 regular expression $(a^*)^* \,b$ and the built-in regular expression
   259 regular expression $(a^*)^* \,b$ and the built-in regular expression
   259 matchers in the popular programming languages Python and Java (Version
   260 matchers in the popular programming languages Python and Java (Version
   260 8 and Version 9). There are many more regular expressions that show
   261 8 and Version 9). There are many more regular expressions that show
   261 the same behaviour.  On the left-hand side the graphs show that for a
   262 the same behaviour.  On the left-hand side the graphs show that for a
   262 string consisting of just 28 $a\,$'s, Python and the older Java (which
   263 string consisting of just 28 $a\,$'s, Python and the olbder Java (which
   263 was the latest version of Java until September 2017) already need 30
   264 was the latest version of Java until September 2017) already need 30
   264 seconds to decide whether this string is matched or not. This is an
   265 seconds to decide whether this string is matched or not. This is an
   265 abysmal result given that Python and Java are very popular programming
   266 abysmal result given that Python and Java are very popular programming
   266 languages. We already know that this problem can be decided much, much
   267 languages. We already know that this problem can be decided much, much
   267 faster.  If these slow regular expression matchers would be employed
   268 faster.  If these slow regular expression matchers would be employed
   270 has to send to the server a large enough string of a's. The newer
   271 has to send to the server a large enough string of a's. The newer
   271 version of Java is better---it can match strings of around 50,000
   272 version of Java is better---it can match strings of around 50,000
   272 $a\,$'s in 30 seconds---however this would still not be good enough
   273 $a\,$'s in 30 seconds---however this would still not be good enough
   273 for being a useful tool for network traffic analysis. What is
   274 for being a useful tool for network traffic analysis. What is
   274 interesting is that even a relatively simple-minded regular expression
   275 interesting is that even a relatively simple-minded regular expression
   275 matcher based on Brzozowski derivatives can out-compete the regular
   276 matcher based on Brzozowski bderivatives can out-compete the regular
   276 expressions matchers in Java and Python on such catastrophic
   277 expressions matchers in Java and Python on such catastrophic
   277 backtracking examples. This gain in speed is the motivation and
   278 backtracking examples. This gain in speed is the motivation and
   278 starting point for the proposed project. \smallskip
   279 starting point for the proposed project. \smallskip
   279 
   280 
   280 \subsection*{Research Plan}
   281 \subsection*{Research Plan}
   300 \end{tabular}
   301 \end{tabular}
   301 \end{center}
   302 \end{center}
   302 
   303 
   303 \noindent
   304 \noindent
   304 For practical purposes, however, regular expressions have been
   305 For practical purposes, however, regular expressions have been
   305 extended in various ways in order to make them even more powerful and
   306 extended in various ways in orbder to make them even more powerful and
   306 even more useful for applications. Some of the problems to do with
   307 even more useful for applications. Some of the problems to do with
   307 catastrophic backtracking stem, unfortunately, from these extensions.
   308 catastrophic backtracking stem, unfortunately, from these extensions.
   308 
   309 
   309 The crux in this project is to not use automata for regular expression
   310 The crux in this project is to not use automata for regular expression
   310 matching (the traditional technique), but to use derivatives of
   311 matching (the traditional technique), but to use bderivatives of
   311 regular expressions instead.  These derivatives were introduced by
   312 regular expressions instead.  These bderivatives were introduced by
   312 Brzozowski in 1964 \cite{Brzozowski1964}, but somehow had been lost
   313 Brzozowski in 1964 \cite{Brzozowski1964}, but somehow had been lost
   313 ``in the sands of time'' \cite{Owens2009}. However, they recently have
   314 ``in the sands of time'' \cite{Owens2009}. However, they recently have
   314 become again a ``hot'' research topic with numerous research papers
   315 become again a ``hot'' research topic with numerous research papers
   315 appearing in the last five years. One reason for this interest is that
   316 appearing in the last five years. One reason for this interest is that
   316 Brzozowski derivatives straightforwardly extend to more general
   317 Brzozowski bderivatives straightforwardly extend to more general
   317 regular expression constructors, which cannot be easily treated with
   318 regular expression constructors, which cannot be easily treated with
   318 standard techniques. They can also be implemented with ease in any
   319 standard techniques. They can also be implemented with ease in any
   319 functional programming language: the definition for derivatives
   320 functional programming language: the definition for bderivatives
   320 consists of just two simple recursive functions shown in
   321 consists of just two simple recursive functions shown in
   321 Figure~\ref{f}. Moreover, it can be easily shown (by a mathematical
   322 Figure~\ref{f}. Moreover, it can be easily shown (by a mathematical
   322 proof) that the resulting regular matcher is in fact always producing
   323 proof) that the resulting regular matcher is in fact always producing
   323 a correct answer. This is an area where Urban can provide a lot of
   324 a correct answer. This is an area where Urban can provide a lot of
   324 expertise for this project: he is one of the main developers of the
   325 expertise for this project: he is one of the main developers of the
   334 $\textit{nullable}(c)$     & $\dn$ & $\textit{false}$\\
   335 $\textit{nullable}(c)$     & $\dn$ & $\textit{false}$\\
   335 $\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\
   336 $\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\
   336 $\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\
   337 $\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\
   337 $\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\medskip\\
   338 $\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\medskip\\
   338 
   339 
   339 $\textit{der}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\
   340 $\textit{bder}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\
   340 $\textit{der}\;c\;(\ONE)$  & $\dn$ & $\ZERO$\\
   341 $\textit{bder}\;c\;(\ONE)$  & $\dn$ & $\ZERO$\\
   341 $\textit{der}\;c\;(d)$     & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\
   342 $\textit{bder}\;c\;(d)$     & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\
   342 $\textit{der}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{der}\;c\;r_1) + (\textit{der}\;c\;r_2)$\\
   343 $\textit{bder}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{bder}\;c\;r_1) + (\textit{bder}\;c\;r_2)$\\
   343 $\textit{der}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\
   344 $\textit{bder}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\
   344       & & $\textit{then}\;((\textit{der}\;c\;r_1)\cdot r_2) + (\textit{der}\;c\;r_2)$\\
   345       & & $\textit{then}\;((\textit{bder}\;c\;r_1)\cdot r_2) + (\textit{bder}\;c\;r_2)$\\
   345       & & $\textit{else}\;(\textit{der}\;c\;r_1)\cdot r_2$\\
   346       & & $\textit{else}\;(\textit{bder}\;c\;r_1)\cdot r_2$\\
   346 $\textit{der}\;c\;(r^*)$ & $\dn$ & $(\textit{der}\;c\;r)\cdot (r^*)$\\[-5mm]
   347 $\textit{bder}\;c\;(r^*)$ & $\dn$ & $(\textit{bder}\;c\;r)\cdot (r^*)$\\[-5mm]
   347 \end{tabular}
   348 \end{tabular}
   348 \end{center}
   349 \end{center}
   349 \caption{The complete definition of derivatives of regular expressions
   350 \caption{The complete definition of bderivatives of regular expressions
   350   \cite{Brzozowski1964}.
   351   \cite{Brzozowski1964}.
   351   This definition can be implemented with ease in functional programming languages and can be easily extended to regular expressions used in practice.
   352   This definition can be implemented with ease in functional programming languages and can be easily extended to regular expressions used in practice.
   352   The are more flexible for applications and easier to manipulate in
   353   The are more flexible for applications and easier to manipulate in
   353   mathematical proofs, than traditional techniques for regular expression
   354   mathematical proofs, than traditional techniques for regular expression
   354   matching.\medskip\label{f}}
   355   matching.\medskip\label{f}}
   355 \end{figure}
   356 \end{figure}
   356 
   357 
   357 There are a number of avenues for research on Brzozowski derivatives.
   358 There are a number of avenues for research on Brzozowski bderivatives.
   358 One problem I like to immediately tackle in this project is how to
   359 One problem I like to immediately tackle in this project is how to
   359 handle \emph{back-references} in regular expressions. It is known that
   360 handle \emph{back-references} in regular expressions. It is known that
   360 the inclusion of back-references causes the underlying algorithmic
   361 the inclusion of back-references causes the unbderlying algorithmic
   361 problem to become NP-hard \cite{Aho}, and is the main reason why
   362 problem to become NP-hard \cite{Aho}, and is the main reason why
   362 regular expression matchers suffer from the catastrophic backtracking
   363 regular expression matchers suffer from the catastrophic backtracking
   363 problem. However, the full generality of back-references are not
   364 problem. However, the full generality of back-references are not
   364 needed in practical applications. The goal therefore is to sufficiently
   365 needed in practical applications. The goal therefore is to sufficiently
   365 restrict the problem so that an efficient algorithm can be
   366 restrict the problem so that an efficient algorithm can be
   366 designed. There seem to be good indications that Brzozowski
   367 designed. There seem to be good indications that Brzozowski
   367 derivatives are useful for that.
   368 bderivatives are useful for that.
   368 
   369 
   369 Another problem is \emph{how} regular expressions match a string
   370 Another problem is \emph{how} regular expressions match a string
   370 \cite{Sulzmann2014}. In this case the algorithm does not just give a
   371 \cite{Sulzmann2014}. In this case the algorithm does not just give a
   371 yes/no answer about whether the regular expression matches the string,
   372 yes/no answer about whether the regular expression matches the string,
   372 but also generates a value that encodes which part of the string is
   373 but also generates a value that encodes which part of the string is
   388 Much research has already gone into regular expressions, and one might
   389 Much research has already gone into regular expressions, and one might
   389 think they have been studied and implemented to death. But this is far
   390 think they have been studied and implemented to death. But this is far
   390 from the truth. In fact regular expressions have become a ``hot''
   391 from the truth. In fact regular expressions have become a ``hot''
   391 research topic in recent years because of problems with the
   392 research topic in recent years because of problems with the
   392 traditional methods (in applications such as network traffic
   393 traditional methods (in applications such as network traffic
   393 analysis), but also because the technique of derivatives of regular
   394 analysis), but also because the technique of bderivatives of regular
   394 expression has been re-discovered. These derivatives provide an
   395 expression has been re-discovered. These bderivatives provide an
   395 alternative approach to regular expression matching. Their potential
   396 alternative approach to regular expression matching. Their potential
   396 lies in the fact that they are more flexible in implementations and
   397 lies in the fact that they are more flexible in implementations and
   397 much easier to manipulate in mathematical proofs. Therefore I like to
   398 much easier to manipulate in mathematical proofs. Therefore I like to
   398 research them in this project.\medskip
   399 research them in this project.\medskip
   399 
   400 
   418 \small
   419 \small
   419 \bibliographystyle{plain}
   420 \bibliographystyle{plain}
   420 \renewcommand{\refname}{References\\[-7mm]\mbox{}}
   421 \renewcommand{\refname}{References\\[-7mm]\mbox{}}
   421 \bibliography{proposal}
   422 \bibliography{proposal}
   422 
   423 
       
   424 
       
   425 \newpage
       
   426 \begin{center}\small 
       
   427 \begin{tabular}{lcl}
       
   428 $\textit{bder}\;c\;_{[]}(\ZERO)$ & $\dn$ & $_{[]}\ZERO$\\
       
   429 $\textit{bder}\;c\;_{bs}(\ONE)$  & $\dn$ & $_{[]}\ZERO$\\
       
   430 $\textit{bder}\;c\;_{bs}(d)$     & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;_{bs}\ONE \; \textit{else} \;_{[]}\ZERO$\\
       
   431 $\textit{bder}\;c\;_{bs}(r_1 + r_2)$ & $\dn$ & $_{bs}(\textit{bder}\;c\;r_1 + \textit{bder}\;c\;r_2)$\\
       
   432 $\textit{bder}\;c\;_{bs}(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\
       
   433       & & $\textit{then}\;_{bs}(_{[]}(\textit{bder}\;c\;r_1)\cdot r_2) + (_{\textit{bmkeps}\,r_1\rightarrow}\textit{bder}\;c\;r_2)$\\
       
   434       & & $\textit{else}\;_{bs}(\textit{bder}\;c\;r_1)\cdot r_2$\\
       
   435 $\textit{bder}\;c\;_{bs}(r^*)$ & $\dn$ & $_{bs}(_{0\rightarrow}\textit{bder}\;c\;r)\cdot (r^*)$\\[-5mm]
       
   436 \end{tabular}
       
   437 \end{center}
       
   438 
       
   439 
       
   440 
       
   441 
   423 \end{document}
   442 \end{document}
   424 
   443 
   425 
   444 
   426 %%% Local Variables: 
   445 %%% Local Variables: 
   427 %%% mode: latex
   446 %%% mode: latex