diff -r 06aa99b54423 -r fedc16924b76 text/proposal.tex --- a/text/proposal.tex Wed Sep 18 16:35:57 2019 +0100 +++ b/text/proposal.tex Sat Oct 24 12:13:39 2020 +0100 @@ -1,9 +1,10 @@ +% !TEX program = pfdlatex \documentclass{article} \usepackage{url} \usepackage{color}\definecolor{darkblue}{rgb}{0,0,0.5} \usepackage[a4paper,colorlinks=true,linkcolor=darkblue,citecolor=darkblue,filecolor=darkblue,pagecolor=darkblue,urlcolor=darkblue]{hyperref} \usepackage{style} -\usepackage{langs} +%\usepackage{langs} \usepackage{tikz} \usepackage{pgf} \usepackage{pgfplots} @@ -11,12 +12,12 @@ \addtolength{\oddsidemargin}{-6mm} \addtolength{\evensidemargin}{-6mm} -\addtolength{\textwidth}{12mm} +\addtolength{\textwidth}{12mm} %% \usepackage{accents} -\newcommand\barbelow[1]{\stackunder[1.2pt]{#1}{\raisebox{-4mm}{\boldmath$\uparrow$}}} +\newcommand\barbelow[1]{\stackunbder[1.2pt]{#1}{\raisebox{-4mm}{\boldmath$\uparrow$}}} \begin{filecontents}{re-python2.data} 1 0.033 @@ -88,7 +89,7 @@ \section*{PPProposal: \\ Fast Regular Expression Matching\\ - with Brzozowski Derivatives} + with Brzozowski bderivatives} \subsubsection*{Summary} @@ -102,19 +103,19 @@ network intrusion prevention systems that use regular expressions for traffic analysis are Snort and Bro. -Given the large -volume of Internet traffic even the smallest servers can handle -nowadays, the regular expressions for traffic analysis have become a -real bottleneck. This is not just a nuisance, but actually a security -vulnerability in itself: it can lead to denial-of-service attacks. -The proposed project aims to remove this bottleneck by using a little-known -technique of building derivatives of regular expressions. These -derivatives have not yet been used in the area of network traffic -analysis, but have the potential to solve some tenacious problems with -existing approaches. The project will require theoretical work, but -also a practical implementation (a proof-of-concept at least). The -work will build on earlier work by Ausaf and Urban from King's College -London \cite{AusafDyckhoffUrban2016}. +Given the large volume of Internet traffic even the smallest servers +can handle nowadays, the regular expressions for traffic analysis have +become a real bottleneck. This is not just a nuisance, but actually a +security vulnerability in itself: it can lead to denial-of-service +attacks. The proposed project aims to remove this bottleneck by using +a little-known technique of building bderivatives of regular +expressions. These derivatives have not yet been used in the area of +network traffic analysis, but have the potential to solve some +tenacious problems with existing approaches. The project will require +theoretical work, but also a practical implementation (a +proof-of-concept at least). The work will build on earlier work by +Ausaf and Urban from King's College London +\cite{AusafDyckhoffUrban2016}. \subsubsection*{Background} @@ -168,10 +169,10 @@ mistakes that can be abused to gain access to a computer. This means server administrators have a suite of sometimes thousands of regular expressions, all prescribing some suspicious pattern for when a -computer is under attack. +computer is unbder attack. -The underlying algorithmic problem is to decide whether a string in +The unbderlying algorithmic problem is to decide whether a string in the logs matches the patterns determined by the regular expressions. Such decisions need to be done as fast as possible, otherwise the scanning programs would not be useful as a hardening @@ -207,7 +208,7 @@ \begin{center} \begin{tabular}{@{}cc@{}} %\multicolumn{2}{c}{Graph: $(a^*)^*\cdot b$ and strings -% $\underbrace{a\ldots a}_{n}$}\smallskip\\ +% $\unbderbrace{a\ldots a}_{n}$}\smallskip\\ \begin{tikzpicture} \begin{axis}[ xlabel={$n$}, @@ -259,7 +260,7 @@ matchers in the popular programming languages Python and Java (Version 8 and Version 9). There are many more regular expressions that show the same behaviour. On the left-hand side the graphs show that for a -string consisting of just 28 $a\,$'s, Python and the older Java (which +string consisting of just 28 $a\,$'s, Python and the olbder Java (which was the latest version of Java until September 2017) already need 30 seconds to decide whether this string is matched or not. This is an abysmal result given that Python and Java are very popular programming @@ -272,7 +273,7 @@ $a\,$'s in 30 seconds---however this would still not be good enough for being a useful tool for network traffic analysis. What is interesting is that even a relatively simple-minded regular expression -matcher based on Brzozowski derivatives can out-compete the regular +matcher based on Brzozowski bderivatives can out-compete the regular expressions matchers in Java and Python on such catastrophic backtracking examples. This gain in speed is the motivation and starting point for the proposed project. \smallskip @@ -302,21 +303,21 @@ \noindent For practical purposes, however, regular expressions have been -extended in various ways in order to make them even more powerful and +extended in various ways in orbder to make them even more powerful and even more useful for applications. Some of the problems to do with catastrophic backtracking stem, unfortunately, from these extensions. The crux in this project is to not use automata for regular expression -matching (the traditional technique), but to use derivatives of -regular expressions instead. These derivatives were introduced by +matching (the traditional technique), but to use bderivatives of +regular expressions instead. These bderivatives were introduced by Brzozowski in 1964 \cite{Brzozowski1964}, but somehow had been lost ``in the sands of time'' \cite{Owens2009}. However, they recently have become again a ``hot'' research topic with numerous research papers appearing in the last five years. One reason for this interest is that -Brzozowski derivatives straightforwardly extend to more general +Brzozowski bderivatives straightforwardly extend to more general regular expression constructors, which cannot be easily treated with standard techniques. They can also be implemented with ease in any -functional programming language: the definition for derivatives +functional programming language: the definition for bderivatives consists of just two simple recursive functions shown in Figure~\ref{f}. Moreover, it can be easily shown (by a mathematical proof) that the resulting regular matcher is in fact always producing @@ -336,17 +337,17 @@ $\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\ $\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\medskip\\ -$\textit{der}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\ -$\textit{der}\;c\;(\ONE)$ & $\dn$ & $\ZERO$\\ -$\textit{der}\;c\;(d)$ & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\ -$\textit{der}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{der}\;c\;r_1) + (\textit{der}\;c\;r_2)$\\ -$\textit{der}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\ - & & $\textit{then}\;((\textit{der}\;c\;r_1)\cdot r_2) + (\textit{der}\;c\;r_2)$\\ - & & $\textit{else}\;(\textit{der}\;c\;r_1)\cdot r_2$\\ -$\textit{der}\;c\;(r^*)$ & $\dn$ & $(\textit{der}\;c\;r)\cdot (r^*)$\\[-5mm] +$\textit{bder}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\ +$\textit{bder}\;c\;(\ONE)$ & $\dn$ & $\ZERO$\\ +$\textit{bder}\;c\;(d)$ & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\ +$\textit{bder}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{bder}\;c\;r_1) + (\textit{bder}\;c\;r_2)$\\ +$\textit{bder}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\ + & & $\textit{then}\;((\textit{bder}\;c\;r_1)\cdot r_2) + (\textit{bder}\;c\;r_2)$\\ + & & $\textit{else}\;(\textit{bder}\;c\;r_1)\cdot r_2$\\ +$\textit{bder}\;c\;(r^*)$ & $\dn$ & $(\textit{bder}\;c\;r)\cdot (r^*)$\\[-5mm] \end{tabular} \end{center} -\caption{The complete definition of derivatives of regular expressions +\caption{The complete definition of bderivatives of regular expressions \cite{Brzozowski1964}. This definition can be implemented with ease in functional programming languages and can be easily extended to regular expressions used in practice. The are more flexible for applications and easier to manipulate in @@ -354,17 +355,17 @@ matching.\medskip\label{f}} \end{figure} -There are a number of avenues for research on Brzozowski derivatives. +There are a number of avenues for research on Brzozowski bderivatives. One problem I like to immediately tackle in this project is how to handle \emph{back-references} in regular expressions. It is known that -the inclusion of back-references causes the underlying algorithmic +the inclusion of back-references causes the unbderlying algorithmic problem to become NP-hard \cite{Aho}, and is the main reason why regular expression matchers suffer from the catastrophic backtracking problem. However, the full generality of back-references are not needed in practical applications. The goal therefore is to sufficiently restrict the problem so that an efficient algorithm can be designed. There seem to be good indications that Brzozowski -derivatives are useful for that. +bderivatives are useful for that. Another problem is \emph{how} regular expressions match a string \cite{Sulzmann2014}. In this case the algorithm does not just give a @@ -390,8 +391,8 @@ from the truth. In fact regular expressions have become a ``hot'' research topic in recent years because of problems with the traditional methods (in applications such as network traffic -analysis), but also because the technique of derivatives of regular -expression has been re-discovered. These derivatives provide an +analysis), but also because the technique of bderivatives of regular +expression has been re-discovered. These bderivatives provide an alternative approach to regular expression matching. Their potential lies in the fact that they are more flexible in implementations and much easier to manipulate in mathematical proofs. Therefore I like to @@ -420,6 +421,24 @@ \renewcommand{\refname}{References\\[-7mm]\mbox{}} \bibliography{proposal} + +\newpage +\begin{center}\small +\begin{tabular}{lcl} +$\textit{bder}\;c\;_{[]}(\ZERO)$ & $\dn$ & $_{[]}\ZERO$\\ +$\textit{bder}\;c\;_{bs}(\ONE)$ & $\dn$ & $_{[]}\ZERO$\\ +$\textit{bder}\;c\;_{bs}(d)$ & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;_{bs}\ONE \; \textit{else} \;_{[]}\ZERO$\\ +$\textit{bder}\;c\;_{bs}(r_1 + r_2)$ & $\dn$ & $_{bs}(\textit{bder}\;c\;r_1 + \textit{bder}\;c\;r_2)$\\ +$\textit{bder}\;c\;_{bs}(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\ + & & $\textit{then}\;_{bs}(_{[]}(\textit{bder}\;c\;r_1)\cdot r_2) + (_{\textit{bmkeps}\,r_1\rightarrow}\textit{bder}\;c\;r_2)$\\ + & & $\textit{else}\;_{bs}(\textit{bder}\;c\;r_1)\cdot r_2$\\ +$\textit{bder}\;c\;_{bs}(r^*)$ & $\dn$ & $_{bs}(_{0\rightarrow}\textit{bder}\;c\;r)\cdot (r^*)$\\[-5mm] +\end{tabular} +\end{center} + + + + \end{document}