% !TEX program = pfdlatex\documentclass{article}\usepackage{url}\usepackage{color}\definecolor{darkblue}{rgb}{0,0,0.5}\usepackage[a4paper,colorlinks=true,linkcolor=darkblue,citecolor=darkblue,filecolor=darkblue,pagecolor=darkblue,urlcolor=darkblue]{hyperref}\usepackage{style}%\usepackage{langs}\usepackage{tikz}\usepackage{pgf}\usepackage{pgfplots}\usepackage{stackengine}\addtolength{\oddsidemargin}{-6mm}\addtolength{\evensidemargin}{-6mm}\addtolength{\textwidth}{12mm} %% \usepackage{accents}\newcommand\barbelow[1]{\stackunbder[1.2pt]{#1}{\raisebox{-4mm}{\boldmath$\uparrow$}}}\begin{filecontents}{re-python2.data}1 0.0335 0.03610 0.03415 0.03618 0.05919 0.084 20 0.14121 0.24822 0.48523 0.87824 1.7125 3.4026 7.0827 14.1228 26.69\end{filecontents}\begin{filecontents}{re-java.data}5 0.0029810 0.0041815 0.0099616 0.0171017 0.0349218 0.0330319 0.0508420 0.1017721 0.1996022 0.4115923 0.8223424 1.7025125 3.3611226 6.6399827 13.3512028 29.81185\end{filecontents} \begin{filecontents}{re-java9.data}1000 0.014102000 0.048823000 0.106094000 0.174565000 0.275306000 0.411167000 0.537418000 0.702619000 0.9398110000 0.9741911000 1.2869712000 1.5138714000 2.0707916000 2.6984620000 4.4182324000 6.4607726000 7.6437330000 9.9944634000 12.96688538000 16.28162142000 19.18022846000 21.98472150000 26.95020360000 43.0327746\end{filecontents}\begin{document}\section*{PPProposal: \\ Fast Regular Expression Matching\\ with Brzozowski bderivatives}\subsubsection*{Summary}If you want to connect a computer directly to the Internet, it must beimmediately \mbox{hardened} against outside attacks. The currenttechnology for this is to use regular expressions in order toautomatically scan all incoming network traffic for signs when acomputer is under attack and if found, to take appropriatecounter-actions. One possible action could be to slow down the trafficfrom sites where repeated password-guesses originate. Well-knownnetwork intrusion prevention systems that use regular expressions fortraffic analysis are Snort and Bro.Given the large volume of Internet traffic even the smallest serverscan handle nowadays, the regular expressions for traffic analysis havebecome a real bottleneck. This is not just a nuisance, but actually asecurity vulnerability in itself: it can lead to denial-of-serviceattacks. The proposed project aims to remove this bottleneck by usinga little-known technique of building bderivatives of regularexpressions. These derivatives have not yet been used in the area ofnetwork traffic analysis, but have the potential to solve sometenacious problems with existing approaches. The project will requiretheoretical work, but also a practical implementation (aproof-of-concept at least). The work will build on earlier work byAusaf and Urban from King's College London\cite{AusafDyckhoffUrban2016}.\subsubsection*{Background}If every 10 minutes a thief turned up at your car and rattledviolently at the doorhandles to see if he could get in, you would moveyour car to a safer neighbourhood. Computers, however, need to stayconnected to the Internet all the time and there they have to withstand suchattacks. A rather typical instance of an attack can be seen in thefollowing log entries from a server at King's:{\small\begin{verbatim} Jan 2 00:53:19 talisker sshd: Received disconnect from 110.53.183.227: [preauth] Jan 2 00:58:31 talisker sshd: Received disconnect from 110.53.183.252: [preauth] Jan 2 01:01:28 talisker sshd: Received disconnect from 221.194.47.236: [preauth] Jan 2 01:03:59 talisker sshd: Received disconnect from 110.53.183.228: [preauth] Jan 2 01:06:53 talisker sshd: Received disconnect from 221.194.47.245: [preauth] ...\end{verbatim}}\noindentThis is a record of the network activities from the server\href{http://talisker.inf.kcl.ac.uk/cgi-bin/repos.cgi}{talisker.inf.kcl.ac.uk},which hosts lecture material for students. The log indicates severalunsuccessful login attempts via the \texttt{ssh} program fromcomputers with the addresses \texttt{110.53.183.227} and so on. Thisis a clear sign of a \emph{brute-force attack} where hackerssystematically try out password candidates. Such attacks are blunt,but unfortunately very powerful. They can originate from anywhere inthe World; they are automated and often conducted from computers thathave been previously hijacked. Once the attacker ``hits'' on thecorrect password, then he or she gets access to the server. The onlyway to prevent this methodical password-guessing is to spot thecorresponding traces in the log and then slow down the traffic fromsuch sites in a way that perhaps only one password per hour can betried out. This does not affect ``normal'' operations of the server,but drastically decreases the chances of hitting the correct passwordby a brute-force attack.Server administrators use \emph{regular expressions} for scanning logfiles. The purpose of these expressions is to specify patterns ofinterest (for example \texttt{Received disconnect from 110.53.183.227}where the address needs to be variable). A program willthen continuously scan for such patterns and trigger an action if apattern is found. Popular examples of such programs are Snort and Bro\cite{snort,bro}. Clearly, there are also other kinds ofvulnerabilities hackers try to take advantage of---mainly programmingmistakes that can be abused to gain access to a computer. This meansserver administrators have a suite of sometimes thousands of regularexpressions, all prescribing some suspicious pattern for when acomputer is unbder attack.The unbderlying algorithmic problem is to decide whether a string inthe logs matches the patterns determined by the regularexpressions. Such decisions need to be done as fast as possible,otherwise the scanning programs would not be useful as a hardeningtechnique for servers. The quest for speed with these decisions ispresently a rather big challenge for the amount of traffic typicalservers have to cope with and the large number of patterns that mightbe of interest. The problem is that when thousands of regularexpressions are involved, the process of detecting whether a stringmatches a regular expression can be excruciating slow. This might nothappen in most instances, but in some small number ofinstances. However in these small number of instances the algorithmfor regular expression matching can grind to a halt. This phenomenonis called \emph{catastrophic backtracking} \cite{catast}.Unfortunately, catastrophic backtracking is not just a nuisance, but asecurity vulnerability in itself. The reason is that attackers lookfor these instances where the scan is very slow and then trigger a\emph{denial-of-service attack} against the server\ldots meaning theserver will not be reachable anymore to normal users.Existing approaches try to mitigate the speed problem with regularexpressions by implementing the matching algorithms in hardware,rather than in software \cite{hardware}. Although this solution offersspeed, it is often unappealing because attack patterns can change orneed to be augmented. A software solution is clearly more desirable inthese circumstances. Also given that regular expressions wereintroduced in 1950 by Kleene, one might think regular expressions havesince been studied and implemented to death. But this would definitelybe a mistake: in fact they are still an active research area andoff-the-shelf implementations are still extremely prone to theproblem of catastrophic backtracking. This can be seen in thefollowing two graphs:\begin{center}\begin{tabular}{@{}cc@{}}%\multicolumn{2}{c}{Graph: $(a^*)^*\cdot b$ and strings % $\unbderbrace{a\ldots a}_{n}$}\smallskip\\ \begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,0.0)}}, ylabel={time in secs}, y label style={at={(0.06,0.5)}}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=45, ytick={0,10,...,40}, scaled ticks=false, axis lines=left, width=6cm, height=4.5cm, legend entries={Python, Java 8}, legend pos=north west]\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};\end{axis}\end{tikzpicture} & \begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,0.0)}}, ylabel={time in secs}, y label style={at={(0.06,0.5)}}, %enlargelimits=false, %xtick={0,5000,...,30000}, xmax=65000, ymax=45, ytick={0,10,...,40}, scaled ticks=false, axis lines=left, width=6cm, height=4.5cm, legend entries={Java 9}, legend pos=north west]\addplot[cyan,mark=*, mark options={fill=white}] table {re-java9.data};\end{axis}\end{tikzpicture}\end{tabular} \end{center}\noindentThese graphs show how long strings can be when using the rather simpleregular expression $(a^*)^* \,b$ and the built-in regular expressionmatchers in the popular programming languages Python and Java (Version8 and Version 9). There are many more regular expressions that showthe same behaviour. On the left-hand side the graphs show that for astring consisting of just 28 $a\,$'s, Python and the olbder Java (whichwas the latest version of Java until September 2017) already need 30seconds to decide whether this string is matched or not. This is anabysmal result given that Python and Java are very popular programminglanguages. We already know that this problem can be decided much, muchfaster. If these slow regular expression matchers would be employedin network traffic analysis, then this example is already anopportunity for mounting an easy denial-of-service attack: one justhas to send to the server a large enough string of a's. The newerversion of Java is better---it can match strings of around 50,000$a\,$'s in 30 seconds---however this would still not be good enoughfor being a useful tool for network traffic analysis. What isinteresting is that even a relatively simple-minded regular expressionmatcher based on Brzozowski bderivatives can out-compete the regularexpressions matchers in Java and Python on such catastrophicbacktracking examples. This gain in speed is the motivation andstarting point for the proposed project. \smallskip\subsection*{Research Plan}Regular expressions are already an old and well-studied topic inComputer Science. They were originally introduced by Kleene in 1951and are utilised in text search algorithms for ``find'' or ``find andreplace'' operations \cite{Brian,Kleene51}. Because of their widerange of applications, many programming languages provide capabilitiesfor regular expression matching via libraries. The classic theoryof regular expressions involves just 6 different kinds of regularexpressions, often defined in terms of the following grammar:\begin{center}\small\begin{tabular}{lcll} $r$ & $::=$ & $\ZERO$ & cannot match anything\\ & $|$ & $\ONE$ & can only match the empty string\\ & $|$ & $c$ & can match a single character (in this case $c$)\\ & $|$ & $r_1 + r_2$ & can match a string either with $r_1$ or with $r_2$\\ & $|$ & $r_1\cdot r_2$ & can match the first part of a string with $r_1$ and\\ & & & then the second part with $r_2$\\ & $|$ & $r^*$ & can match zero or more times $r$\\\end{tabular}\end{center}\noindentFor practical purposes, however, regular expressions have beenextended in various ways in orbder to make them even more powerful andeven more useful for applications. Some of the problems to do withcatastrophic backtracking stem, unfortunately, from these extensions.The crux in this project is to not use automata for regular expressionmatching (the traditional technique), but to use bderivatives ofregular expressions instead. These bderivatives were introduced byBrzozowski in 1964 \cite{Brzozowski1964}, but somehow had been lost``in the sands of time'' \cite{Owens2009}. However, they recently havebecome again a ``hot'' research topic with numerous research papersappearing in the last five years. One reason for this interest is thatBrzozowski bderivatives straightforwardly extend to more generalregular expression constructors, which cannot be easily treated withstandard techniques. They can also be implemented with ease in anyfunctional programming language: the definition for bderivativesconsists of just two simple recursive functions shown inFigure~\ref{f}. Moreover, it can be easily shown (by a mathematicalproof) that the resulting regular matcher is in fact always producinga correct answer. This is an area where Urban can provide a lot ofexpertise for this project: he is one of the main developers of theIsabelle theorem prover, which is a program designed for such proofs\cite{Isabelle}.\begin{figure}[t]\begin{center}\small\begin{tabular}{lcl}$\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\$\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\$\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\$\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\$\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\$\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\medskip\\$\textit{bder}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\$\textit{bder}\;c\;(\ONE)$ & $\dn$ & $\ZERO$\\$\textit{bder}\;c\;(d)$ & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\$\textit{bder}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{bder}\;c\;r_1) + (\textit{bder}\;c\;r_2)$\\$\textit{bder}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\ & & $\textit{then}\;((\textit{bder}\;c\;r_1)\cdot r_2) + (\textit{bder}\;c\;r_2)$\\ & & $\textit{else}\;(\textit{bder}\;c\;r_1)\cdot r_2$\\$\textit{bder}\;c\;(r^*)$ & $\dn$ & $(\textit{bder}\;c\;r)\cdot (r^*)$\\[-5mm]\end{tabular}\end{center}\caption{The complete definition of bderivatives of regular expressions \cite{Brzozowski1964}. This definition can be implemented with ease in functional programming languages and can be easily extended to regular expressions used in practice. The are more flexible for applications and easier to manipulate in mathematical proofs, than traditional techniques for regular expression matching.\medskip\label{f}}\end{figure}There are a number of avenues for research on Brzozowski bderivatives.One problem I like to immediately tackle in this project is how tohandle \emph{back-references} in regular expressions. It is known thatthe inclusion of back-references causes the unbderlying algorithmicproblem to become NP-hard \cite{Aho}, and is the main reason whyregular expression matchers suffer from the catastrophic backtrackingproblem. However, the full generality of back-references are notneeded in practical applications. The goal therefore is to sufficientlyrestrict the problem so that an efficient algorithm can bedesigned. There seem to be good indications that Brzozowskibderivatives are useful for that.Another problem is \emph{how} regular expressions match a string\cite{Sulzmann2014}. In this case the algorithm does not just give ayes/no answer about whether the regular expression matches the string,but also generates a value that encodes which part of the string ismatched by which part of the regular expression. This is important inapplications, like network analysis where from a matching log entry aspecific computer address needs to be extracted. Also compilers makeextensive use of such an extension when they lex their sourceprograms, i.e.~break up code into word-like components. Again Ausafand Urban from King's made some initial progress about proving thecorrectness of such an extension, but their algorithm is not yet fastenough for practical purposes \cite{AusafDyckhoffUrban2016}. It wouldbe desirable to study optimisations that make the algorithm faster,but preserve the correctness guaranties obtained by Ausaf and Urban.\mbox{}\\[-11mm]\subsubsection*{Conclusion}Much research has already gone into regular expressions, and one mightthink they have been studied and implemented to death. But this is farfrom the truth. In fact regular expressions have become a ``hot''research topic in recent years because of problems with thetraditional methods (in applications such as network trafficanalysis), but also because the technique of bderivatives of regularexpression has been re-discovered. These bderivatives provide analternative approach to regular expression matching. Their potentiallies in the fact that they are more flexible in implementations andmuch easier to manipulate in mathematical proofs. Therefore I like toresearch them in this project.\medskip\noindent {\bf Impact:} Regular expression matching is a coretechnology in Computer Science with numerous applications. I willfocus on the area of network traffic analysis and also lexicalanalysis. In these areas this project can have a quite large impact:for example the program Bro has been downloaded at least 10,000 times and Snorteven has 5 million downloads and over 600,000 registered users\cite{snort,bro}. Lexical analysis is a core component in everycompiler, which are the bedrock on which nearly all programming isbuilt on.% {\small% \begin{itemize}% \item[$\bullet$] \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}% \item[$\bullet$] \url{https://vimeo.com/112065252}% \item[$\bullet$] \url{http://davidvgalbraith.com/how-i-fixed-atom/} % \end{itemize}}\mbox{}\\[-11mm]\small\bibliographystyle{plain}\renewcommand{\refname}{References\\[-7mm]\mbox{}}\bibliography{proposal}\newpage\begin{center}\small \begin{tabular}{lcl}$\textit{bder}\;c\;_{[]}(\ZERO)$ & $\dn$ & $_{[]}\ZERO$\\$\textit{bder}\;c\;_{bs}(\ONE)$ & $\dn$ & $_{[]}\ZERO$\\$\textit{bder}\;c\;_{bs}(d)$ & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;_{bs}\ONE \; \textit{else} \;_{[]}\ZERO$\\$\textit{bder}\;c\;_{bs}(r_1 + r_2)$ & $\dn$ & $_{bs}(\textit{bder}\;c\;r_1 + \textit{bder}\;c\;r_2)$\\$\textit{bder}\;c\;_{bs}(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\ & & $\textit{then}\;_{bs}(_{[]}(\textit{bder}\;c\;r_1)\cdot r_2) + (_{\textit{bmkeps}\,r_1\rightarrow}\textit{bder}\;c\;r_2)$\\ & & $\textit{else}\;_{bs}(\textit{bder}\;c\;r_1)\cdot r_2$\\$\textit{bder}\;c\;_{bs}(r^*)$ & $\dn$ & $_{bs}(_{0\rightarrow}\textit{bder}\;c\;r)\cdot (r^*)$\\[-5mm]\end{tabular}\end{center}\end{document}%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: