|
1 % !TEX program = pfdlatex |
1 \documentclass{article} |
2 \documentclass{article} |
2 \usepackage{url} |
3 \usepackage{url} |
3 \usepackage{color}\definecolor{darkblue}{rgb}{0,0,0.5} |
4 \usepackage{color}\definecolor{darkblue}{rgb}{0,0,0.5} |
4 \usepackage[a4paper,colorlinks=true,linkcolor=darkblue,citecolor=darkblue,filecolor=darkblue,pagecolor=darkblue,urlcolor=darkblue]{hyperref} |
5 \usepackage[a4paper,colorlinks=true,linkcolor=darkblue,citecolor=darkblue,filecolor=darkblue,pagecolor=darkblue,urlcolor=darkblue]{hyperref} |
5 \usepackage{style} |
6 \usepackage{style} |
6 \usepackage{langs} |
7 %\usepackage{langs} |
7 \usepackage{tikz} |
8 \usepackage{tikz} |
8 \usepackage{pgf} |
9 \usepackage{pgf} |
9 \usepackage{pgfplots} |
10 \usepackage{pgfplots} |
10 \usepackage{stackengine} |
11 \usepackage{stackengine} |
11 |
12 |
12 \addtolength{\oddsidemargin}{-6mm} |
13 \addtolength{\oddsidemargin}{-6mm} |
13 \addtolength{\evensidemargin}{-6mm} |
14 \addtolength{\evensidemargin}{-6mm} |
14 \addtolength{\textwidth}{12mm} |
15 \addtolength{\textwidth}{12mm} |
15 |
16 |
16 |
17 |
17 |
18 |
18 %% \usepackage{accents} |
19 %% \usepackage{accents} |
19 \newcommand\barbelow[1]{\stackunder[1.2pt]{#1}{\raisebox{-4mm}{\boldmath$\uparrow$}}} |
20 \newcommand\barbelow[1]{\stackunbder[1.2pt]{#1}{\raisebox{-4mm}{\boldmath$\uparrow$}}} |
20 |
21 |
21 \begin{filecontents}{re-python2.data} |
22 \begin{filecontents}{re-python2.data} |
22 1 0.033 |
23 1 0.033 |
23 5 0.036 |
24 5 0.036 |
24 10 0.034 |
25 10 0.034 |
100 counter-actions. One possible action could be to slow down the traffic |
101 counter-actions. One possible action could be to slow down the traffic |
101 from sites where repeated password-guesses originate. Well-known |
102 from sites where repeated password-guesses originate. Well-known |
102 network intrusion prevention systems that use regular expressions for |
103 network intrusion prevention systems that use regular expressions for |
103 traffic analysis are Snort and Bro. |
104 traffic analysis are Snort and Bro. |
104 |
105 |
105 Given the large |
106 Given the large volume of Internet traffic even the smallest servers |
106 volume of Internet traffic even the smallest servers can handle |
107 can handle nowadays, the regular expressions for traffic analysis have |
107 nowadays, the regular expressions for traffic analysis have become a |
108 become a real bottleneck. This is not just a nuisance, but actually a |
108 real bottleneck. This is not just a nuisance, but actually a security |
109 security vulnerability in itself: it can lead to denial-of-service |
109 vulnerability in itself: it can lead to denial-of-service attacks. |
110 attacks. The proposed project aims to remove this bottleneck by using |
110 The proposed project aims to remove this bottleneck by using a little-known |
111 a little-known technique of building bderivatives of regular |
111 technique of building derivatives of regular expressions. These |
112 expressions. These derivatives have not yet been used in the area of |
112 derivatives have not yet been used in the area of network traffic |
113 network traffic analysis, but have the potential to solve some |
113 analysis, but have the potential to solve some tenacious problems with |
114 tenacious problems with existing approaches. The project will require |
114 existing approaches. The project will require theoretical work, but |
115 theoretical work, but also a practical implementation (a |
115 also a practical implementation (a proof-of-concept at least). The |
116 proof-of-concept at least). The work will build on earlier work by |
116 work will build on earlier work by Ausaf and Urban from King's College |
117 Ausaf and Urban from King's College London |
117 London \cite{AusafDyckhoffUrban2016}. |
118 \cite{AusafDyckhoffUrban2016}. |
118 |
119 |
119 |
120 |
120 \subsubsection*{Background} |
121 \subsubsection*{Background} |
121 |
122 |
122 If every 10 minutes a thief turned up at your car and rattled |
123 If every 10 minutes a thief turned up at your car and rattled |
166 \cite{snort,bro}. Clearly, there are also other kinds of |
167 \cite{snort,bro}. Clearly, there are also other kinds of |
167 vulnerabilities hackers try to take advantage of---mainly programming |
168 vulnerabilities hackers try to take advantage of---mainly programming |
168 mistakes that can be abused to gain access to a computer. This means |
169 mistakes that can be abused to gain access to a computer. This means |
169 server administrators have a suite of sometimes thousands of regular |
170 server administrators have a suite of sometimes thousands of regular |
170 expressions, all prescribing some suspicious pattern for when a |
171 expressions, all prescribing some suspicious pattern for when a |
171 computer is under attack. |
172 computer is unbder attack. |
172 |
173 |
173 |
174 |
174 The underlying algorithmic problem is to decide whether a string in |
175 The unbderlying algorithmic problem is to decide whether a string in |
175 the logs matches the patterns determined by the regular |
176 the logs matches the patterns determined by the regular |
176 expressions. Such decisions need to be done as fast as possible, |
177 expressions. Such decisions need to be done as fast as possible, |
177 otherwise the scanning programs would not be useful as a hardening |
178 otherwise the scanning programs would not be useful as a hardening |
178 technique for servers. The quest for speed with these decisions is |
179 technique for servers. The quest for speed with these decisions is |
179 presently a rather big challenge for the amount of traffic typical |
180 presently a rather big challenge for the amount of traffic typical |
257 These graphs show how long strings can be when using the rather simple |
258 These graphs show how long strings can be when using the rather simple |
258 regular expression $(a^*)^* \,b$ and the built-in regular expression |
259 regular expression $(a^*)^* \,b$ and the built-in regular expression |
259 matchers in the popular programming languages Python and Java (Version |
260 matchers in the popular programming languages Python and Java (Version |
260 8 and Version 9). There are many more regular expressions that show |
261 8 and Version 9). There are many more regular expressions that show |
261 the same behaviour. On the left-hand side the graphs show that for a |
262 the same behaviour. On the left-hand side the graphs show that for a |
262 string consisting of just 28 $a\,$'s, Python and the older Java (which |
263 string consisting of just 28 $a\,$'s, Python and the olbder Java (which |
263 was the latest version of Java until September 2017) already need 30 |
264 was the latest version of Java until September 2017) already need 30 |
264 seconds to decide whether this string is matched or not. This is an |
265 seconds to decide whether this string is matched or not. This is an |
265 abysmal result given that Python and Java are very popular programming |
266 abysmal result given that Python and Java are very popular programming |
266 languages. We already know that this problem can be decided much, much |
267 languages. We already know that this problem can be decided much, much |
267 faster. If these slow regular expression matchers would be employed |
268 faster. If these slow regular expression matchers would be employed |
270 has to send to the server a large enough string of a's. The newer |
271 has to send to the server a large enough string of a's. The newer |
271 version of Java is better---it can match strings of around 50,000 |
272 version of Java is better---it can match strings of around 50,000 |
272 $a\,$'s in 30 seconds---however this would still not be good enough |
273 $a\,$'s in 30 seconds---however this would still not be good enough |
273 for being a useful tool for network traffic analysis. What is |
274 for being a useful tool for network traffic analysis. What is |
274 interesting is that even a relatively simple-minded regular expression |
275 interesting is that even a relatively simple-minded regular expression |
275 matcher based on Brzozowski derivatives can out-compete the regular |
276 matcher based on Brzozowski bderivatives can out-compete the regular |
276 expressions matchers in Java and Python on such catastrophic |
277 expressions matchers in Java and Python on such catastrophic |
277 backtracking examples. This gain in speed is the motivation and |
278 backtracking examples. This gain in speed is the motivation and |
278 starting point for the proposed project. \smallskip |
279 starting point for the proposed project. \smallskip |
279 |
280 |
280 \subsection*{Research Plan} |
281 \subsection*{Research Plan} |
300 \end{tabular} |
301 \end{tabular} |
301 \end{center} |
302 \end{center} |
302 |
303 |
303 \noindent |
304 \noindent |
304 For practical purposes, however, regular expressions have been |
305 For practical purposes, however, regular expressions have been |
305 extended in various ways in order to make them even more powerful and |
306 extended in various ways in orbder to make them even more powerful and |
306 even more useful for applications. Some of the problems to do with |
307 even more useful for applications. Some of the problems to do with |
307 catastrophic backtracking stem, unfortunately, from these extensions. |
308 catastrophic backtracking stem, unfortunately, from these extensions. |
308 |
309 |
309 The crux in this project is to not use automata for regular expression |
310 The crux in this project is to not use automata for regular expression |
310 matching (the traditional technique), but to use derivatives of |
311 matching (the traditional technique), but to use bderivatives of |
311 regular expressions instead. These derivatives were introduced by |
312 regular expressions instead. These bderivatives were introduced by |
312 Brzozowski in 1964 \cite{Brzozowski1964}, but somehow had been lost |
313 Brzozowski in 1964 \cite{Brzozowski1964}, but somehow had been lost |
313 ``in the sands of time'' \cite{Owens2009}. However, they recently have |
314 ``in the sands of time'' \cite{Owens2009}. However, they recently have |
314 become again a ``hot'' research topic with numerous research papers |
315 become again a ``hot'' research topic with numerous research papers |
315 appearing in the last five years. One reason for this interest is that |
316 appearing in the last five years. One reason for this interest is that |
316 Brzozowski derivatives straightforwardly extend to more general |
317 Brzozowski bderivatives straightforwardly extend to more general |
317 regular expression constructors, which cannot be easily treated with |
318 regular expression constructors, which cannot be easily treated with |
318 standard techniques. They can also be implemented with ease in any |
319 standard techniques. They can also be implemented with ease in any |
319 functional programming language: the definition for derivatives |
320 functional programming language: the definition for bderivatives |
320 consists of just two simple recursive functions shown in |
321 consists of just two simple recursive functions shown in |
321 Figure~\ref{f}. Moreover, it can be easily shown (by a mathematical |
322 Figure~\ref{f}. Moreover, it can be easily shown (by a mathematical |
322 proof) that the resulting regular matcher is in fact always producing |
323 proof) that the resulting regular matcher is in fact always producing |
323 a correct answer. This is an area where Urban can provide a lot of |
324 a correct answer. This is an area where Urban can provide a lot of |
324 expertise for this project: he is one of the main developers of the |
325 expertise for this project: he is one of the main developers of the |
334 $\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\ |
335 $\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\ |
335 $\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\ |
336 $\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\ |
336 $\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\ |
337 $\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\ |
337 $\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\medskip\\ |
338 $\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\medskip\\ |
338 |
339 |
339 $\textit{der}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\ |
340 $\textit{bder}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\ |
340 $\textit{der}\;c\;(\ONE)$ & $\dn$ & $\ZERO$\\ |
341 $\textit{bder}\;c\;(\ONE)$ & $\dn$ & $\ZERO$\\ |
341 $\textit{der}\;c\;(d)$ & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\ |
342 $\textit{bder}\;c\;(d)$ & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\ |
342 $\textit{der}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{der}\;c\;r_1) + (\textit{der}\;c\;r_2)$\\ |
343 $\textit{bder}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{bder}\;c\;r_1) + (\textit{bder}\;c\;r_2)$\\ |
343 $\textit{der}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\ |
344 $\textit{bder}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\ |
344 & & $\textit{then}\;((\textit{der}\;c\;r_1)\cdot r_2) + (\textit{der}\;c\;r_2)$\\ |
345 & & $\textit{then}\;((\textit{bder}\;c\;r_1)\cdot r_2) + (\textit{bder}\;c\;r_2)$\\ |
345 & & $\textit{else}\;(\textit{der}\;c\;r_1)\cdot r_2$\\ |
346 & & $\textit{else}\;(\textit{bder}\;c\;r_1)\cdot r_2$\\ |
346 $\textit{der}\;c\;(r^*)$ & $\dn$ & $(\textit{der}\;c\;r)\cdot (r^*)$\\[-5mm] |
347 $\textit{bder}\;c\;(r^*)$ & $\dn$ & $(\textit{bder}\;c\;r)\cdot (r^*)$\\[-5mm] |
347 \end{tabular} |
348 \end{tabular} |
348 \end{center} |
349 \end{center} |
349 \caption{The complete definition of derivatives of regular expressions |
350 \caption{The complete definition of bderivatives of regular expressions |
350 \cite{Brzozowski1964}. |
351 \cite{Brzozowski1964}. |
351 This definition can be implemented with ease in functional programming languages and can be easily extended to regular expressions used in practice. |
352 This definition can be implemented with ease in functional programming languages and can be easily extended to regular expressions used in practice. |
352 The are more flexible for applications and easier to manipulate in |
353 The are more flexible for applications and easier to manipulate in |
353 mathematical proofs, than traditional techniques for regular expression |
354 mathematical proofs, than traditional techniques for regular expression |
354 matching.\medskip\label{f}} |
355 matching.\medskip\label{f}} |
355 \end{figure} |
356 \end{figure} |
356 |
357 |
357 There are a number of avenues for research on Brzozowski derivatives. |
358 There are a number of avenues for research on Brzozowski bderivatives. |
358 One problem I like to immediately tackle in this project is how to |
359 One problem I like to immediately tackle in this project is how to |
359 handle \emph{back-references} in regular expressions. It is known that |
360 handle \emph{back-references} in regular expressions. It is known that |
360 the inclusion of back-references causes the underlying algorithmic |
361 the inclusion of back-references causes the unbderlying algorithmic |
361 problem to become NP-hard \cite{Aho}, and is the main reason why |
362 problem to become NP-hard \cite{Aho}, and is the main reason why |
362 regular expression matchers suffer from the catastrophic backtracking |
363 regular expression matchers suffer from the catastrophic backtracking |
363 problem. However, the full generality of back-references are not |
364 problem. However, the full generality of back-references are not |
364 needed in practical applications. The goal therefore is to sufficiently |
365 needed in practical applications. The goal therefore is to sufficiently |
365 restrict the problem so that an efficient algorithm can be |
366 restrict the problem so that an efficient algorithm can be |
366 designed. There seem to be good indications that Brzozowski |
367 designed. There seem to be good indications that Brzozowski |
367 derivatives are useful for that. |
368 bderivatives are useful for that. |
368 |
369 |
369 Another problem is \emph{how} regular expressions match a string |
370 Another problem is \emph{how} regular expressions match a string |
370 \cite{Sulzmann2014}. In this case the algorithm does not just give a |
371 \cite{Sulzmann2014}. In this case the algorithm does not just give a |
371 yes/no answer about whether the regular expression matches the string, |
372 yes/no answer about whether the regular expression matches the string, |
372 but also generates a value that encodes which part of the string is |
373 but also generates a value that encodes which part of the string is |
388 Much research has already gone into regular expressions, and one might |
389 Much research has already gone into regular expressions, and one might |
389 think they have been studied and implemented to death. But this is far |
390 think they have been studied and implemented to death. But this is far |
390 from the truth. In fact regular expressions have become a ``hot'' |
391 from the truth. In fact regular expressions have become a ``hot'' |
391 research topic in recent years because of problems with the |
392 research topic in recent years because of problems with the |
392 traditional methods (in applications such as network traffic |
393 traditional methods (in applications such as network traffic |
393 analysis), but also because the technique of derivatives of regular |
394 analysis), but also because the technique of bderivatives of regular |
394 expression has been re-discovered. These derivatives provide an |
395 expression has been re-discovered. These bderivatives provide an |
395 alternative approach to regular expression matching. Their potential |
396 alternative approach to regular expression matching. Their potential |
396 lies in the fact that they are more flexible in implementations and |
397 lies in the fact that they are more flexible in implementations and |
397 much easier to manipulate in mathematical proofs. Therefore I like to |
398 much easier to manipulate in mathematical proofs. Therefore I like to |
398 research them in this project.\medskip |
399 research them in this project.\medskip |
399 |
400 |