|
1 \documentclass{article} |
|
2 \usepackage{url} |
|
3 \usepackage{color}\definecolor{darkblue}{rgb}{0,0,0.5} |
|
4 \usepackage[a4paper,colorlinks=true,linkcolor=darkblue,citecolor=darkblue,filecolor=darkblue,pagecolor=darkblue,urlcolor=darkblue]{hyperref} |
|
5 \usepackage{style} |
|
6 \usepackage{langs} |
|
7 \usepackage{tikz} |
|
8 \usepackage{pgf} |
|
9 \usepackage{pgfplots} |
|
10 \usepackage{stackengine} |
|
11 |
|
12 \addtolength{\oddsidemargin}{-6mm} |
|
13 \addtolength{\evensidemargin}{-6mm} |
|
14 \addtolength{\textwidth}{12mm} |
|
15 |
|
16 |
|
17 |
|
18 %% \usepackage{accents} |
|
19 \newcommand\barbelow[1]{\stackunder[1.2pt]{#1}{\raisebox{-4mm}{\boldmath$\uparrow$}}} |
|
20 |
|
21 \begin{filecontents}{re-python2.data} |
|
22 1 0.033 |
|
23 5 0.036 |
|
24 10 0.034 |
|
25 15 0.036 |
|
26 18 0.059 |
|
27 19 0.084 |
|
28 20 0.141 |
|
29 21 0.248 |
|
30 22 0.485 |
|
31 23 0.878 |
|
32 24 1.71 |
|
33 25 3.40 |
|
34 26 7.08 |
|
35 27 14.12 |
|
36 28 26.69 |
|
37 \end{filecontents} |
|
38 |
|
39 \begin{filecontents}{re-java.data} |
|
40 5 0.00298 |
|
41 10 0.00418 |
|
42 15 0.00996 |
|
43 16 0.01710 |
|
44 17 0.03492 |
|
45 18 0.03303 |
|
46 19 0.05084 |
|
47 20 0.10177 |
|
48 21 0.19960 |
|
49 22 0.41159 |
|
50 23 0.82234 |
|
51 24 1.70251 |
|
52 25 3.36112 |
|
53 26 6.63998 |
|
54 27 13.35120 |
|
55 28 29.81185 |
|
56 \end{filecontents} |
|
57 |
|
58 \begin{filecontents}{re-java9.data} |
|
59 1000 0.01410 |
|
60 2000 0.04882 |
|
61 3000 0.10609 |
|
62 4000 0.17456 |
|
63 5000 0.27530 |
|
64 6000 0.41116 |
|
65 7000 0.53741 |
|
66 8000 0.70261 |
|
67 9000 0.93981 |
|
68 10000 0.97419 |
|
69 11000 1.28697 |
|
70 12000 1.51387 |
|
71 14000 2.07079 |
|
72 16000 2.69846 |
|
73 20000 4.41823 |
|
74 24000 6.46077 |
|
75 26000 7.64373 |
|
76 30000 9.99446 |
|
77 34000 12.966885 |
|
78 38000 16.281621 |
|
79 42000 19.180228 |
|
80 46000 21.984721 |
|
81 50000 26.950203 |
|
82 60000 43.0327746 |
|
83 \end{filecontents} |
|
84 |
|
85 |
|
86 \begin{document} |
|
87 |
|
88 |
|
89 \section*{Proposal:\\ |
|
90 Fast Regular Expression Matching\\ |
|
91 with Brzozowski Derivatives} |
|
92 |
|
93 \subsubsection*{Summary} |
|
94 |
|
95 If you want to connect a computer directly to the Internet, it must be |
|
96 immediately \mbox{hardened} against outside attacks. The current |
|
97 technology for this is to use regular expressions in order to |
|
98 automatically scan all incoming network traffic for signs when a |
|
99 computer is under attack and if found, to take appropriate |
|
100 counter-actions. One possible action could be to slow down the traffic |
|
101 from sites where repeated password-guesses originate. Well-known |
|
102 network intrusion prevention systems that use regular expressions for |
|
103 traffic analysis are Snort and Bro. |
|
104 |
|
105 Given the large |
|
106 volume of Internet traffic even the smallest servers can handle |
|
107 nowadays, the regular expressions for traffic analysis have become a |
|
108 real bottleneck. This is not just a nuisance, but actually a security |
|
109 vulnerability in itself: it can lead to denial-of-service attacks. |
|
110 The proposed project aims to remove this bottleneck by using a little-known |
|
111 technique of building derivatives of regular expressions. These |
|
112 derivatives have not yet been used in the area of network traffic |
|
113 analysis, but have the potential to solve some tenacious problems with |
|
114 existing approaches. The project will require theoretical work, but |
|
115 also a practical implementation (a proof-of-concept at least). The |
|
116 work will build on earlier work by Ausaf and Urban from King's College |
|
117 London \cite{AusafDyckhoffUrban2016}. |
|
118 |
|
119 |
|
120 \subsubsection*{Background} |
|
121 |
|
122 If every 10 minutes a thief turned up at your car and rattled |
|
123 violently at the doorhandles to see if he could get in, you would move |
|
124 your car to a safer neighbourhood. Computers, however, need to stay |
|
125 connected to the Internet all the time and there they have to withstand such |
|
126 attacks. A rather typical instance of an attack can be seen in the |
|
127 following log entries from a server at King's: |
|
128 |
|
129 {\small |
|
130 \begin{verbatim} |
|
131 Jan 2 00:53:19 talisker sshd: Received disconnect from 110.53.183.227: [preauth] |
|
132 Jan 2 00:58:31 talisker sshd: Received disconnect from 110.53.183.252: [preauth] |
|
133 Jan 2 01:01:28 talisker sshd: Received disconnect from 221.194.47.236: [preauth] |
|
134 Jan 2 01:03:59 talisker sshd: Received disconnect from 110.53.183.228: [preauth] |
|
135 Jan 2 01:06:53 talisker sshd: Received disconnect from 221.194.47.245: [preauth] |
|
136 ... |
|
137 \end{verbatim}} |
|
138 |
|
139 |
|
140 \noindent |
|
141 This is a record of the network activities from the server |
|
142 \href{http://talisker.inf.kcl.ac.uk/cgi-bin/repos.cgi}{talisker.inf.kcl.ac.uk}, |
|
143 which hosts lecture material for students. The log indicates several |
|
144 unsuccessful login attempts via the \texttt{ssh} program from |
|
145 computers with the addresses \texttt{110.53.183.227} and so on. This |
|
146 is a clear sign of a \emph{brute-force attack} where hackers |
|
147 systematically try out password candidates. Such attacks are blunt, |
|
148 but unfortunately very powerful. They can originate from anywhere in |
|
149 the World; they are automated and often conducted from computers that |
|
150 have been previously hijacked. Once the attacker ``hits'' on the |
|
151 correct password, then he or she gets access to the server. The only |
|
152 way to prevent this methodical password-guessing is to spot the |
|
153 corresponding traces in the log and then slow down the traffic from |
|
154 such sites in a way that perhaps only one password per hour can be |
|
155 tried out. This does not affect ``normal'' operations of the server, |
|
156 but drastically decreases the chances of hitting the correct password |
|
157 by a brute-force attack. |
|
158 |
|
159 |
|
160 Server administrators use \emph{regular expressions} for scanning log |
|
161 files. The purpose of these expressions is to specify patterns of |
|
162 interest (for example \texttt{Received disconnect from 110.53.183.227} |
|
163 where the address needs to be variable). A program will |
|
164 then continuously scan for such patterns and trigger an action if a |
|
165 pattern is found. Popular examples of such programs are Snort and Bro |
|
166 \cite{snort,bro}. Clearly, there are also other kinds of |
|
167 vulnerabilities hackers try to take advantage of---mainly programming |
|
168 mistakes that can be abused to gain access to a computer. This means |
|
169 server administrators have a suite of sometimes thousands of regular |
|
170 expressions, all prescribing some suspicious pattern for when a |
|
171 computer is under attack. |
|
172 |
|
173 |
|
174 The underlying algorithmic problem is to decide whether a string in |
|
175 the logs matches the patterns determined by the regular |
|
176 expressions. Such decisions need to be done as fast as possible, |
|
177 otherwise the scanning programs would not be useful as a hardening |
|
178 technique for servers. The quest for speed with these decisions is |
|
179 presently a rather big challenge for the amount of traffic typical |
|
180 serves have to cope with and the large number of patterns that might |
|
181 be of interest. The problem is that when thousands of regular |
|
182 expressions are involved, the process of detecting whether a string |
|
183 matches a regular expression can be excruciating slow. This might not |
|
184 happen in most instances, but in some small number of |
|
185 instances. However in these small number of instances the algorithm |
|
186 for regular expression matching can grind to a halt. This phenomenon |
|
187 is called \emph{catastrophic backtracking} \cite{catast}. |
|
188 Unfortunately, catastrophic backtracking is not just a nuisance, but a |
|
189 security vulnerability in itself. The reason is that attackers look |
|
190 for these instances where the scan is very slow and then trigger a |
|
191 \emph{denial-of-service attack} against the server\ldots meaning the |
|
192 server will not be reachable anymore to normal users. |
|
193 |
|
194 Existing approaches try to mitigate the speed problem with regular |
|
195 expressions by implementing the matching algorithms in hardware, |
|
196 rather than in software \cite{hardware}. Although this solution offers |
|
197 speed, it is often unappealing because attack patterns can change or |
|
198 need to be augmented. A software solution is clearly more desirable in |
|
199 these circumstances. Also given that regular expressions were |
|
200 introduced in 1950 by Kleene, one might think regular expressions have |
|
201 since been studied and implemented to death. But this would definitely |
|
202 be a mistake: in fact they are still an active research area and |
|
203 off-the-shelf implementations are still extremely prone to the |
|
204 problem of catastrophic backtracking. This can be seen in the |
|
205 following two graphs: |
|
206 |
|
207 \begin{center} |
|
208 \begin{tabular}{@{}cc@{}} |
|
209 %\multicolumn{2}{c}{Graph: $(a^*)^*\cdot b$ and strings |
|
210 % $\underbrace{a\ldots a}_{n}$}\smallskip\\ |
|
211 \begin{tikzpicture} |
|
212 \begin{axis}[ |
|
213 xlabel={$n$}, |
|
214 x label style={at={(1.05,0.0)}}, |
|
215 ylabel={time in secs}, |
|
216 y label style={at={(0.06,0.5)}}, |
|
217 enlargelimits=false, |
|
218 xtick={0,5,...,30}, |
|
219 xmax=33, |
|
220 ymax=45, |
|
221 ytick={0,10,...,40}, |
|
222 scaled ticks=false, |
|
223 axis lines=left, |
|
224 width=6cm, |
|
225 height=4.5cm, |
|
226 legend entries={Python, Java 8}, |
|
227 legend pos=north west] |
|
228 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
|
229 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
|
230 \end{axis} |
|
231 \end{tikzpicture} |
|
232 & |
|
233 \begin{tikzpicture} |
|
234 \begin{axis}[ |
|
235 xlabel={$n$}, |
|
236 x label style={at={(1.05,0.0)}}, |
|
237 ylabel={time in secs}, |
|
238 y label style={at={(0.06,0.5)}}, |
|
239 %enlargelimits=false, |
|
240 %xtick={0,5000,...,30000}, |
|
241 xmax=65000, |
|
242 ymax=45, |
|
243 ytick={0,10,...,40}, |
|
244 scaled ticks=false, |
|
245 axis lines=left, |
|
246 width=6cm, |
|
247 height=4.5cm, |
|
248 legend entries={Java 9}, |
|
249 legend pos=north west] |
|
250 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java9.data}; |
|
251 \end{axis} |
|
252 \end{tikzpicture} |
|
253 \end{tabular} |
|
254 \end{center} |
|
255 |
|
256 \noindent |
|
257 These graphs show how long strings can be when using the rather simple |
|
258 regular expression $(a^*)^* \,b$ and the built-in regular expression |
|
259 matchers in the popular programming languages Python and Java (Version |
|
260 8 and Version 9). There are many more regular expressions that show |
|
261 the same bahaviour. On the left-hand side the graphs show that for a |
|
262 string consisting of just 28 $a\,$'s, Python and the older Java (which |
|
263 was the latest version of Java until September 2017) already need 30 |
|
264 seconds to decide whether this string is matched or not. This is an |
|
265 abysmal result given that Python and Java are very popular programming |
|
266 languages. We already know that this problem can be decided much, much |
|
267 faster. If these slow regular expression matchers would be employed |
|
268 in network traffic analysis, then this example is already an |
|
269 opportunity for mounting an easy denial-of-service attack: one just |
|
270 has to send to the server a large enough string of a's. The newer |
|
271 version of Java is better---it can match strings of around 50,000 |
|
272 $a\,$'s in 30 seconds---however this would still not be good enough |
|
273 for being a useful tool for network traffic analysis. What is |
|
274 interesting is that even a relatively simple-minded regular expression |
|
275 matcher based on Brzozowski derivatives can out-compete the regular |
|
276 expressions matchers in Java and Python on such catastrophic |
|
277 backtracking examples. This gain in speed is the motivation and |
|
278 starting point for the proposed project. \smallskip |
|
279 |
|
280 \subsection*{Research Plan} |
|
281 |
|
282 Regular expressions are already an old and well-studied topic in |
|
283 Computer Science. They were originally introduced by Kleene in 1951 |
|
284 and are utilised in text search algorithms for ``find'' or ``find and |
|
285 replace'' operations \cite{Brian,Kleene51}. Because of their wide |
|
286 range of applications, many programming languages provide capabilities |
|
287 for regular expression matching via libraries. The classic theory |
|
288 of regular expressions involves just 6 different kinds of regular |
|
289 expressions, often defined in terms of the following grammar: |
|
290 |
|
291 \begin{center}\small |
|
292 \begin{tabular}{lcll} |
|
293 $r$ & $::=$ & $\ZERO$ & cannot match anything\\ |
|
294 & $|$ & $\ONE$ & can only match the empty string\\ |
|
295 & $|$ & $c$ & can match a single character (in this case $c$)\\ |
|
296 & $|$ & $r_1 + r_2$ & can match a string either with $r_1$ or with $r_2$\\ |
|
297 & $|$ & $r_1\cdot r_2$ & can match the first part of a string with $r_1$ and\\ |
|
298 & & & then the second part with $r_2$\\ |
|
299 & $|$ & $r^*$ & can match zero or more times $r$\\ |
|
300 \end{tabular} |
|
301 \end{center} |
|
302 |
|
303 \noindent |
|
304 For practical purposes, however, regular expressions have been |
|
305 extended in various ways in order to make them even more powerful and |
|
306 even more useful for applications. Some of the problems to do with |
|
307 catastrophic backtracking stem, unfortunately, from these extensions. |
|
308 |
|
309 The crux in this project is to not use automata for regular expression |
|
310 matching (the traditional technique), but to use derivatives of |
|
311 regular expressions instead. These derivatives were introduced by |
|
312 Brzozowski in 1964 \cite{Brzozowski1964}, but somehow had been lost |
|
313 ``in the sands of time'' \cite{Owens2009}. However, they recently have |
|
314 become again a ``hot'' research topic with numerous research papers |
|
315 appearing in the last five years. One reason for this interest is that |
|
316 Brzozowski derivatives straightforwardly extend to more general |
|
317 regular expression constructors, which cannot be easily treated with |
|
318 standard techniques. They can also be implemented with ease in any |
|
319 functional programming language: the definition for derivatives |
|
320 consists of just two simple recursive functions shown in |
|
321 Figure~\ref{f}. Moreover, it can be easily shown (by a mathematical |
|
322 proof) that the resulting regular matcher is in fact always producing |
|
323 a correct answer. This is an area where Urban can provide a lot of |
|
324 expertise for this project: he is one of the main developers of the |
|
325 Isabelle theorem prover, which is a program designed for such proofs |
|
326 \cite{Isabelle}. |
|
327 |
|
328 |
|
329 \begin{figure}[t] |
|
330 \begin{center}\small |
|
331 \begin{tabular}{lcl} |
|
332 $\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\ |
|
333 $\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\ |
|
334 $\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\ |
|
335 $\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\ |
|
336 $\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\ |
|
337 $\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\medskip\\ |
|
338 |
|
339 $\textit{der}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\ |
|
340 $\textit{der}\;c\;(\ONE)$ & $\dn$ & $\ZERO$\\ |
|
341 $\textit{der}\;c\;(d)$ & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\ |
|
342 $\textit{der}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{der}\;c\;r_1) + (\textit{der}\;c\;r_2)$\\ |
|
343 $\textit{der}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\ |
|
344 & & $\textit{then}\;((\textit{der}\;c\;r_1)\cdot r_2) + (\textit{der}\;c\;r_2)$\\ |
|
345 & & $\textit{else}\;(\textit{der}\;c\;r_1)\cdot r_2$\\ |
|
346 $\textit{der}\;c\;(r^*)$ & $\dn$ & $(\textit{der}\;c\;r)\cdot (r^*)$\\[-5mm] |
|
347 \end{tabular} |
|
348 \end{center} |
|
349 \caption{The complete definition of derivatives of regular expressions |
|
350 \cite{Brzozowski1964}. |
|
351 This definition can be implemented with ease in functional programming languages and can be easily extended to regular expressions used in practice. |
|
352 The are more flexible for applications and easier to manipulate in |
|
353 mathematical proofs, than traditional techniques for regular expression |
|
354 matching.\medskip\label{f}} |
|
355 \end{figure} |
|
356 |
|
357 There are a number of avenues for research on Brzozowski derivatives. |
|
358 One problem I like to immediately tackle in this project is how to |
|
359 handle \emph{back-references} in regular expressions. It is known that |
|
360 the inclusion of back-references causes the underlying algorithmic |
|
361 problem to become NP-hard \cite{Aho}, and is the main reason why |
|
362 regular expression matchers suffer from the catastrophic backtracking |
|
363 problem. However, the full generality of back-references are not |
|
364 needed in practical applications. The goal therefore is to sufficiently |
|
365 restrict the problem so that an efficient algorithm can be |
|
366 designed. There seem to be good indications that Brzozowski |
|
367 derivatives are useful for that. |
|
368 |
|
369 Another problem is \emph{how} regular expressions match a string |
|
370 \cite{Sulzmann2014}. In this case the algorithm does not just give a |
|
371 yes/no answer about whether the regular expression matches the string, |
|
372 but also generates a value that encodes which part of the string is |
|
373 matched by which part of the regular expression. This is important in |
|
374 applications, like network analysis where from a matching log entry a |
|
375 specific computer address needs to be extracted. Also compilers make |
|
376 extensive use of such an extension when they lex their source |
|
377 programs, i.e.~break up code into word-like components. Again Ausaf |
|
378 and Urban from King's made some initial progress about proving the |
|
379 correctness of such an extension, but their algorithm is not yet fast |
|
380 enough for practical purposes \cite{AusafDyckhoffUrban2016}. It would |
|
381 be desirable to study optimisations that make the algorithm faster, |
|
382 but preserve the correctness guaranties obtained by Ausaf and Urban. |
|
383 |
|
384 \mbox{}\\[-11mm] |
|
385 |
|
386 \subsubsection*{Conclusion} |
|
387 |
|
388 Much research has already gone into regular expressions, and one might |
|
389 think they have been studied and implemented to death. But this is far |
|
390 from the truth. In fact regular expressions have become a ``hot'' |
|
391 research topic in recent years because of problems with the |
|
392 traditional methods (in applications such as network traffic |
|
393 analysis), but also because the technique of derivatives of regular |
|
394 expression has been re-discovered. These derivatives provide an |
|
395 alternative approach to regular expression matching. Their potential |
|
396 lies in the fact that they are more flexible in implementations and |
|
397 much easier to manipulate in mathematical proofs. Therefore I like to |
|
398 research them in this project.\medskip |
|
399 |
|
400 \noindent {\bf Impact:} Regular expression matching is a core |
|
401 technology in Computer Science with numerous applications. I will |
|
402 focus on the area of network traffic analysis and also lexical |
|
403 analysis. In these areas this project can have a quite large impact: |
|
404 for example the program Bro has been downloaded at least 10,000 times and Snort |
|
405 even has 5 million downloads and over 600,000 registered users |
|
406 \cite{snort,bro}. Lexical analysis is a core component in every |
|
407 compiler, which are the bedrock on which nearly all programming is |
|
408 built on. |
|
409 |
|
410 % {\small |
|
411 % \begin{itemize} |
|
412 % \item[$\bullet$] \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
|
413 % \item[$\bullet$] \url{https://vimeo.com/112065252} |
|
414 % \item[$\bullet$] \url{http://davidvgalbraith.com/how-i-fixed-atom/} |
|
415 % \end{itemize}} |
|
416 \mbox{}\\[-11mm] |
|
417 |
|
418 \small |
|
419 \bibliographystyle{plain} |
|
420 \renewcommand{\refname}{References\\[-7mm]\mbox{}} |
|
421 \bibliography{proposal} |
|
422 |
|
423 \end{document} |
|
424 |
|
425 |
|
426 %%% Local Variables: |
|
427 %%% mode: latex |
|
428 %%% TeX-master: t |
|
429 %%% End: |