# HG changeset patch # User Christian Urban # Date 1474639625 -3600 # Node ID 3feaf8bc3e48214ec16e913ab584adb8817af7e2 # Parent 9eea20ad0caf2bae8c3b0ab7f2a535edb08a64be updated diff -r 9eea20ad0caf -r 3feaf8bc3e48 bsc-projects-16.html --- a/bsc-projects-16.html Fri Sep 23 13:29:59 2016 +0100 +++ b/bsc-projects-16.html Fri Sep 23 15:07:05 2016 +0100 @@ -1,7 +1,7 @@ -2015/16 MSc Projects +2016/17 BSc Projects @@ -35,7 +35,7 @@

Email: christian dot urban at kcl dot ac dot uk, Office: Strand Building S1.27

If you are interested in a project, please send me an email and we can discuss details. Please include a short description about your programming skills and Computer Science background in your first email. -I will also need your King's username in order to book the project for you. Thanks.

+Thanks.

Note that besides being a lecturer at the theoretical end of Computer Science, I am also a passionate hacker … @@ -58,7 +58,9 @@ lexing programs, syntax highlighting and so on. Given that regular expressions were introduced in 1950 by Stephen Kleene, you might think regular expressions have since been studied and implemented to death. But you would definitely be - mistaken: in fact they are still an active research area. For example + mistaken: in fact they are still an active research area. On the top of my head, I can give + you at least research papers that appeared in the last few years. + For example this paper about regular expression matching and derivatives was presented just last summer at the international FLOPS'14 conference. The task in this project is to implement their results and use them for lexing.

@@ -72,13 +74,21 @@ innocently looking regular expression a?{28}a{28} and match it, say, against the string aaaaaaaaaaaaaaaaaaaaaaaaaaaa (that is 28 as), you will soon notice that your CPU usage goes to 100%. In fact, Python and Ruby need approximately 30 seconds of hard work for matching this string. You can try it for yourself: - re.py (Python version) and - re.rb - (Ruby version). You can imagine an attacker + catastrophic.py (Python version) and + catastrophic.rb + (Ruby version). Here is a similar problem in Java: catastrophic.java +

+ +

+ You can imagine an attacker mounting a nice DoS attack against - your program if it contains such an “evil” regular expression. Actually - Scala (and also Java) are almost immune from such - attacks as they can deal with strings of up to 4,300 as in less than a second. But if you scale + your program if it contains such an “evil” regular expression. But it can also happen by accident: + on 20 July 2016 the website Stack Exchange + was knocked offline because of an evil regular expression. One of their engineers talks about it in this + video. A similar problem needed to be fixed in the + Atom editor. + A few implementations of regular expression matchers are almost immune from such problems. + For example, Scala can deal with strings of up to 4,300 as in less than a second. But if you scale the regular expression and string further to, say, 4,600 as, then you get a StackOverflowError potentially crashing your program. Moreover (beside the "minor" problem of being painfully slow) according to this report @@ -96,7 +106,7 @@ derivatives of regular expressions. These derivatives were introduced in 1964 by Janusz Brzozowski, but according to this - paper had been lost in the “sands of time”. + paper had been lost in the “sands of time”. The advantage of derivatives is that they side-step completely the usual translations of regular expressions into NFAs or DFAs, which can introduce the exponential behaviour exhibited by the regular @@ -124,7 +134,7 @@ Literature: The place to start with this project is obviously this paper - and this one. + and this one. Traditional methods for regular expression matching are explained in the Wikipedia articles here and @@ -150,8 +160,8 @@ F#, ML, Haskell, etc. Python and other non-functional languages - can be also used, but seem much less convenient. If you attend my Formal Languages and - Automata module, that would obviously give you a head-start with this project. + can be also used, but seem much less convenient. If you attend my Compilers and Formal Languages + module, that would obviously give you a head-start with this project.

  • [CU2] A Compiler for a small Programming Language