# HG changeset patch
# User Christian Urban Email: christian dot urban at kcl dot ac dot uk, Office: Strand Building S1.27
If you are interested in a project, please send me an email and we can discuss details. Please include
a short description about your programming skills and Computer Science background in your first email.
-I will also need your King's username in order to book the project for you. Thanks.
+Thanks.
Note that besides being a lecturer at the theoretical end of Computer Science, I am also a passionate
hacker …
@@ -58,7 +58,9 @@
lexing programs, syntax highlighting and so on. Given that regular expressions were
introduced in 1950 by Stephen Kleene,
you might think regular expressions have since been studied and implemented to death. But you would definitely be
- mistaken: in fact they are still an active research area. For example
+ mistaken: in fact they are still an active research area. On the top of my head, I can give
+ you at least research papers that appeared in the last few years.
+ For example
this paper
about regular expression matching and derivatives was presented just last summer at the international
FLOPS'14 conference. The task in this project is to implement their results and use them for lexing.
a?{28}a{28}
and match it, say, against the string
aaaaaaaaaaaaaaaaaaaaaaaaaaaa
(that is 28 a
s), you will soon notice that your CPU usage goes to 100%. In fact,
Python and Ruby need approximately 30 seconds of hard work for matching this string. You can try it for yourself:
- re.py (Python version) and
- re.rb
- (Ruby version). You can imagine an attacker
+ catastrophic.py (Python version) and
+ catastrophic.rb
+ (Ruby version). Here is a similar problem in Java: catastrophic.java
+
+
+
+ You can imagine an attacker
mounting a nice DoS attack against
- your program if it contains such an “evil” regular expression. Actually
- Scala (and also Java) are almost immune from such
- attacks as they can deal with strings of up to 4,300 a
s in less than a second. But if you scale
+ your program if it contains such an “evil” regular expression. But it can also happen by accident:
+ on 20 July 2016 the website Stack Exchange
+ was knocked offline because of an evil regular expression. One of their engineers talks about it in this
+ video. A similar problem needed to be fixed in the
+ Atom editor.
+ A few implementations of regular expression matchers are almost immune from such problems.
+ For example, Scala can deal with strings of up to 4,300 a
s in less than a second. But if you scale
the regular expression and string further to, say, 4,600 a
s, then you get a StackOverflowError
potentially crashing your program. Moreover (beside the "minor" problem of being painfully slow) according to this
report
@@ -96,7 +106,7 @@
derivatives of regular expressions.
These derivatives were introduced in 1964 by
Janusz Brzozowski, but according to this
- paper had been lost in the “sands of time”.
+ paper had been lost in the “sands of time”.
The advantage of derivatives is that they side-step completely the usual
translations of regular expressions
into NFAs or DFAs, which can introduce the exponential behaviour exhibited by the regular
@@ -124,7 +134,7 @@
Literature:
The place to start with this project is obviously this
paper
- and this one.
+ and this one.
Traditional methods for regular expression matching are explained
in the Wikipedia articles
here and
@@ -150,8 +160,8 @@
F#,
ML,
Haskell, etc. Python and other non-functional languages
- can be also used, but seem much less convenient. If you attend my Formal Languages and
- Automata module, that would obviously give you a head-start with this project.
+ can be also used, but seem much less convenient. If you attend my Compilers and Formal Languages
+ module, that would obviously give you a head-start with this project.