diff -r 9eea20ad0caf -r 3feaf8bc3e48 bsc-projects-16.html --- a/bsc-projects-16.html Fri Sep 23 13:29:59 2016 +0100 +++ b/bsc-projects-16.html Fri Sep 23 15:07:05 2016 +0100 @@ -1,7 +1,7 @@
-a?{28}a{28}
and match it, say, against the string
aaaaaaaaaaaaaaaaaaaaaaaaaaaa
(that is 28 a
s), you will soon notice that your CPU usage goes to 100%. In fact,
Python and Ruby need approximately 30 seconds of hard work for matching this string. You can try it for yourself:
- re.py (Python version) and
- re.rb
- (Ruby version). You can imagine an attacker
+ catastrophic.py (Python version) and
+ catastrophic.rb
+ (Ruby version). Here is a similar problem in Java: catastrophic.java
+
+
+
+ You can imagine an attacker
mounting a nice DoS attack against
- your program if it contains such an “evil” regular expression. Actually
- Scala (and also Java) are almost immune from such
- attacks as they can deal with strings of up to 4,300 a
s in less than a second. But if you scale
+ your program if it contains such an “evil” regular expression. But it can also happen by accident:
+ on 20 July 2016 the website Stack Exchange
+ was knocked offline because of an evil regular expression. One of their engineers talks about it in this
+ video. A similar problem needed to be fixed in the
+ Atom editor.
+ A few implementations of regular expression matchers are almost immune from such problems.
+ For example, Scala can deal with strings of up to 4,300 a
s in less than a second. But if you scale
the regular expression and string further to, say, 4,600 a
s, then you get a StackOverflowError
potentially crashing your program. Moreover (beside the "minor" problem of being painfully slow) according to this
report
@@ -96,7 +106,7 @@
derivatives of regular expressions.
These derivatives were introduced in 1964 by
Janusz Brzozowski, but according to this
- paper had been lost in the “sands of time”.
+ paper had been lost in the “sands of time”.
The advantage of derivatives is that they side-step completely the usual
translations of regular expressions
into NFAs or DFAs, which can introduce the exponential behaviour exhibited by the regular
@@ -124,7 +134,7 @@
Literature:
The place to start with this project is obviously this
paper
- and this one.
+ and this one.
Traditional methods for regular expression matching are explained
in the Wikipedia articles
here and
@@ -150,8 +160,8 @@
F#,
ML,
Haskell, etc. Python and other non-functional languages
- can be also used, but seem much less convenient. If you attend my Formal Languages and
- Automata module, that would obviously give you a head-start with this project.
+ can be also used, but seem much less convenient. If you attend my Compilers and Formal Languages
+ module, that would obviously give you a head-start with this project.