55 <A HREF="http://www.home.hs-karlsruhe.de/~suma0002/publications/ppdp12-part-deriv-sub-match.pdf">this paper</A> |
55 <A HREF="http://www.home.hs-karlsruhe.de/~suma0002/publications/ppdp12-part-deriv-sub-match.pdf">this paper</A> |
56 about regular expression matching and partial derivatives was presented this summer at the international |
56 about regular expression matching and partial derivatives was presented this summer at the international |
57 PPDP'12 conference. The task in this project is to implement the results from this paper.</p> |
57 PPDP'12 conference. The task in this project is to implement the results from this paper.</p> |
58 |
58 |
59 <p>The background for this project is that some regular expressions are |
59 <p>The background for this project is that some regular expressions are |
60 "<A HREF="http://en.wikipedia.org/wiki/ReDoS#Examples">evil</A>" |
60 “<A HREF="http://en.wikipedia.org/wiki/ReDoS#Examples">evil</A>” |
61 and can "stab you in the back" according to |
61 and can “stab you in the back” according to |
62 this <A HREF="http://tech.blog.cueup.com/regular-expressions-will-stab-you-in-the-back">blog post</A>. |
62 this <A HREF="http://tech.blog.cueup.com/regular-expressions-will-stab-you-in-the-back">blog post</A>. |
63 For example, if you use in <A HREF="http://www.python.org">Python</A> or |
63 For example, if you use in <A HREF="http://www.python.org">Python</A> or |
64 in <A HREF="http://www.ruby-lang.org/en/">Ruby</A> (probably also in other mainstream programming languages) the |
64 in <A HREF="http://www.ruby-lang.org/en/">Ruby</A> (probably also in other mainstream programming languages) the |
65 innocently looking regular expression <code>a?{28}a{28}</code> and match it, say, against the string |
65 innocently looking regular expression <code>a?{28}a{28}</code> and match it, say, against the string |
66 <code>aaaaaaaaaaaaaaaaaaaaaaaaaaaa</code> (that is 28 <code>a</code>s), you will soon notice that your CPU usage goes to 100%. In fact, |
66 <code>aaaaaaaaaaaaaaaaaaaaaaaaaaaa</code> (that is 28 <code>a</code>s), you will soon notice that your CPU usage goes to 100%. In fact, |
67 Python and Ruby need approximately 30 seconds of hard work for matching this string. You can try it for yourself: |
67 Python and Ruby need approximately 30 seconds of hard work for matching this string. You can try it for yourself: |
68 <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/re.py">re.py</A> (Python version) and |
68 <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/re.py">re.py</A> (Python version) and |
69 <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/re-internal.rb">re.rb</A> |
69 <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/re-internal.rb">re.rb</A> |
70 (Ruby version). You can imagine an attacker |
70 (Ruby version). You can imagine an attacker |
71 mounting a nice <A HREF="http://en.wikipedia.org/wiki/Denial-of-service_attack">DoS attack</A> against |
71 mounting a nice <A HREF="http://en.wikipedia.org/wiki/Denial-of-service_attack">DoS attack</A> against |
72 your program if it contains such an "evil" regular expression. Actually |
72 your program if it contains such an “evil” regular expression. Actually |
73 <A HREF="http://www.scala-lang.org/">Scala</A> (and also Java) are almost immune from such |
73 <A HREF="http://www.scala-lang.org/">Scala</A> (and also Java) are almost immune from such |
74 attacks as they can deal with strings of up to 4,300 <code>a</code>s in less than a second. But if you scale |
74 attacks as they can deal with strings of up to 4,300 <code>a</code>s in less than a second. But if you scale |
75 the regular expression and string further to, say, 4,600 <code>a</code>s, then you get a <code>StackOverflowError</code> |
75 the regular expression and string further to, say, 4,600 <code>a</code>s, then you get a <code>StackOverflowError</code> |
76 potentially crashing your program. |
76 potentially crashing your program. |
77 </p> |
77 </p> |
85 official matcher maxes out at 4,600 <code>a</code>s). My matcher is approximately |
85 official matcher maxes out at 4,600 <code>a</code>s). My matcher is approximately |
86 85 lines of code and based on the concept of |
86 85 lines of code and based on the concept of |
87 <A HREF="http://lambda-the-ultimate.org/node/2293">derivatives of regular expressions</A>. |
87 <A HREF="http://lambda-the-ultimate.org/node/2293">derivatives of regular expressions</A>. |
88 These derivatives were introduced in 1964 by <A HREF="http://en.wikipedia.org/wiki/Janusz_Brzozowski_(computer_scientist)"> |
88 These derivatives were introduced in 1964 by <A HREF="http://en.wikipedia.org/wiki/Janusz_Brzozowski_(computer_scientist)"> |
89 Janusz Brzozowski</A>, but according to this |
89 Janusz Brzozowski</A>, but according to this |
90 <A HREF="http://www.cl.cam.ac.uk/~so294/documents/jfp09.pdf">paper</A> had been lost in the "sands of time". |
90 <A HREF="http://www.cl.cam.ac.uk/~so294/documents/jfp09.pdf">paper</A> had been lost in the “sands of time”. |
91 The advantage of derivatives is that they side-step completely the usual |
91 The advantage of derivatives is that they side-step completely the usual |
92 <A HREF="http://hackingoff.com/compilers/regular-expression-to-nfa-dfa">translations</A> of regular expressions |
92 <A HREF="http://hackingoff.com/compilers/regular-expression-to-nfa-dfa">translations</A> of regular expressions |
93 into NFAs or DFAs, which can introduce the exponential behaviour exhibited by the regular |
93 into NFAs or DFAs, which can introduce the exponential behaviour exhibited by the regular |
94 expression matchers in Python and Ruby. |
94 expression matchers in Python and Ruby. |
95 </p> |
95 </p> |
104 Their approach is based on the concept of partial derivatives introduced in 1994 by |
104 Their approach is based on the concept of partial derivatives introduced in 1994 by |
105 <A HREF="http://reference.kfupm.edu.sa/content/p/a/partial_derivatives_of_regular_expressio_1319383.pdf">Valentin Antimirov</A>. |
105 <A HREF="http://reference.kfupm.edu.sa/content/p/a/partial_derivatives_of_regular_expressio_1319383.pdf">Valentin Antimirov</A>. |
106 I used them once myself in a <A HREF="http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf">paper</A> |
106 I used them once myself in a <A HREF="http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf">paper</A> |
107 in order to prove the <A HREF="http://en.wikipedia.org/wiki/Myhill–Nerode_theorem">Myhill-Nerode theorem</A>. |
107 in order to prove the <A HREF="http://en.wikipedia.org/wiki/Myhill–Nerode_theorem">Myhill-Nerode theorem</A>. |
108 So I know they are worth their money. Still, it would be interesting to actually compare their results |
108 So I know they are worth their money. Still, it would be interesting to actually compare their results |
109 with my simple rainy-afternoon matcher and potentially "blow away" the regular expression matchers |
109 with my simple rainy-afternoon matcher and potentially “blow away” the regular expression matchers |
110 in Python and Ruby (and possibly in Scala too). |
110 in Python and Ruby (and possibly in Scala too). |
111 </p> |
111 </p> |
112 |
112 |
113 <p> |
113 <p> |
114 <B>Literature:</B> |
114 <B>Literature:</B> |
122 on automata and regular expressions is by John Hopcroft and Jeffrey Ullmann (available in the library). |
122 on automata and regular expressions is by John Hopcroft and Jeffrey Ullmann (available in the library). |
123 There is also an online course about this topic by Ullman at |
123 There is also an online course about this topic by Ullman at |
124 <A HREF="https://www.coursera.org/course/automata">Coursera</A>, though IMHO not |
124 <A HREF="https://www.coursera.org/course/automata">Coursera</A>, though IMHO not |
125 done with love. |
125 done with love. |
126 Finally, there are millions of other pointers about regular expression |
126 Finally, there are millions of other pointers about regular expression |
127 matching on the Net. Test cases for "<A HREF="http://en.wikipedia.org/wiki/ReDoS#Examples">evil</A>" |
127 matching on the Net. Test cases for “<A HREF="http://en.wikipedia.org/wiki/ReDoS#Examples">evil</A>” |
128 regular expressions can be obtained from <A HREF="http://en.wikipedia.org/wiki/ReDoS#Examples">here</A>. |
128 regular expressions can be obtained from <A HREF="http://en.wikipedia.org/wiki/ReDoS#Examples">here</A>. |
129 </p> |
129 </p> |
130 |
130 |
131 <p> |
131 <p> |
132 <B>Skills:</B> |
132 <B>Skills:</B> |
293 <p> |
293 <p> |
294 <b>Description:</b> |
294 <b>Description:</b> |
295 This project is similar to [CU3]. The emphasis here, however, is on the |
295 This project is similar to [CU3]. The emphasis here, however, is on the |
296 implementation and comparison of register spilling algorithms, also often called register allocation |
296 implementation and comparison of register spilling algorithms, also often called register allocation |
297 algorithms. They are part of any respectable compiler. As said |
297 algorithms. They are part of any respectable compiler. As said |
298 in [CU3], however, my simple compiler lacks them and assumes an infinite amount of registers instead. |
298 in [CU3] my simple compiler lacks them and assumes an infinite amount of registers instead. |
299 Real CPUs however only provide a fixed amount of registers (for example |
299 Real CPUs however only provide a fixed amount of registers (for example |
300 x86-64 has 16 general purpose registers). Whenever a program needs |
300 x86-64 has 16 general purpose registers). Whenever a program needs |
301 to hold more values than registers, the values need to be “spilled” |
301 to hold more values than registers, the values need to be “spilled” |
302 into the main memory. Register spilling algorithms try to minimise |
302 into the main memory. Register spilling algorithms try to minimise |
303 this spilling, since fetching values from main memory is a costly |
303 this spilling, since fetching values from main memory is a costly |
381 computers or smart phones. |
381 computers or smart phones. |
382 </p> |
382 </p> |
383 |
383 |
384 <p> |
384 <p> |
385 However, there is one restriction that makes this project harder than it seems |
385 However, there is one restriction that makes this project harder than it seems |
386 as first sight. The department does not allow large server applications and databases |
386 at first sight: The department does not allow large server applications and databases |
387 to be run on calcium - the central server in the department. So the problem should be solved with as few resources needed |
387 to be run on calcium, which is the central server in the department. So the problem |
388 on the "back-end" which collects the votes. |
388 should be solved with as few resources as possible |
|
389 on the “back-end” collecting the votes. |
389 </p> |
390 </p> |
390 |
391 |
391 <p> |
392 <p> |
392 <B>Literature:</B> |
393 <B>Literature:</B> |
393 The project requires fluency in a web-programming language (for example |
394 The project requires fluency in a web-programming language (for example |