66 <p>The background for this project is that some regular expressions are |
66 <p>The background for this project is that some regular expressions are |
67 “<A HREF="http://en.wikipedia.org/wiki/ReDoS#Examples">evil</A>” |
67 “<A HREF="http://en.wikipedia.org/wiki/ReDoS#Examples">evil</A>” |
68 and can “stab you in the back” according to |
68 and can “stab you in the back” according to |
69 this <A HREF="http://peterscott.github.io/2013/01/17/regular-expressions-will-stab-you-in-the-back/">blog post</A>. |
69 this <A HREF="http://peterscott.github.io/2013/01/17/regular-expressions-will-stab-you-in-the-back/">blog post</A>. |
70 For example, if you use in <A HREF="http://www.python.org">Python</A> or |
70 For example, if you use in <A HREF="http://www.python.org">Python</A> or |
71 in <A HREF="http://www.ruby-lang.org/en/">Ruby</A> (or also in a number of other mainstream programming languages according to this |
71 in <A HREF="http://www.ruby-lang.org/en/">Ruby</A> (or also in a number of other mainstream programming languages) the |
72 <A HREF="http://www.computerbytesman.com/redos/">blog</A>) the |
|
73 innocently looking regular expression <code>a?{28}a{28}</code> and match it, say, against the string |
72 innocently looking regular expression <code>a?{28}a{28}</code> and match it, say, against the string |
74 <code>aaaaaaaaaaaaaaaaaaaaaaaaaaaa</code> (that is 28 <code>a</code>s), you will soon notice that your CPU usage goes to 100%. In fact, |
73 <code>aaaaaaaaaaaaaaaaaaaaaaaaaaaa</code> (that is 28 <code>a</code>s), you will soon notice that your CPU usage goes to 100%. In fact, |
75 Python and Ruby need approximately 30 seconds of hard work for matching this string. You can try it for yourself: |
74 Python and Ruby need approximately 30 seconds of hard work for matching this string. You can try it for yourself: |
76 <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/progs/re.py">re.py</A> (Python version) and |
75 <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/progs/re.py">re.py</A> (Python version) and |
77 <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/progs/re.rb">re.rb</A> |
76 <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/progs/re.rb">re.rb</A> |
81 <A HREF="http://www.scala-lang.org/">Scala</A> (and also Java) are almost immune from such |
80 <A HREF="http://www.scala-lang.org/">Scala</A> (and also Java) are almost immune from such |
82 attacks as they can deal with strings of up to 4,300 <code>a</code>s in less than a second. But if you scale |
81 attacks as they can deal with strings of up to 4,300 <code>a</code>s in less than a second. But if you scale |
83 the regular expression and string further to, say, 4,600 <code>a</code>s, then you get a <code>StackOverflowError</code> |
82 the regular expression and string further to, say, 4,600 <code>a</code>s, then you get a <code>StackOverflowError</code> |
84 potentially crashing your program. Moreover (beside the "minor" problem of being painfully slow) according to this |
83 potentially crashing your program. Moreover (beside the "minor" problem of being painfully slow) according to this |
85 <A HREF="http://www.haskell.org/haskellwiki/Regex_Posix">report</A> |
84 <A HREF="http://www.haskell.org/haskellwiki/Regex_Posix">report</A> |
86 nearly all POSIX regular expression matchers are actually buggy. |
85 nearly all regular expression matchers using the POSIX rules are actually buggy. |
87 </p> |
86 </p> |
88 |
87 |
89 <p> |
88 <p> |
90 On a rainy afternoon, I implemented |
89 On a rainy afternoon, I implemented |
91 <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/progs/re3.scala">this</A> |
90 <A HREF="http://www.dcs.kcl.ac.uk/staff/urbanc/cgi-bin/repos.cgi/afl-material/raw-file/tip/progs/re3.scala">this</A> |
122 </p> |
121 </p> |
123 |
122 |
124 <p> |
123 <p> |
125 <B>Literature:</B> |
124 <B>Literature:</B> |
126 The place to start with this project is obviously this |
125 The place to start with this project is obviously this |
127 <A HREF="http://www.home.hs-karlsruhe.de/~suma0002/publications/ppdp12-part-deriv-sub-match.pdf">paper</A>. |
126 <A HREF="http://www.home.hs-karlsruhe.de/~suma0002/publications/regex-parsing-derivatives.pdf">paper</A> |
|
127 and this <A HREF="http://www.home.hs-karlsruhe.de/~suma0002/publications/ppdp12-part-deriv-sub-match.pdf">one</A>. |
128 Traditional methods for regular expression matching are explained |
128 Traditional methods for regular expression matching are explained |
129 in the Wikipedia articles |
129 in the Wikipedia articles |
130 <A HREF="http://en.wikipedia.org/wiki/DFA_minimization">here</A> and |
130 <A HREF="http://en.wikipedia.org/wiki/DFA_minimization">here</A> and |
131 <A HREF="http://en.wikipedia.org/wiki/Powerset_construction">here</A>. |
131 <A HREF="http://en.wikipedia.org/wiki/Powerset_construction">here</A>. |
132 The authoritative <A HREF="http://infolab.stanford.edu/~ullman/ialc.html">book</A> |
132 The authoritative <A HREF="http://infolab.stanford.edu/~ullman/ialc.html">book</A> |