diff -r 51ac1ab6e1fd -r 9c52c21b5db3 ninems/ninems.tex --- a/ninems/ninems.tex Thu Jul 25 21:02:06 2019 +0100 +++ b/ninems/ninems.tex Sun Aug 18 22:19:46 2019 +0100 @@ -187,7 +187,10 @@ frequent enough to have a separate name created for them---\emph{evil regular expressions}. In empiric work, Davis et al report that they have found thousands of such evil regular expressions -in the JavaScript and Python ecosystems \cite{Davis18}. +in the JavaScript and Python ecosystems \cite{Davis18}. Static analysis +approach that is both sound and complete exists\cite{17Bir}, but the running +time on certain examples in the RegExLib and Snort regular expressions +libraries is unacceptable. Therefore the problem of efficiency still remains. This superlinear blowup in matching algorithms sometimes causes considerable grief in real life: for example on 20 July 2016 one evil @@ -1274,13 +1277,76 @@ contains all the bitcodes needed. The task here is to find a way to compute the correct $v'$. -The second task is to speed up the more aggressive simplification. -Currently it is slower than the original naive simplification by Ausaf -and Urban (the naive version as implemented by Ausaf and Urban of -course can ``explode'' in some cases). So it needs to be explored how to -make our algorithm faster on all inputs. Our possibility would be to -explore again the connection to DFAs. This is very much work in -progress. +The second task is to speed up the more aggressive simplification. Currently +it is slower than the original naive simplification by Ausaf and Urban (the +naive version as implemented by Ausaf and Urban of course can ``explode'' in +some cases). It is therefore not surprising that the speed is also much slower +than regular expression engines in popular programming languages such as Java +and Python on most inputs that are linear. For example, just by rewriting the +example regular expression in the beginning of this report $(a^*)^*\,b$ into +$a^*\,b$ would eliminate the ambiguity in the matching and make the time +for matching linear with respect to the input string size. This allows the +DFA approach to become blindingly fast, and dwarf the speed of our current +implementation. For example, here is a comparison of Java regex engine +and our implementation on this example. + +\begin{center} +\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n*1000$}, + x label style={at={(1.05,-0.05)}}, + ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=9, + scaled ticks=true, + axis lines=left, + width=5cm, + height=4cm, + legend entries={Bitcoded Algorithm}, + legend pos=north west, + legend cell align=left] +\addplot[red,mark=*, mark options={fill=white}] table {bad-scala.data}; +\end{axis} +\end{tikzpicture} + & +\begin{tikzpicture} +\begin{axis}[ + xlabel={$n*1000$}, + x label style={at={(1.05,-0.05)}}, + %ylabel={time in secs}, + enlargelimits=false, + xtick={0,5,...,30}, + xmax=33, + ymax=9, + scaled ticks=false, + axis lines=left, + width=5cm, + height=4cm, + legend entries={Java}, + legend pos=north west, + legend cell align=left] +\addplot[cyan,mark=*, mark options={fill=white}] table {good-java.data}; +\end{axis} +\end{tikzpicture}\\ +\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings + of the form $\underbrace{aa..a}_{n}$.} +\end{tabular} +\end{center} + + +Java regex engine can match string of thousands of characters in a few milliseconds, +whereas our current algorithm gets excruciatingly slow on input of this size. +The running time in theory is linear, however it does not appear to be the +case in an actual implementation. So it needs to be explored how to +make our algorithm faster on all inputs. It could be the recursive calls that are +needed to manipulate bits that are causing the slow down. A possible solution +is to write recursive functions into tail-recusive form. +Another possibility would be to explore +again the connection to DFAs to speed up the algorithm on +subcalls that are small enough. This is very much work in progress. \section{Conclusion}