cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:51ac1ab6e1fd
+:9c52c21b5db3
 exhibit this super-linear behaviour.  But unfortunately, such regular
 expressions are not just a few outliers. They are actually
 frequent enough to have a separate name created for
 them---\emph{evil regular expressions}. In empiric work, Davis et al
 report that they have found thousands of such evil regular expressions
-in the JavaScript and Python ecosystems \cite{Davis18}.
+in the JavaScript and Python ecosystems \cite{Davis18}. Static analysis
+approach that is both sound and complete exists\cite{17Bir}, but the running
+time on certain examples in the RegExLib and Snort regular expressions
+libraries is unacceptable. Therefore the problem of efficiency still remains.
 This superlinear blowup in matching algorithms sometimes causes
 considerable grief in real life: for example on 20 July 2016 one evil
 regular expression brought the webpage
 \href{http://stackexchange.com}{Stack Exchange} to its
 The $\textit{LHS}$ of the above equation is the bitcode we want. This
 would prove that our simplified version of regular expression still
 contains all the bitcodes needed. The task here is to find a way to
 compute the correct $v'$.
-The second task is to speed up the more aggressive simplification.
+The second task is to speed up the more aggressive simplification.  Currently
-Currently it is slower than the original naive simplification by Ausaf
+it is slower than the original naive simplification by Ausaf and Urban (the
-and Urban (the naive version as implemented by Ausaf   and Urban of
+naive version as implemented by Ausaf   and Urban of course can ``explode'' in
-course can ``explode'' in some cases). So it needs to be explored how to
+some cases).  It is therefore not surprising that the speed is also much slower
-make our algorithm faster on all inputs. Our possibility would be to
+than regular expression engines in popular programming languages such as Java
-explore again the connection to DFAs. This is very much work in
+and Python on most inputs that are linear. For example, just by rewriting the
-progress.
+example regular expression in the beginning of this report  $(a^*)^*\,b$ into
+$a^*\,b$ would eliminate the ambiguity in the matching and make the time
+for matching linear with respect to the input string size. This allows the
+DFA approach to become blindingly fast, and dwarf the speed of our current
+implementation. For example, here is a comparison of Java regex engine
+and our implementation on this example.
+\begin{center}
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n*1000$},
+x label style={at={(1.05,-0.05)}},
+ylabel={time in secs},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax=9,
+scaled ticks=true,
+axis lines=left,
+width=5cm,
+height=4cm,
+legend entries={Bitcoded Algorithm},
+legend pos=north west,
+legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {bad-scala.data};
+\end{axis}
+\end{tikzpicture}
+&
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n*1000$},
+x label style={at={(1.05,-0.05)}},
+%ylabel={time in secs},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax=9,
+scaled ticks=false,
+axis lines=left,
+width=5cm,
+height=4cm,
+legend entries={Java},
+legend pos=north west,
+legend cell align=left]
+\addplot[cyan,mark=*, mark options={fill=white}] table {good-java.data};
+\end{axis}
+\end{tikzpicture}\\
+\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
+of the form $\underbrace{aa..a}_{n}$.}
+\end{tabular}
+\end{center}
+Java regex engine can match string of thousands of characters in a few milliseconds,
+whereas our current algorithm gets excruciatingly slow on input of this size.
+The running time in theory is linear, however it does not appear to be the
+case in an actual implementation. So it needs to be explored how to
+make our algorithm faster on all inputs.  It could be the recursive calls that are
+needed to manipulate bits that are causing the slow down. A possible solution
+is to write recursive functions into tail-recusive form.
+Another possibility would be to explore
+again the connection to DFAs to speed up the algorithm on
+subcalls that are small enough. This is very much work in progress.
 \section{Conclusion}
 In this PhD-project we are interested in fast algorithms for regular
 expression matching. While this seems to be a ``settled'' area, in

changeset 87	9c52c21b5db3
parent 86	51ac1ab6e1fd
child 88	bb1ff936e6cf