ninems/ninems.tex
changeset 87 9c52c21b5db3
parent 86 51ac1ab6e1fd
child 88 bb1ff936e6cf
--- a/ninems/ninems.tex	Thu Jul 25 21:02:06 2019 +0100
+++ b/ninems/ninems.tex	Sun Aug 18 22:19:46 2019 +0100
@@ -187,7 +187,10 @@
 frequent enough to have a separate name created for
 them---\emph{evil regular expressions}. In empiric work, Davis et al
 report that they have found thousands of such evil regular expressions
-in the JavaScript and Python ecosystems \cite{Davis18}.
+in the JavaScript and Python ecosystems \cite{Davis18}. Static analysis
+approach that is both sound and complete exists\cite{17Bir}, but the running 
+time on certain examples in the RegExLib and Snort regular expressions
+libraries is unacceptable. Therefore the problem of efficiency still remains.
 
 This superlinear blowup in matching algorithms sometimes causes
 considerable grief in real life: for example on 20 July 2016 one evil
@@ -1274,13 +1277,76 @@
 contains all the bitcodes needed. The task here is to find a way to
 compute the correct $v'$.
 
-The second task is to speed up the more aggressive simplification.
-Currently it is slower than the original naive simplification by Ausaf
-and Urban (the naive version as implemented by Ausaf   and Urban of
-course can ``explode'' in some cases). So it needs to be explored how to
-make our algorithm faster on all inputs. Our possibility would be to
-explore again the connection to DFAs. This is very much work in
-progress.
+The second task is to speed up the more aggressive simplification.  Currently
+it is slower than the original naive simplification by Ausaf and Urban (the
+naive version as implemented by Ausaf   and Urban of course can ``explode'' in
+some cases).  It is therefore not surprising that the speed is also much slower
+than regular expression engines in popular programming languages such as Java
+and Python on most inputs that are linear. For example, just by rewriting the
+example regular expression in the beginning of this report  $(a^*)^*\,b$ into
+$a^*\,b$ would eliminate the ambiguity in the matching and make the time
+for matching linear with respect to the input string size. This allows the 
+DFA approach to become blindingly fast, and dwarf the speed of our current
+implementation. For example, here is a comparison of Java regex engine 
+and our implementation on this example.
+
+\begin{center}
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n*1000$},
+    x label style={at={(1.05,-0.05)}},
+    ylabel={time in secs},
+    enlargelimits=false,
+    xtick={0,5,...,30},
+    xmax=33,
+    ymax=9,
+    scaled ticks=true,
+    axis lines=left,
+    width=5cm,
+    height=4cm, 
+    legend entries={Bitcoded Algorithm},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {bad-scala.data};
+\end{axis}
+\end{tikzpicture}
+  &
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n*1000$},
+    x label style={at={(1.05,-0.05)}},
+    %ylabel={time in secs},
+    enlargelimits=false,
+    xtick={0,5,...,30},
+    xmax=33,
+    ymax=9,
+    scaled ticks=false,
+    axis lines=left,
+    width=5cm,
+    height=4cm, 
+    legend entries={Java},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[cyan,mark=*, mark options={fill=white}] table {good-java.data};
+\end{axis}
+\end{tikzpicture}\\
+\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
+           of the form $\underbrace{aa..a}_{n}$.}
+\end{tabular}    
+\end{center}  
+
+
+Java regex engine can match string of thousands of characters in a few milliseconds,
+whereas our current algorithm gets excruciatingly slow on input of this size.
+The running time in theory is linear, however it does not appear to be the 
+case in an actual implementation. So it needs to be explored how to
+make our algorithm faster on all inputs.  It could be the recursive calls that are
+needed to manipulate bits that are causing the slow down. A possible solution
+is to write recursive functions into tail-recusive form.
+Another possibility would be to explore
+again the connection to DFAs to speed up the algorithm on 
+subcalls that are small enough. This is very much work in progress.
 
 \section{Conclusion}