ninems/ninems.tex
changeset 87 9c52c21b5db3
parent 86 51ac1ab6e1fd
child 88 bb1ff936e6cf
equal deleted inserted replaced
86:51ac1ab6e1fd 87:9c52c21b5db3
   185 exhibit this super-linear behaviour.  But unfortunately, such regular
   185 exhibit this super-linear behaviour.  But unfortunately, such regular
   186 expressions are not just a few outliers. They are actually 
   186 expressions are not just a few outliers. They are actually 
   187 frequent enough to have a separate name created for
   187 frequent enough to have a separate name created for
   188 them---\emph{evil regular expressions}. In empiric work, Davis et al
   188 them---\emph{evil regular expressions}. In empiric work, Davis et al
   189 report that they have found thousands of such evil regular expressions
   189 report that they have found thousands of such evil regular expressions
   190 in the JavaScript and Python ecosystems \cite{Davis18}.
   190 in the JavaScript and Python ecosystems \cite{Davis18}. Static analysis
       
   191 approach that is both sound and complete exists\cite{17Bir}, but the running 
       
   192 time on certain examples in the RegExLib and Snort regular expressions
       
   193 libraries is unacceptable. Therefore the problem of efficiency still remains.
   191 
   194 
   192 This superlinear blowup in matching algorithms sometimes causes
   195 This superlinear blowup in matching algorithms sometimes causes
   193 considerable grief in real life: for example on 20 July 2016 one evil
   196 considerable grief in real life: for example on 20 July 2016 one evil
   194 regular expression brought the webpage
   197 regular expression brought the webpage
   195 \href{http://stackexchange.com}{Stack Exchange} to its
   198 \href{http://stackexchange.com}{Stack Exchange} to its
  1272 The $\textit{LHS}$ of the above equation is the bitcode we want. This
  1275 The $\textit{LHS}$ of the above equation is the bitcode we want. This
  1273 would prove that our simplified version of regular expression still
  1276 would prove that our simplified version of regular expression still
  1274 contains all the bitcodes needed. The task here is to find a way to
  1277 contains all the bitcodes needed. The task here is to find a way to
  1275 compute the correct $v'$.
  1278 compute the correct $v'$.
  1276 
  1279 
  1277 The second task is to speed up the more aggressive simplification.
  1280 The second task is to speed up the more aggressive simplification.  Currently
  1278 Currently it is slower than the original naive simplification by Ausaf
  1281 it is slower than the original naive simplification by Ausaf and Urban (the
  1279 and Urban (the naive version as implemented by Ausaf   and Urban of
  1282 naive version as implemented by Ausaf   and Urban of course can ``explode'' in
  1280 course can ``explode'' in some cases). So it needs to be explored how to
  1283 some cases).  It is therefore not surprising that the speed is also much slower
  1281 make our algorithm faster on all inputs. Our possibility would be to
  1284 than regular expression engines in popular programming languages such as Java
  1282 explore again the connection to DFAs. This is very much work in
  1285 and Python on most inputs that are linear. For example, just by rewriting the
  1283 progress.
  1286 example regular expression in the beginning of this report  $(a^*)^*\,b$ into
       
  1287 $a^*\,b$ would eliminate the ambiguity in the matching and make the time
       
  1288 for matching linear with respect to the input string size. This allows the 
       
  1289 DFA approach to become blindingly fast, and dwarf the speed of our current
       
  1290 implementation. For example, here is a comparison of Java regex engine 
       
  1291 and our implementation on this example.
       
  1292 
       
  1293 \begin{center}
       
  1294 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
  1295 \begin{tikzpicture}
       
  1296 \begin{axis}[
       
  1297     xlabel={$n*1000$},
       
  1298     x label style={at={(1.05,-0.05)}},
       
  1299     ylabel={time in secs},
       
  1300     enlargelimits=false,
       
  1301     xtick={0,5,...,30},
       
  1302     xmax=33,
       
  1303     ymax=9,
       
  1304     scaled ticks=true,
       
  1305     axis lines=left,
       
  1306     width=5cm,
       
  1307     height=4cm, 
       
  1308     legend entries={Bitcoded Algorithm},  
       
  1309     legend pos=north west,
       
  1310     legend cell align=left]
       
  1311 \addplot[red,mark=*, mark options={fill=white}] table {bad-scala.data};
       
  1312 \end{axis}
       
  1313 \end{tikzpicture}
       
  1314   &
       
  1315 \begin{tikzpicture}
       
  1316 \begin{axis}[
       
  1317     xlabel={$n*1000$},
       
  1318     x label style={at={(1.05,-0.05)}},
       
  1319     %ylabel={time in secs},
       
  1320     enlargelimits=false,
       
  1321     xtick={0,5,...,30},
       
  1322     xmax=33,
       
  1323     ymax=9,
       
  1324     scaled ticks=false,
       
  1325     axis lines=left,
       
  1326     width=5cm,
       
  1327     height=4cm, 
       
  1328     legend entries={Java},  
       
  1329     legend pos=north west,
       
  1330     legend cell align=left]
       
  1331 \addplot[cyan,mark=*, mark options={fill=white}] table {good-java.data};
       
  1332 \end{axis}
       
  1333 \end{tikzpicture}\\
       
  1334 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
       
  1335            of the form $\underbrace{aa..a}_{n}$.}
       
  1336 \end{tabular}    
       
  1337 \end{center}  
       
  1338 
       
  1339 
       
  1340 Java regex engine can match string of thousands of characters in a few milliseconds,
       
  1341 whereas our current algorithm gets excruciatingly slow on input of this size.
       
  1342 The running time in theory is linear, however it does not appear to be the 
       
  1343 case in an actual implementation. So it needs to be explored how to
       
  1344 make our algorithm faster on all inputs.  It could be the recursive calls that are
       
  1345 needed to manipulate bits that are causing the slow down. A possible solution
       
  1346 is to write recursive functions into tail-recusive form.
       
  1347 Another possibility would be to explore
       
  1348 again the connection to DFAs to speed up the algorithm on 
       
  1349 subcalls that are small enough. This is very much work in progress.
  1284 
  1350 
  1285 \section{Conclusion}
  1351 \section{Conclusion}
  1286 
  1352 
  1287 In this PhD-project we are interested in fast algorithms for regular
  1353 In this PhD-project we are interested in fast algorithms for regular
  1288 expression matching. While this seems to be a ``settled'' area, in
  1354 expression matching. While this seems to be a ``settled'' area, in