185 exhibit this super-linear behaviour. But unfortunately, such regular |
185 exhibit this super-linear behaviour. But unfortunately, such regular |
186 expressions are not just a few outliers. They are actually |
186 expressions are not just a few outliers. They are actually |
187 frequent enough to have a separate name created for |
187 frequent enough to have a separate name created for |
188 them---\emph{evil regular expressions}. In empiric work, Davis et al |
188 them---\emph{evil regular expressions}. In empiric work, Davis et al |
189 report that they have found thousands of such evil regular expressions |
189 report that they have found thousands of such evil regular expressions |
190 in the JavaScript and Python ecosystems \cite{Davis18}. |
190 in the JavaScript and Python ecosystems \cite{Davis18}. Static analysis |
|
191 approach that is both sound and complete exists\cite{17Bir}, but the running |
|
192 time on certain examples in the RegExLib and Snort regular expressions |
|
193 libraries is unacceptable. Therefore the problem of efficiency still remains. |
191 |
194 |
192 This superlinear blowup in matching algorithms sometimes causes |
195 This superlinear blowup in matching algorithms sometimes causes |
193 considerable grief in real life: for example on 20 July 2016 one evil |
196 considerable grief in real life: for example on 20 July 2016 one evil |
194 regular expression brought the webpage |
197 regular expression brought the webpage |
195 \href{http://stackexchange.com}{Stack Exchange} to its |
198 \href{http://stackexchange.com}{Stack Exchange} to its |
1272 The $\textit{LHS}$ of the above equation is the bitcode we want. This |
1275 The $\textit{LHS}$ of the above equation is the bitcode we want. This |
1273 would prove that our simplified version of regular expression still |
1276 would prove that our simplified version of regular expression still |
1274 contains all the bitcodes needed. The task here is to find a way to |
1277 contains all the bitcodes needed. The task here is to find a way to |
1275 compute the correct $v'$. |
1278 compute the correct $v'$. |
1276 |
1279 |
1277 The second task is to speed up the more aggressive simplification. |
1280 The second task is to speed up the more aggressive simplification. Currently |
1278 Currently it is slower than the original naive simplification by Ausaf |
1281 it is slower than the original naive simplification by Ausaf and Urban (the |
1279 and Urban (the naive version as implemented by Ausaf and Urban of |
1282 naive version as implemented by Ausaf and Urban of course can ``explode'' in |
1280 course can ``explode'' in some cases). So it needs to be explored how to |
1283 some cases). It is therefore not surprising that the speed is also much slower |
1281 make our algorithm faster on all inputs. Our possibility would be to |
1284 than regular expression engines in popular programming languages such as Java |
1282 explore again the connection to DFAs. This is very much work in |
1285 and Python on most inputs that are linear. For example, just by rewriting the |
1283 progress. |
1286 example regular expression in the beginning of this report $(a^*)^*\,b$ into |
|
1287 $a^*\,b$ would eliminate the ambiguity in the matching and make the time |
|
1288 for matching linear with respect to the input string size. This allows the |
|
1289 DFA approach to become blindingly fast, and dwarf the speed of our current |
|
1290 implementation. For example, here is a comparison of Java regex engine |
|
1291 and our implementation on this example. |
|
1292 |
|
1293 \begin{center} |
|
1294 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
1295 \begin{tikzpicture} |
|
1296 \begin{axis}[ |
|
1297 xlabel={$n*1000$}, |
|
1298 x label style={at={(1.05,-0.05)}}, |
|
1299 ylabel={time in secs}, |
|
1300 enlargelimits=false, |
|
1301 xtick={0,5,...,30}, |
|
1302 xmax=33, |
|
1303 ymax=9, |
|
1304 scaled ticks=true, |
|
1305 axis lines=left, |
|
1306 width=5cm, |
|
1307 height=4cm, |
|
1308 legend entries={Bitcoded Algorithm}, |
|
1309 legend pos=north west, |
|
1310 legend cell align=left] |
|
1311 \addplot[red,mark=*, mark options={fill=white}] table {bad-scala.data}; |
|
1312 \end{axis} |
|
1313 \end{tikzpicture} |
|
1314 & |
|
1315 \begin{tikzpicture} |
|
1316 \begin{axis}[ |
|
1317 xlabel={$n*1000$}, |
|
1318 x label style={at={(1.05,-0.05)}}, |
|
1319 %ylabel={time in secs}, |
|
1320 enlargelimits=false, |
|
1321 xtick={0,5,...,30}, |
|
1322 xmax=33, |
|
1323 ymax=9, |
|
1324 scaled ticks=false, |
|
1325 axis lines=left, |
|
1326 width=5cm, |
|
1327 height=4cm, |
|
1328 legend entries={Java}, |
|
1329 legend pos=north west, |
|
1330 legend cell align=left] |
|
1331 \addplot[cyan,mark=*, mark options={fill=white}] table {good-java.data}; |
|
1332 \end{axis} |
|
1333 \end{tikzpicture}\\ |
|
1334 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings |
|
1335 of the form $\underbrace{aa..a}_{n}$.} |
|
1336 \end{tabular} |
|
1337 \end{center} |
|
1338 |
|
1339 |
|
1340 Java regex engine can match string of thousands of characters in a few milliseconds, |
|
1341 whereas our current algorithm gets excruciatingly slow on input of this size. |
|
1342 The running time in theory is linear, however it does not appear to be the |
|
1343 case in an actual implementation. So it needs to be explored how to |
|
1344 make our algorithm faster on all inputs. It could be the recursive calls that are |
|
1345 needed to manipulate bits that are causing the slow down. A possible solution |
|
1346 is to write recursive functions into tail-recusive form. |
|
1347 Another possibility would be to explore |
|
1348 again the connection to DFAs to speed up the algorithm on |
|
1349 subcalls that are small enough. This is very much work in progress. |
1284 |
1350 |
1285 \section{Conclusion} |
1351 \section{Conclusion} |
1286 |
1352 |
1287 In this PhD-project we are interested in fast algorithms for regular |
1353 In this PhD-project we are interested in fast algorithms for regular |
1288 expression matching. While this seems to be a ``settled'' area, in |
1354 expression matching. While this seems to be a ``settled'' area, in |