ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 601 ce4e5151a836
parent 600 fd068f39ac23
child 602 46db6ae66448
equal deleted inserted replaced
600:fd068f39ac23 601:ce4e5151a836
   149 
   149 
   150 
   150 
   151 \def\zeroable{\textit{zeroable}}
   151 \def\zeroable{\textit{zeroable}}
   152 \def\nub{\textit{nub}}
   152 \def\nub{\textit{nub}}
   153 \def\filter{\textit{filter}}
   153 \def\filter{\textit{filter}}
   154 \def\not{\textit{not}}
   154 %\def\not{\textit{not}}
   155 
   155 
   156 
   156 
   157 
   157 
   158 \def\RZERO{\mathbf{0}_r }
   158 \def\RZERO{\mathbf{0}_r }
   159 \def\RONE{\mathbf{1}_r}
   159 \def\RONE{\mathbf{1}_r}
   190 %This part is about regular expressions, Brzozowski derivatives,
   190 %This part is about regular expressions, Brzozowski derivatives,
   191 %and a bit-coded lexing algorithm with proven correctness and time bounds.
   191 %and a bit-coded lexing algorithm with proven correctness and time bounds.
   192 
   192 
   193 %TODO: look up snort rules to use here--give readers idea of what regexes look like
   193 %TODO: look up snort rules to use here--give readers idea of what regexes look like
   194 
   194 
   195 \begin{figure}
   195 
   196 \centering
   196 
       
   197 
       
   198 
       
   199 
       
   200 Regular expressions are widely used in computer science: 
       
   201 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
       
   202 command-line tools like $\mathit{grep}$ that facilitate easy 
       
   203 text-processing; network intrusion
       
   204 detection systems that reject suspicious traffic; or compiler
       
   205 front ends--the majority of the solutions to these tasks 
       
   206 involve lexing with regular 
       
   207 expressions.
       
   208 Given its usefulness and ubiquity, one would imagine that
       
   209 modern regular expression matching implementations
       
   210 are mature and fully studied.
       
   211 Indeed, in a popular programming language' regex engine, 
       
   212 supplying it with regular expressions and strings, one can
       
   213 get rich matching information in a very short time.
       
   214 Some network intrusion detection systems
       
   215 use regex engines that are able to process 
       
   216 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
       
   217 Unfortunately, this is not the case for $\mathbf{all}$ inputs.
       
   218 %TODO: get source for SNORT/BRO's regex matching engine/speed
       
   219 
       
   220 \begin{figure}[p]
   197 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
   221 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
   198 \begin{tikzpicture}
   222 \begin{tikzpicture}
   199 \begin{axis}[
   223 \begin{axis}[
   200     xlabel={$n$},
   224     xlabel={$n$},
   201     x label style={at={(1.05,-0.05)}},
   225     x label style={at={(1.05,-0.05)}},
   255     legend pos=north west,
   279     legend pos=north west,
   256     legend cell align=left]
   280     legend cell align=left]
   257 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
   281 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
   258 \end{axis}
   282 \end{axis}
   259 \end{tikzpicture}\\
   283 \end{tikzpicture}\\
   260 \multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
   284 \begin{tikzpicture}
   261            of the form $\underbrace{aa..a}_{n}$.}
   285 \begin{axis}[
       
   286     xlabel={$n$},
       
   287     x label style={at={(1.05,-0.05)}},
       
   288     ylabel={time in secs},
       
   289     enlargelimits=false,
       
   290     xtick={0,5,...,30},
       
   291     xmax=33,
       
   292     ymax=35,
       
   293     ytick={0,5,...,30},
       
   294     scaled ticks=false,
       
   295     axis lines=left,
       
   296     width=5cm,
       
   297     height=4cm, 
       
   298     legend entries={Dart},  
       
   299     legend pos=north west,
       
   300     legend cell align=left]
       
   301 \addplot[green,mark=*, mark options={fill=white}] table {re-dart.data};
       
   302 \end{axis}
       
   303 \end{tikzpicture}
       
   304   &
       
   305 \begin{tikzpicture}
       
   306 \begin{axis}[
       
   307     xlabel={$n$},
       
   308     x label style={at={(1.05,-0.05)}},
       
   309     %ylabel={time in secs},
       
   310     enlargelimits=false,
       
   311     xtick={0,5,...,30},
       
   312     xmax=33,
       
   313     ymax=35,
       
   314     ytick={0,5,...,30},
       
   315     scaled ticks=false,
       
   316     axis lines=left,
       
   317     width=5cm,
       
   318     height=4cm, 
       
   319     legend entries={Swift},  
       
   320     legend pos=north west,
       
   321     legend cell align=left]
       
   322 \addplot[purple,mark=*, mark options={fill=white}] table {re-swift.data};
       
   323 \end{axis}
       
   324 \end{tikzpicture}
       
   325   & \\
       
   326 \multicolumn{3}{c}{Graphs}
   262 \end{tabular}    
   327 \end{tabular}    
   263 \caption{aStarStarb} \label{fig:aStarStarb}
   328 \caption{Graphs showing runtime for matching $(a^*)^*\,b$ with strings 
   264 \end{figure}
   329            of the form $\protect\underbrace{aa..a}_{n}$ in various existing regular expression libraries.
   265 
   330    The reason for their superlinear behaviour is that they do a depth-first-search.
   266 
   331    If the string does not match, the engine starts to explore all possibilities. 
   267 
   332 }\label{fig:aStarStarb}
   268 
   333 \end{figure}\afterpage{\clearpage}
   269 
       
   270 Regular expressions are widely used in computer science: 
       
   271 be it in text-editors \parencite{atomEditor} with syntax highlighting and auto-completion;
       
   272 command-line tools like $\mathit{grep}$ that facilitate easy 
       
   273 text-processing; network intrusion
       
   274 detection systems that reject suspicious traffic; or compiler
       
   275 front ends--the majority of the solutions to these tasks 
       
   276 involve lexing with regular 
       
   277 expressions.
       
   278 Given its usefulness and ubiquity, one would imagine that
       
   279 modern regular expression matching implementations
       
   280 are mature and fully studied.
       
   281 Indeed, in a popular programming language' regex engine, 
       
   282 supplying it with regular expressions and strings, one can
       
   283 get rich matching information in a very short time.
       
   284 Some network intrusion detection systems
       
   285 use regex engines that are able to process 
       
   286 megabytes or even gigabytes of data per second \parencite{Turo_ov__2020}.
       
   287 Unfortunately, this is not the case for $\mathbf{all}$ inputs.
       
   288 %TODO: get source for SNORT/BRO's regex matching engine/speed
       
   289 
       
   290 
   334 
   291 Take $(a^*)^*\,b$ and ask whether
   335 Take $(a^*)^*\,b$ and ask whether
   292 strings of the form $aa..a$ match this regular
   336 strings of the form $aa..a$ match this regular
   293 expression. Obviously this is not the case---the expected $b$ in the last
   337 expression. Obviously this is not the case---the expected $b$ in the last
   294 position is missing. One would expect that modern regular expression
   338 position is missing. One would expect that modern regular expression
   297 length, say around 30 $a$'s, one discovers that 
   341 length, say around 30 $a$'s, one discovers that 
   298 this decision takes crazy time to finish given the simplicity of the problem.
   342 this decision takes crazy time to finish given the simplicity of the problem.
   299 This is clearly exponential behaviour, and 
   343 This is clearly exponential behaviour, and 
   300 is triggered by some relatively simple regex patterns, as the graphs
   344 is triggered by some relatively simple regex patterns, as the graphs
   301  in \ref{fig:aStarStarb} show.
   345  in \ref{fig:aStarStarb} show.
   302 
   346 Java 9 and newer
   303 
   347 versions improves this behaviour, but is still slow compared 
   304 
   348 with the approach we are going to use.
   305 
   349 
   306 \ChristianComment{Superlinear I just leave out the explanation 
   350 
   307 which I find once used would distract the flow. Plus if i just say exponential
   351 
   308 here the 2016 event in StackExchange was not exponential, but just quardratic so would be 
       
   309 in accurate}
       
   310 
   352 
   311 This superlinear blowup in regular expression engines
   353 This superlinear blowup in regular expression engines
   312 had repeatedly caused grief in real life.
   354 had repeatedly caused grief in real life.
   313 For example, on 20 July 2016 one evil
   355 For example, on 20 July 2016 one evil
   314 regular expression brought the webpage
   356 regular expression brought the webpage
   328 2019. A poorly written regular expression exhibited exponential
   370 2019. A poorly written regular expression exhibited exponential
   329 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
   371 behaviour and exhausted CPUs that serve HTTP traffic. Although the outage
   330 had several causes, at the heart was a regular expression that
   372 had several causes, at the heart was a regular expression that
   331 was used to monitor network
   373 was used to monitor network
   332 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
   374 traffic.\footnote{\url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/}(Last accessed in 2022)}
   333 %TODO: data points for some new versions of languages
       
   334 These problems with regular expressions 
   375 These problems with regular expressions 
   335 are not isolated events that happen
   376 are not isolated events that happen
   336 very occasionally, but actually widespread.
   377 very occasionally, but actually widespread.
   337 They occur so often that they get a 
   378 They occur so often that they get a 
   338 name--Regular-Expression-Denial-Of-Service (ReDoS)
   379 name--Regular-Expression-Denial-Of-Service (ReDoS)
   343 They therefore concluded that evil regular expressions
   384 They therefore concluded that evil regular expressions
   344 are problems "more than a parlour trick", but one that
   385 are problems "more than a parlour trick", but one that
   345 requires
   386 requires
   346 more research attention.
   387 more research attention.
   347 
   388 
   348 
   389 Regular expressions and regular expression matchers 
   349 But the problems are not limited to slowness on certain 
   390 have of course been studied for many, many years.
       
   391 One of the most recent work in the context of lexing
       
   392 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
       
   393 This is relevant work and we will compare later on
       
   394 our derivative-based matcher we are going to present.
       
   395 There is also some newer work called
       
   396 Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
       
   397 For that the problem is the bounded repetitions ($r^{n}$,
       
   398 $r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$).
       
   399 They often occur in practical use, in for example the Snort XML definitions:
       
   400 
       
   401 
       
   402 The problems are not limited to slowness on certain 
   350 cases. 
   403 cases. 
   351 Another thing about these libraries is that there
   404 Another thing about these libraries is that there
   352 is no correctness guarantee.
   405 is no correctness guarantee.
   353 In some cases, they either fail to generate a lexing result when there exists a match,
   406 In some cases, they either fail to generate a lexing result when there exists a match,
   354 or give results that are inconsistent with the $\POSIX$ standard.
   407 or give results that are inconsistent with the $\POSIX$ standard.
   355 A concrete example would be 
   408 A concrete example would be the regex
   356 the regex
   409 \begin{center}
   357 \begin{verbatim}
   410 	$(aba + ab + a)* \text{and the string} ababa$
   358 (aba|ab|a)*
   411 \end{center}
   359 \end{verbatim}
       
   360 and the string
       
   361 \begin{verbatim}
       
   362 ababa
       
   363 \end{verbatim}
       
   364 The correct $\POSIX$ match for the above would be 
   412 The correct $\POSIX$ match for the above would be 
   365 with the entire string $ababa$, 
   413 with the entire string $ababa$, 
   366 split into two Kleene star iterations, $[ab] [aba]$ at positions
   414 split into two Kleene star iterations, $[ab] [aba]$ at positions
   367 $[0, 2), [2, 5)$
   415 $[0, 2), [2, 5)$
   368 respectively.
   416 respectively.
   373 
   421 
   374 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
   422 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
   375 correctly implementing the POSIX (maximum-munch)
   423 correctly implementing the POSIX (maximum-munch)
   376 rule of regular expression matching.
   424 rule of regular expression matching.
   377 
   425 
   378 As Grathwohl\parencite{grathwohl2014crash} commented,
   426 As Grathwohl\parencite{grathwohl2014crash} wrote,
   379 \begin{center}
   427 \begin{quote}
   380 	``The POSIX strategy is more complicated than the greedy because of the dependence on information about the length of matched strings in the various subexpressions.''
   428 	The POSIX strategy is more complicated than the 
   381 \end{center}
   429 	greedy because of the dependence on information about 
   382 
   430 	the length of matched strings in the various subexpressions.
   383 
   431 \end{quote}
       
   432 %\noindent
   384 To summarise the above, regular expressions are important.
   433 To summarise the above, regular expressions are important.
   385 They are popular and programming languages' library functions
   434 They are popular and programming languages' library functions
   386 for them are very fast on non-catastrophic cases.
   435 for them are very fast on non-catastrophic cases.
   387 But there are problems with current practical implementations.
   436 But there are problems with current practical implementations.
   388 First thing is that the running time might blow up.
   437 First thing is that the running time might blow up.
   422 The second phase is the lexing phase, when the input string 
   471 The second phase is the lexing phase, when the input string 
   423 $s$ is read and the data structure
   472 $s$ is read and the data structure
   424 representing that regex $r$ is being operated on. 
   473 representing that regex $r$ is being operated on. 
   425 We represent the time
   474 We represent the time
   426 it takes by $P_2(r, s)$.\\
   475 it takes by $P_2(r, s)$.\\
   427 
       
   428 For $\mathit{DFA}$,
   476 For $\mathit{DFA}$,
   429 we have $P_2(r, s) = O( |s| )$,
   477 we have $P_2(r, s) = O( |s| )$,
   430 because we take at most $|s|$ steps, 
   478 because we take at most $|s|$ steps, 
   431 and each step takes
   479 and each step takes
   432 at most one transition--
   480 at most one transition--
   433 a deterministic-finite-automata
   481 a deterministic-finite-automata
   434 by definition has at most one state active and at most one
   482 by definition has at most one state active and at most one
   435 transition upon receiving an input symbol.
   483 transition upon receiving an input symbol.
   436 But unfortunately in the  worst case
   484 But unfortunately in the  worst case
   437 $P_1(r) = O(exp^{|r|})$. An example will be given later. 
   485 $P_1(r) = O(exp^{|r|})$. An example will be given later. 
   438 
       
   439 
       
   440 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
   486 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
   441 expressions like $r^n$ into $\underbrace{r \cdots r}_{\text{n copies of r}}$.
   487 expressions like $r^n$ into 
       
   488 \[
       
   489 	\underbrace{r \cdots r}_{\text{n copies of r}}.
       
   490 \]
   442 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
   491 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
   443 On the other hand, if backtracking is used, the worst-case time bound bloats
   492 On the other hand, if backtracking is used, the worst-case time bound bloats
   444 to $|r| * 2^|s|$.
   493 to $|r| * 2^{|s|}$.
   445 %on the input
   494 %on the input
   446 %And when calculating the time complexity of the matching algorithm,
   495 %And when calculating the time complexity of the matching algorithm,
   447 %we are assuming that each input reading step requires constant time.
   496 %we are assuming that each input reading step requires constant time.
   448 %which translates to that the number of 
   497 %which translates to that the number of 
   449 %states active and transitions taken each time is bounded by a
   498 %states active and transitions taken each time is bounded by a
   636 % Java and Python both support back-references, but shows
   685 % Java and Python both support back-references, but shows
   637 %catastrophic backtracking behaviours on inputs without back-references(
   686 %catastrophic backtracking behaviours on inputs without back-references(
   638 %when the language is still regular).
   687 %when the language is still regular).
   639  %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
   688  %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
   640  %TODO: verify the fact Rust does not allow 1000+ reps
   689  %TODO: verify the fact Rust does not allow 1000+ reps
   641 \ChristianComment{Comment required: Java 17 updated graphs? Is it ok to still use Java 8 graphs?}
       
   642 
   690 
   643 
   691 
   644 So we have practical implementations 
   692 So we have practical implementations 
   645 on regular expression matching/lexing which are fast
   693 on regular expression matching/lexing which are fast
   646 but do not come with any guarantees that it will not grind to a halt
   694 but do not come with any guarantees that it will not grind to a halt