ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 605 ed53ce26ecb6
parent 604 16d67f9c07d4
child 606 99b530103464
equal deleted inserted replaced
604:16d67f9c07d4 605:ed53ce26ecb6
   404 can occur
   404 can occur
   405 and (ii) why we choose our 
   405 and (ii) why we choose our 
   406 approach (Brzozowski derivatives and formal proofs).
   406 approach (Brzozowski derivatives and formal proofs).
   407 
   407 
   408 
   408 
   409 \section{Terminology, and the Problem with Bounded Repetitions}
   409 \section{Regex, and the Problems with Regex Matchers}
   410 Regular expressions and regular expression matchers 
   410 Regular expressions and regular expression matchers 
   411 have of course been studied for many, many years.
   411 have of course been studied for many, many years.
   412 Theoretical results in automata theory says
   412 Theoretical results in automata theory say
   413 that basic regular expression matching should be linear
   413 that basic regular expression matching should be linear
   414 w.r.t the input, provided that the regular expression
   414 w.r.t the input.
   415 $r$ had been pre-processed and turned into a
   415 This assumes that the regular expression
   416 deterministic finite automata (DFA).
   416 $r$ was pre-processed and turned into a
       
   417 deterministic finite automata (DFA) before matching,
       
   418 which could be exponential\cite{Sakarovitch2009}.
   417 By basic we mean textbook definitions such as the one
   419 By basic we mean textbook definitions such as the one
   418 below, involving only characters, alternatives,
   420 below, involving only characters, alternatives,
   419 sequences, and Kleene stars:
   421 sequences, and Kleene stars:
   420 \[
   422 \[
   421 	r ::= \ZERO | \ONE | c | r_1 + r_2 | r_1 \cdot r_2 | r^*
   423 	r ::= \ZERO | \ONE | c | r_1 + r_2 | r_1 \cdot r_2 | r^*
   422 \]
   424 \]
   423 Modern regular expression matchers used by programmers,
   425 Modern regular expression matchers used by programmers,
   424 however,
   426 however,
   425 support richer constructs such as bounded repetitions
   427 support richer constructs such as bounded repetitions
   426 and back-references.
   428 and back-references.
   427 The syntax and expressive power of those 
   429 To differentiate, people use the word \emph{regex} to refer
   428 matching engines
   430 to those expressions with richer constructs while reserving the
   429 make ``regular expressions'' quite different from 
   431 term \emph{regular expression}
   430 their original meaning in the formal languages
   432 for the more traditional meaning in formal languages theory.
   431 theory.
   433 We follow this convention 
   432 To differentiate, people tend to use the word \emph{regex} to refer
   434 in this thesis.
   433 those expressions with richer constructs, and regular expressions
   435 In the future, we aim to support all the popular features of regexes, 
   434 for the more traditional meaning.
       
   435 For example, the PCRE standard (Peral Compatible Regular Expressions)
       
   436 is such a regex syntax standard.
       
   437 We follow this convention in this thesis.
       
   438 We aim to support all the popular features of regexes in the future,
       
   439 but for this work we mainly look at regular expressions.
   436 but for this work we mainly look at regular expressions.
   440 
   437 
   441 \subsection{A Little Introduction to Regexes: Bounded Repetitions
   438 
   442 and Back-references}
   439 
       
   440 %Most modern regex libraries
       
   441 %the so-called PCRE standard (Peral Compatible Regular Expressions)
       
   442 %has the back-references
   443 Regexes come with a lot of constructs
   443 Regexes come with a lot of constructs
   444 that makes it more convenient for 
   444 beyond the basic ones
       
   445 that make it more convenient for 
   445 programmers to write regular expressions.
   446 programmers to write regular expressions.
       
   447 Depending on the types of these constructs
       
   448 the task of matching and lexing with them
       
   449 will have different levels of complexity increase.
   446 Some of those constructs are syntactic sugars that are
   450 Some of those constructs are syntactic sugars that are
   447 simply short hand notations
   451 simply short hand notations
   448 that save the programmers a few keystrokes,
   452 that save the programmers a few keystrokes.
   449 for example the
   453 These will not cause trouble for regex libraries.
       
   454 
       
   455 \noindent
       
   456 For example the
   450 non-binary alternative involving three or more choices:
   457 non-binary alternative involving three or more choices:
   451 \[
   458 \[
   452 	r = (a | b | c | \ldots | z)^*
   459 	(a | b | c) \stackrel{means}{=} ((a + b)+ c)
   453 \]
   460 \]
   454 , the range operator $-$ which means the alternative
   461 the range operator $-$ used to express the alternative
   455 of all characters between its operands:
   462 of all characters between its operands in a concise way:
   456 \[
   463 \[
   457 	r = [0-9a-zA-Z] \; \text{(all alpha-numeric characters)}
   464 	[0~-9]\stackrel{means}{=} (0 | 1 | \ldots | 9 ) \; \text{(all number digits)}
   458 \]
   465 \]
   459 and the 
   466 and the
   460 wildcard character $.$ meaning any character
   467 wildcard character $.$ used to refer to any single character:
   461 \[
   468 \[
   462 	. = [0-9a-zA-Z+-()*&\ldots]
   469 	. \stackrel{means}{=} [0-9a-zA-Z+-()*\&\ldots]
   463 
       
   464 \]
   470 \]
       
   471 
       
   472 \noindent
       
   473 \subsection{Bounded Repetitions}
   465 Some of those constructs do make the expressions much
   474 Some of those constructs do make the expressions much
   466 more compact, and matching time could be greatly increase.
   475 more compact.
   467 For example, $r^{n}$ is exponentially more concise compared with
   476 For example, the bounded regular expressions
   468 the expression $\underbrace{r}_\text{n copies of r}$,
   477 (where $n$ and $m$ are constant natural numbers)
   469 and therefore a naive algorithm that simply unfolds
   478 $r^{\{n\}}$, $r^{\{\ldots m\}}$, $r^{\{n\ldots \}}$ and $r^{\{n\ldots m\}}$,
   470 $r^{n}$ into $\underbrace{r}_\text{n copies of r}$
   479 defined as 
   471 will suffer exponential runtime increase.
   480 \begin{center}
   472 Some constructs can even raise the expressive
   481 	\begin{tabular}{lcl}
   473 power to the non-regular realm, for example
   482 		$L \; r^{\{n\}}$ & $\dn$ & $(L \; r)^n$\\
   474 the back-references.
   483 		$L \; r^{\{\ldots m\}}$ & $\dn$ & $\bigcup_{0 \leq i \leq m}. (L \; r)^i$\\
   475 
   484 		$L \; r^{\{n\ldots \}}$ & $\dn$ & $\bigcup_{n \leq i}. (L \; r)^i$\\
   476 bounded repetitions, as we have discussed in the 
   485 		$L \; r^{\{n \ldots m\}}$ & $\dn$ & $\bigcup_{n \leq i \leq m}. (L \; r)^i$
   477 previous section.
   486 	\end{tabular}
   478 This super-linear behaviour of the 
   487 \end{center}
   479 regex matching engines we have?
   488 are exponentially smaller compared with
   480 One of the most recent work in the context of lexing
   489 their unfolded form: for example $r^{\{n\}}$
   481 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
   490 as opposed to
   482 This is relevant work and we will compare later on
   491 \[
   483 our derivative-based matcher we are going to present.
   492 	\underbrace{r\ldots r}_\text{n copies of r}.
   484 There is also some newer work called
   493 \]
   485 Verbatim++\cite{Verbatimpp}, this does not use derivatives, but automaton instead.
   494 %Therefore, a naive algorithm that simply unfolds
   486 For that the problem is dealing with the bounded regular expressions of the form
   495 %them into their desugared forms
   487 $r^{n}$ where $n$ is a constant specifying that $r$ must repeat
   496 %will suffer from at least an exponential runtime increase.
   488 exactly $n$ times. The Verbatim++ lexer becomes excruciatingly slow
   497 
   489 on the bounded repetitions construct.
       
   490 
       
   491 In the work reported in \cite{CSL2022} and here, we add better support
       
   492 for them. 
       
   493 The other repetition constructs include
       
   494 $r^{\ldots m}$, $r^{n\ldots}$ and $r^{n\ldots m}$ which specify 
       
   495 intervals for how many times $r$ should match.
       
   496 $r^{\ldots m}$ means repeating
       
   497 at most $m$ times, $r^{n\ldots}$ means repeating at least $n$ times and 
       
   498 $r^{n\ldots m}$ means repeating between $n$ and $m$ times.
       
   499 The results presented in this thesis extend straightforwardly to them
       
   500 too. 
       
   501 Their formal definitions will be given later.
       
   502 
       
   503 Bounded repetitions are important because they
       
   504 tend to occur often in practical use, for example in RegExLib,
       
   505 Snort, as well as in XML Schema definitions (XSDs).
       
   506 According to Bj\"{o}rklund et al \cite{xml2015},
       
   507 bounded regular expressions occur frequently in the latter and can have
       
   508 counters up to ten million. An example XSD with a large counter they gave
       
   509 was:
       
   510 \begin{verbatim}
       
   511 <sequence minOccurs="0" maxOccurs="65535">
       
   512     <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
       
   513     <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
       
   514 </sequence>
       
   515 \end{verbatim}
       
   516 This can be seen as the expression 
       
   517 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
       
   518 regular expressions 
       
   519 satisfying certain constraints (such as 
       
   520 satisfying the floating point number format).
       
   521 The problem here is that tools based on the classic notion of
   498 The problem here is that tools based on the classic notion of
   522 automata need to expand $r^{n}$ into $n$ connected 
   499 automata need to expand $r^{n}$ into $n$ connected 
   523 copies of the automaton for $r$. This leads to very inefficient matching
   500 copies of the automaton for $r$. This leads to very inefficient matching
   524 algorithms  or algorithms that consume large amounts of memory.
   501 algorithms  or algorithms that consume large amounts of memory.
       
   502 Implementations using $\DFA$s will
       
   503 either become excruciatingly slow 
       
   504 (for example Verbatim++\cite{Verbatimpp}) or get
       
   505 out of memory errors (for example $\mathit{LEX}$ and 
       
   506 $\mathit{JFLEX}$\footnote{which are lexer generators
       
   507 in C and JAVA that generate $\mathit{DFA}$-based
       
   508 lexers. The user provides a set of regular expressions
       
   509 and configurations to them, and then 
       
   510 gets an output program encoding a minimized $\mathit{DFA}$
       
   511 that can be compiled and run. 
       
   512 When given the above countdown regular expression,
       
   513 a small $n$ (a few dozen) would result in a 
       
   514 determinised automata
       
   515 with millions of states.}) under large counters.
   525 A classic example is the regular expression $(a+b)^*  a (a+b)^{n}$
   516 A classic example is the regular expression $(a+b)^*  a (a+b)^{n}$
   526 where the minimal DFA requires at least $2^{n+1}$ states (more on this
   517 where the minimal DFA requires at least $2^{n+1}$ states.
   527 later).
   518 For example, when $n$ is equal to 2,
   528 Therefore regular expressions matching libraries that rely on the classic
       
   529 notion of DFAs  often impose adhoc limits
       
   530 for bounded regular expressions:
       
   531 For example, in the regular expression matching library in the Go
       
   532 language the regular expression $a^{1001}$ is not permitted, because no counter
       
   533 can be above 1000, and in the built-in Rust regular expression library
       
   534 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
       
   535 for being too big. These problems can of course be solved in matching algorithms where 
       
   536 automata go beyond the classic notion and for instance include explicit
       
   537 counters \cite{Turo_ov__2020}.
       
   538 The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
       
   539 straightforwardly extended to deal with bounded regular expressions
       
   540 and moreover the resulting code still consists of only simple
       
   541 recursive functions and inductive datatypes.
       
   542 Finally, bounded regular expressions do not destroy our finite
       
   543 boundedness property, which we shall prove later on.
       
   544 
       
   545 \section{Back-references and The Terminology Regex}
       
   546 
       
   547 
       
   548 Bounded repetitions, usually written in the form
       
   549 $r^{\{c\}}$ (where $c$ is a constant natural number),
       
   550 denotes a regular expression accepting strings
       
   551 that can be divided into $c$ substrings, where each 
       
   552 substring is in $r$. 
       
   553 For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
       
   554 an $\mathit{NFA}$ describing it would look like:
   519 an $\mathit{NFA}$ describing it would look like:
   555 \begin{center}
   520 \begin{center}
   556 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
   521 \begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
   557    \node[state,initial] (q_0)   {$q_0$}; 
   522    \node[state,initial] (q_0)   {$q_0$}; 
   558    \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
   523    \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
   563     	  edge [loop below] node {a,b} ()
   528     	  edge [loop below] node {a,b} ()
   564     (q_1) edge  node  {a,b} (q_2)
   529     (q_1) edge  node  {a,b} (q_2)
   565     (q_2) edge  node  {a,b} (q_3);
   530     (q_2) edge  node  {a,b} (q_3);
   566 \end{tikzpicture}
   531 \end{tikzpicture}
   567 \end{center}
   532 \end{center}
   568 The red states are "countdown states" which counts down 
   533 which requires at least $2^3$ states
       
   534 for its subset construction.\footnote{The 
       
   535 red states are "countdown states" which counts down 
   569 the number of characters needed in addition to the current
   536 the number of characters needed in addition to the current
   570 string to make a successful match.
   537 string to make a successful match.
   571 For example, state $q_1$ indicates a match that has
   538 For example, state $q_1$ indicates a match that has
   572 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
   539 gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
   573 and just consumed the "delimiter" $a$ in the middle, and 
   540 and just consumed the "delimiter" $a$ in the middle, and 
   586 because the subset construction during determinisation will generate
   553 because the subset construction during determinisation will generate
   587 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
   554 all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
   588 Generalizing this to regular expressions with larger
   555 Generalizing this to regular expressions with larger
   589 bounded repetitions number, we have that
   556 bounded repetitions number, we have that
   590 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
   557 regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
   591 would require at least $2^{n+1}$ states, if $r$ contains
   558 would require at least $2^{n+1}$ states, if $r$ itself contains
   592 more than 1 string.
   559 more than 1 string.
   593 This is to represent all different 
   560 This is to represent all different 
   594 scenarios which "countdown" states are active.
   561 scenarios which "countdown" states are active.}
   595 For those regexes, tools that uses $\DFA$s will get
   562 
   596 out of memory errors.
   563 One of the most recent work in the context of lexing
   597 
   564 %with this issue
   598 
   565 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
   599 The time cost of regex matching algorithms in general
   566 This is relevant work and we will compare later on
   600 involve two different phases, and different things can go differently wrong on 
   567 our derivative-based matcher we are going to present.
   601 these phases.
   568 There is also some newer work called
   602 $\DFA$s usually have problems in the first (construction) phase
   569 Verbatim++\cite{Verbatimpp}, which does not use derivatives, 
   603 , whereas $\NFA$s usually run into trouble
   570 but deterministic finite automaton instead.
   604 on the second phase.
   571 %An example that gives problem to automaton approaches would be
   605 
   572 %the regular expression $(a|b)^*a(a|b)^{\{n\}}$.
   606 
   573 %It requires at least $2^{n+1}$ states to represent
   607 \section{Error-prone POSIX Implementations}
   574 %as a DFA.
   608 The problems with practical implementations
   575 
   609 of reare not limited to slowness on certain 
   576 
   610 cases. 
   577 Bounded repetitions are very important because they
   611 Another thing about these libraries is that there
   578 tend to occur a lot in practical use.
   612 is no correctness guarantee.
   579 For example in the regex library RegExLib,
   613 In some cases, they either fail to generate a lexing result when there exists a match,
   580 the rules library of Snort, as well as in XML Schema definitions (XSDs).
   614 or give results that are inconsistent with the $\POSIX$ standard.
   581 According to Bj\"{o}rklund et al \cite{xml2015},
   615 A concrete example would be the regex
   582 more than half of the 
   616 \begin{center}
   583 XSDs they found have bounded regular expressions in them.
   617 	$(aba + ab + a)* \text{and the string} ababa$
   584 Often the counters are quite large, the largest up to ten million. 
   618 \end{center}
   585 An example XSD they gave
   619 The correct $\POSIX$ match for the above would be 
   586 was:
   620 with the entire string $ababa$, 
   587 \begin{verbatim}
   621 split into two Kleene star iterations, $[ab] [aba]$ at positions
   588 <sequence minOccurs="0" maxOccurs="65535">
   622 $[0, 2), [2, 5)$
   589     <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
   623 respectively.
   590     <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
   624 But trying this out in regex101\parencite{regex101}
   591 </sequence>
   625 with different language engines would yield 
   592 \end{verbatim}
   626 the same two fragmented matches: $[aba]$ at $[0, 3)$
   593 This can be seen as the expression 
   627 and $a$ at $[4, 5)$.
   594 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
   628 
   595 regular expressions 
   629 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
   596 satisfying certain constraints (such as 
   630 correctly implementing the POSIX (maximum-munch)
   597 satisfying the floating point number format).
   631 rule of regular expression matching.
   598 
   632 
   599 It is therefore quite unsatisfying that 
   633 As Grathwohl\parencite{grathwohl2014crash} wrote,
   600 some regular expressions matching libraries
   634 \begin{quote}
   601 impose adhoc limits
   635 	The POSIX strategy is more complicated than the 
   602 for bounded regular expressions:
   636 	greedy because of the dependence on information about 
   603 For example, in the regular expression matching library in the Go
   637 	the length of matched strings in the various subexpressions.
   604 language the regular expression $a^{1001}$ is not permitted, because no counter
   638 \end{quote}
   605 can be above 1000, and in the built-in Rust regular expression library
   639 %\noindent
   606 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
   640 To summarise the above, regular expressions are important.
   607 for being too big. These problems can of course be solved in matching algorithms where 
   641 They are popular and programming languages' library functions
   608 automata go beyond the classic notion and for instance include explicit
   642 for them are very fast on non-catastrophic cases.
   609 counters \cite{Turo_ov__2020}.
   643 But there are problems with current practical implementations.
   610 
   644 First thing is that the running time might blow up.
   611 In the work reported in \cite{CSL2022} and here, 
   645 The second problem is that they might be error-prone on certain
   612 we add better support using derivatives
   646 very simple cases.
   613 for bounded regular expressions $r^{\{n\}}$.
   647 In the next part of the chapter, we will look into reasons why 
   614 The results
   648 certain regex engines are running horribly slow on the "catastrophic"
   615 extend straightforwardly to
   649 cases and propose a solution that addresses both of these problems
   616 repetitions with an interval such as 
   650 based on Brzozowski and Sulzmann and Lu's work.
   617 $r^{\{n\ldots m\}}$.
   651 
   618 The merit of Brzozowski derivatives
   652 
   619 on this problem is that
   653 
   620 it can be naturally extended to support bounded repetitions.
   654 \subsection{Different Phases of a Matching/Lexing Algorithm}
   621 Moreover these extensions are still made up of only
   655 
   622 inductive datatypes and recursive functions,
   656 
   623 making it handy to deal with using theorem provers.
   657 Most lexing algorithms can be roughly divided into 
   624 %The point here is that Brzozowski derivatives and the algorithms by Sulzmann and Lu can be
   658 two phases during its run.
   625 %straightforwardly extended to deal with bounded regular expressions
   659 The first phase is the "construction" phase,
   626 %and moreover the resulting code still consists of only simple
   660 in which the algorithm builds some  
   627 %recursive functions and inductive datatypes.
   661 suitable data structure from the input regex $r$, so that
   628 Finally, bounded regular expressions do not destroy our finite
   662 it can be easily operated on later.
   629 boundedness property, which we shall prove later on.
   663 We denote
   630 
   664 the time cost for such a phase by $P_1(r)$.
   631 
   665 The second phase is the lexing phase, when the input string 
   632 \subsection{Back-References}
   666 $s$ is read and the data structure
   633 %Some constructs can even raise the expressive
   667 representing that regex $r$ is being operated on. 
   634 %power to the non-regular realm, for example
   668 We represent the time
   635 %the back-references.
   669 it takes by $P_2(r, s)$.\\
   636 %bounded repetitions, as we have discussed in the 
   670 For $\mathit{DFA}$,
   637 %previous section.
   671 we have $P_2(r, s) = O( |s| )$,
   638 %This super-linear behaviour of the 
   672 because we take at most $|s|$ steps, 
   639 %regex matching engines we have?
   673 and each step takes
   640 The syntax and expressive power of regexes 
   674 at most one transition--
   641 can go quickly beyond the regular language hierarchy
   675 a deterministic-finite-automata
   642 with back-references.
   676 by definition has at most one state active and at most one
   643 %\section{Back-references and The Terminology Regex}
   677 transition upon receiving an input symbol.
   644 
   678 But unfortunately in the  worst case
   645 %When one constructs an $\NFA$ out of a regular expression
   679 $P_1(r) = O(exp^{|r|})$. An example will be given later. 
   646 %there is often very little to be done in the first phase, one simply 
   680 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
   647 %construct the $\NFA$ states based on the structure of the input regular expression.
   681 expressions like $r^n$ into 
   648 
   682 \[
   649 %In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
   683 	\underbrace{r \cdots r}_{\text{n copies of r}}.
   650 %one by keeping track of all active states after consuming 
   684 \]
   651 %a character, and update that set of states iteratively.
   685 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
   652 %This can be viewed as a breadth-first-search of the $\mathit{NFA}$
   686 On the other hand, if backtracking is used, the worst-case time bound bloats
   653 %for a path terminating
   687 to $|r| * 2^{|s|}$.
   654 %at an accepting state.
   688 %on the input
       
   689 %And when calculating the time complexity of the matching algorithm,
       
   690 %we are assuming that each input reading step requires constant time.
       
   691 %which translates to that the number of 
       
   692 %states active and transitions taken each time is bounded by a
       
   693 %constant $C$.
       
   694 %But modern  regex libraries in popular language engines
       
   695 % often want to support much richer constructs than just
       
   696 % sequences and Kleene stars,
       
   697 %such as negation, intersection, 
       
   698 %bounded repetitions and back-references.
       
   699 %And de-sugaring these "extended" regular expressions 
       
   700 %into basic ones might bloat the size exponentially.
       
   701 %TODO: more reference for exponential size blowup on desugaring. 
       
   702 
       
   703 \subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
       
   704 
       
   705 
       
   706 The good things about $\mathit{DFA}$s is that once
       
   707 generated, they are fast and stable, unlike
       
   708 backtracking algorithms. 
       
   709 However, they do not scale well with bounded repetitions.
       
   710 
       
   711 \subsubsection{Problems with Bounded Repetitions}
       
   712 
       
   713 
       
   714 
       
   715 
       
   716 
       
   717 \subsubsection{Tools that uses $\mathit{DFA}$s}
       
   718 %TODO:more tools that use DFAs?
       
   719 $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
       
   720 in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
       
   721 lexers. The user provides a set of regular expressions
       
   722 and configurations to such lexer generators, and then 
       
   723 gets an output program encoding a minimized $\mathit{DFA}$
       
   724 that can be compiled and run. 
       
   725 When given the above countdown regular expression,
       
   726 a small number $n$ would result in a determinised automata
       
   727 with millions of states.
       
   728 
       
   729 For this reason, regex libraries that support 
       
   730 bounded repetitions often choose to use the $\mathit{NFA}$ 
       
   731 approach.
       
   732 
       
   733 
       
   734 
       
   735 
       
   736 
       
   737 
       
   738 
       
   739 
       
   740 \subsection{Why $\mathit{NFA}$s can be slow in the second phase}
       
   741 When one constructs an $\NFA$ out of a regular expression
       
   742 there is often very little to be done in the first phase, one simply 
       
   743 construct the $\NFA$ states based on the structure of the input regular expression.
       
   744 
       
   745 In the lexing phase, one can simulate the $\mathit{NFA}$ running in two ways:
       
   746 one by keeping track of all active states after consuming 
       
   747 a character, and update that set of states iteratively.
       
   748 This can be viewed as a breadth-first-search of the $\mathit{NFA}$
       
   749 for a path terminating
       
   750 at an accepting state.
       
   751 Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
   655 Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
   752 type of $\mathit{NFA}$ simulation and guarantees a linear runtime
   656 type of $\mathit{NFA}$ simulation and guarantees a linear runtime
   753 in terms of input string length.
   657 in terms of input string length.
   754 %TODO:try out these lexers
   658 %TODO:try out these lexers
   755 The other way to use $\mathit{NFA}$ for matching is choosing  
   659 The other way to use $\mathit{NFA}$ for matching is choosing  
   836 %when the language is still regular).
   740 %when the language is still regular).
   837  %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
   741  %TODO: test performance of Rust on (((((a*a*)b*)b){20})*)c  baabaabababaabaaaaaaaaababaaaababababaaaabaaabaaaaaabaabaabababaababaaaaaaaaababaaaababababaaaaaaaaaaaaac
   838  %TODO: verify the fact Rust does not allow 1000+ reps
   742  %TODO: verify the fact Rust does not allow 1000+ reps
   839 
   743 
   840 
   744 
       
   745 
       
   746 
       
   747 %The time cost of regex matching algorithms in general
       
   748 %involve two different phases, and different things can go differently wrong on 
       
   749 %these phases.
       
   750 %$\DFA$s usually have problems in the first (construction) phase
       
   751 %, whereas $\NFA$s usually run into trouble
       
   752 %on the second phase.
       
   753 
       
   754 
       
   755 \section{Error-prone POSIX Implementations}
       
   756 The problems with practical implementations
       
   757 of reare not limited to slowness on certain 
       
   758 cases. 
       
   759 Another thing about these libraries is that there
       
   760 is no correctness guarantee.
       
   761 In some cases, they either fail to generate a lexing result when there exists a match,
       
   762 or give results that are inconsistent with the $\POSIX$ standard.
       
   763 A concrete example would be the regex
       
   764 \begin{center}
       
   765 	$(aba + ab + a)* \text{and the string} ababa$
       
   766 \end{center}
       
   767 The correct $\POSIX$ match for the above would be 
       
   768 with the entire string $ababa$, 
       
   769 split into two Kleene star iterations, $[ab] [aba]$ at positions
       
   770 $[0, 2), [2, 5)$
       
   771 respectively.
       
   772 But trying this out in regex101\parencite{regex101}
       
   773 with different language engines would yield 
       
   774 the same two fragmented matches: $[aba]$ at $[0, 3)$
       
   775 and $a$ at $[4, 5)$.
       
   776 
       
   777 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
       
   778 correctly implementing the POSIX (maximum-munch)
       
   779 rule of regular expression matching.
       
   780 
       
   781 As Grathwohl\parencite{grathwohl2014crash} wrote,
       
   782 \begin{quote}
       
   783 	The POSIX strategy is more complicated than the 
       
   784 	greedy because of the dependence on information about 
       
   785 	the length of matched strings in the various subexpressions.
       
   786 \end{quote}
       
   787 %\noindent
       
   788 To summarise the above, regular expressions are important.
       
   789 They are popular and programming languages' library functions
       
   790 for them are very fast on non-catastrophic cases.
       
   791 But there are problems with current practical implementations.
       
   792 First thing is that the running time might blow up.
       
   793 The second problem is that they might be error-prone on certain
       
   794 very simple cases.
       
   795 In the next part of the chapter, we will look into reasons why 
       
   796 certain regex engines are running horribly slow on the "catastrophic"
       
   797 cases and propose a solution that addresses both of these problems
       
   798 based on Brzozowski and Sulzmann and Lu's work.
       
   799 
       
   800 
       
   801 \subsection{Different Phases of a Matching/Lexing Algorithm}
       
   802 
       
   803 
       
   804 Most lexing algorithms can be roughly divided into 
       
   805 two phases during its run.
       
   806 The first phase is the "construction" phase,
       
   807 in which the algorithm builds some  
       
   808 suitable data structure from the input regex $r$, so that
       
   809 it can be easily operated on later.
       
   810 We denote
       
   811 the time cost for such a phase by $P_1(r)$.
       
   812 The second phase is the lexing phase, when the input string 
       
   813 $s$ is read and the data structure
       
   814 representing that regex $r$ is being operated on. 
       
   815 We represent the time
       
   816 it takes by $P_2(r, s)$.\\
       
   817 For $\mathit{DFA}$,
       
   818 we have $P_2(r, s) = O( |s| )$,
       
   819 because we take at most $|s|$ steps, 
       
   820 and each step takes
       
   821 at most one transition--
       
   822 a deterministic-finite-automata
       
   823 by definition has at most one state active and at most one
       
   824 transition upon receiving an input symbol.
       
   825 But unfortunately in the  worst case
       
   826 $P_1(r) = O(exp^{|r|})$. An example will be given later. 
       
   827 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
       
   828 expressions like $r^n$ into 
       
   829 \[
       
   830 	\underbrace{r \cdots r}_{\text{n copies of r}}.
       
   831 \]
       
   832 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
       
   833 On the other hand, if backtracking is used, the worst-case time bound bloats
       
   834 to $|r| * 2^{|s|}$.
       
   835 %on the input
       
   836 %And when calculating the time complexity of the matching algorithm,
       
   837 %we are assuming that each input reading step requires constant time.
       
   838 %which translates to that the number of 
       
   839 %states active and transitions taken each time is bounded by a
       
   840 %constant $C$.
       
   841 %But modern  regex libraries in popular language engines
       
   842 % often want to support much richer constructs than just
       
   843 % sequences and Kleene stars,
       
   844 %such as negation, intersection, 
       
   845 %bounded repetitions and back-references.
       
   846 %And de-sugaring these "extended" regular expressions 
       
   847 %into basic ones might bloat the size exponentially.
       
   848 %TODO: more reference for exponential size blowup on desugaring. 
       
   849 
       
   850 \subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
       
   851 
       
   852 
       
   853 The good things about $\mathit{DFA}$s is that once
       
   854 generated, they are fast and stable, unlike
       
   855 backtracking algorithms. 
       
   856 However, they do not scale well with bounded repetitions.
       
   857 
       
   858 \subsubsection{Problems with Bounded Repetitions}
       
   859 
       
   860 
       
   861 
       
   862 
       
   863 
       
   864 
       
   865 
       
   866 
       
   867 
       
   868 
       
   869 
       
   870 
       
   871 
       
   872 
   841 So we have practical implementations 
   873 So we have practical implementations 
   842 on regular expression matching/lexing which are fast
   874 on regular expression matching/lexing which are fast
   843 but do not come with any guarantees that it will not grind to a halt
   875 but do not come with any guarantees that it will not grind to a halt
   844 or give wrong answers.
   876 or give wrong answers.
   845 Our goal is to have a regex lexing algorithm that comes with 
   877 Our goal is to have a regex lexing algorithm that comes with