ChengsongTanPhdThesis/Chapters/Introduction.tex
changeset 604 16d67f9c07d4
parent 603 370fe1dde7c7
child 605 ed53ce26ecb6
equal deleted inserted replaced
603:370fe1dde7c7 604:16d67f9c07d4
   395 with the help of formal proofs.
   395 with the help of formal proofs.
   396 We offer a lexing algorithm based
   396 We offer a lexing algorithm based
   397 on Brzozowski derivatives with certified correctness (in 
   397 on Brzozowski derivatives with certified correctness (in 
   398 Isabelle/HOL)
   398 Isabelle/HOL)
   399 and finiteness property.
   399 and finiteness property.
   400 Such properties guarantees the absence of 
   400 Such properties guarantee the absence of 
   401 catastrophic backtracking in most cases.
   401 catastrophic backtracking in most cases.
   402 We will give more details why we choose our 
   402 We will give more details in the next sections
   403 approach (Brzozowski derivatives and formal proofs)
   403 on (i) why the slow cases in graph \ref{fig:aStarStarb}
   404 in the next sections.
   404 can occur
   405 
   405 and (ii) why we choose our 
   406 
   406 approach (Brzozowski derivatives and formal proofs).
   407 \section{The Problem with Bounded Repetitions}
   407 
       
   408 
       
   409 \section{Terminology, and the Problem with Bounded Repetitions}
   408 Regular expressions and regular expression matchers 
   410 Regular expressions and regular expression matchers 
   409 have of course been studied for many, many years.
   411 have of course been studied for many, many years.
       
   412 Theoretical results in automata theory says
       
   413 that basic regular expression matching should be linear
       
   414 w.r.t the input, provided that the regular expression
       
   415 $r$ had been pre-processed and turned into a
       
   416 deterministic finite automata (DFA).
       
   417 By basic we mean textbook definitions such as the one
       
   418 below, involving only characters, alternatives,
       
   419 sequences, and Kleene stars:
       
   420 \[
       
   421 	r ::= \ZERO | \ONE | c | r_1 + r_2 | r_1 \cdot r_2 | r^*
       
   422 \]
       
   423 Modern regular expression matchers used by programmers,
       
   424 however,
       
   425 support richer constructs such as bounded repetitions
       
   426 and back-references.
       
   427 The syntax and expressive power of those 
       
   428 matching engines
       
   429 make ``regular expressions'' quite different from 
       
   430 their original meaning in the formal languages
       
   431 theory.
       
   432 To differentiate, people tend to use the word \emph{regex} to refer
       
   433 those expressions with richer constructs, and regular expressions
       
   434 for the more traditional meaning.
       
   435 For example, the PCRE standard (Peral Compatible Regular Expressions)
       
   436 is such a regex syntax standard.
       
   437 We follow this convention in this thesis.
       
   438 We aim to support all the popular features of regexes in the future,
       
   439 but for this work we mainly look at regular expressions.
       
   440 
       
   441 \subsection{A Little Introduction to Regexes: Bounded Repetitions
       
   442 and Back-references}
       
   443 Regexes come with a lot of constructs
       
   444 that makes it more convenient for 
       
   445 programmers to write regular expressions.
       
   446 Some of those constructs are syntactic sugars that are
       
   447 simply short hand notations
       
   448 that save the programmers a few keystrokes,
       
   449 for example the
       
   450 non-binary alternative involving three or more choices:
       
   451 \[
       
   452 	r = (a | b | c | \ldots | z)^*
       
   453 \]
       
   454 , the range operator $-$ which means the alternative
       
   455 of all characters between its operands:
       
   456 \[
       
   457 	r = [0-9a-zA-Z] \; \text{(all alpha-numeric characters)}
       
   458 \]
       
   459 and the 
       
   460 wildcard character $.$ meaning any character
       
   461 \[
       
   462 	. = [0-9a-zA-Z+-()*&\ldots]
       
   463 
       
   464 \]
       
   465 Some of those constructs do make the expressions much
       
   466 more compact, and matching time could be greatly increase.
       
   467 For example, $r^{n}$ is exponentially more concise compared with
       
   468 the expression $\underbrace{r}_\text{n copies of r}$,
       
   469 and therefore a naive algorithm that simply unfolds
       
   470 $r^{n}$ into $\underbrace{r}_\text{n copies of r}$
       
   471 will suffer exponential runtime increase.
       
   472 Some constructs can even raise the expressive
       
   473 power to the non-regular realm, for example
       
   474 the back-references.
       
   475 
       
   476 bounded repetitions, as we have discussed in the 
       
   477 previous section.
       
   478 This super-linear behaviour of the 
       
   479 regex matching engines we have?
   410 One of the most recent work in the context of lexing
   480 One of the most recent work in the context of lexing
   411 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
   481 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
   412 This is relevant work and we will compare later on
   482 This is relevant work and we will compare later on
   413 our derivative-based matcher we are going to present.
   483 our derivative-based matcher we are going to present.
   414 There is also some newer work called
   484 There is also some newer work called
   470 and moreover the resulting code still consists of only simple
   540 and moreover the resulting code still consists of only simple
   471 recursive functions and inductive datatypes.
   541 recursive functions and inductive datatypes.
   472 Finally, bounded regular expressions do not destroy our finite
   542 Finally, bounded regular expressions do not destroy our finite
   473 boundedness property, which we shall prove later on.
   543 boundedness property, which we shall prove later on.
   474 
   544 
   475 \section{The Terminology Regex, and why Backtracking?}
   545 \section{Back-references and The Terminology Regex}
   476 Shouldn't regular expression matching be linear?
   546 
   477 How can one explain the super-linear behaviour of the 
   547 
   478 regex matching engines we have?
       
   479 The time cost of regex matching algorithms in general
       
   480 involve two different phases, and different things can go differently wrong on 
       
   481 these phases.
       
   482 $\DFA$s usually have problems in the first (construction) phase
       
   483 , whereas $\NFA$s usually run into trouble
       
   484 on the second phase.
       
   485 
       
   486 
       
   487 \section{Error-prone POSIX Implementations}
       
   488 The problems with practical implementations
       
   489 of reare not limited to slowness on certain 
       
   490 cases. 
       
   491 Another thing about these libraries is that there
       
   492 is no correctness guarantee.
       
   493 In some cases, they either fail to generate a lexing result when there exists a match,
       
   494 or give results that are inconsistent with the $\POSIX$ standard.
       
   495 A concrete example would be the regex
       
   496 \begin{center}
       
   497 	$(aba + ab + a)* \text{and the string} ababa$
       
   498 \end{center}
       
   499 The correct $\POSIX$ match for the above would be 
       
   500 with the entire string $ababa$, 
       
   501 split into two Kleene star iterations, $[ab] [aba]$ at positions
       
   502 $[0, 2), [2, 5)$
       
   503 respectively.
       
   504 But trying this out in regex101\parencite{regex101}
       
   505 with different language engines would yield 
       
   506 the same two fragmented matches: $[aba]$ at $[0, 3)$
       
   507 and $a$ at $[4, 5)$.
       
   508 
       
   509 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
       
   510 correctly implementing the POSIX (maximum-munch)
       
   511 rule of regular expression matching.
       
   512 
       
   513 As Grathwohl\parencite{grathwohl2014crash} wrote,
       
   514 \begin{quote}
       
   515 	The POSIX strategy is more complicated than the 
       
   516 	greedy because of the dependence on information about 
       
   517 	the length of matched strings in the various subexpressions.
       
   518 \end{quote}
       
   519 %\noindent
       
   520 To summarise the above, regular expressions are important.
       
   521 They are popular and programming languages' library functions
       
   522 for them are very fast on non-catastrophic cases.
       
   523 But there are problems with current practical implementations.
       
   524 First thing is that the running time might blow up.
       
   525 The second problem is that they might be error-prone on certain
       
   526 very simple cases.
       
   527 In the next part of the chapter, we will look into reasons why 
       
   528 certain regex engines are running horribly slow on the "catastrophic"
       
   529 cases and propose a solution that addresses both of these problems
       
   530 based on Brzozowski and Sulzmann and Lu's work.
       
   531 
       
   532 
       
   533 
       
   534 \subsection{Different Phases of a Matching/Lexing Algorithm}
       
   535 
       
   536 
       
   537 Most lexing algorithms can be roughly divided into 
       
   538 two phases during its run.
       
   539 The first phase is the "construction" phase,
       
   540 in which the algorithm builds some  
       
   541 suitable data structure from the input regex $r$, so that
       
   542 it can be easily operated on later.
       
   543 We denote
       
   544 the time cost for such a phase by $P_1(r)$.
       
   545 The second phase is the lexing phase, when the input string 
       
   546 $s$ is read and the data structure
       
   547 representing that regex $r$ is being operated on. 
       
   548 We represent the time
       
   549 it takes by $P_2(r, s)$.\\
       
   550 For $\mathit{DFA}$,
       
   551 we have $P_2(r, s) = O( |s| )$,
       
   552 because we take at most $|s|$ steps, 
       
   553 and each step takes
       
   554 at most one transition--
       
   555 a deterministic-finite-automata
       
   556 by definition has at most one state active and at most one
       
   557 transition upon receiving an input symbol.
       
   558 But unfortunately in the  worst case
       
   559 $P_1(r) = O(exp^{|r|})$. An example will be given later. 
       
   560 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
       
   561 expressions like $r^n$ into 
       
   562 \[
       
   563 	\underbrace{r \cdots r}_{\text{n copies of r}}.
       
   564 \]
       
   565 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
       
   566 On the other hand, if backtracking is used, the worst-case time bound bloats
       
   567 to $|r| * 2^{|s|}$.
       
   568 %on the input
       
   569 %And when calculating the time complexity of the matching algorithm,
       
   570 %we are assuming that each input reading step requires constant time.
       
   571 %which translates to that the number of 
       
   572 %states active and transitions taken each time is bounded by a
       
   573 %constant $C$.
       
   574 %But modern  regex libraries in popular language engines
       
   575 % often want to support much richer constructs than just
       
   576 % sequences and Kleene stars,
       
   577 %such as negation, intersection, 
       
   578 %bounded repetitions and back-references.
       
   579 %And de-sugaring these "extended" regular expressions 
       
   580 %into basic ones might bloat the size exponentially.
       
   581 %TODO: more reference for exponential size blowup on desugaring. 
       
   582 
       
   583 \subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
       
   584 
       
   585 
       
   586 The good things about $\mathit{DFA}$s is that once
       
   587 generated, they are fast and stable, unlike
       
   588 backtracking algorithms. 
       
   589 However, they do not scale well with bounded repetitions.
       
   590 
       
   591 \subsubsection{Problems with Bounded Repetitions}
       
   592 Bounded repetitions, usually written in the form
   548 Bounded repetitions, usually written in the form
   593 $r^{\{c\}}$ (where $c$ is a constant natural number),
   549 $r^{\{c\}}$ (where $c$ is a constant natural number),
   594 denotes a regular expression accepting strings
   550 denotes a regular expression accepting strings
   595 that can be divided into $c$ substrings, where each 
   551 that can be divided into $c$ substrings, where each 
   596 substring is in $r$. 
   552 substring is in $r$. 
   636 more than 1 string.
   592 more than 1 string.
   637 This is to represent all different 
   593 This is to represent all different 
   638 scenarios which "countdown" states are active.
   594 scenarios which "countdown" states are active.
   639 For those regexes, tools that uses $\DFA$s will get
   595 For those regexes, tools that uses $\DFA$s will get
   640 out of memory errors.
   596 out of memory errors.
       
   597 
       
   598 
       
   599 The time cost of regex matching algorithms in general
       
   600 involve two different phases, and different things can go differently wrong on 
       
   601 these phases.
       
   602 $\DFA$s usually have problems in the first (construction) phase
       
   603 , whereas $\NFA$s usually run into trouble
       
   604 on the second phase.
       
   605 
       
   606 
       
   607 \section{Error-prone POSIX Implementations}
       
   608 The problems with practical implementations
       
   609 of reare not limited to slowness on certain 
       
   610 cases. 
       
   611 Another thing about these libraries is that there
       
   612 is no correctness guarantee.
       
   613 In some cases, they either fail to generate a lexing result when there exists a match,
       
   614 or give results that are inconsistent with the $\POSIX$ standard.
       
   615 A concrete example would be the regex
       
   616 \begin{center}
       
   617 	$(aba + ab + a)* \text{and the string} ababa$
       
   618 \end{center}
       
   619 The correct $\POSIX$ match for the above would be 
       
   620 with the entire string $ababa$, 
       
   621 split into two Kleene star iterations, $[ab] [aba]$ at positions
       
   622 $[0, 2), [2, 5)$
       
   623 respectively.
       
   624 But trying this out in regex101\parencite{regex101}
       
   625 with different language engines would yield 
       
   626 the same two fragmented matches: $[aba]$ at $[0, 3)$
       
   627 and $a$ at $[4, 5)$.
       
   628 
       
   629 Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
       
   630 correctly implementing the POSIX (maximum-munch)
       
   631 rule of regular expression matching.
       
   632 
       
   633 As Grathwohl\parencite{grathwohl2014crash} wrote,
       
   634 \begin{quote}
       
   635 	The POSIX strategy is more complicated than the 
       
   636 	greedy because of the dependence on information about 
       
   637 	the length of matched strings in the various subexpressions.
       
   638 \end{quote}
       
   639 %\noindent
       
   640 To summarise the above, regular expressions are important.
       
   641 They are popular and programming languages' library functions
       
   642 for them are very fast on non-catastrophic cases.
       
   643 But there are problems with current practical implementations.
       
   644 First thing is that the running time might blow up.
       
   645 The second problem is that they might be error-prone on certain
       
   646 very simple cases.
       
   647 In the next part of the chapter, we will look into reasons why 
       
   648 certain regex engines are running horribly slow on the "catastrophic"
       
   649 cases and propose a solution that addresses both of these problems
       
   650 based on Brzozowski and Sulzmann and Lu's work.
       
   651 
       
   652 
       
   653 
       
   654 \subsection{Different Phases of a Matching/Lexing Algorithm}
       
   655 
       
   656 
       
   657 Most lexing algorithms can be roughly divided into 
       
   658 two phases during its run.
       
   659 The first phase is the "construction" phase,
       
   660 in which the algorithm builds some  
       
   661 suitable data structure from the input regex $r$, so that
       
   662 it can be easily operated on later.
       
   663 We denote
       
   664 the time cost for such a phase by $P_1(r)$.
       
   665 The second phase is the lexing phase, when the input string 
       
   666 $s$ is read and the data structure
       
   667 representing that regex $r$ is being operated on. 
       
   668 We represent the time
       
   669 it takes by $P_2(r, s)$.\\
       
   670 For $\mathit{DFA}$,
       
   671 we have $P_2(r, s) = O( |s| )$,
       
   672 because we take at most $|s|$ steps, 
       
   673 and each step takes
       
   674 at most one transition--
       
   675 a deterministic-finite-automata
       
   676 by definition has at most one state active and at most one
       
   677 transition upon receiving an input symbol.
       
   678 But unfortunately in the  worst case
       
   679 $P_1(r) = O(exp^{|r|})$. An example will be given later. 
       
   680 For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
       
   681 expressions like $r^n$ into 
       
   682 \[
       
   683 	\underbrace{r \cdots r}_{\text{n copies of r}}.
       
   684 \]
       
   685 The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
       
   686 On the other hand, if backtracking is used, the worst-case time bound bloats
       
   687 to $|r| * 2^{|s|}$.
       
   688 %on the input
       
   689 %And when calculating the time complexity of the matching algorithm,
       
   690 %we are assuming that each input reading step requires constant time.
       
   691 %which translates to that the number of 
       
   692 %states active and transitions taken each time is bounded by a
       
   693 %constant $C$.
       
   694 %But modern  regex libraries in popular language engines
       
   695 % often want to support much richer constructs than just
       
   696 % sequences and Kleene stars,
       
   697 %such as negation, intersection, 
       
   698 %bounded repetitions and back-references.
       
   699 %And de-sugaring these "extended" regular expressions 
       
   700 %into basic ones might bloat the size exponentially.
       
   701 %TODO: more reference for exponential size blowup on desugaring. 
       
   702 
       
   703 \subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
       
   704 
       
   705 
       
   706 The good things about $\mathit{DFA}$s is that once
       
   707 generated, they are fast and stable, unlike
       
   708 backtracking algorithms. 
       
   709 However, they do not scale well with bounded repetitions.
       
   710 
       
   711 \subsubsection{Problems with Bounded Repetitions}
       
   712 
       
   713 
       
   714 
       
   715 
   641 
   716 
   642 \subsubsection{Tools that uses $\mathit{DFA}$s}
   717 \subsubsection{Tools that uses $\mathit{DFA}$s}
   643 %TODO:more tools that use DFAs?
   718 %TODO:more tools that use DFAs?
   644 $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
   719 $\mathit{LEX}$ and $\mathit{JFLEX}$ are tools
   645 in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based
   720 in $C$ and $\mathit{JAVA}$ that generates $\mathit{DFA}$-based