ChengsongPhdThesis/ChengsongPhDThesis.tex
changeset 444 a7e98deebb5c
parent 443 c6351a96bf80
child 453 d68b9db4c9ec
equal deleted inserted replaced
443:c6351a96bf80 444:a7e98deebb5c
    36 \newcommand{\ZERO}{\mbox{\bf 0}}
    36 \newcommand{\ZERO}{\mbox{\bf 0}}
    37 \newcommand{\ONE}{\mbox{\bf 1}}
    37 \newcommand{\ONE}{\mbox{\bf 1}}
    38 \def\lexer{\mathit{lexer}}
    38 \def\lexer{\mathit{lexer}}
    39 \def\mkeps{\mathit{mkeps}}
    39 \def\mkeps{\mathit{mkeps}}
    40 
    40 
       
    41 \def\DFA{\textit{DFA}}
    41 \def\bmkeps{\textit{bmkeps}}
    42 \def\bmkeps{\textit{bmkeps}}
    42 \def\retrieve{\textit{retrieve}}
    43 \def\retrieve{\textit{retrieve}}
    43 \def\blexer{\textit{blexer}}
    44 \def\blexer{\textit{blexer}}
    44 \def\flex{\textit{flex}}
    45 \def\flex{\textit{flex}}
    45 \def\inj{\mathit{inj}}
    46 \def\inj{\mathit{inj}}
    53 \def\nullable{\mathit{nullable}}
    54 \def\nullable{\mathit{nullable}}
    54 \def\Z{\mathit{Z}}
    55 \def\Z{\mathit{Z}}
    55 \def\S{\mathit{S}}
    56 \def\S{\mathit{S}}
    56 \def\rup{r^\uparrow}
    57 \def\rup{r^\uparrow}
    57 
    58 
    58 
    59 \newcommand{\PDER}{\textit{PDER}}
       
    60 \newcommand{\flts}{\textit{flts}}
       
    61 \newcommand{\distinctBy}{\textit{distinctBy}}
       
    62 \newcommand{\map}{\textit{map}}
       
    63 \newcommand{\size}{\textit{size}}
    59 \def\awidth{\mathit{awidth}}
    64 \def\awidth{\mathit{awidth}}
    60 \def\pder{\mathit{pder}}
    65 \def\pder{\mathit{pder}}
    61 \def\maxterms{\mathit{maxterms}}
    66 \def\maxterms{\mathit{maxterms}}
    62 \def\bsimp{\mathit{bsimp}}
    67 \def\bsimp{\mathit{bsimp}}
    63 
    68 
    99   negation and back-references.
   104   negation and back-references.
   100 \end{abstract}
   105 \end{abstract}
   101 
   106 
   102 \section{Introduction}
   107 \section{Introduction}
   103 \subsection{Basic Regex Introduction}
   108 \subsection{Basic Regex Introduction}
       
   109 
   104 Suppose (basic) regular expressions are given by the following grammar:
   110 Suppose (basic) regular expressions are given by the following grammar:
   105 \[			r ::=   \ZERO \mid  \ONE
   111 \[			r ::=   \ZERO \mid  \ONE
   106 			 \mid  c  
   112 			 \mid  c  
   107 			 \mid  r_1 \cdot r_2
   113 			 \mid  r_1 \cdot r_2
   108 			 \mid  r_1 + r_2   
   114 			 \mid  r_1 + r_2   
   115 \begin{tabular}{lcr}
   121 \begin{tabular}{lcr}
   116 $\textit{Match}(r, s)$ & $ =  $ & $\textit{if}\; s \in L(r)\; \textit{output} \; \textit{YES}$\\
   122 $\textit{Match}(r, s)$ & $ =  $ & $\textit{if}\; s \in L(r)\; \textit{output} \; \textit{YES}$\\
   117 &				& $\textit{else} \; \textit{output} \; \textit{NO}$
   123 &				& $\textit{else} \; \textit{output} \; \textit{NO}$
   118 \end{tabular}
   124 \end{tabular}
   119 \end{center}
   125 \end{center}
   120 
   126 Omnipresent use of regexes in modern 
   121 
   127 software. 
   122 \subsection{The practical problem}
       
   123 Introduce the problem of matching a pattern with a string.
       
   124 Why important?
       
   125 Need to be fast, real-time, on large inputs.
       
   126 Examples: Snort, Bro, etc?
   128 Examples: Snort, Bro, etc?
   127 \subsubsection{The rules for network intrusion analysis tools }
   129 \subsubsection{The rules for network intrusion analysis tools }
   128 TODO: read rules libraries such as Snort and the explanation for some of the rules
   130 TODO: read rules libraries such as Snort and the explanation for some of the rules
   129 TODO: pcre/pcre2?
   131 TODO: pcre/pcre2?
   130 TODO: any other libraries?
   132 TODO: any other libraries?
   131 
   133 
       
   134 
       
   135 There has been many widely used libraries such as 
       
   136 Henry Spencer's regexp(3), RE2, etc.
       
   137 They are fast and successful, but ugly corner cases
       
   138 allowing the $\textit{ReDoS}$ attack exist, and
       
   139 is a non-negligible protion.
       
   140 \subsection{The practical problem}
       
   141 These corner cases either
       
   142 \begin{itemize}
       
   143 \item
       
   144 go unnoticed until they
       
   145 cause considerable grief in real life
       
   146 \item
       
   147 or force the regex library writers to pose
       
   148 restrictions on the input, limiting the 
       
   149 choice a programmer has when using regexes.
       
   150 \end{itemize}
       
   151 
       
   152 Motivation:
       
   153 We want some library that supports as many constructs as possible,
       
   154 but still gives formal guarantees on the correctness and running
       
   155 time.
   132 
   156 
   133 \subsection{Regexes that brought down CloudFlare}
   157 \subsection{Regexes that brought down CloudFlare}
   134 
   158 
   135 
   159 
   136 matching some string $s$ with a regex
   160 matching some string $s$ with a regex
   215 state explosion for bounded repetitions due to 
   239 state explosion for bounded repetitions due to 
   216 theoretic bottleneck of having to remember exactly what the
   240 theoretic bottleneck of having to remember exactly what the
   217 suffix up to length $n$ of input string is.
   241 suffix up to length $n$ of input string is.
   218 "Countdown States activation problem":
   242 "Countdown States activation problem":
   219 $.*a.{100}$ requires $2^100$ + DFA states.
   243 $.*a.{100}$ requires $2^100$ + DFA states.
       
   244 Example:
       
   245 Converting $((a|b )*b.{10}){3}$ to a $\DFA$
       
   246 gives the error:
       
   247 \begin{verbatim}
       
   248 147972 states before minimization, 79107 states in minimized DFA
       
   249 Old file "search.java" saved as "search.java~"
       
   250 Writing code to "search.java"
       
   251 
       
   252 Unexpected exception encountered. This indicates a bug in JFlex.
       
   253 Please consider filing an issue at http://github.com/jflex-de/jflex/issues/new
       
   254 
       
   255 
       
   256 character value expected
       
   257 java.lang.IllegalArgumentException: character value expected
       
   258         at jflex.generator.PackEmitter.emitUC(PackEmitter.java:105)
       
   259         at jflex.generator.CountEmitter.emit(CountEmitter.java:116)
       
   260         at jflex.generator.Emitter.emitDynamicInit(Emitter.java:530)
       
   261         at jflex.generator.Emitter.emit(Emitter.java:1369)
       
   262         at jflex.generator.LexGenerator.generate(LexGenerator.java:115)
       
   263         at jflex.Main.generate(Main.java:320)
       
   264         at jflex.Main.main(Main.java:336)
       
   265 \end{verbatim}
       
   266 
   220 \subsubsection{variant of DFA's}
   267 \subsubsection{variant of DFA's}
   221 counting set automata 
   268 counting set automata 
   222 \\
   269 \\
   223 TODO: microsoft 2020 oopsla CsA work, need to add bibli entry, and read, understand key novelty, learn to benchmark like it
   270 TODO: microsoft 2020 oopsla CsA work, need to add bibli entry, and read, understand key novelty, learn to benchmark like it
   224 \\
   271 \\
   469 (card (rexp_enum (Suc (N + rsize (RSTAR r0))))) * (Suc (N + rsize (RSTAR r0)))"
   516 (card (rexp_enum (Suc (N + rsize (RSTAR r0))))) * (Suc (N + rsize (RSTAR r0)))"
   470 \end{verbatim}
   517 \end{verbatim}
   471 
   518 
   472 
   519 
   473 And we have a similar argument for the sequence case.
   520 And we have a similar argument for the sequence case.
   474 \subsection{stronger simplification}
   521 \subsection{stronger simplification needed}
       
   522 
       
   523 \subsubsection{Bounded List of Terms}
       
   524 We have seen that without simplification the size of $(a+aa)^*$
       
   525 grows exponentially and unbounded(where we omit certain nested
       
   526 parentheses among the four terms in the last explicitly written out regex):
       
   527 
       
   528 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
       
   529 \begin{center}
       
   530 \begin{tabular}{rll}
       
   531 $(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
   532 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
   533 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
       
   534 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
   535 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
       
   536 \end{tabular}
       
   537 \end{center}
       
   538 
       
   539 But if we look closely at the regex
       
   540 \begin{center}
       
   541 \begin{tabular}{rll}
       
   542 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
       
   543 \end{tabular}
       
   544 \end{center}
       
   545 we realize that:
       
   546 \begin{itemize}
       
   547 \item
       
   548 The regex is  equivalent to an alternative taking a long-flattened list, 
       
   549 where each list is a sequence, and the second child of that sequence
       
   550 is always $(a+aa)^*$. In other words, the regex is a "linear combination"
       
   551 of terms of the form $(a+aa)\backslash s \cdot (a+aa)^*$ ($s$ is any string).
       
   552 \item
       
   553 The number of different terms of the shape $(a+aa) \backslash s \cdot (a+aa)^*$ is 
       
   554 bounded because the first child $(a+aa) \backslash s$ can only be one of
       
   555 $(\ZERO + \ZERO{}a + \ZERO)$, $(\ZERO + \ZERO{}a + \ONE)$, 
       
   556 $(\ONE + \ONE{}a)$ and $(\ZERO + \ZERO{}a)$.
       
   557 \item
       
   558 With simplification we have that the regex is additionally reduced to,
       
   559 
       
   560 where each term $\bsimp((a+aa)\backslash s )  $
       
   561 is further reduced to only
       
   562 $\ONE$, $\ONE + a$ and $\ZERO$.
       
   563 
       
   564 
       
   565 \end{itemize}
       
   566 Generalizing this to any regular expression of the form 
       
   567 $\sum_{s\in L(r)} \bsimp(r\backslash s ) \cdot r^*$,
       
   568 we have the closed-form for star regex's string derivative as below:
       
   569 $\forall r \;s.\; \exists sl. \; s.t.\;\bsimp(r^* \backslash s) = \bsimp(\sum_{s'\in sl}(r\backslash s') \cdot r^* )$.
       
   570 
       
   571 The regex $\bsimp(\sum_{s' \in sl}(r\backslash s') \cdot r^*)$ is bounded by
       
   572 $\distinctBy(\flts(\sum_{s'\in sl}(\bsimp(r \backslash s')) \cdot r^*))$,
       
   573 which again is bounded by $\distinctBy(\sum_{s'\in sl}(\bsimp(r\backslash s')) \cdot r^*)$.
       
   574 This might give us a polynomial bound on the length of the list
       
   575 $\distinctBy[(\bsimp(r\backslash s')) \cdot r^* | {s'\in sl} ]$, if the terms in 
       
   576 $\distinctBy[(\bsimp (r\backslash s')) | {s' \in sl}]$ has a polynomial bound.
       
   577 This is unfortunately not true under our current $\distinctBy$ function:
       
   578 If we let $r_n = ( (aa)^* + (aaa)^* + \ldots + \underline{(a\ldots a)^*}{n \,a's}) $,
       
   579 then we have that $\textit{maxterms} r_n = \textit{sup} (\textit{length} [\bsimp(r_n\backslash s') | s' \in sl]) = 
       
   580 L.C.M(1,\ldots, n)$. According to \href{http://oeis.org/A003418}{OEISA003418}
       
   581 this grows exponentially quickly. So we have found a regex $r_n$ where
       
   582 $\textit{maxterms} (r_n ^* \backslash s) \geq 2^{(n-1)}$.
       
   583 
       
   584 
       
   585 \subsubsection{stronger version of \distinctBy}
       
   586 \href{https://www.researchgate.net/publication/340874991_Partial_derivatives_of_regular_expressions_and_finite_automaton_constructions}{Antimirove}
       
   587 has proven a linear bound on the number of terms in the "partial derivatives" of a regular expression:
       
   588 \begin{center}
       
   589 $\size (\PDER(r))  \leq \awidth (r)$.
       
   590 \end{center}
       
   591 
       
   592 The proof is by structural induction on the regular expression r.
       
   593 The hard cases are the sequence case $r_1\cdot r_2$ and star case $r^*$.
       
   594 The central idea that allows the induction to go through for this bound is on the inclusion:
       
   595 \begin{center}
       
   596 $\pder_{s@[c]} (a\cdot b) \subseteq (\pder_{s@[c]} a ) \cdot b \cup (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))$
       
   597 \end{center} 
       
   598 
       
   599 This way, 
       
   600 
       
   601 \begin{center}
       
   602 \begin{tabular}{lcl}
       
   603 $| \pder_{s@[c]} (a\cdot b) |$ & $ \leq$ & $ | (\pder_{s@[c]} a ) \cdot b \cup (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\
       
   604 & $\leq$ & $| (\pder_{s@[c]} a ) \cdot b|  + | (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\
       
   605 & $=$ & $\awidth(a) + \awidth(b)$ \\
       
   606 & $=$ & $\awidth(a+b)$
       
   607 \end{tabular}
       
   608 \end{center} 
       
   609 
       
   610 we have that a compound regular expression $a\cdot b$'s subterms
       
   611  is bounded by its sub-expression's derivatives terms.
       
   612  
       
   613 This argument can be modified to bound the terms in 
       
   614 our  version of regex with strong simplification:
       
   615 \begin{center}
       
   616 \begin{tabular}{lcl}
       
   617 $| \maxterms (\bsimp  (a\cdot b) \backslash s)|$ & $=$ & $ |maxterms(\bsimp( (a\backslash s \cdot b) + \sum_{s'\in sl}(b\backslash s') ))|$\\
       
   618 & $\leq$ & $| (\pder_{s@[c]} a ) \cdot b|  + | (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\
       
   619 & $=$ & $\awidth(a) + \awidth(b)$ \\
       
   620 & $=$ & $\awidth(a+b)$
       
   621 \end{tabular}
       
   622 \end{center} 
       
   623 
       
   624 
   475 
   625 
   476 
   626 
   477 \subsection{cubic bound}
   627 \subsection{cubic bound}
   478 
   628 Bounding the regex's subterms by
   479 
   629 its alphabetic width.
   480 
   630 
   481 
   631 The entire expression's size can be bounded by
   482 
   632 number of subterms times each subterms' size.
   483 
   633 
   484 
   634 
   485 
   635 
   486 
   636 
   487 
   637 
   488 
   638 \section{Support for bounded repetitions and other constructs}
   489 \subsection{Too Detailed too basic? regex to NFA translation}
   639 Example: 
       
   640 $.*a.\{100\}$ after translation to $\DFA$ and minimization will 
       
   641 always take over $2^{100}$ states.
       
   642 
       
   643 \section{Towards a library with fast running time practically}
       
   644 
       
   645 registers and cache-related optimizations?
       
   646 JVM related optimizations?
       
   647 
       
   648 \section{Past Report materials}
   490 
   649 
   491 Deciding whether a string is in the language of the regex
   650 Deciding whether a string is in the language of the regex
   492 can be intuitively done by constructing an NFA\cite{Thompson_1968}:
   651 can be intuitively done by constructing an NFA\cite{Thompson_1968}:
   493 and simulate the running of it.
   652 and simulate the running of it.
   494 
   653