115 \begin{tabular}{lcr} |
121 \begin{tabular}{lcr} |
116 $\textit{Match}(r, s)$ & $ = $ & $\textit{if}\; s \in L(r)\; \textit{output} \; \textit{YES}$\\ |
122 $\textit{Match}(r, s)$ & $ = $ & $\textit{if}\; s \in L(r)\; \textit{output} \; \textit{YES}$\\ |
117 & & $\textit{else} \; \textit{output} \; \textit{NO}$ |
123 & & $\textit{else} \; \textit{output} \; \textit{NO}$ |
118 \end{tabular} |
124 \end{tabular} |
119 \end{center} |
125 \end{center} |
120 |
126 Omnipresent use of regexes in modern |
121 |
127 software. |
122 \subsection{The practical problem} |
|
123 Introduce the problem of matching a pattern with a string. |
|
124 Why important? |
|
125 Need to be fast, real-time, on large inputs. |
|
126 Examples: Snort, Bro, etc? |
128 Examples: Snort, Bro, etc? |
127 \subsubsection{The rules for network intrusion analysis tools } |
129 \subsubsection{The rules for network intrusion analysis tools } |
128 TODO: read rules libraries such as Snort and the explanation for some of the rules |
130 TODO: read rules libraries such as Snort and the explanation for some of the rules |
129 TODO: pcre/pcre2? |
131 TODO: pcre/pcre2? |
130 TODO: any other libraries? |
132 TODO: any other libraries? |
131 |
133 |
|
134 |
|
135 There has been many widely used libraries such as |
|
136 Henry Spencer's regexp(3), RE2, etc. |
|
137 They are fast and successful, but ugly corner cases |
|
138 allowing the $\textit{ReDoS}$ attack exist, and |
|
139 is a non-negligible protion. |
|
140 \subsection{The practical problem} |
|
141 These corner cases either |
|
142 \begin{itemize} |
|
143 \item |
|
144 go unnoticed until they |
|
145 cause considerable grief in real life |
|
146 \item |
|
147 or force the regex library writers to pose |
|
148 restrictions on the input, limiting the |
|
149 choice a programmer has when using regexes. |
|
150 \end{itemize} |
|
151 |
|
152 Motivation: |
|
153 We want some library that supports as many constructs as possible, |
|
154 but still gives formal guarantees on the correctness and running |
|
155 time. |
132 |
156 |
133 \subsection{Regexes that brought down CloudFlare} |
157 \subsection{Regexes that brought down CloudFlare} |
134 |
158 |
135 |
159 |
136 matching some string $s$ with a regex |
160 matching some string $s$ with a regex |
215 state explosion for bounded repetitions due to |
239 state explosion for bounded repetitions due to |
216 theoretic bottleneck of having to remember exactly what the |
240 theoretic bottleneck of having to remember exactly what the |
217 suffix up to length $n$ of input string is. |
241 suffix up to length $n$ of input string is. |
218 "Countdown States activation problem": |
242 "Countdown States activation problem": |
219 $.*a.{100}$ requires $2^100$ + DFA states. |
243 $.*a.{100}$ requires $2^100$ + DFA states. |
|
244 Example: |
|
245 Converting $((a|b )*b.{10}){3}$ to a $\DFA$ |
|
246 gives the error: |
|
247 \begin{verbatim} |
|
248 147972 states before minimization, 79107 states in minimized DFA |
|
249 Old file "search.java" saved as "search.java~" |
|
250 Writing code to "search.java" |
|
251 |
|
252 Unexpected exception encountered. This indicates a bug in JFlex. |
|
253 Please consider filing an issue at http://github.com/jflex-de/jflex/issues/new |
|
254 |
|
255 |
|
256 character value expected |
|
257 java.lang.IllegalArgumentException: character value expected |
|
258 at jflex.generator.PackEmitter.emitUC(PackEmitter.java:105) |
|
259 at jflex.generator.CountEmitter.emit(CountEmitter.java:116) |
|
260 at jflex.generator.Emitter.emitDynamicInit(Emitter.java:530) |
|
261 at jflex.generator.Emitter.emit(Emitter.java:1369) |
|
262 at jflex.generator.LexGenerator.generate(LexGenerator.java:115) |
|
263 at jflex.Main.generate(Main.java:320) |
|
264 at jflex.Main.main(Main.java:336) |
|
265 \end{verbatim} |
|
266 |
220 \subsubsection{variant of DFA's} |
267 \subsubsection{variant of DFA's} |
221 counting set automata |
268 counting set automata |
222 \\ |
269 \\ |
223 TODO: microsoft 2020 oopsla CsA work, need to add bibli entry, and read, understand key novelty, learn to benchmark like it |
270 TODO: microsoft 2020 oopsla CsA work, need to add bibli entry, and read, understand key novelty, learn to benchmark like it |
224 \\ |
271 \\ |
469 (card (rexp_enum (Suc (N + rsize (RSTAR r0))))) * (Suc (N + rsize (RSTAR r0)))" |
516 (card (rexp_enum (Suc (N + rsize (RSTAR r0))))) * (Suc (N + rsize (RSTAR r0)))" |
470 \end{verbatim} |
517 \end{verbatim} |
471 |
518 |
472 |
519 |
473 And we have a similar argument for the sequence case. |
520 And we have a similar argument for the sequence case. |
474 \subsection{stronger simplification} |
521 \subsection{stronger simplification needed} |
|
522 |
|
523 \subsubsection{Bounded List of Terms} |
|
524 We have seen that without simplification the size of $(a+aa)^*$ |
|
525 grows exponentially and unbounded(where we omit certain nested |
|
526 parentheses among the four terms in the last explicitly written out regex): |
|
527 |
|
528 \def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}} |
|
529 \begin{center} |
|
530 \begin{tabular}{rll} |
|
531 $(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
532 & $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
533 & $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\ |
|
534 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
535 & $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots) |
|
536 \end{tabular} |
|
537 \end{center} |
|
538 |
|
539 But if we look closely at the regex |
|
540 \begin{center} |
|
541 \begin{tabular}{rll} |
|
542 & & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ |
|
543 \end{tabular} |
|
544 \end{center} |
|
545 we realize that: |
|
546 \begin{itemize} |
|
547 \item |
|
548 The regex is equivalent to an alternative taking a long-flattened list, |
|
549 where each list is a sequence, and the second child of that sequence |
|
550 is always $(a+aa)^*$. In other words, the regex is a "linear combination" |
|
551 of terms of the form $(a+aa)\backslash s \cdot (a+aa)^*$ ($s$ is any string). |
|
552 \item |
|
553 The number of different terms of the shape $(a+aa) \backslash s \cdot (a+aa)^*$ is |
|
554 bounded because the first child $(a+aa) \backslash s$ can only be one of |
|
555 $(\ZERO + \ZERO{}a + \ZERO)$, $(\ZERO + \ZERO{}a + \ONE)$, |
|
556 $(\ONE + \ONE{}a)$ and $(\ZERO + \ZERO{}a)$. |
|
557 \item |
|
558 With simplification we have that the regex is additionally reduced to, |
|
559 |
|
560 where each term $\bsimp((a+aa)\backslash s ) $ |
|
561 is further reduced to only |
|
562 $\ONE$, $\ONE + a$ and $\ZERO$. |
|
563 |
|
564 |
|
565 \end{itemize} |
|
566 Generalizing this to any regular expression of the form |
|
567 $\sum_{s\in L(r)} \bsimp(r\backslash s ) \cdot r^*$, |
|
568 we have the closed-form for star regex's string derivative as below: |
|
569 $\forall r \;s.\; \exists sl. \; s.t.\;\bsimp(r^* \backslash s) = \bsimp(\sum_{s'\in sl}(r\backslash s') \cdot r^* )$. |
|
570 |
|
571 The regex $\bsimp(\sum_{s' \in sl}(r\backslash s') \cdot r^*)$ is bounded by |
|
572 $\distinctBy(\flts(\sum_{s'\in sl}(\bsimp(r \backslash s')) \cdot r^*))$, |
|
573 which again is bounded by $\distinctBy(\sum_{s'\in sl}(\bsimp(r\backslash s')) \cdot r^*)$. |
|
574 This might give us a polynomial bound on the length of the list |
|
575 $\distinctBy[(\bsimp(r\backslash s')) \cdot r^* | {s'\in sl} ]$, if the terms in |
|
576 $\distinctBy[(\bsimp (r\backslash s')) | {s' \in sl}]$ has a polynomial bound. |
|
577 This is unfortunately not true under our current $\distinctBy$ function: |
|
578 If we let $r_n = ( (aa)^* + (aaa)^* + \ldots + \underline{(a\ldots a)^*}{n \,a's}) $, |
|
579 then we have that $\textit{maxterms} r_n = \textit{sup} (\textit{length} [\bsimp(r_n\backslash s') | s' \in sl]) = |
|
580 L.C.M(1,\ldots, n)$. According to \href{http://oeis.org/A003418}{OEISA003418} |
|
581 this grows exponentially quickly. So we have found a regex $r_n$ where |
|
582 $\textit{maxterms} (r_n ^* \backslash s) \geq 2^{(n-1)}$. |
|
583 |
|
584 |
|
585 \subsubsection{stronger version of \distinctBy} |
|
586 \href{https://www.researchgate.net/publication/340874991_Partial_derivatives_of_regular_expressions_and_finite_automaton_constructions}{Antimirove} |
|
587 has proven a linear bound on the number of terms in the "partial derivatives" of a regular expression: |
|
588 \begin{center} |
|
589 $\size (\PDER(r)) \leq \awidth (r)$. |
|
590 \end{center} |
|
591 |
|
592 The proof is by structural induction on the regular expression r. |
|
593 The hard cases are the sequence case $r_1\cdot r_2$ and star case $r^*$. |
|
594 The central idea that allows the induction to go through for this bound is on the inclusion: |
|
595 \begin{center} |
|
596 $\pder_{s@[c]} (a\cdot b) \subseteq (\pder_{s@[c]} a ) \cdot b \cup (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))$ |
|
597 \end{center} |
|
598 |
|
599 This way, |
|
600 |
|
601 \begin{center} |
|
602 \begin{tabular}{lcl} |
|
603 $| \pder_{s@[c]} (a\cdot b) |$ & $ \leq$ & $ | (\pder_{s@[c]} a ) \cdot b \cup (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\ |
|
604 & $\leq$ & $| (\pder_{s@[c]} a ) \cdot b| + | (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\ |
|
605 & $=$ & $\awidth(a) + \awidth(b)$ \\ |
|
606 & $=$ & $\awidth(a+b)$ |
|
607 \end{tabular} |
|
608 \end{center} |
|
609 |
|
610 we have that a compound regular expression $a\cdot b$'s subterms |
|
611 is bounded by its sub-expression's derivatives terms. |
|
612 |
|
613 This argument can be modified to bound the terms in |
|
614 our version of regex with strong simplification: |
|
615 \begin{center} |
|
616 \begin{tabular}{lcl} |
|
617 $| \maxterms (\bsimp (a\cdot b) \backslash s)|$ & $=$ & $ |maxterms(\bsimp( (a\backslash s \cdot b) + \sum_{s'\in sl}(b\backslash s') ))|$\\ |
|
618 & $\leq$ & $| (\pder_{s@[c]} a ) \cdot b| + | (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\ |
|
619 & $=$ & $\awidth(a) + \awidth(b)$ \\ |
|
620 & $=$ & $\awidth(a+b)$ |
|
621 \end{tabular} |
|
622 \end{center} |
|
623 |
|
624 |
475 |
625 |
476 |
626 |
477 \subsection{cubic bound} |
627 \subsection{cubic bound} |
478 |
628 Bounding the regex's subterms by |
479 |
629 its alphabetic width. |
480 |
630 |
481 |
631 The entire expression's size can be bounded by |
482 |
632 number of subterms times each subterms' size. |
483 |
633 |
484 |
634 |
485 |
635 |
486 |
636 |
487 |
637 |
488 |
638 \section{Support for bounded repetitions and other constructs} |
489 \subsection{Too Detailed too basic? regex to NFA translation} |
639 Example: |
|
640 $.*a.\{100\}$ after translation to $\DFA$ and minimization will |
|
641 always take over $2^{100}$ states. |
|
642 |
|
643 \section{Towards a library with fast running time practically} |
|
644 |
|
645 registers and cache-related optimizations? |
|
646 JVM related optimizations? |
|
647 |
|
648 \section{Past Report materials} |
490 |
649 |
491 Deciding whether a string is in the language of the regex |
650 Deciding whether a string is in the language of the regex |
492 can be intuitively done by constructing an NFA\cite{Thompson_1968}: |
651 can be intuitively done by constructing an NFA\cite{Thompson_1968}: |
493 and simulate the running of it. |
652 and simulate the running of it. |
494 |
653 |