diff -r c6351a96bf80 -r a7e98deebb5c ChengsongPhdThesis/ChengsongPhDThesis.tex --- a/ChengsongPhdThesis/ChengsongPhDThesis.tex Tue Mar 08 00:50:40 2022 +0000 +++ b/ChengsongPhdThesis/ChengsongPhDThesis.tex Wed Mar 09 17:33:08 2022 +0000 @@ -38,6 +38,7 @@ \def\lexer{\mathit{lexer}} \def\mkeps{\mathit{mkeps}} +\def\DFA{\textit{DFA}} \def\bmkeps{\textit{bmkeps}} \def\retrieve{\textit{retrieve}} \def\blexer{\textit{blexer}} @@ -55,7 +56,11 @@ \def\S{\mathit{S}} \def\rup{r^\uparrow} - +\newcommand{\PDER}{\textit{PDER}} +\newcommand{\flts}{\textit{flts}} +\newcommand{\distinctBy}{\textit{distinctBy}} +\newcommand{\map}{\textit{map}} +\newcommand{\size}{\textit{size}} \def\awidth{\mathit{awidth}} \def\pder{\mathit{pder}} \def\maxterms{\mathit{maxterms}} @@ -101,6 +106,7 @@ \section{Introduction} \subsection{Basic Regex Introduction} + Suppose (basic) regular expressions are given by the following grammar: \[ r ::= \ZERO \mid \ONE \mid c @@ -117,12 +123,8 @@ & & $\textit{else} \; \textit{output} \; \textit{NO}$ \end{tabular} \end{center} - - -\subsection{The practical problem} -Introduce the problem of matching a pattern with a string. -Why important? -Need to be fast, real-time, on large inputs. +Omnipresent use of regexes in modern +software. Examples: Snort, Bro, etc? \subsubsection{The rules for network intrusion analysis tools } TODO: read rules libraries such as Snort and the explanation for some of the rules @@ -130,6 +132,28 @@ TODO: any other libraries? +There has been many widely used libraries such as +Henry Spencer's regexp(3), RE2, etc. +They are fast and successful, but ugly corner cases +allowing the $\textit{ReDoS}$ attack exist, and +is a non-negligible protion. +\subsection{The practical problem} +These corner cases either +\begin{itemize} +\item +go unnoticed until they +cause considerable grief in real life +\item +or force the regex library writers to pose +restrictions on the input, limiting the +choice a programmer has when using regexes. +\end{itemize} + +Motivation: +We want some library that supports as many constructs as possible, +but still gives formal guarantees on the correctness and running +time. + \subsection{Regexes that brought down CloudFlare} @@ -217,6 +241,29 @@ suffix up to length $n$ of input string is. "Countdown States activation problem": $.*a.{100}$ requires $2^100$ + DFA states. +Example: +Converting $((a|b )*b.{10}){3}$ to a $\DFA$ +gives the error: +\begin{verbatim} +147972 states before minimization, 79107 states in minimized DFA +Old file "search.java" saved as "search.java~" +Writing code to "search.java" + +Unexpected exception encountered. This indicates a bug in JFlex. +Please consider filing an issue at http://github.com/jflex-de/jflex/issues/new + + +character value expected +java.lang.IllegalArgumentException: character value expected + at jflex.generator.PackEmitter.emitUC(PackEmitter.java:105) + at jflex.generator.CountEmitter.emit(CountEmitter.java:116) + at jflex.generator.Emitter.emitDynamicInit(Emitter.java:530) + at jflex.generator.Emitter.emit(Emitter.java:1369) + at jflex.generator.LexGenerator.generate(LexGenerator.java:115) + at jflex.Main.generate(Main.java:320) + at jflex.Main.main(Main.java:336) +\end{verbatim} + \subsubsection{variant of DFA's} counting set automata \\ @@ -471,22 +518,134 @@ And we have a similar argument for the sequence case. -\subsection{stronger simplification} +\subsection{stronger simplification needed} + +\subsubsection{Bounded List of Terms} +We have seen that without simplification the size of $(a+aa)^*$ +grows exponentially and unbounded(where we omit certain nested +parentheses among the four terms in the last explicitly written out regex): + +\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}} +\begin{center} +\begin{tabular}{rll} +$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\ +& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ +& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\ +& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ +& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots) +\end{tabular} +\end{center} + +But if we look closely at the regex +\begin{center} +\begin{tabular}{rll} +& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\ +\end{tabular} +\end{center} +we realize that: +\begin{itemize} +\item +The regex is equivalent to an alternative taking a long-flattened list, +where each list is a sequence, and the second child of that sequence +is always $(a+aa)^*$. In other words, the regex is a "linear combination" +of terms of the form $(a+aa)\backslash s \cdot (a+aa)^*$ ($s$ is any string). +\item +The number of different terms of the shape $(a+aa) \backslash s \cdot (a+aa)^*$ is +bounded because the first child $(a+aa) \backslash s$ can only be one of +$(\ZERO + \ZERO{}a + \ZERO)$, $(\ZERO + \ZERO{}a + \ONE)$, +$(\ONE + \ONE{}a)$ and $(\ZERO + \ZERO{}a)$. +\item +With simplification we have that the regex is additionally reduced to, + +where each term $\bsimp((a+aa)\backslash s ) $ +is further reduced to only +$\ONE$, $\ONE + a$ and $\ZERO$. + + +\end{itemize} +Generalizing this to any regular expression of the form +$\sum_{s\in L(r)} \bsimp(r\backslash s ) \cdot r^*$, +we have the closed-form for star regex's string derivative as below: +$\forall r \;s.\; \exists sl. \; s.t.\;\bsimp(r^* \backslash s) = \bsimp(\sum_{s'\in sl}(r\backslash s') \cdot r^* )$. + +The regex $\bsimp(\sum_{s' \in sl}(r\backslash s') \cdot r^*)$ is bounded by +$\distinctBy(\flts(\sum_{s'\in sl}(\bsimp(r \backslash s')) \cdot r^*))$, +which again is bounded by $\distinctBy(\sum_{s'\in sl}(\bsimp(r\backslash s')) \cdot r^*)$. +This might give us a polynomial bound on the length of the list +$\distinctBy[(\bsimp(r\backslash s')) \cdot r^* | {s'\in sl} ]$, if the terms in +$\distinctBy[(\bsimp (r\backslash s')) | {s' \in sl}]$ has a polynomial bound. +This is unfortunately not true under our current $\distinctBy$ function: +If we let $r_n = ( (aa)^* + (aaa)^* + \ldots + \underline{(a\ldots a)^*}{n \,a's}) $, +then we have that $\textit{maxterms} r_n = \textit{sup} (\textit{length} [\bsimp(r_n\backslash s') | s' \in sl]) = +L.C.M(1,\ldots, n)$. According to \href{http://oeis.org/A003418}{OEISA003418} +this grows exponentially quickly. So we have found a regex $r_n$ where +$\textit{maxterms} (r_n ^* \backslash s) \geq 2^{(n-1)}$. + + +\subsubsection{stronger version of \distinctBy} +\href{https://www.researchgate.net/publication/340874991_Partial_derivatives_of_regular_expressions_and_finite_automaton_constructions}{Antimirove} +has proven a linear bound on the number of terms in the "partial derivatives" of a regular expression: +\begin{center} +$\size (\PDER(r)) \leq \awidth (r)$. +\end{center} + +The proof is by structural induction on the regular expression r. +The hard cases are the sequence case $r_1\cdot r_2$ and star case $r^*$. +The central idea that allows the induction to go through for this bound is on the inclusion: +\begin{center} +$\pder_{s@[c]} (a\cdot b) \subseteq (\pder_{s@[c]} a ) \cdot b \cup (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))$ +\end{center} + +This way, + +\begin{center} +\begin{tabular}{lcl} +$| \pder_{s@[c]} (a\cdot b) |$ & $ \leq$ & $ | (\pder_{s@[c]} a ) \cdot b \cup (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\ +& $\leq$ & $| (\pder_{s@[c]} a ) \cdot b| + | (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\ +& $=$ & $\awidth(a) + \awidth(b)$ \\ +& $=$ & $\awidth(a+b)$ +\end{tabular} +\end{center} + +we have that a compound regular expression $a\cdot b$'s subterms + is bounded by its sub-expression's derivatives terms. + +This argument can be modified to bound the terms in +our version of regex with strong simplification: +\begin{center} +\begin{tabular}{lcl} +$| \maxterms (\bsimp (a\cdot b) \backslash s)|$ & $=$ & $ |maxterms(\bsimp( (a\backslash s \cdot b) + \sum_{s'\in sl}(b\backslash s') ))|$\\ +& $\leq$ & $| (\pder_{s@[c]} a ) \cdot b| + | (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\ +& $=$ & $\awidth(a) + \awidth(b)$ \\ +& $=$ & $\awidth(a+b)$ +\end{tabular} +\end{center} + + \subsection{cubic bound} +Bounding the regex's subterms by +its alphabetic width. + +The entire expression's size can be bounded by +number of subterms times each subterms' size. - - - +\section{Support for bounded repetitions and other constructs} +Example: +$.*a.\{100\}$ after translation to $\DFA$ and minimization will +always take over $2^{100}$ states. - +\section{Towards a library with fast running time practically} -\subsection{Too Detailed too basic? regex to NFA translation} +registers and cache-related optimizations? +JVM related optimizations? + +\section{Past Report materials} Deciding whether a string is in the language of the regex can be intuitively done by constructing an NFA\cite{Thompson_1968}: