ChengsongPhdThesis/ChengsongPhDThesis.tex
changeset 444 a7e98deebb5c
parent 443 c6351a96bf80
child 453 d68b9db4c9ec
--- a/ChengsongPhdThesis/ChengsongPhDThesis.tex	Tue Mar 08 00:50:40 2022 +0000
+++ b/ChengsongPhdThesis/ChengsongPhDThesis.tex	Wed Mar 09 17:33:08 2022 +0000
@@ -38,6 +38,7 @@
 \def\lexer{\mathit{lexer}}
 \def\mkeps{\mathit{mkeps}}
 
+\def\DFA{\textit{DFA}}
 \def\bmkeps{\textit{bmkeps}}
 \def\retrieve{\textit{retrieve}}
 \def\blexer{\textit{blexer}}
@@ -55,7 +56,11 @@
 \def\S{\mathit{S}}
 \def\rup{r^\uparrow}
 
-
+\newcommand{\PDER}{\textit{PDER}}
+\newcommand{\flts}{\textit{flts}}
+\newcommand{\distinctBy}{\textit{distinctBy}}
+\newcommand{\map}{\textit{map}}
+\newcommand{\size}{\textit{size}}
 \def\awidth{\mathit{awidth}}
 \def\pder{\mathit{pder}}
 \def\maxterms{\mathit{maxterms}}
@@ -101,6 +106,7 @@
 
 \section{Introduction}
 \subsection{Basic Regex Introduction}
+
 Suppose (basic) regular expressions are given by the following grammar:
 \[			r ::=   \ZERO \mid  \ONE
 			 \mid  c  
@@ -117,12 +123,8 @@
 &				& $\textit{else} \; \textit{output} \; \textit{NO}$
 \end{tabular}
 \end{center}
-
-
-\subsection{The practical problem}
-Introduce the problem of matching a pattern with a string.
-Why important?
-Need to be fast, real-time, on large inputs.
+Omnipresent use of regexes in modern 
+software. 
 Examples: Snort, Bro, etc?
 \subsubsection{The rules for network intrusion analysis tools }
 TODO: read rules libraries such as Snort and the explanation for some of the rules
@@ -130,6 +132,28 @@
 TODO: any other libraries?
 
 
+There has been many widely used libraries such as 
+Henry Spencer's regexp(3), RE2, etc.
+They are fast and successful, but ugly corner cases
+allowing the $\textit{ReDoS}$ attack exist, and
+is a non-negligible protion.
+\subsection{The practical problem}
+These corner cases either
+\begin{itemize}
+\item
+go unnoticed until they
+cause considerable grief in real life
+\item
+or force the regex library writers to pose
+restrictions on the input, limiting the 
+choice a programmer has when using regexes.
+\end{itemize}
+
+Motivation:
+We want some library that supports as many constructs as possible,
+but still gives formal guarantees on the correctness and running
+time.
+
 \subsection{Regexes that brought down CloudFlare}
 
 
@@ -217,6 +241,29 @@
 suffix up to length $n$ of input string is.
 "Countdown States activation problem":
 $.*a.{100}$ requires $2^100$ + DFA states.
+Example:
+Converting $((a|b )*b.{10}){3}$ to a $\DFA$
+gives the error:
+\begin{verbatim}
+147972 states before minimization, 79107 states in minimized DFA
+Old file "search.java" saved as "search.java~"
+Writing code to "search.java"
+
+Unexpected exception encountered. This indicates a bug in JFlex.
+Please consider filing an issue at http://github.com/jflex-de/jflex/issues/new
+
+
+character value expected
+java.lang.IllegalArgumentException: character value expected
+        at jflex.generator.PackEmitter.emitUC(PackEmitter.java:105)
+        at jflex.generator.CountEmitter.emit(CountEmitter.java:116)
+        at jflex.generator.Emitter.emitDynamicInit(Emitter.java:530)
+        at jflex.generator.Emitter.emit(Emitter.java:1369)
+        at jflex.generator.LexGenerator.generate(LexGenerator.java:115)
+        at jflex.Main.generate(Main.java:320)
+        at jflex.Main.main(Main.java:336)
+\end{verbatim}
+
 \subsubsection{variant of DFA's}
 counting set automata 
 \\
@@ -471,22 +518,134 @@
 
 
 And we have a similar argument for the sequence case.
-\subsection{stronger simplification}
+\subsection{stronger simplification needed}
+
+\subsubsection{Bounded List of Terms}
+We have seen that without simplification the size of $(a+aa)^*$
+grows exponentially and unbounded(where we omit certain nested
+parentheses among the four terms in the last explicitly written out regex):
+
+\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
+\begin{center}
+\begin{tabular}{rll}
+$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
+& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
+\end{tabular}
+\end{center}
+
+But if we look closely at the regex
+\begin{center}
+\begin{tabular}{rll}
+& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+\end{tabular}
+\end{center}
+we realize that:
+\begin{itemize}
+\item
+The regex is  equivalent to an alternative taking a long-flattened list, 
+where each list is a sequence, and the second child of that sequence
+is always $(a+aa)^*$. In other words, the regex is a "linear combination"
+of terms of the form $(a+aa)\backslash s \cdot (a+aa)^*$ ($s$ is any string).
+\item
+The number of different terms of the shape $(a+aa) \backslash s \cdot (a+aa)^*$ is 
+bounded because the first child $(a+aa) \backslash s$ can only be one of
+$(\ZERO + \ZERO{}a + \ZERO)$, $(\ZERO + \ZERO{}a + \ONE)$, 
+$(\ONE + \ONE{}a)$ and $(\ZERO + \ZERO{}a)$.
+\item
+With simplification we have that the regex is additionally reduced to,
+
+where each term $\bsimp((a+aa)\backslash s )  $
+is further reduced to only
+$\ONE$, $\ONE + a$ and $\ZERO$.
+
+
+\end{itemize}
+Generalizing this to any regular expression of the form 
+$\sum_{s\in L(r)} \bsimp(r\backslash s ) \cdot r^*$,
+we have the closed-form for star regex's string derivative as below:
+$\forall r \;s.\; \exists sl. \; s.t.\;\bsimp(r^* \backslash s) = \bsimp(\sum_{s'\in sl}(r\backslash s') \cdot r^* )$.
+
+The regex $\bsimp(\sum_{s' \in sl}(r\backslash s') \cdot r^*)$ is bounded by
+$\distinctBy(\flts(\sum_{s'\in sl}(\bsimp(r \backslash s')) \cdot r^*))$,
+which again is bounded by $\distinctBy(\sum_{s'\in sl}(\bsimp(r\backslash s')) \cdot r^*)$.
+This might give us a polynomial bound on the length of the list
+$\distinctBy[(\bsimp(r\backslash s')) \cdot r^* | {s'\in sl} ]$, if the terms in 
+$\distinctBy[(\bsimp (r\backslash s')) | {s' \in sl}]$ has a polynomial bound.
+This is unfortunately not true under our current $\distinctBy$ function:
+If we let $r_n = ( (aa)^* + (aaa)^* + \ldots + \underline{(a\ldots a)^*}{n \,a's}) $,
+then we have that $\textit{maxterms} r_n = \textit{sup} (\textit{length} [\bsimp(r_n\backslash s') | s' \in sl]) = 
+L.C.M(1,\ldots, n)$. According to \href{http://oeis.org/A003418}{OEISA003418}
+this grows exponentially quickly. So we have found a regex $r_n$ where
+$\textit{maxterms} (r_n ^* \backslash s) \geq 2^{(n-1)}$.
+
+
+\subsubsection{stronger version of \distinctBy}
+\href{https://www.researchgate.net/publication/340874991_Partial_derivatives_of_regular_expressions_and_finite_automaton_constructions}{Antimirove}
+has proven a linear bound on the number of terms in the "partial derivatives" of a regular expression:
+\begin{center}
+$\size (\PDER(r))  \leq \awidth (r)$.
+\end{center}
+
+The proof is by structural induction on the regular expression r.
+The hard cases are the sequence case $r_1\cdot r_2$ and star case $r^*$.
+The central idea that allows the induction to go through for this bound is on the inclusion:
+\begin{center}
+$\pder_{s@[c]} (a\cdot b) \subseteq (\pder_{s@[c]} a ) \cdot b \cup (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))$
+\end{center} 
+
+This way, 
+
+\begin{center}
+\begin{tabular}{lcl}
+$| \pder_{s@[c]} (a\cdot b) |$ & $ \leq$ & $ | (\pder_{s@[c]} a ) \cdot b \cup (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\
+& $\leq$ & $| (\pder_{s@[c]} a ) \cdot b|  + | (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\
+& $=$ & $\awidth(a) + \awidth(b)$ \\
+& $=$ & $\awidth(a+b)$
+\end{tabular}
+\end{center} 
+
+we have that a compound regular expression $a\cdot b$'s subterms
+ is bounded by its sub-expression's derivatives terms.
+ 
+This argument can be modified to bound the terms in 
+our  version of regex with strong simplification:
+\begin{center}
+\begin{tabular}{lcl}
+$| \maxterms (\bsimp  (a\cdot b) \backslash s)|$ & $=$ & $ |maxterms(\bsimp( (a\backslash s \cdot b) + \sum_{s'\in sl}(b\backslash s') ))|$\\
+& $\leq$ & $| (\pder_{s@[c]} a ) \cdot b|  + | (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\
+& $=$ & $\awidth(a) + \awidth(b)$ \\
+& $=$ & $\awidth(a+b)$
+\end{tabular}
+\end{center} 
+
+
 
 
 \subsection{cubic bound}
+Bounding the regex's subterms by
+its alphabetic width.
+
+The entire expression's size can be bounded by
+number of subterms times each subterms' size.
 
 
 
 
 
-
-
-
+\section{Support for bounded repetitions and other constructs}
+Example: 
+$.*a.\{100\}$ after translation to $\DFA$ and minimization will 
+always take over $2^{100}$ states.
 
-
+\section{Towards a library with fast running time practically}
 
-\subsection{Too Detailed too basic? regex to NFA translation}
+registers and cache-related optimizations?
+JVM related optimizations?
+
+\section{Past Report materials}
 
 Deciding whether a string is in the language of the regex
 can be intuitively done by constructing an NFA\cite{Thompson_1968}: