lexing: comparison ChengsongPhdThesis/ChengsongPhDThesis.tex

equal deleted inserted replaced

-:c6351a96bf80
+:a7e98deebb5c
 \newcommand{\ZERO}{\mbox{\bf 0}}
 \newcommand{\ONE}{\mbox{\bf 1}}
 \def\lexer{\mathit{lexer}}
 \def\mkeps{\mathit{mkeps}}
+\def\DFA{\textit{DFA}}
 \def\bmkeps{\textit{bmkeps}}
 \def\retrieve{\textit{retrieve}}
 \def\blexer{\textit{blexer}}
 \def\flex{\textit{flex}}
 \def\inj{\mathit{inj}}
 \def\nullable{\mathit{nullable}}
 \def\Z{\mathit{Z}}
 \def\S{\mathit{S}}
 \def\rup{r^\uparrow}
+\newcommand{\PDER}{\textit{PDER}}
+\newcommand{\flts}{\textit{flts}}
+\newcommand{\distinctBy}{\textit{distinctBy}}
+\newcommand{\map}{\textit{map}}
+\newcommand{\size}{\textit{size}}
 \def\awidth{\mathit{awidth}}
 \def\pder{\mathit{pder}}
 \def\maxterms{\mathit{maxterms}}
 \def\bsimp{\mathit{bsimp}}
 negation and back-references.
 \end{abstract}
 \section{Introduction}
 \subsection{Basic Regex Introduction}
 Suppose (basic) regular expressions are given by the following grammar:
 \[			r ::=   \ZERO \mid  \ONE
 			 \mid  c
 			 \mid  r_1 \cdot r_2
 			 \mid  r_1 + r_2
 \begin{tabular}{lcr}
 $\textit{Match}(r, s)$ & $ =  $ & $\textit{if}\; s \in L(r)\; \textit{output} \; \textit{YES}$\\
 &				& $\textit{else} \; \textit{output} \; \textit{NO}$
 \end{tabular}
 \end{center}
+Omnipresent use of regexes in modern
+software.
-\subsection{The practical problem}
-Introduce the problem of matching a pattern with a string.
-Why important?
-Need to be fast, real-time, on large inputs.
 Examples: Snort, Bro, etc?
 \subsubsection{The rules for network intrusion analysis tools }
 TODO: read rules libraries such as Snort and the explanation for some of the rules
 TODO: pcre/pcre2?
 TODO: any other libraries?
+There has been many widely used libraries such as
+Henry Spencer's regexp(3), RE2, etc.
+They are fast and successful, but ugly corner cases
+allowing the $\textit{ReDoS}$ attack exist, and
+is a non-negligible protion.
+\subsection{The practical problem}
+These corner cases either
+\begin{itemize}
+\item
+go unnoticed until they
+cause considerable grief in real life
+\item
+or force the regex library writers to pose
+restrictions on the input, limiting the
+choice a programmer has when using regexes.
+\end{itemize}
+Motivation:
+We want some library that supports as many constructs as possible,
+but still gives formal guarantees on the correctness and running
+time.
 \subsection{Regexes that brought down CloudFlare}
 matching some string $s$ with a regex
 state explosion for bounded repetitions due to
 theoretic bottleneck of having to remember exactly what the
 suffix up to length $n$ of input string is.
 "Countdown States activation problem":
 $.*a.{100}$ requires $2^100$ + DFA states.
+Example:
+Converting $((a|b )*b.{10}){3}$ to a $\DFA$
+gives the error:
+\begin{verbatim}
+147972 states before minimization, 79107 states in minimized DFA
+Old file "search.java" saved as "search.java~"
+Writing code to "search.java"
+Unexpected exception encountered. This indicates a bug in JFlex.
+Please consider filing an issue at http://github.com/jflex-de/jflex/issues/new
+character value expected
+java.lang.IllegalArgumentException: character value expected
+at jflex.generator.PackEmitter.emitUC(PackEmitter.java:105)
+at jflex.generator.CountEmitter.emit(CountEmitter.java:116)
+at jflex.generator.Emitter.emitDynamicInit(Emitter.java:530)
+at jflex.generator.Emitter.emit(Emitter.java:1369)
+at jflex.generator.LexGenerator.generate(LexGenerator.java:115)
+at jflex.Main.generate(Main.java:320)
+at jflex.Main.main(Main.java:336)
+\end{verbatim}
 \subsubsection{variant of DFA's}
 counting set automata
 \\
 TODO: microsoft 2020 oopsla CsA work, need to add bibli entry, and read, understand key novelty, learn to benchmark like it
 \\
 (card (rexp_enum (Suc (N + rsize (RSTAR r0))))) * (Suc (N + rsize (RSTAR r0)))"
 \end{verbatim}
 And we have a similar argument for the sequence case.
-\subsection{stronger simplification}
+\subsection{stronger simplification needed}
+\subsubsection{Bounded List of Terms}
+We have seen that without simplification the size of $(a+aa)^*$
+grows exponentially and unbounded(where we omit certain nested
+parentheses among the four terms in the last explicitly written out regex):
+\def\ll{\stackrel{\_\backslash{} a}{\longrightarrow}}
+\begin{center}
+\begin{tabular}{rll}
+$(a + aa)^*$ & $\ll$ & $(\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & $(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* \;+\; (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & $(\ZERO + \ZERO{}a + \ZERO) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^* \;+\; $\\
+& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+& $\ll$ & \ldots \hspace{15mm}(regular expressions of sizes 98, 169, 283, 468, 767, \ldots)
+\end{tabular}
+\end{center}
+But if we look closely at the regex
+\begin{center}
+\begin{tabular}{rll}
+& & $\qquad(\ZERO + \ZERO{}a + \ONE) \cdot (a + aa)^* + (\ONE + \ONE{}a) \cdot (a + aa)^*$\\
+\end{tabular}
+\end{center}
+we realize that:
+\begin{itemize}
+\item
+The regex is  equivalent to an alternative taking a long-flattened list,
+where each list is a sequence, and the second child of that sequence
+is always $(a+aa)^*$. In other words, the regex is a "linear combination"
+of terms of the form $(a+aa)\backslash s \cdot (a+aa)^*$ ($s$ is any string).
+\item
+The number of different terms of the shape $(a+aa) \backslash s \cdot (a+aa)^*$ is
+bounded because the first child $(a+aa) \backslash s$ can only be one of
+$(\ZERO + \ZERO{}a + \ZERO)$, $(\ZERO + \ZERO{}a + \ONE)$,
+$(\ONE + \ONE{}a)$ and $(\ZERO + \ZERO{}a)$.
+\item
+With simplification we have that the regex is additionally reduced to,
+where each term $\bsimp((a+aa)\backslash s )  $
+is further reduced to only
+$\ONE$, $\ONE + a$ and $\ZERO$.
+\end{itemize}
+Generalizing this to any regular expression of the form
+$\sum_{s\in L(r)} \bsimp(r\backslash s ) \cdot r^*$,
+we have the closed-form for star regex's string derivative as below:
+$\forall r \;s.\; \exists sl. \; s.t.\;\bsimp(r^* \backslash s) = \bsimp(\sum_{s'\in sl}(r\backslash s') \cdot r^* )$.
+The regex $\bsimp(\sum_{s' \in sl}(r\backslash s') \cdot r^*)$ is bounded by
+$\distinctBy(\flts(\sum_{s'\in sl}(\bsimp(r \backslash s')) \cdot r^*))$,
+which again is bounded by $\distinctBy(\sum_{s'\in sl}(\bsimp(r\backslash s')) \cdot r^*)$.
+This might give us a polynomial bound on the length of the list
+$\distinctBy[(\bsimp(r\backslash s')) \cdot r^* | {s'\in sl} ]$, if the terms in
+$\distinctBy[(\bsimp (r\backslash s')) | {s' \in sl}]$ has a polynomial bound.
+This is unfortunately not true under our current $\distinctBy$ function:
+If we let $r_n = ( (aa)^* + (aaa)^* + \ldots + \underline{(a\ldots a)^*}{n \,a's}) $,
+then we have that $\textit{maxterms} r_n = \textit{sup} (\textit{length} [\bsimp(r_n\backslash s') | s' \in sl]) =
+L.C.M(1,\ldots, n)$. According to \href{http://oeis.org/A003418}{OEISA003418}
+this grows exponentially quickly. So we have found a regex $r_n$ where
+$\textit{maxterms} (r_n ^* \backslash s) \geq 2^{(n-1)}$.
+\subsubsection{stronger version of \distinctBy}
+\href{https://www.researchgate.net/publication/340874991_Partial_derivatives_of_regular_expressions_and_finite_automaton_constructions}{Antimirove}
+has proven a linear bound on the number of terms in the "partial derivatives" of a regular expression:
+\begin{center}
+$\size (\PDER(r))  \leq \awidth (r)$.
+\end{center}
+The proof is by structural induction on the regular expression r.
+The hard cases are the sequence case $r_1\cdot r_2$ and star case $r^*$.
+The central idea that allows the induction to go through for this bound is on the inclusion:
+\begin{center}
+$\pder_{s@[c]} (a\cdot b) \subseteq (\pder_{s@[c]} a ) \cdot b \cup (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))$
+\end{center}
+This way,
+\begin{center}
+\begin{tabular}{lcl}
+$| \pder_{s@[c]} (a\cdot b) |$ & $ \leq$ & $ | (\pder_{s@[c]} a ) \cdot b \cup (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\
+& $\leq$ & $| (\pder_{s@[c]} a ) \cdot b|  + | (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\
+& $=$ & $\awidth(a) + \awidth(b)$ \\
+& $=$ & $\awidth(a+b)$
+\end{tabular}
+\end{center}
+we have that a compound regular expression $a\cdot b$'s subterms
+is bounded by its sub-expression's derivatives terms.
+This argument can be modified to bound the terms in
+our  version of regex with strong simplification:
+\begin{center}
+\begin{tabular}{lcl}
+$| \maxterms (\bsimp  (a\cdot b) \backslash s)|$ & $=$ & $ |maxterms(\bsimp( (a\backslash s \cdot b) + \sum_{s'\in sl}(b\backslash s') ))|$\\
+& $\leq$ & $| (\pder_{s@[c]} a ) \cdot b|  + | (\bigcup_{s' \in Suf(s@[c])} (\pder_{s'} \; b))|$\\
+& $=$ & $\awidth(a) + \awidth(b)$ \\
+& $=$ & $\awidth(a+b)$
+\end{tabular}
+\end{center}
 \subsection{cubic bound}
+Bounding the regex's subterms by
+its alphabetic width.
+The entire expression's size can be bounded by
+number of subterms times each subterms' size.
+\section{Support for bounded repetitions and other constructs}
-\subsection{Too Detailed too basic? regex to NFA translation}
+Example:
+$.*a.\{100\}$ after translation to $\DFA$ and minimization will
+always take over $2^{100}$ states.
+\section{Towards a library with fast running time practically}
+registers and cache-related optimizations?
+JVM related optimizations?
+\section{Past Report materials}
 Deciding whether a string is in the language of the regex
 can be intuitively done by constructing an NFA\cite{Thompson_1968}:
 and simulate the running of it.

changeset 444	a7e98deebb5c
parent 443	c6351a96bf80
child 453	d68b9db4c9ec