# HG changeset patch # User Chengsong # Date 1646265191 0 # Node ID 0856fbf8bda74786dea8ed85d54f8171f706d450 # Parent a5376206fd529454803327ef6bde31b110b6eb50 bonestruct diff -r a5376206fd52 -r 0856fbf8bda7 ChengsongPhdThesis/ChengsongPhDThesis.tex --- a/ChengsongPhdThesis/ChengsongPhDThesis.tex Wed Mar 02 23:23:19 2022 +0000 +++ b/ChengsongPhdThesis/ChengsongPhDThesis.tex Wed Mar 02 23:53:11 2022 +0000 @@ -95,9 +95,39 @@ \end{abstract} \section{Introduction} -\subsection{Practical Example of Regex} +\subsection{Basic Regex Introduction} +Suppose (basic) regular expressions are given by the following grammar: +\[ r ::= \ZERO \mid \ONE + \mid c + \mid r_1 \cdot r_2 + \mid r_1 + r_2 + \mid r^* +\] + +Problem of matching: -%TODO: read rules libraries and the explanation for some of the rules +\begin{center} +\begin{tabular}{lcr} +$\textit{Match}(r, s)$ & $ = $ & $\textit{if}\; s \in L(r)\; \textit{output} \; \textit{YES}$\\ +& & $\textit{else} \; \textit{output} \; \textit{NO}$ +\end{tabular} +\end{center} + + +\subsection{The practical problem} +Introduce the problem of matching a pattern with a string. +Why important? +Need to be fast, real-time, on large inputs. +Examples: Snort, Bro, etc? +\subsubsection{The rules for network intrusion analysis tools } +TODO: read rules libraries such as Snort and the explanation for some of the rules +TODO: pcre/pcre2? +TODO: any other libraries? + + +\subsection{Regexes that brought down CloudFlare} + + matching some string $s$ with a regex \begin{verbatim} @@ -111,37 +141,75 @@ %Checking whether there is some malicious code %in the network data blocks being routed. %If so, discard the data and identify the sender for future alert. +\section{Existing approaches} +\subsection{Shortcomings of different methods} -\subsection{The problem: Efficient Matching and Lexing} -A programmer writes patterns to process texts, -where a regex is a structured symbolic pattern -specifying what the string should be like. + +\subsubsection{ NFA's} +Can be slow especially when many states are active. -The above regex looks complicated, but can be -described by some basic constructs: +\subsubsection{DFAs} +The tool JFLEX uses it. +Advantages: super fast on most regexes \\ +TODO: show it being fast on a lot of inputs\\ +Disavantages: +state explosion for bounded repetitions due to +theoretic bottleneck of having to remember exactly what the +suffix up to length $n$ of input string is. +"Countdown States activation problem": +$.*a.{100}$ requires $2^100$ + DFA states. +\subsubsection{variant of DFA's} +counting set automata +\\ +TODO: microsoft 2020 oopsla CsA work, need to add bibli entry, and read, understand key novelty, learn to benchmark like it +\\ +TODO: find weakness of such counting set automata? +\\ +Other variants? + +\subsubsection{NFA and Regex: isomorphic structure} +TODO: define mathematically an isomorphism?\\ + -Suppose (basic) regular expressions are given by the following grammar: -\[ r ::= \ZERO \mid \ONE - \mid c - \mid r_1 \cdot r_2 - \mid r_1 + r_2 - \mid r^* -\] +\subsubsection{variants of NFA's} +How about acting on regular expressions themselves? Certain copies represent verbose info--that they will always match the same string--prune away! + +\subsection{Brzozowski's derivatives} + +\subsection{Sulzmann and Lu's algorithm} + +\subsection{Bit-coded algorithm} ++bitcodes! +Built on top of derivatives, but with auxiliary bits +\subsection{Correctness Proof} +unproven by Sulzmann and Lu +by Ausaf and Urban + +\section{My Work} + +\subsection{an improved version of bit-coded algorithm: with simp!} -\noindent -The intended meaning of the constructors is as follows: $\ZERO$ -cannot match any string, $\ONE$ can match the empty string, the -character regular expression $c$ can match the character $c$, and so -on. +\subsection{a correctness proof for bitcoded} + +\subsection{finiteness proof } + +\subsection{stronger simplification} + + +\subsection{cubic bound} + -and the underlying algorithmic problem is: -\begin{center} -\begin{tabular}{lcr} -$\textit{Match}(r, s)$ & $ = $ & $\textit{if}\; s \in L(r)\; \textit{output} \; \textit{YES}$\\ -& & $\textit{else} \; \textit{output} \; \textit{NO}$ -\end{tabular} -\end{center} + + + + + + + + + +\subsection{Too Detailed too basic? regex to NFA translation} Deciding whether a string is in the language of the regex can be intuitively done by constructing an NFA\cite{Thompson_1968}: