lexing: comparison ChengsongPhdThesis/ChengsongPhDThesis.tex

equal deleted inserted replaced

-:a5376206fd52
+:0856fbf8bda7
 looked at extended regular expressions, such as bounded repetitions,
 negation and back-references.
 \end{abstract}
 \section{Introduction}
-\subsection{Practical Example of Regex}
+\subsection{Basic Regex Introduction}
-%TODO: read rules libraries and the explanation for some of the rules
-matching some string $s$ with a regex
-\begin{verbatim}
-(?:(?:\"|'|\]|\}|\\|\d|
-(?:nan|infinity|true|false|null|undefined|symbol|math)
-|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))
-\end{verbatim}
-%Could be from a network intrusion detection algorithm.
-%Checking whether there is some malicious code
-%in the network data blocks being routed.
-%If so, discard the data and identify the sender for future alert.
-\subsection{The problem: Efficient Matching and Lexing}
-A programmer writes patterns to process texts,
-where a regex is a structured symbolic pattern
-specifying what the string should be like.
-The above regex looks complicated, but can be
-described by some basic constructs:
 Suppose (basic) regular expressions are given by the following grammar:
 \[			r ::=   \ZERO \mid  \ONE
 			 \mid  c
 			 \mid  r_1 \cdot r_2
 			 \mid  r_1 + r_2
 			 \mid r^*
 \]
-\noindent
+Problem of matching:
-The intended meaning of the constructors is as follows: $\ZERO$
-cannot match any string, $\ONE$ can match the empty string, the
-character regular expression $c$ can match the character $c$, and so
-on.
-and the underlying algorithmic problem is:
 \begin{center}
 \begin{tabular}{lcr}
 $\textit{Match}(r, s)$ & $ =  $ & $\textit{if}\; s \in L(r)\; \textit{output} \; \textit{YES}$\\
 &				& $\textit{else} \; \textit{output} \; \textit{NO}$
 \end{tabular}
 \end{center}
+\subsection{The practical problem}
+Introduce the problem of matching a pattern with a string.
+Why important?
+Need to be fast, real-time, on large inputs.
+Examples: Snort, Bro, etc?
+\subsubsection{The rules for network intrusion analysis tools }
+TODO: read rules libraries such as Snort and the explanation for some of the rules
+TODO: pcre/pcre2?
+TODO: any other libraries?
+\subsection{Regexes that brought down CloudFlare}
+matching some string $s$ with a regex
+\begin{verbatim}
+(?:(?:\"|'|\]|\}|\\|\d|
+(?:nan|infinity|true|false|null|undefined|symbol|math)
+|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))
+\end{verbatim}
+%Could be from a network intrusion detection algorithm.
+%Checking whether there is some malicious code
+%in the network data blocks being routed.
+%If so, discard the data and identify the sender for future alert.
+\section{Existing approaches}
+\subsection{Shortcomings of different methods}
+\subsubsection{ NFA's}
+Can be slow especially when many states are active.
+\subsubsection{DFAs}
+The tool JFLEX uses it.
+Advantages: super fast on most regexes \\
+TODO: show it being fast on a lot of inputs\\
+Disavantages:
+state explosion for bounded repetitions due to
+theoretic bottleneck of having to remember exactly what the
+suffix up to length $n$ of input string is.
+"Countdown States activation problem":
+$.*a.{100}$ requires $2^100$ + DFA states.
+\subsubsection{variant of DFA's}
+counting set automata
+\\
+TODO: microsoft 2020 oopsla CsA work, need to add bibli entry, and read, understand key novelty, learn to benchmark like it
+\\
+TODO: find weakness of such counting set automata?
+\\
+Other variants?
+\subsubsection{NFA and Regex: isomorphic structure}
+TODO: define mathematically an isomorphism?\\
+\subsubsection{variants of NFA's}
+How about acting on regular expressions themselves? Certain copies represent verbose info--that they will always match the same string--prune away!
+\subsection{Brzozowski's derivatives}
+\subsection{Sulzmann and Lu's algorithm}
+\subsection{Bit-coded algorithm}
++bitcodes!
+Built on top of derivatives, but with auxiliary bits
+\subsection{Correctness Proof}
+unproven by Sulzmann  and Lu
+by Ausaf and Urban
+\section{My Work}
+\subsection{an improved version of bit-coded algorithm: with simp!}
+\subsection{a correctness proof for bitcoded}
+\subsection{finiteness proof }
+\subsection{stronger simplification}
+\subsection{cubic bound}
+\subsection{Too Detailed too basic? regex to NFA translation}
 Deciding whether a string is in the language of the regex
 can be intuitively done by constructing an NFA\cite{Thompson_1968}:
 and simulate the running of it:
 \begin{figure}

changeset 440	0856fbf8bda7
parent 438	a73b2e553804
child 441	426a93160f4a