lexing: changeset 607:e6fc9b72c0e3

--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Thu Sep 29 14:05:42 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri Sep 30 01:47:33 2022 +0100
@@ -577,7 +577,10 @@
 Bounded repetitions are very important because they
 tend to occur a lot in practical use.
 For example in the regex library RegExLib,
-the rules library of Snort, as well as in XML Schema definitions (XSDs).
+the rules library of Snort \cite{Snort1999}\footnote{
+Snort is a network intrusion detection (NID) tool
+for monitoring network traffic.}, 
+as well as in XML Schema definitions (XSDs).
 According to Bj\"{o}rklund et al \cite{xml2015},
 more than half of the 
 XSDs they found have bounded regular expressions in them.
@@ -707,9 +710,6 @@
 The backtracking method is employed in regex libraries
 that support \emph{back-references}, for example
 in Java and Python.
-The syntax and expressive power of regexes 
-go beyond the regular language hierarchy
-with back-references.
 %\section{Back-references and The Terminology Regex}
 
 %When one constructs an $\NFA$ out of a regular expression
@@ -766,15 +766,46 @@
 A concrete example
 for back-references would be
 \begin{center}
-$((a|b|c|\ldots|z)^*)\backslash 1$,
+$(.^*)\backslash 1$,
 \end{center}
 which would match 
 strings that can be split into two identical halves,
-for example $\mathit{bobo}$, $\mathit{weewee}$ and etc.
-Back-reference is a regex construct 
-that programmers found useful, but not exactly 
-regular any more.
-In fact, that allows the regex construct to express 
+for example $\mathit{foofoo}$, $\mathit{ww}$ and etc.
+Note that this is different from 
+repeating the  sub-expression verbatim like
+\begin{center}
+	$(.^*)(.^*)$,
+\end{center}
+which does not impose any restrictions on what strings the second 
+sub-expression $.^*$
+might match.
+Another example of back-references would be
+\begin{center}
+$(.)(.)\backslash 2\backslash 1$
+\end{center}
+which expresses four-character palindromes
+like $abba$, $x??x$ etc.
+
+Back-references is a regex construct 
+that programmers found quite useful.
+According to Becchi and Crawley\cite{Becchi08},
+6\% of Snort rules (up until 2008) include the use of them.
+The most common use of back-references
+would be expressing well-formed html files,
+where back-references would be handy in expressing
+a pair of opening and closing tags like 
+\begin{center}
+	$\langle html \rangle \ldots \langle / html \rangle$
+\end{center}
+A regex describing such a format
+could be
+\begin{center}
+	$\langle (.^+) \rangle \ldots \langle / \backslash 1 \rangle$
+\end{center}
+Despite being useful, the syntax and expressive power of regexes 
+go beyond the regular language hierarchy
+with back-references.
+In fact, they allow the regex construct to express 
 languages that cannot be contained in context-free
 languages either.
 For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
@@ -783,20 +814,37 @@
 Such a language is contained in the context-sensitive hierarchy
 of formal languages. 
 Solving the back-reference expressions matching problem
-is NP-complete\parencite{alfred2014algorithms} and a non-bactracking,
+is NP-complete\parencite{alfred2014algorithms}.
+A non-bactracking,
 efficient solution is not known to exist.
-%TODO:read a bit more about back reference algorithms
+Regex libraries supporting back-references such as PCRE therefore have to
+revert to a depth-first search algorithm that backtracks.
+The unreasonable part with them, is that even in the case of 
+regexes not involving back-references, there is still
+a (non-negligible) chance they might backtrack super-linearly.
 
-It seems that languages like Java and Python made the trade-off
-to support back-references at the expense of having to backtrack,
-even in the case of regexes not involving back-references.\\
+\subsection{Summary of the Catastrophic Backtracking Problem}
 Summing these up, we can categorise existing 
-practical regex libraries into the ones  with  linear
-time guarantees like Go and Rust, which impose restrictions
+practical regex libraries into two kinds:
+(i)The ones  with  linear
+time guarantees like Go and Rust. The cost with them is that
+they impose restrictions
 on the user input (not allowing back-references, 
-bounded repetitions cannot exceed 1000 etc.), and ones  
- that allows the programmer much freedom, but grinds to a halt
- in some non-negligible portion of cases.
+bounded repetitions cannot exceed a counter limit etc.).
+(ii) Those 
+that allow large bounded regular expressions and back-references
+at the expense of using a backtracking algorithm.
+They could grind to a halt
+on some very simple cases, posing a vulnerability of
+a ReDoS attack.
+
+ 
+We would like to have regex engines that can 
+deal with the regular part (e.g.
+bounded repetitions) of regexes more
+efficiently.
+Also we want to make sure that they do it correctly.
+It turns out that such aim is not so easy to achieve.
  %TODO: give examples such as RE2 GOLANG 1000 restriction, rust no repetitions 
 % For example, the Rust regex engine claims to be linear, 
 % but does not support lookarounds and back-references.
@@ -819,16 +867,25 @@
 
 
 \section{Error-prone POSIX Implementations}
-The problems with practical implementations
-of reare not limited to slowness on certain 
-cases. 
-Another thing about these libraries is that there
-is no correctness guarantee.
-In some cases, they either fail to generate a lexing result when there exists a match,
+When there are multiple ways of matching a string
+with a regular expression, a matcher needs to
+disambiguate.
+The standard for which particular match to pick
+is called the disambiguation strategy.
+The more intuitive strategy is called POSIX,
+which always chooses the longest initial match.
+An alternative strategy would be greedy matches,
+which always ends a sub-match as early as possible.
+The POSIX standard is widely adopted in many operating systems.
+However, many implementations (including the C libraries
+used by Linux and OS X distributions) contain bugs
+or do not meet the specification they claim to adhere to.
+In some cases, they either fail to generate a lexing 
+result when there exists a match,
 or give results that are inconsistent with the $\POSIX$ standard.
-A concrete example would be the regex
+A concrete example would be the regex given by \cite{fowler2003}
 \begin{center}
-	$(aba + ab + a)* \text{and the string} ababa$
+	$(aba + ab + a)^* \text{and the string} ababa$
 \end{center}
 The correct $\POSIX$ match for the above would be 
 with the entire string $ababa$, 
@@ -839,117 +896,127 @@
 with different language engines would yield 
 the same two fragmented matches: $[aba]$ at $[0, 3)$
 and $a$ at $[4, 5)$.
-
-Kuklewicz\parencite{KuklewiczHaskell} commented that most regex libraries are not
+Fowler \cite{fowler2003} and Kuklewicz \cite{KuklewiczHaskell} 
+commented that most regex libraries are not
 correctly implementing the POSIX (maximum-munch)
 rule of regular expression matching.
-
 As Grathwohl\parencite{grathwohl2014crash} wrote,
 \begin{quote}
-	The POSIX strategy is more complicated than the 
+	``The POSIX strategy is more complicated than the 
 	greedy because of the dependence on information about 
-	the length of matched strings in the various subexpressions.
+	the length of matched strings in the various subexpressions.''
 \end{quote}
 %\noindent
-To summarise the above, regular expressions are important.
-They are popular and programming languages' library functions
-for them are very fast on non-catastrophic cases.
-But there are problems with current practical implementations.
-First thing is that the running time might blow up.
-The second problem is that they might be error-prone on certain
-very simple cases.
-In the next part of the chapter, we will look into reasons why 
-certain regex engines are running horribly slow on the "catastrophic"
-cases and propose a solution that addresses both of these problems
-based on Brzozowski and Sulzmann and Lu's work.
-
-
-\subsection{Different Phases of a Matching/Lexing Algorithm}
+The implementation complexity of POSIX rules also come from
+the specification being not very clear.
+There are many informal summaries of this disambiguation
+strategy, which are often quite long and delicate.
+For example Kuklewicz described the POSIX rule as
+\begin{quote}
+	``
+	\begin{itemize}
+		\item
+regular expressions (REs) take the leftmost starting match, and the longest match starting there
+earlier subpatterns have leftmost-longest priority over later subpatterns\\
+\item
+higher-level subpatterns have leftmost-longest priority over their component subpatterns\\
+\item
+REs have right associative concatenation which can be changed with parenthesis\\
+\item
+parenthesized subexpressions return the match from their last usage\\
+\item
+text of component subexpressions must be contained in the text of the 
+higher-level subexpressions\\
+\item
+if "p" and "q" can never match the same text then "p|q" and "q|p" are equivalent, up to trivial renumbering of captured subexpressions\\
+\item
+if "p" in "p*" is used to capture non-empty text then additional repetitions of "p" will not capture an empty string\\''
+\end{itemize}
+\end{quote}
+The text above 
+is trying to capture something very precise,
+and is crying out for formalising.
 
 
-Most lexing algorithms can be roughly divided into 
-two phases during its run.
-The first phase is the "construction" phase,
-in which the algorithm builds some  
-suitable data structure from the input regex $r$, so that
-it can be easily operated on later.
-We denote
-the time cost for such a phase by $P_1(r)$.
-The second phase is the lexing phase, when the input string 
-$s$ is read and the data structure
-representing that regex $r$ is being operated on. 
-We represent the time
-it takes by $P_2(r, s)$.\\
-For $\mathit{DFA}$,
-we have $P_2(r, s) = O( |s| )$,
-because we take at most $|s|$ steps, 
-and each step takes
-at most one transition--
-a deterministic-finite-automata
-by definition has at most one state active and at most one
-transition upon receiving an input symbol.
-But unfortunately in the  worst case
-$P_1(r) = O(exp^{|r|})$. An example will be given later. 
-For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
-expressions like $r^n$ into 
-\[
-	\underbrace{r \cdots r}_{\text{n copies of r}}.
-\]
-The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
-On the other hand, if backtracking is used, the worst-case time bound bloats
-to $|r| * 2^{|s|}$.
-%on the input
-%And when calculating the time complexity of the matching algorithm,
-%we are assuming that each input reading step requires constant time.
-%which translates to that the number of 
-%states active and transitions taken each time is bounded by a
-%constant $C$.
-%But modern  regex libraries in popular language engines
-% often want to support much richer constructs than just
-% sequences and Kleene stars,
-%such as negation, intersection, 
-%bounded repetitions and back-references.
-%And de-sugaring these "extended" regular expressions 
-%into basic ones might bloat the size exponentially.
-%TODO: more reference for exponential size blowup on desugaring. 
+%\subsection{Different Phases of a Matching/Lexing Algorithm}
+%
+%
+%Most lexing algorithms can be roughly divided into 
+%two phases during its run.
+%The first phase is the "construction" phase,
+%in which the algorithm builds some  
+%suitable data structure from the input regex $r$, so that
+%it can be easily operated on later.
+%We denote
+%the time cost for such a phase by $P_1(r)$.
+%The second phase is the lexing phase, when the input string 
+%$s$ is read and the data structure
+%representing that regex $r$ is being operated on. 
+%We represent the time
+%it takes by $P_2(r, s)$.\\
+%For $\mathit{DFA}$,
+%we have $P_2(r, s) = O( |s| )$,
+%because we take at most $|s|$ steps, 
+%and each step takes
+%at most one transition--
+%a deterministic-finite-automata
+%by definition has at most one state active and at most one
+%transition upon receiving an input symbol.
+%But unfortunately in the  worst case
+%$P_1(r) = O(exp^{|r|})$. An example will be given later. 
+%For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
+%expressions like $r^n$ into 
+%\[
+%	\underbrace{r \cdots r}_{\text{n copies of r}}.
+%\]
+%The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
+%On the other hand, if backtracking is used, the worst-case time bound bloats
+%to $|r| * 2^{|s|}$.
+%%on the input
+%%And when calculating the time complexity of the matching algorithm,
+%%we are assuming that each input reading step requires constant time.
+%%which translates to that the number of 
+%%states active and transitions taken each time is bounded by a
+%%constant $C$.
+%%But modern  regex libraries in popular language engines
+%% often want to support much richer constructs than just
+%% sequences and Kleene stars,
+%%such as negation, intersection, 
+%%bounded repetitions and back-references.
+%%And de-sugaring these "extended" regular expressions 
+%%into basic ones might bloat the size exponentially.
+%%TODO: more reference for exponential size blowup on desugaring. 
+%
+%\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
+%
+%
+%The good things about $\mathit{DFA}$s is that once
+%generated, they are fast and stable, unlike
+%backtracking algorithms. 
+%However, they do not scale well with bounded repetitions.
+%
+%\subsubsection{Problems with Bounded Repetitions}
+%
+%
 
-\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
-
-
-The good things about $\mathit{DFA}$s is that once
-generated, they are fast and stable, unlike
-backtracking algorithms. 
-However, they do not scale well with bounded repetitions.
-
-\subsubsection{Problems with Bounded Repetitions}
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-So we have practical implementations 
-on regular expression matching/lexing which are fast
-but do not come with any guarantees that it will not grind to a halt
-or give wrong answers.
-Our goal is to have a regex lexing algorithm that comes with 
+To summarise, we need regex libraries that are both fast
+and correct.
+And that correctness needs to be built on a precise
+model of what POSIX disambiguation is.
+We propose a solution that addresses both problems
+based on Brzozowski, Sulzmann and Lu and Ausaf and Urban's work.
+The end result is a regular expression lexing algorithm that comes with 
 \begin{itemize}
 \item
-proven correctness 
+a proven correctness theorem according to POSIX specification 
+given by Ausaf and Urban \cite{AusafDyckhoffUrban2016},
 \item 
-proven non-catastrophic properties
+a proven property saying that the algorithm's internal data structure will
+remain finite,
 \item
-easy extensions to
-constructs like 
- bounded repetitions, negation,  lookarounds, and even back-references.
+and extension to
+the bounded repetitions construct with the correctness and finiteness property
+maintained.
  \end{itemize}
  
 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}

--- a/ChengsongTanPhdThesis/example.bib	Thu Sep 29 14:05:42 2022 +0100
+++ b/ChengsongTanPhdThesis/example.bib	Fri Sep 30 01:47:33 2022 +0100
@@ -5,6 +5,29 @@
 
 
 %% Saved with string encoding Unicode (UTF-8) 
+
+@article{fowler2003,
+  title={An interpretation of the POSIX regex standard},
+  author={Fowler, Glenn},
+  journal={URL: https://web. archive. org/web/20050408073627/http://www. research. att. com/\~{} gsf/testregex/re-interpretation. html},
+  year={2003}
+}
+
+
+@inproceedings{Snort1999,
+author = {Roesch, Martin},
+title = {Snort - Lightweight Intrusion Detection for Networks},
+year = {1999},
+publisher = {USENIX Association},
+address = {USA},
+abstract = {Network intrusion detection systems (NIDS) are an important part of any network security architecture. They provide a layer of defense which monitors network traffic for predefined suspicious activity or patterns, and alert system administrators when potential hostile traffic is detected. Commercial NIDS have many differences, but Information Systems departments must face the commonalities that they share such as significant system footprint, complex deployment and high monetary cost. Snort was designed to address these issues.},
+booktitle = {Proceedings of the 13th USENIX Conference on System Administration},
+pages = {229–238},
+numpages = {10},
+location = {Seattle, Washington},
+series = {LISA '99}
+}
+
 @inproceedings{Becchi08,
 author = {Becchi, Michela and Crowley, Patrick},
 year = {2008},
@@ -26,7 +49,7 @@
 }
 
 
-@unpublished{CSL22
+@unpublished{CSL2022,
 author = "Chengsong Tan and Christian Urban",
 title = "POSIX Lexing with Bitcoded Derivatives",
 note = "submitted",

author	Chengsong
	Fri, 30 Sep 2022 01:47:33 +0100
changeset 607	e6fc9b72c0e3
parent 606	99b530103464
child 608	37b6fd310a16

ChengsongTanPhdThesis/Chapters/Introduction.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/example.bib		file \| annotate \| diff \| comparison \| revisions