lexing: changeset 608:37b6fd310a16

--- a/ChengsongTanPhdThesis/Chapters/Inj.tex	Fri Sep 30 01:47:33 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Inj.tex	Sat Oct 01 12:06:46 2022 +0100
@@ -473,6 +473,7 @@
 derivatives on regular expressions and
 derivatives on a set of strings:
 \begin{lemma}\label{derDer}
+	\mbox{}
 	\begin{itemize}
 		\item
 $\textit{Der} \; c \; L(r) = L (r\backslash c)$
@@ -638,8 +639,10 @@
 later, but as a preliminary we have to first define 
 the datatype for lexing results, 
 called \emph{value} or
-sometimes also \emph{lexical value}.  Values and regular
-expressions correspond to each other as illustrated in the following
+sometimes also \emph{lexical value}.  
+Values and regular
+expressions correspond to each other 
+as illustrated in the following
 table:
 
 \begin{center}

--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Fri Sep 30 01:47:33 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Sat Oct 01 12:06:46 2022 +0100
@@ -579,7 +579,12 @@
 For example in the regex library RegExLib,
 the rules library of Snort \cite{Snort1999}\footnote{
 Snort is a network intrusion detection (NID) tool
-for monitoring network traffic.}, 
+for monitoring network traffic.
+The network security community curates a list
+of malicious patterns written as regexes,
+which is used by Snort's detection engine
+to match against network traffic for any hostile
+activities such as buffer overflow attacks.}, 
 as well as in XML Schema definitions (XSDs).
 According to Bj\"{o}rklund et al \cite{xml2015},
 more than half of the 
@@ -683,7 +688,7 @@
 extend straightforwardly to
 repetitions with an interval such as 
 $r^{\{n\ldots m\}}$.
-The merit of Brzozowski derivatives
+The merit of Brzozowski derivatives (more on this later)
 on this problem is that
 it can be naturally extended to support bounded repetitions.
 Moreover these extensions are still made up of only
@@ -808,7 +813,7 @@
 In fact, they allow the regex construct to express 
 languages that cannot be contained in context-free
 languages either.
-For example, the back-reference $((a^*)b\backslash1 b \backslash 1$
+For example, the back-reference $(a^*)b\backslash1 b \backslash 1$
 expresses the language $\{a^n b a^n b a^n\mid n \in \mathbb{N}\}$,
 which cannot be expressed by context-free grammars\parencite{campeanu2003formal}.
 Such a language is contained in the context-sensitive hierarchy
@@ -817,7 +822,8 @@
 is NP-complete\parencite{alfred2014algorithms}.
 A non-bactracking,
 efficient solution is not known to exist.
-Regex libraries supporting back-references such as PCRE therefore have to
+Regex libraries supporting back-references such as 
+PCRE \cite{pcre} therefore have to
 revert to a depth-first search algorithm that backtracks.
 The unreasonable part with them, is that even in the case of 
 regexes not involving back-references, there is still
@@ -911,7 +917,8 @@
 the specification being not very clear.
 There are many informal summaries of this disambiguation
 strategy, which are often quite long and delicate.
-For example Kuklewicz described the POSIX rule as
+For example Kuklewicz \cite{KuklewiczHaskell} 
+described the POSIX rule as
 \begin{quote}
 	``
 	\begin{itemize}
@@ -936,88 +943,38 @@
 The text above 
 is trying to capture something very precise,
 and is crying out for formalising.
-
+Ausaf et al. \cite{AusafDyckhoffUrban2016}
+are the first to describe such a formalised POSIX
+specification in Isabelle/HOL.
+They then formally proved the correctness of
+a lexing algorithm by Sulzmann and Lu \cite{Sulzmann2014}
+based on that specification.
 
-%\subsection{Different Phases of a Matching/Lexing Algorithm}
-%
-%
-%Most lexing algorithms can be roughly divided into 
-%two phases during its run.
-%The first phase is the "construction" phase,
-%in which the algorithm builds some  
-%suitable data structure from the input regex $r$, so that
-%it can be easily operated on later.
-%We denote
-%the time cost for such a phase by $P_1(r)$.
-%The second phase is the lexing phase, when the input string 
-%$s$ is read and the data structure
-%representing that regex $r$ is being operated on. 
-%We represent the time
-%it takes by $P_2(r, s)$.\\
-%For $\mathit{DFA}$,
-%we have $P_2(r, s) = O( |s| )$,
-%because we take at most $|s|$ steps, 
-%and each step takes
-%at most one transition--
-%a deterministic-finite-automata
-%by definition has at most one state active and at most one
-%transition upon receiving an input symbol.
-%But unfortunately in the  worst case
-%$P_1(r) = O(exp^{|r|})$. An example will be given later. 
-%For $\mathit{NFA}$s, we have $P_1(r) = O(|r|)$ if we do not unfold 
-%expressions like $r^n$ into 
-%\[
-%	\underbrace{r \cdots r}_{\text{n copies of r}}.
-%\]
-%The $P_2(r, s)$ is bounded by $|r|\cdot|s|$, if we do not backtrack.
-%On the other hand, if backtracking is used, the worst-case time bound bloats
-%to $|r| * 2^{|s|}$.
-%%on the input
-%%And when calculating the time complexity of the matching algorithm,
-%%we are assuming that each input reading step requires constant time.
-%%which translates to that the number of 
-%%states active and transitions taken each time is bounded by a
-%%constant $C$.
-%%But modern  regex libraries in popular language engines
-%% often want to support much richer constructs than just
-%% sequences and Kleene stars,
-%%such as negation, intersection, 
-%%bounded repetitions and back-references.
-%%And de-sugaring these "extended" regular expressions 
-%%into basic ones might bloat the size exponentially.
-%%TODO: more reference for exponential size blowup on desugaring. 
-%
-%\subsection{Why $\mathit{DFA}s$ can be slow in the first phase}
-%
-%
-%The good things about $\mathit{DFA}$s is that once
-%generated, they are fast and stable, unlike
-%backtracking algorithms. 
-%However, they do not scale well with bounded repetitions.
-%
-%\subsubsection{Problems with Bounded Repetitions}
-%
-%
-
-To summarise, we need regex libraries that are both fast
-and correct.
-And that correctness needs to be built on a precise
-model of what POSIX disambiguation is.
-We propose a solution that addresses both problems
-based on Brzozowski, Sulzmann and Lu and Ausaf and Urban's work.
+In this thesis,
+we propose a solution to catastrophic
+backtracking and error-prone matchers--a formally verified
+regular expression lexing algorithm
+that is both fast
+and correct by extending Ausaf et al.'s work.
 The end result is a regular expression lexing algorithm that comes with 
 \begin{itemize}
 \item
 a proven correctness theorem according to POSIX specification 
-given by Ausaf and Urban \cite{AusafDyckhoffUrban2016},
+given by Ausaf et al. \cite{AusafDyckhoffUrban2016}, 
 \item 
-a proven property saying that the algorithm's internal data structure will
+a proven complexity-related property saying that the 
+algorithm's internal data structure will
 remain finite,
 \item
 and extension to
 the bounded repetitions construct with the correctness and finiteness property
 maintained.
  \end{itemize}
+In the next section we will very briefly
+introduce Brzozowski derivatives.
+We will give a taste to what they 
+are like and why they are suitable for regular expression
+matching and lexing.
  
 \section{Our Solution--Formal Specification of POSIX and Brzozowski Derivatives}
 We propose Brzozowski derivatives on regular expressions as

--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/ChengsongTanPhdThesis/Chapters/RelatedWork.tex	Sat Oct 01 12:06:46 2022 +0100
@@ -0,0 +1,25 @@
+% Chapter Template
+
+\chapter{Related Work} % Main chapter title
+
+\label{RelatedWork} 
+
+In this chapter, we introduce
+the work relevant to this thesis.
+
+
+\section{Work on Back-References}
+We introduced back-references
+in chapter \ref{Introduction}.
+This is a quite deep problem,
+with theoretical work on them being
+fairly recent.
+
+Campaneu gave 
+
+
+See \cite{AHO1990255} for a survey
+of these works and comparison between different
+flavours (e.g. whether references can be circular, 
+can labels be repeated etc.) of back-references syntax.
+

--- a/ChengsongTanPhdThesis/example.bib	Fri Sep 30 01:47:33 2022 +0100
+++ b/ChengsongTanPhdThesis/example.bib	Sat Oct 01 12:06:46 2022 +0100
@@ -6,6 +6,29 @@
 
 %% Saved with string encoding Unicode (UTF-8) 
 
+
+
+
+%% back-references--------------------
+@incollection{AHO1990255,
+title = {CHAPTER 5 - Algorithms for Finding Patterns in Strings},
+editor = {JAN {VAN LEEUWEN}},
+booktitle = {Algorithms and Complexity},
+publisher = {Elsevier},
+address = {Amsterdam},
+pages = {255-300},
+year = {1990},
+series = {Handbook of Theoretical Computer Science},
+isbn = {978-0-444-88071-0},
+doi = {https://doi.org/10.1016/B978-0-444-88071-0.50010-2},
+url = {https://www.sciencedirect.com/science/article/pii/B9780444880710500102},
+author = {Alfred V. AHO},
+abstract = {Publisher Summary
+This chapter discusses the algorithms for solving string-matching problems that have proven useful for text-editing and text-processing applications. String pattern matching is an important problem that occurs in many areas of science and information processing. In computing, it occurs naturally as part of data processing, text editing, term rewriting, lexical analysis, and information retrieval. Many text editors and programming languages have facilities for matching strings. In biology, string-matching problems arise in the analysis of nucleic acids and protein sequences, and in the investigation of molecular phylogeny. String matching is also one of the central and most widely studied problems in theoretical computer science. The simplest form of the problem is to locate an occurrence of a keyword as a substring in a sequence of characters, which is called the input string. For example, the input string queueing contains the keyword ueuei as a substring. Even for this problem, several innovative, theoretically interesting algorithms have been devised that run significantly faster than the obvious brute-force method.}
+}
+
+%% back-references--------------------
+
 @article{fowler2003,
   title={An interpretation of the POSIX regex standard},
   author={Fowler, Glenn},
@@ -206,6 +229,18 @@
 	year = {2021},
 	bdsk-url-1 = {https://github.com/doyensec/regexploit}}
 
+@software{pcre,
+	author = {Philip Hazel},
+	date-added = {1997-01-01 00:00:00 +0000},
+	date-modified = {2021-06-14 00:00:00 +0000},
+	keywords = {Perl Compatible Regular Expressions},
+	month = {June},
+	title = {PCRE},
+	url = {https://www.pcre.org/original/doc/html/},
+	year = {2021},
+	bdsk-url-1 = {https://www.pcre.org/original/doc/html/}
+}
+
 @misc{KuklewiczHaskell,
 	author = {Kuklewicz},
 	keywords = {Buggy C POSIX Lexing Libraries},

--- a/ChengsongTanPhdThesis/main.tex	Fri Sep 30 01:47:33 2022 +0100
+++ b/ChengsongTanPhdThesis/main.tex	Sat Oct 01 12:06:46 2022 +0100
@@ -448,6 +448,7 @@
 \include{Chapters/Finite} 
 \include{Chapters/Cubic}
 \include{Chapters/Further}
+\include{Chapters/RelatedWork}
 
 %----------------------------------------------------------------------------------------
 %	THESIS CONTENT - APPENDICES

author	Chengsong
	Sat, 01 Oct 2022 12:06:46 +0100
changeset 608	37b6fd310a16
parent 607	e6fc9b72c0e3
child 609	61139fdddae0

ChengsongTanPhdThesis/Chapters/Inj.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/Introduction.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/Chapters/RelatedWork.tex		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/example.bib		file \| annotate \| diff \| comparison \| revisions
ChengsongTanPhdThesis/main.tex		file \| annotate \| diff \| comparison \| revisions