--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Wed Sep 28 01:28:01 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Thu Sep 29 14:05:42 2022 +0100
@@ -584,12 +584,12 @@
Often the counters are quite large, the largest up to ten million.
An example XSD they gave
was:
-\begin{verbatim}
-<sequence minOccurs="0" maxOccurs="65535">
- <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
- <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
-</sequence>
-\end{verbatim}
+%\begin{verbatim}
+%<sequence minOccurs="0" maxOccurs="65535">
+% <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+% <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+%</sequence>
+%\end{verbatim}
This can be seen as the expression
$(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
regular expressions
@@ -604,9 +604,74 @@
language the regular expression $a^{1001}$ is not permitted, because no counter
can be above 1000, and in the built-in Rust regular expression library
expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
-for being too big. These problems can of course be solved in matching algorithms where
+for being too big.
+As Becchi and Crawley\cite{Becchi08} have pointed out,
+the reason for these restrictions
+are that they simulate a non-deterministic finite
+automata (NFA) with a breadth-first search.
+This way the number of active states could
+be equal to the counter number.
+When the counters are large,
+the memory requirement could become
+infeasible, and the pattern needs to be rejected straight away.
+\begin{figure}[H]
+\begin{center}
+\begin{tikzpicture} [node distance = 2cm, on grid, auto]
+
+ \node (q0) [state, initial] {$0$};
+ \node (q1) [state, right = of q0] {$1$};
+ \node (q2) [state, right = of q1] {$2$};
+ \node (qdots) [right = of q2] {$\ldots$};
+ \node (qn) [state, right = of qdots] {$n$};
+ \node (qn1) [state, right = of qn] {$n+1$};
+ \node (qn2) [state, right = of qn1] {$n+2$};
+ \node (qn3) [state, accepting, right = of qn2] {$n+3$};
+
+\path [-stealth, thick]
+ (q0) edge [loop above] node {a} ()
+ (q0) edge node {a} (q1)
+ (q1) edge node {.} (q2)
+ (q2) edge node {.} (qdots)
+ (qdots) edge node {.} (qn)
+ (qn) edge node {.} (qn1)
+ (qn1) edge node {b} (qn2)
+ (qn2) edge node {$c$} (qn3);
+\end{tikzpicture}
+%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto]
+% \node[state,initial] (q_0) {$0$};
+% \node[state, ] (q_1) [right=of q_0] {$1$};
+% \node[state, ] (q_2) [right=of q_1] {$2$};
+% \node[state,
+% \node[state, accepting, ](q_3) [right=of q_2] {$3$};
+% \path[->]
+% (q_0) edge node {a} (q_1)
+% edge [loop below] node {a,b} ()
+% (q_1) edge node {a,b} (q_2)
+% (q_2) edge node {a,b} (q_3);
+%\end{tikzpicture}
+\end{center}
+\caption{The example given by Becchi and Crawley
+ that NFA simulation can consume large
+ amounts of memory: $.^*a.^{\{n\}}bc$ matching
+ strings of the form $aaa\ldots aaaabc$.
+ When traversing in a breadth-first manner,
+all states from 0 till $n+1$ will become active.}
+\end{figure}
+%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
+%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
+%in terms of input string length.
+%TODO:try out these lexers
+These problems can of course be solved in matching algorithms where
automata go beyond the classic notion and for instance include explicit
counters \cite{Turo_ov__2020}.
+These solutions can be quite effective,
+with the ability to process
+gigabytes of string input per second
+even with large counters \cite{Becchi08}.
+But formally reasoning about these automata can be challenging
+and un-intuitive.
+Therefore, correctness and runtime claims made about these solutions need to be
+taken with a grain of salt.
In the work reported in \cite{CSL2022} and here,
we add better support using derivatives
@@ -629,16 +694,21 @@
boundedness property, which we shall prove later on.
+
+
+
\subsection{Back-References}
-%Some constructs can even raise the expressive
-%power to the non-regular realm, for example
-%the back-references.
-%bounded repetitions, as we have discussed in the
-%previous section.
-%This super-linear behaviour of the
-%regex matching engines we have?
+The other way to simulate an $\mathit{NFA}$ for matching is choosing
+a single transition each time, keeping all the other options in
+a queue or stack, and backtracking if that choice eventually
+fails. This method, often called a "depth-first-search",
+is efficient in a lot of cases, but could end up
+with exponential run time.
+The backtracking method is employed in regex libraries
+that support \emph{back-references}, for example
+in Java and Python.
The syntax and expressive power of regexes
-can go quickly beyond the regular language hierarchy
+go beyond the regular language hierarchy
with back-references.
%\section{Back-references and The Terminology Regex}
@@ -652,60 +722,56 @@
%This can be viewed as a breadth-first-search of the $\mathit{NFA}$
%for a path terminating
%at an accepting state.
-Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
-type of $\mathit{NFA}$ simulation and guarantees a linear runtime
-in terms of input string length.
-%TODO:try out these lexers
-The other way to use $\mathit{NFA}$ for matching is choosing
-a single transition each time, keeping all the other options in
-a queue or stack, and backtracking if that choice eventually
-fails. This method, often called a "depth-first-search",
-is efficient in a lot of cases, but could end up
-with exponential run time.\\
-%TODO:COMPARE java python lexer speed with Rust and Go
-The reason behind backtracking algorithms in languages like
-Java and Python is that they support back-references.
-\subsubsection{Back References}
-If we have a regular expression like this (the sequence
+
+
+
+Given a regular expression like this (the sequence
operator is omitted for brevity):
\begin{center}
- $r_1(r_2(r_3r_4))$
+ $r_1r_2r_3r_4$
\end{center}
-We could label sub-expressions of interest
+one could label sub-expressions of interest
by parenthesizing them and giving
them a number by the order in which their opening parentheses appear.
One possible way of parenthesizing and labelling is given below:
\begin{center}
$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
\end{center}
-$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$, $r_4$ are labelled
-by 1 to 4. $1$ would refer to the entire expression
-$(r_1(r_2(r_3)(r_4)))$, $2$ referring to $r_2(r_3)(r_4)$, etc.
-These sub-expressions are called "capturing groups".
-We can use the following syntax to denote that we want a string just matched by a
-sub-expression (capturing group) to appear at a certain location again,
-exactly as it was:
+The sub-expressions
+$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
+by 1 to 4, and can be ``referred back'' by their respective numbers.
+%These sub-expressions are called "capturing groups".
+To do so, we use the syntax $\backslash i$
+to denote that we want the sub-string
+of the input just matched by the i-th
+sub-expression to appear again,
+exactly the same as it first appeared:
\begin{center}
$\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots
\underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
\end{center}
-The backslash and number $i$ are used to denote such
-so-called "back-references".
-Let $e$ be an expression made of regular expressions
-and back-references. $e$ contains the expression $e_i$
-as its $i$-th capturing group.
-The semantics of back-reference can be recursively
-written as:
+%The backslash and number $i$ are the
+%so-called "back-references".
+%Let $e$ be an expression made of regular expressions
+%and back-references. $e$ contains the expression $e_i$
+%as its $i$-th capturing group.
+%The semantics of back-reference can be recursively
+%written as:
+%\begin{center}
+% \begin{tabular}{c}
+% $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
+% $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
+% \end{tabular}
+%\end{center}
+A concrete example
+for back-references would be
\begin{center}
- \begin{tabular}{c}
- $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
- $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
- \end{tabular}
+$((a|b|c|\ldots|z)^*)\backslash 1$,
\end{center}
-The concrete example
-$((a|b|c|\ldots|z)^*)\backslash 1$
-would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\
-Back-reference is a construct in the "regex" standard
+which would match
+strings that can be split into two identical halves,
+for example $\mathit{bobo}$, $\mathit{weewee}$ and etc.
+Back-reference is a regex construct
that programmers found useful, but not exactly
regular any more.
In fact, that allows the regex construct to express
--- a/ChengsongTanPhdThesis/example.bib Wed Sep 28 01:28:01 2022 +0100
+++ b/ChengsongTanPhdThesis/example.bib Thu Sep 29 14:05:42 2022 +0100
@@ -5,6 +5,14 @@
%% Saved with string encoding Unicode (UTF-8)
+@inproceedings{Becchi08,
+author = {Becchi, Michela and Crowley, Patrick},
+year = {2008},
+month = {01},
+pages = {25},
+title = {Extending finite automata to efficiently match Perl-compatible regular expressions},
+doi = {10.1145/1544012.1544037}
+}
@book{
Sakarovitch2009,