new
authorChengsong
Thu, 29 Sep 2022 14:05:42 +0100
changeset 606 99b530103464
parent 605 ed53ce26ecb6
child 607 e6fc9b72c0e3
new
ChengsongTanPhdThesis/Chapters/Introduction.tex
ChengsongTanPhdThesis/example.bib
--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Wed Sep 28 01:28:01 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Thu Sep 29 14:05:42 2022 +0100
@@ -584,12 +584,12 @@
 Often the counters are quite large, the largest up to ten million. 
 An example XSD they gave
 was:
-\begin{verbatim}
-<sequence minOccurs="0" maxOccurs="65535">
-    <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
-    <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
-</sequence>
-\end{verbatim}
+%\begin{verbatim}
+%<sequence minOccurs="0" maxOccurs="65535">
+%  <element name="TimeIncr" type="mpeg7:MediaIncrDurationType"/>
+%  <element name="MotionParams" type="float" minOccurs="2" maxOccurs="12"/>
+%</sequence>
+%\end{verbatim}
 This can be seen as the expression 
 $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves
 regular expressions 
@@ -604,9 +604,74 @@
 language the regular expression $a^{1001}$ is not permitted, because no counter
 can be above 1000, and in the built-in Rust regular expression library
 expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message
-for being too big. These problems can of course be solved in matching algorithms where 
+for being too big. 
+As Becchi and Crawley\cite{Becchi08}  have pointed out,
+the reason for these restrictions
+are that they simulate a non-deterministic finite
+automata (NFA) with a breadth-first search.
+This way the number of active states could
+be equal to the counter number.
+When the counters are large, 
+the memory requirement could become
+infeasible, and the pattern needs to be rejected straight away.
+\begin{figure}[H]
+\begin{center}
+\begin{tikzpicture} [node distance = 2cm, on grid, auto]
+ 
+    	\node (q0) [state, initial] {$0$};
+	\node (q1) [state, right = of q0] {$1$};
+	\node (q2) [state, right = of q1] {$2$};
+	\node (qdots) [right = of q2] {$\ldots$};
+	\node (qn) [state, right = of qdots] {$n$};
+	\node (qn1) [state, right = of qn] {$n+1$};
+	\node (qn2) [state, right = of qn1] {$n+2$};
+	\node (qn3) [state, accepting, right = of qn2] {$n+3$}; 
+ 
+\path [-stealth, thick]
+	(q0) edge [loop above] node {a} ()
+    (q0) edge node {a}   (q1) 
+    (q1) edge node {.}   (q2)
+    (q2) edge node {.}   (qdots)
+    (qdots) edge node {.} (qn)
+    (qn) edge node {.} (qn1)
+    (qn1) edge node {b} (qn2)
+    (qn2) edge node {$c$} (qn3);
+\end{tikzpicture}
+%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
+%   \node[state,initial] (q_0)   {$0$}; 
+%   \node[state, ] (q_1) [right=of q_0] {$1$}; 
+%   \node[state, ] (q_2) [right=of q_1] {$2$}; 
+%   \node[state,
+%   \node[state, accepting, ](q_3) [right=of q_2] {$3$};
+%    \path[->] 
+%    (q_0) edge  node {a} (q_1)
+%    	  edge [loop below] node {a,b} ()
+%    (q_1) edge  node  {a,b} (q_2)
+%    (q_2) edge  node  {a,b} (q_3);
+%\end{tikzpicture}
+\end{center}
+\caption{The example given by Becchi and Crawley
+	that NFA simulation can consume large
+	amounts of memory: $.^*a.^{\{n\}}bc$ matching
+	strings of the form $aaa\ldots aaaabc$.
+	When traversing in a breadth-first manner,
+all states from 0 till $n+1$ will become active.}
+\end{figure}
+%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
+%type of $\mathit{NFA}$ simulation and guarantees a linear runtime
+%in terms of input string length.
+%TODO:try out these lexers
+These problems can of course be solved in matching algorithms where 
 automata go beyond the classic notion and for instance include explicit
 counters \cite{Turo_ov__2020}.
+These solutions can be quite effective,
+with the ability to process
+gigabytes of string input per second
+even with large counters \cite{Becchi08}.
+But formally reasoning about these automata can be challenging
+and un-intuitive.
+Therefore, correctness and runtime claims made about these solutions need to be
+taken with a grain of salt.
 
 In the work reported in \cite{CSL2022} and here, 
 we add better support using derivatives
@@ -629,16 +694,21 @@
 boundedness property, which we shall prove later on.
 
 
+
+
+
 \subsection{Back-References}
-%Some constructs can even raise the expressive
-%power to the non-regular realm, for example
-%the back-references.
-%bounded repetitions, as we have discussed in the 
-%previous section.
-%This super-linear behaviour of the 
-%regex matching engines we have?
+The other way to simulate an $\mathit{NFA}$ for matching is choosing  
+a single transition each time, keeping all the other options in 
+a queue or stack, and backtracking if that choice eventually 
+fails. This method, often called a  "depth-first-search", 
+is efficient in a lot of cases, but could end up
+with exponential run time.
+The backtracking method is employed in regex libraries
+that support \emph{back-references}, for example
+in Java and Python.
 The syntax and expressive power of regexes 
-can go quickly beyond the regular language hierarchy
+go beyond the regular language hierarchy
 with back-references.
 %\section{Back-references and The Terminology Regex}
 
@@ -652,60 +722,56 @@
 %This can be viewed as a breadth-first-search of the $\mathit{NFA}$
 %for a path terminating
 %at an accepting state.
-Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this
-type of $\mathit{NFA}$ simulation and guarantees a linear runtime
-in terms of input string length.
-%TODO:try out these lexers
-The other way to use $\mathit{NFA}$ for matching is choosing  
-a single transition each time, keeping all the other options in 
-a queue or stack, and backtracking if that choice eventually 
-fails. This method, often called a  "depth-first-search", 
-is efficient in a lot of cases, but could end up
-with exponential run time.\\
-%TODO:COMPARE java python lexer speed with Rust and Go
-The reason behind backtracking algorithms in languages like
-Java and Python is that they support back-references.
-\subsubsection{Back References}
-If we have a regular expression like this (the sequence
+
+
+
+Given a regular expression like this (the sequence
 operator is omitted for brevity):
 \begin{center}
-	$r_1(r_2(r_3r_4))$
+	$r_1r_2r_3r_4$
 \end{center}
-We could label sub-expressions of interest 
+one could label sub-expressions of interest 
 by parenthesizing them and giving 
 them a number by the order in which their opening parentheses appear.
 One possible way of parenthesizing and labelling is given below:
 \begin{center}
 	$\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$
 \end{center}
-$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$, $r_4$ are labelled
-by 1 to 4. $1$ would refer to the entire expression 
-$(r_1(r_2(r_3)(r_4)))$, $2$ referring to $r_2(r_3)(r_4)$, etc.
-These sub-expressions are called "capturing groups".
-We can use the following syntax to denote that we want a string just matched by a 
-sub-expression (capturing group) to appear at a certain location again, 
-exactly as it was:
+The sub-expressions
+$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled
+by 1 to 4, and can be ``referred back'' by their respective numbers. 
+%These sub-expressions are called "capturing groups".
+To do so, we use the syntax $\backslash i$ 
+to denote that we want the sub-string 
+of the input just matched by the i-th
+sub-expression to appear again, 
+exactly the same as it first appeared: 
 \begin{center}
 $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots 
 \underset{s_i \text{ which just matched} \;r_i}{\backslash i}$
 \end{center}
-The backslash and number $i$ are used to denote such 
-so-called "back-references".
-Let $e$ be an expression made of regular expressions 
-and back-references. $e$ contains the expression $e_i$
-as its $i$-th capturing group.
-The semantics of back-reference can be recursively
-written as:
+%The backslash and number $i$ are the
+%so-called "back-references".
+%Let $e$ be an expression made of regular expressions 
+%and back-references. $e$ contains the expression $e_i$
+%as its $i$-th capturing group.
+%The semantics of back-reference can be recursively
+%written as:
+%\begin{center}
+%	\begin{tabular}{c}
+%		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
+%		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
+%	\end{tabular}
+%\end{center}
+A concrete example
+for back-references would be
 \begin{center}
-	\begin{tabular}{c}
-		$L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\
-		$s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$
-	\end{tabular}
+$((a|b|c|\ldots|z)^*)\backslash 1$,
 \end{center}
-The concrete example
-$((a|b|c|\ldots|z)^*)\backslash 1$
-would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\
-Back-reference is a construct in the "regex" standard
+which would match 
+strings that can be split into two identical halves,
+for example $\mathit{bobo}$, $\mathit{weewee}$ and etc.
+Back-reference is a regex construct 
 that programmers found useful, but not exactly 
 regular any more.
 In fact, that allows the regex construct to express 
--- a/ChengsongTanPhdThesis/example.bib	Wed Sep 28 01:28:01 2022 +0100
+++ b/ChengsongTanPhdThesis/example.bib	Thu Sep 29 14:05:42 2022 +0100
@@ -5,6 +5,14 @@
 
 
 %% Saved with string encoding Unicode (UTF-8) 
+@inproceedings{Becchi08,
+author = {Becchi, Michela and Crowley, Patrick},
+year = {2008},
+month = {01},
+pages = {25},
+title = {Extending finite automata to efficiently match Perl-compatible regular expressions},
+doi = {10.1145/1544012.1544037}
+}
 
 @book{
 	Sakarovitch2009,