more
authorChengsong
Sun, 25 Sep 2022 01:40:12 +0100
changeset 604 16d67f9c07d4
parent 603 370fe1dde7c7
child 605 ed53ce26ecb6
more
ChengsongTanPhdThesis/Chapters/Introduction.tex
--- a/ChengsongTanPhdThesis/Chapters/Introduction.tex	Sat Sep 24 00:49:10 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex	Sun Sep 25 01:40:12 2022 +0100
@@ -397,16 +397,86 @@
 on Brzozowski derivatives with certified correctness (in 
 Isabelle/HOL)
 and finiteness property.
-Such properties guarantees the absence of 
+Such properties guarantee the absence of 
 catastrophic backtracking in most cases.
-We will give more details why we choose our 
-approach (Brzozowski derivatives and formal proofs)
-in the next sections.
+We will give more details in the next sections
+on (i) why the slow cases in graph \ref{fig:aStarStarb}
+can occur
+and (ii) why we choose our 
+approach (Brzozowski derivatives and formal proofs).
 
 
-\section{The Problem with Bounded Repetitions}
+\section{Terminology, and the Problem with Bounded Repetitions}
 Regular expressions and regular expression matchers 
 have of course been studied for many, many years.
+Theoretical results in automata theory says
+that basic regular expression matching should be linear
+w.r.t the input, provided that the regular expression
+$r$ had been pre-processed and turned into a
+deterministic finite automata (DFA).
+By basic we mean textbook definitions such as the one
+below, involving only characters, alternatives,
+sequences, and Kleene stars:
+\[
+	r ::= \ZERO | \ONE | c | r_1 + r_2 | r_1 \cdot r_2 | r^*
+\]
+Modern regular expression matchers used by programmers,
+however,
+support richer constructs such as bounded repetitions
+and back-references.
+The syntax and expressive power of those 
+matching engines
+make ``regular expressions'' quite different from 
+their original meaning in the formal languages
+theory.
+To differentiate, people tend to use the word \emph{regex} to refer
+those expressions with richer constructs, and regular expressions
+for the more traditional meaning.
+For example, the PCRE standard (Peral Compatible Regular Expressions)
+is such a regex syntax standard.
+We follow this convention in this thesis.
+We aim to support all the popular features of regexes in the future,
+but for this work we mainly look at regular expressions.
+
+\subsection{A Little Introduction to Regexes: Bounded Repetitions
+and Back-references}
+Regexes come with a lot of constructs
+that makes it more convenient for 
+programmers to write regular expressions.
+Some of those constructs are syntactic sugars that are
+simply short hand notations
+that save the programmers a few keystrokes,
+for example the
+non-binary alternative involving three or more choices:
+\[
+	r = (a | b | c | \ldots | z)^*
+\]
+, the range operator $-$ which means the alternative
+of all characters between its operands:
+\[
+	r = [0-9a-zA-Z] \; \text{(all alpha-numeric characters)}
+\]
+and the 
+wildcard character $.$ meaning any character
+\[
+	. = [0-9a-zA-Z+-()*&\ldots]
+
+\]
+Some of those constructs do make the expressions much
+more compact, and matching time could be greatly increase.
+For example, $r^{n}$ is exponentially more concise compared with
+the expression $\underbrace{r}_\text{n copies of r}$,
+and therefore a naive algorithm that simply unfolds
+$r^{n}$ into $\underbrace{r}_\text{n copies of r}$
+will suffer exponential runtime increase.
+Some constructs can even raise the expressive
+power to the non-regular realm, for example
+the back-references.
+
+bounded repetitions, as we have discussed in the 
+previous section.
+This super-linear behaviour of the 
+regex matching engines we have?
 One of the most recent work in the context of lexing
 is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}.
 This is relevant work and we will compare later on
@@ -472,10 +542,60 @@
 Finally, bounded regular expressions do not destroy our finite
 boundedness property, which we shall prove later on.
 
-\section{The Terminology Regex, and why Backtracking?}
-Shouldn't regular expression matching be linear?
-How can one explain the super-linear behaviour of the 
-regex matching engines we have?
+\section{Back-references and The Terminology Regex}
+
+
+Bounded repetitions, usually written in the form
+$r^{\{c\}}$ (where $c$ is a constant natural number),
+denotes a regular expression accepting strings
+that can be divided into $c$ substrings, where each 
+substring is in $r$. 
+For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
+an $\mathit{NFA}$ describing it would look like:
+\begin{center}
+\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
+   \node[state,initial] (q_0)   {$q_0$}; 
+   \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
+   \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
+   \node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
+    \path[->] 
+    (q_0) edge  node {a} (q_1)
+    	  edge [loop below] node {a,b} ()
+    (q_1) edge  node  {a,b} (q_2)
+    (q_2) edge  node  {a,b} (q_3);
+\end{tikzpicture}
+\end{center}
+The red states are "countdown states" which counts down 
+the number of characters needed in addition to the current
+string to make a successful match.
+For example, state $q_1$ indicates a match that has
+gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
+and just consumed the "delimiter" $a$ in the middle, and 
+need to match 2 more iterations of $(a|b)$ to complete.
+State $q_2$ on the other hand, can be viewed as a state
+after $q_1$ has consumed 1 character, and just waits
+for 1 more character to complete.
+$q_3$ is the last state, requiring 0 more character and is accepting.
+Depending on the suffix of the
+input string up to the current read location,
+the states $q_1$ and $q_2$, $q_3$
+may or may
+not be active, independent from each other.
+A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
+contain at least $2^3$ non-equivalent states that cannot be merged, 
+because the subset construction during determinisation will generate
+all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
+Generalizing this to regular expressions with larger
+bounded repetitions number, we have that
+regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
+would require at least $2^{n+1}$ states, if $r$ contains
+more than 1 string.
+This is to represent all different 
+scenarios which "countdown" states are active.
+For those regexes, tools that uses $\DFA$s will get
+out of memory errors.
+
+
 The time cost of regex matching algorithms in general
 involve two different phases, and different things can go differently wrong on 
 these phases.
@@ -589,55 +709,10 @@
 However, they do not scale well with bounded repetitions.
 
 \subsubsection{Problems with Bounded Repetitions}
-Bounded repetitions, usually written in the form
-$r^{\{c\}}$ (where $c$ is a constant natural number),
-denotes a regular expression accepting strings
-that can be divided into $c$ substrings, where each 
-substring is in $r$. 
-For the regular expression $(a|b)^*a(a|b)^{\{2\}}$,
-an $\mathit{NFA}$ describing it would look like:
-\begin{center}
-\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] 
-   \node[state,initial] (q_0)   {$q_0$}; 
-   \node[state, red] (q_1) [right=of q_0] {$q_1$}; 
-   \node[state, red] (q_2) [right=of q_1] {$q_2$}; 
-   \node[state, accepting, red](q_3) [right=of q_2] {$q_3$};
-    \path[->] 
-    (q_0) edge  node {a} (q_1)
-    	  edge [loop below] node {a,b} ()
-    (q_1) edge  node  {a,b} (q_2)
-    (q_2) edge  node  {a,b} (q_3);
-\end{tikzpicture}
-\end{center}
-The red states are "countdown states" which counts down 
-the number of characters needed in addition to the current
-string to make a successful match.
-For example, state $q_1$ indicates a match that has
-gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$,
-and just consumed the "delimiter" $a$ in the middle, and 
-need to match 2 more iterations of $(a|b)$ to complete.
-State $q_2$ on the other hand, can be viewed as a state
-after $q_1$ has consumed 1 character, and just waits
-for 1 more character to complete.
-$q_3$ is the last state, requiring 0 more character and is accepting.
-Depending on the suffix of the
-input string up to the current read location,
-the states $q_1$ and $q_2$, $q_3$
-may or may
-not be active, independent from each other.
-A $\mathit{DFA}$ for such an $\mathit{NFA}$ would
-contain at least $2^3$ non-equivalent states that cannot be merged, 
-because the subset construction during determinisation will generate
-all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$.
-Generalizing this to regular expressions with larger
-bounded repetitions number, we have that
-regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s
-would require at least $2^{n+1}$ states, if $r$ contains
-more than 1 string.
-This is to represent all different 
-scenarios which "countdown" states are active.
-For those regexes, tools that uses $\DFA$s will get
-out of memory errors.
+
+
+
+
 
 \subsubsection{Tools that uses $\mathit{DFA}$s}
 %TODO:more tools that use DFAs?