# HG changeset patch # User Chengsong # Date 1664066412 -3600 # Node ID 16d67f9c07d4a0bc34c94b70f99db5377abab06b # Parent 370fe1dde7c705044883dc25ea77734b6a377c2e more diff -r 370fe1dde7c7 -r 16d67f9c07d4 ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Sat Sep 24 00:49:10 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Sun Sep 25 01:40:12 2022 +0100 @@ -397,16 +397,86 @@ on Brzozowski derivatives with certified correctness (in Isabelle/HOL) and finiteness property. -Such properties guarantees the absence of +Such properties guarantee the absence of catastrophic backtracking in most cases. -We will give more details why we choose our -approach (Brzozowski derivatives and formal proofs) -in the next sections. +We will give more details in the next sections +on (i) why the slow cases in graph \ref{fig:aStarStarb} +can occur +and (ii) why we choose our +approach (Brzozowski derivatives and formal proofs). -\section{The Problem with Bounded Repetitions} +\section{Terminology, and the Problem with Bounded Repetitions} Regular expressions and regular expression matchers have of course been studied for many, many years. +Theoretical results in automata theory says +that basic regular expression matching should be linear +w.r.t the input, provided that the regular expression +$r$ had been pre-processed and turned into a +deterministic finite automata (DFA). +By basic we mean textbook definitions such as the one +below, involving only characters, alternatives, +sequences, and Kleene stars: +\[ + r ::= \ZERO | \ONE | c | r_1 + r_2 | r_1 \cdot r_2 | r^* +\] +Modern regular expression matchers used by programmers, +however, +support richer constructs such as bounded repetitions +and back-references. +The syntax and expressive power of those +matching engines +make ``regular expressions'' quite different from +their original meaning in the formal languages +theory. +To differentiate, people tend to use the word \emph{regex} to refer +those expressions with richer constructs, and regular expressions +for the more traditional meaning. +For example, the PCRE standard (Peral Compatible Regular Expressions) +is such a regex syntax standard. +We follow this convention in this thesis. +We aim to support all the popular features of regexes in the future, +but for this work we mainly look at regular expressions. + +\subsection{A Little Introduction to Regexes: Bounded Repetitions +and Back-references} +Regexes come with a lot of constructs +that makes it more convenient for +programmers to write regular expressions. +Some of those constructs are syntactic sugars that are +simply short hand notations +that save the programmers a few keystrokes, +for example the +non-binary alternative involving three or more choices: +\[ + r = (a | b | c | \ldots | z)^* +\] +, the range operator $-$ which means the alternative +of all characters between its operands: +\[ + r = [0-9a-zA-Z] \; \text{(all alpha-numeric characters)} +\] +and the +wildcard character $.$ meaning any character +\[ + . = [0-9a-zA-Z+-()*&\ldots] + +\] +Some of those constructs do make the expressions much +more compact, and matching time could be greatly increase. +For example, $r^{n}$ is exponentially more concise compared with +the expression $\underbrace{r}_\text{n copies of r}$, +and therefore a naive algorithm that simply unfolds +$r^{n}$ into $\underbrace{r}_\text{n copies of r}$ +will suffer exponential runtime increase. +Some constructs can even raise the expressive +power to the non-regular realm, for example +the back-references. + +bounded repetitions, as we have discussed in the +previous section. +This super-linear behaviour of the +regex matching engines we have? One of the most recent work in the context of lexing is the Verbatim lexer by Egolf, Lasser and Fisher\cite{Verbatim}. This is relevant work and we will compare later on @@ -472,10 +542,60 @@ Finally, bounded regular expressions do not destroy our finite boundedness property, which we shall prove later on. -\section{The Terminology Regex, and why Backtracking?} -Shouldn't regular expression matching be linear? -How can one explain the super-linear behaviour of the -regex matching engines we have? +\section{Back-references and The Terminology Regex} + + +Bounded repetitions, usually written in the form +$r^{\{c\}}$ (where $c$ is a constant natural number), +denotes a regular expression accepting strings +that can be divided into $c$ substrings, where each +substring is in $r$. +For the regular expression $(a|b)^*a(a|b)^{\{2\}}$, +an $\mathit{NFA}$ describing it would look like: +\begin{center} +\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] + \node[state,initial] (q_0) {$q_0$}; + \node[state, red] (q_1) [right=of q_0] {$q_1$}; + \node[state, red] (q_2) [right=of q_1] {$q_2$}; + \node[state, accepting, red](q_3) [right=of q_2] {$q_3$}; + \path[->] + (q_0) edge node {a} (q_1) + edge [loop below] node {a,b} () + (q_1) edge node {a,b} (q_2) + (q_2) edge node {a,b} (q_3); +\end{tikzpicture} +\end{center} +The red states are "countdown states" which counts down +the number of characters needed in addition to the current +string to make a successful match. +For example, state $q_1$ indicates a match that has +gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$, +and just consumed the "delimiter" $a$ in the middle, and +need to match 2 more iterations of $(a|b)$ to complete. +State $q_2$ on the other hand, can be viewed as a state +after $q_1$ has consumed 1 character, and just waits +for 1 more character to complete. +$q_3$ is the last state, requiring 0 more character and is accepting. +Depending on the suffix of the +input string up to the current read location, +the states $q_1$ and $q_2$, $q_3$ +may or may +not be active, independent from each other. +A $\mathit{DFA}$ for such an $\mathit{NFA}$ would +contain at least $2^3$ non-equivalent states that cannot be merged, +because the subset construction during determinisation will generate +all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$. +Generalizing this to regular expressions with larger +bounded repetitions number, we have that +regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s +would require at least $2^{n+1}$ states, if $r$ contains +more than 1 string. +This is to represent all different +scenarios which "countdown" states are active. +For those regexes, tools that uses $\DFA$s will get +out of memory errors. + + The time cost of regex matching algorithms in general involve two different phases, and different things can go differently wrong on these phases. @@ -589,55 +709,10 @@ However, they do not scale well with bounded repetitions. \subsubsection{Problems with Bounded Repetitions} -Bounded repetitions, usually written in the form -$r^{\{c\}}$ (where $c$ is a constant natural number), -denotes a regular expression accepting strings -that can be divided into $c$ substrings, where each -substring is in $r$. -For the regular expression $(a|b)^*a(a|b)^{\{2\}}$, -an $\mathit{NFA}$ describing it would look like: -\begin{center} -\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] - \node[state,initial] (q_0) {$q_0$}; - \node[state, red] (q_1) [right=of q_0] {$q_1$}; - \node[state, red] (q_2) [right=of q_1] {$q_2$}; - \node[state, accepting, red](q_3) [right=of q_2] {$q_3$}; - \path[->] - (q_0) edge node {a} (q_1) - edge [loop below] node {a,b} () - (q_1) edge node {a,b} (q_2) - (q_2) edge node {a,b} (q_3); -\end{tikzpicture} -\end{center} -The red states are "countdown states" which counts down -the number of characters needed in addition to the current -string to make a successful match. -For example, state $q_1$ indicates a match that has -gone past the $(a|b)^*$ part of $(a|b)^*a(a|b)^{\{2\}}$, -and just consumed the "delimiter" $a$ in the middle, and -need to match 2 more iterations of $(a|b)$ to complete. -State $q_2$ on the other hand, can be viewed as a state -after $q_1$ has consumed 1 character, and just waits -for 1 more character to complete. -$q_3$ is the last state, requiring 0 more character and is accepting. -Depending on the suffix of the -input string up to the current read location, -the states $q_1$ and $q_2$, $q_3$ -may or may -not be active, independent from each other. -A $\mathit{DFA}$ for such an $\mathit{NFA}$ would -contain at least $2^3$ non-equivalent states that cannot be merged, -because the subset construction during determinisation will generate -all the elements in the power set $\mathit{Pow}\{q_1, q_2, q_3\}$. -Generalizing this to regular expressions with larger -bounded repetitions number, we have that -regexes shaped like $r^*ar^{\{n\}}$ when converted to $\mathit{DFA}$s -would require at least $2^{n+1}$ states, if $r$ contains -more than 1 string. -This is to represent all different -scenarios which "countdown" states are active. -For those regexes, tools that uses $\DFA$s will get -out of memory errors. + + + + \subsubsection{Tools that uses $\mathit{DFA}$s} %TODO:more tools that use DFAs?