# HG changeset patch # User Chengsong # Date 1664456742 -3600 # Node ID 99b530103464ba669a7fdebc0055c9184236d06e # Parent ed53ce26ecb6835a8e1fc48309980a45082177ef new diff -r ed53ce26ecb6 -r 99b530103464 ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Wed Sep 28 01:28:01 2022 +0100 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Thu Sep 29 14:05:42 2022 +0100 @@ -584,12 +584,12 @@ Often the counters are quite large, the largest up to ten million. An example XSD they gave was: -\begin{verbatim} - - - - -\end{verbatim} +%\begin{verbatim} +% +% +% +% +%\end{verbatim} This can be seen as the expression $(ab^{2\ldots 12})^{0 \ldots 65535}$, where $a$ and $b$ are themselves regular expressions @@ -604,9 +604,74 @@ language the regular expression $a^{1001}$ is not permitted, because no counter can be above 1000, and in the built-in Rust regular expression library expressions such as $a^{\{1000\}\{100\}\{5\}}$ give an error message -for being too big. These problems can of course be solved in matching algorithms where +for being too big. +As Becchi and Crawley\cite{Becchi08} have pointed out, +the reason for these restrictions +are that they simulate a non-deterministic finite +automata (NFA) with a breadth-first search. +This way the number of active states could +be equal to the counter number. +When the counters are large, +the memory requirement could become +infeasible, and the pattern needs to be rejected straight away. +\begin{figure}[H] +\begin{center} +\begin{tikzpicture} [node distance = 2cm, on grid, auto] + + \node (q0) [state, initial] {$0$}; + \node (q1) [state, right = of q0] {$1$}; + \node (q2) [state, right = of q1] {$2$}; + \node (qdots) [right = of q2] {$\ldots$}; + \node (qn) [state, right = of qdots] {$n$}; + \node (qn1) [state, right = of qn] {$n+1$}; + \node (qn2) [state, right = of qn1] {$n+2$}; + \node (qn3) [state, accepting, right = of qn2] {$n+3$}; + +\path [-stealth, thick] + (q0) edge [loop above] node {a} () + (q0) edge node {a} (q1) + (q1) edge node {.} (q2) + (q2) edge node {.} (qdots) + (qdots) edge node {.} (qn) + (qn) edge node {.} (qn1) + (qn1) edge node {b} (qn2) + (qn2) edge node {$c$} (qn3); +\end{tikzpicture} +%\begin{tikzpicture}[shorten >=1pt,node distance=2cm,on grid,auto] +% \node[state,initial] (q_0) {$0$}; +% \node[state, ] (q_1) [right=of q_0] {$1$}; +% \node[state, ] (q_2) [right=of q_1] {$2$}; +% \node[state, +% \node[state, accepting, ](q_3) [right=of q_2] {$3$}; +% \path[->] +% (q_0) edge node {a} (q_1) +% edge [loop below] node {a,b} () +% (q_1) edge node {a,b} (q_2) +% (q_2) edge node {a,b} (q_3); +%\end{tikzpicture} +\end{center} +\caption{The example given by Becchi and Crawley + that NFA simulation can consume large + amounts of memory: $.^*a.^{\{n\}}bc$ matching + strings of the form $aaa\ldots aaaabc$. + When traversing in a breadth-first manner, +all states from 0 till $n+1$ will become active.} +\end{figure} +%Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this +%type of $\mathit{NFA}$ simulation and guarantees a linear runtime +%in terms of input string length. +%TODO:try out these lexers +These problems can of course be solved in matching algorithms where automata go beyond the classic notion and for instance include explicit counters \cite{Turo_ov__2020}. +These solutions can be quite effective, +with the ability to process +gigabytes of string input per second +even with large counters \cite{Becchi08}. +But formally reasoning about these automata can be challenging +and un-intuitive. +Therefore, correctness and runtime claims made about these solutions need to be +taken with a grain of salt. In the work reported in \cite{CSL2022} and here, we add better support using derivatives @@ -629,16 +694,21 @@ boundedness property, which we shall prove later on. + + + \subsection{Back-References} -%Some constructs can even raise the expressive -%power to the non-regular realm, for example -%the back-references. -%bounded repetitions, as we have discussed in the -%previous section. -%This super-linear behaviour of the -%regex matching engines we have? +The other way to simulate an $\mathit{NFA}$ for matching is choosing +a single transition each time, keeping all the other options in +a queue or stack, and backtracking if that choice eventually +fails. This method, often called a "depth-first-search", +is efficient in a lot of cases, but could end up +with exponential run time. +The backtracking method is employed in regex libraries +that support \emph{back-references}, for example +in Java and Python. The syntax and expressive power of regexes -can go quickly beyond the regular language hierarchy +go beyond the regular language hierarchy with back-references. %\section{Back-references and The Terminology Regex} @@ -652,60 +722,56 @@ %This can be viewed as a breadth-first-search of the $\mathit{NFA}$ %for a path terminating %at an accepting state. -Languages like $\mathit{Go}$ and $\mathit{Rust}$ use this -type of $\mathit{NFA}$ simulation and guarantees a linear runtime -in terms of input string length. -%TODO:try out these lexers -The other way to use $\mathit{NFA}$ for matching is choosing -a single transition each time, keeping all the other options in -a queue or stack, and backtracking if that choice eventually -fails. This method, often called a "depth-first-search", -is efficient in a lot of cases, but could end up -with exponential run time.\\ -%TODO:COMPARE java python lexer speed with Rust and Go -The reason behind backtracking algorithms in languages like -Java and Python is that they support back-references. -\subsubsection{Back References} -If we have a regular expression like this (the sequence + + + +Given a regular expression like this (the sequence operator is omitted for brevity): \begin{center} - $r_1(r_2(r_3r_4))$ + $r_1r_2r_3r_4$ \end{center} -We could label sub-expressions of interest +one could label sub-expressions of interest by parenthesizing them and giving them a number by the order in which their opening parentheses appear. One possible way of parenthesizing and labelling is given below: \begin{center} $\underset{1}{(}r_1\underset{2}{(}r_2\underset{3}{(}r_3)\underset{4}{(}r_4)))$ \end{center} -$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$, $r_4$ are labelled -by 1 to 4. $1$ would refer to the entire expression -$(r_1(r_2(r_3)(r_4)))$, $2$ referring to $r_2(r_3)(r_4)$, etc. -These sub-expressions are called "capturing groups". -We can use the following syntax to denote that we want a string just matched by a -sub-expression (capturing group) to appear at a certain location again, -exactly as it was: +The sub-expressions +$r_1r_2r_3r_4$, $r_1r_2r_3$, $r_3$ and $r_4$ are labelled +by 1 to 4, and can be ``referred back'' by their respective numbers. +%These sub-expressions are called "capturing groups". +To do so, we use the syntax $\backslash i$ +to denote that we want the sub-string +of the input just matched by the i-th +sub-expression to appear again, +exactly the same as it first appeared: \begin{center} $\ldots\underset{\text{i-th lparen}}{(}{r_i})\ldots \underset{s_i \text{ which just matched} \;r_i}{\backslash i}$ \end{center} -The backslash and number $i$ are used to denote such -so-called "back-references". -Let $e$ be an expression made of regular expressions -and back-references. $e$ contains the expression $e_i$ -as its $i$-th capturing group. -The semantics of back-reference can be recursively -written as: +%The backslash and number $i$ are the +%so-called "back-references". +%Let $e$ be an expression made of regular expressions +%and back-references. $e$ contains the expression $e_i$ +%as its $i$-th capturing group. +%The semantics of back-reference can be recursively +%written as: +%\begin{center} +% \begin{tabular}{c} +% $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\ +% $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$ +% \end{tabular} +%\end{center} +A concrete example +for back-references would be \begin{center} - \begin{tabular}{c} - $L ( e \cdot \backslash i) = \{s @ s_i \mid s \in L (e)\quad s_i \in L(r_i)$\\ - $s_i\; \text{match of ($e$, $s$)'s $i$-th capturing group string}\}$ - \end{tabular} +$((a|b|c|\ldots|z)^*)\backslash 1$, \end{center} -The concrete example -$((a|b|c|\ldots|z)^*)\backslash 1$ -would match the string like $\mathit{bobo}$, $\mathit{weewee}$ and etc.\\ -Back-reference is a construct in the "regex" standard +which would match +strings that can be split into two identical halves, +for example $\mathit{bobo}$, $\mathit{weewee}$ and etc. +Back-reference is a regex construct that programmers found useful, but not exactly regular any more. In fact, that allows the regex construct to express diff -r ed53ce26ecb6 -r 99b530103464 ChengsongTanPhdThesis/example.bib --- a/ChengsongTanPhdThesis/example.bib Wed Sep 28 01:28:01 2022 +0100 +++ b/ChengsongTanPhdThesis/example.bib Thu Sep 29 14:05:42 2022 +0100 @@ -5,6 +5,14 @@ %% Saved with string encoding Unicode (UTF-8) +@inproceedings{Becchi08, +author = {Becchi, Michela and Crowley, Patrick}, +year = {2008}, +month = {01}, +pages = {25}, +title = {Extending finite automata to efficiently match Perl-compatible regular expressions}, +doi = {10.1145/1544012.1544037} +} @book{ Sakarovitch2009,