ninems/ninems.tex
changeset 63 d3c22f809dde
parent 62 5b10d83b0834
child 64 afd0d702a4fe
--- a/ninems/ninems.tex	Sat Jul 06 21:09:45 2019 +0100
+++ b/ninems/ninems.tex	Sat Jul 06 23:34:27 2019 +0100
@@ -2,16 +2,19 @@
 \usepackage{graphic}
 \usepackage{data}
 \usepackage{tikz-cd}
-\usepackage{algorithm}
+%\usepackage{algorithm}
 \usepackage{amsmath}
 \usepackage[noend]{algpseudocode}
 \usepackage{enumitem}
+
+\definecolor{darkblue}{rgb}{0,0,0.6}
+\hypersetup{colorlinks=true,allcolors=darkblue}
 % \documentclass{article}
 %\usepackage[utf8]{inputenc}
 %\usepackage[english]{babel}
 %\usepackage{listings}
 % \usepackage{amsthm}
-% \usepackage{hyperref}
+%\usepackage{hyperref}
 % \usepackage[margin=0.5in]{geometry}
 %\usepackage{pmboxdraw}
  
@@ -181,10 +184,10 @@
 in the JavaScript and Python ecosystems \cite{Davis18}.
 
 This exponential blowup in matching algorithms sometimes causes
-considerable disturbance in real life: for example on 20 July 2016 one evil
+considerable grief in real life: for example on 20 July 2016 one evil
 regular expression brought the webpage
 \href{http://stackexchange.com}{Stack Exchange} to its
-knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
+knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
 In this instance, a regular expression intended to just trim white
 spaces from the beginning and the end of a line actually consumed
 massive amounts of CPU-resources and because of this the web servers
@@ -197,7 +200,8 @@
 are not covered by the classical automata theory, and in this more
 general setting there are quite a few research questions still
 unanswered and fast algorithms still need to be developed (for example
-how to include bounded repetitions, negation and  back-references).
+how to treat bounded repetitions, negation and  back-references
+efficiently).
 
 There is also another under-researched problem to do with regular
 expressions and lexing, i.e.~the process of breaking up strings into
@@ -396,16 +400,17 @@
 
 \noindent
 There is no value  corresponding to $\ZERO$; $\Empty$ corresponds to
-$\ONE$; $\Seq$ to the sequence regular expression and so on. The idea of
-values is to encode parse trees. To see this, suppose a \emph{flatten}
-operation, written $|v|$, which we can use to extract the underlying
-string of a value $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \,
-(\textit{Char y})|$ is the string $xy$. We omit the straightforward
-definition of flatten. Using flatten, we can describe how values encode
-parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @ |v_2|$
-matches the regex $r_1 \cdot r_2$: $r_1$ matches the substring $|v_1|$ and,
-respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two are matched
-is contained in the sub-structure of $v_1$ and $v_2$. 
+$\ONE$; $\Char$ to the character regular expression; $\Seq$ to the
+sequence regular expression and so on. The idea of values is to encode
+parse trees. To see this, suppose a \emph{flatten} operation, written
+$|v|$. We use this function to extract the underlying string of a value
+$v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char
+y})|$ is the string $xy$.  Using flatten, we can describe how values
+encode parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @
+|v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the
+substring $|v_1|$ and, respectively, $r_2$ matches the substring
+$|v_2|$. Exactly how these two are matched is contained in the
+sub-structure of $v_1$ and $v_2$. 
 
  To give a concrete example of how value works, consider the string $xy$
 and the regular expression $(x + (y + xy))^*$. We can view this regular
@@ -450,7 +455,7 @@
 \noindent
 For convenience, we shall employ the following notations: the regular expression we
 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
-\ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots  according to
+\ldots c_n$. In  the first phase, we build the derivatives $r_1$, $r_2$, \ldots  according to
 the characters $c_0$, $c_1$,\ldots  until we exhaust the string and
 obtain at the derivative $r_n$. We test whether this derivative is
 $\textit{nullable}$ or not. If not, we know the string does not match
@@ -481,23 +486,19 @@
 important later on.
 
 After this, we inject back the characters one by one in order to build
-the parse tree $v_i$ for how the regex $r_i$ matches the string
-$s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we
-get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
-
-We need a function that
-reverses the "chopping off" for values which we did in the
-first phase for derivatives. The corresponding function is
-called $\textit{inj}$ (short for injection). This function takes three
-arguments: the first one is a regular expression ${r_{i-1}}$, 
-before the character is chopped off, the second is  ${c_{i-1}}$,
-the character
-we
-want to inject and the third argument is the value $textit{v_i}$, where we
-will inject the character(it corresponds to the regular expression
- after the character
-has been chopped off). The result of this function is a
-new value. The definition of $\textit{inj}$ is as follows: 
+the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
+($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After
+injecting back $n$ characters, we get the parse tree for how $r_0$
+matches $s$. For this Sulzmann and Lu defined a function that reverses
+the ``chopping off'' of characters during the derivative phase. The
+corresponding function is called $\textit{inj}$; it takes three
+arguments: the first one is a regular expression ${r_{i-1}}$, before the
+character is chopped off, the second is a character ${c_{i-1}}$, the
+character we want to inject and the third argument is the value
+$textit{v_i}$, into which one wants to inject the character (it
+corresponds to the regular expression after the character has been
+chopped off). The result of this function is a new value. The definition
+of $\textit{inj}$ is as follows: 
 
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
@@ -511,54 +512,59 @@
 \end{tabular}
 \end{center}
 
-\noindent This definition is by recursion on the regular
-expressions' and values' shapes. 
-To understands this definition in a more vivid way, the reader 
-might imagine this: when we do the derivative on regular expression 
-$r_{i-1}$, we chop off a character from $r_{i-1}$ to form $r_i$.
-This leaves a "hole" on $r_i$. In order to calculate a value for
-$r_{i-1}$, we need to locate where that hole is. We can find this location
-informtation by comparing $r_{i-1}$ and $v_i$.
-For instance, if $r_{i-1}$ is of shape $r_a \cdot r_b$, and $v_i$
-is of shape $\textit{Left}(\textit{Seq}(v_a, v_b))$, we know immediately that 
-\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c\],
+\noindent This definition is by recursion on the ``shape'' of regular
+expressions and values. To understands this definition better consider
+the situation when we build the derivative on regular expression $r_{i-1}$.
+For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a
+``hole'' in $r_i$. In order to calculate a value for $r_{i-1}$, we need to
+locate where that hole is. We can find this location information by
+comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape
+$r_a \cdot r_b$, and $v_i$ is of shape $\textit{Left}(\textit{Seq}(v_a,
+v_b))$, we know immediately that 
+%
+\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c,\]
+
+\noindent
 otherwise if $r_a$ is not nullable,
-\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b\],
-the value $v_i$ should be of shape $Seq(\ldots)$, contradicting the fact that
-$v_i$ is actually $Left(\ldots)$. Furthermore, since $v_i$ is of shape
-$Left(\ldots)$ instead of $Right(\ldots)$, we know that the left
+\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b,\]
+
+\noindent
+the value $v_i$ should be of shape $\Seq(\ldots)$, contradicting the fact that
+$v_i$ is actually $\Left(\ldots)$. Furthermore, since $v_i$ is of shape
+$\Left(\ldots)$ instead of $\Right(\ldots)$, we know that the left
 branch is taken instead of the right one. We have therefore found out 
-that the hole will be on $r_a$. So we recursively call $\textit{inj} 
-r_a c v_a$ to fill that hole in $v_a$. After injection, the value 
-$v_i$ for $r_i = r_a \cdot r_b$ should be $\textit{Seq}(\textit{inj}
-r_a c v_a, v_b)$.
+that the hole will be on $r_a$. So we recursively call $\inj\, 
+r_a\,c\,v_a$ in order to fill that hole in $v_a$. After injection, the value 
+$v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
 Other clauses can be understood in a similar way.
 
-To understand what is going on, it might be best to do some
-example calculations and compare them with Figure~\eqref{graph:2}.
-% A correctness proof using induction can be routinely established.
-Suppose we have a regular expression $((((a+b)+ab)+c)+abc)^*$ and want to
-match it against the string $abc$(when $abc$ is written as a regular
-expression, the most standard way of expressing it should be $a \cdot (b
-\cdot c)$. We omit the parentheses and dots here for readability). By
-POSIX rules the lexer should go for the longest matching, i.e. it should
-match the string $abc$ in one star iteration, using the longest string
-$abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this
-sub-expression for conciseness). Here is how the lexer achieves a parse
-tree for this matching. First, build successive derivatives until we
-exhaust the string, as illustrated here(we simplified some regular
-expressions like $0 \cdot b$ to $0$ for conciseness. We also omit 
-some parentheses if they are clear from the context.):
+Let us do some further examples involving $\inj$: Suppose we have the
+regular expression $((((a+b)+ab)+c)+abc)^*$ and want to match it against
+the string $abc$ (when $abc$ is written as a regular expression, the most
+standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
+the parentheses and dots here for readability). By POSIX rules the lexer
+should go for the longest matching, i.e. it should match the string
+$abc$ in one star iteration, using the longest string $abc$ in the
+sub-expression $a+b+ab+c+abc$ (we use $r$ to denote this sub-expression
+for conciseness). Here is how the lexer achieves a parse tree for this
+matching. First, build successive derivatives until we exhaust the
+string, as illustrated here (we simplified some regular expressions like
+$0 \cdot b$ to $0$ for conciseness; we also omit some parentheses if
+they are clear from the context):
 %Similarly, we allow
 %$\textit{ALT}$ to take a list of regular expressions as an argument
 %instead of just 2 operands to reduce the nested depth of
 %$\textit{ALT}$
 \begin{center}
-\[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r^* \xrightarrow{\backslash b}\]
-\[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0  + 0 + 0) \cdot r^* \xrightarrow{\backslash c}\] 
-\[r_3 = (( 0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r^*) +((0+1+0  + 0 + 0) \cdot r^*+(0+0+0  + 1 + 0) \cdot r^* )
-\]
+\begin{tabular}{lcl}
+$r^*$ & $\xrightarrow{\backslash a}$ & $r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r^*$\\
+      & $\xrightarrow{\backslash b}$ & $r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0  + 0 + 0) \cdot r^*$\\
+      & $\xrightarrow{\backslash c}$ & $r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r^*) + $\\ 
+      &                              & $\phantom{r_3 = (} ((0+1+0  + 0 + 0) \cdot r^* + (0+0+0  + 1 + 0) \cdot r^* )$
+\end{tabular}
 \end{center}
+
+\noindent
 Now instead when $nullable$ gives a $yes$ on $r_3$, we  call $mkeps$ 
 to construct a parse tree for how $r_3$ matched the string $abc$. 
 $mkeps$ gives the following value $v_3$: 
@@ -612,47 +618,51 @@
 not change because of later derivatives,
 there is no point in traversing this information twice. This leads to a new approach of lexing-- if we store the information for parse trees  in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
 
-In the next section, we shall focus on the bit-coded algorithm and the natural
-process of simplification of regular expressions using bit-codes, which is needed in
+In the next section, we shall focus on the bit-coded algorithm and the
+process of simplification of regular expressions. This is needed in
 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
-and Lu's algorithms.  This is where the PhD-project hopes to advance
-the state-of-the-art.
+and Lu's algorithms.  This is where the PhD-project aims to advance the
+state-of-the-art.
 
 
 \section{Simplification of Regular Expressions}
-Using bit-codes to guide  parsing is not a new idea.
-It was applied to context free grammars and then adapted by Henglein and Nielson for efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and Lu took a step further by integrating bitcodes into derivatives.
+
+
+Using bit-codes to guide  parsing is not a novel idea. It was applied to
+context free grammars and then adapted by Henglein and Nielson for
+efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and
+Lu took a step further by integrating bitcodes into derivatives.
 
-The argument for complicating the data structures from basic regular expressions to those with bitcodes
-is that we can introduce simplification without making the algorithm crash or overly complex to reason about.
-The reason why we need simplification is due to the shortcoming of the previous algorithm. 
-The main drawback of building successive derivatives according to
-Brzozowski's definition is that they can grow very quickly in size.
-This is mainly due to the fact that the derivative operation generates
-often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result,
-if implemented naively both algorithms by Brzozowski and by Sulzmann
-and Lu are excruciatingly slow. For example when starting with the
-regular expression $(a + aa)^*$ and building 12 successive derivatives
-w.r.t.~the character $a$, one obtains a derivative regular expression
-with more than 8000 nodes (when viewed as a tree). Operations like
-derivative and $\nullable$ need to traverse such trees and
-consequently the bigger the size of the derivative the slower the
-algorithm. Fortunately, one can simplify regular expressions after
-each derivative step. Various simplifications of regular expressions
-are possible, such as the simplifications of $\ZERO + r$,
-$r + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just
-$r$. These simplifications do not affect the answer for whether a
-regular expression matches a string or not, but fortunately also do
-not affect the POSIX strategy of how regular expressions match
-strings---although the latter is much harder to establish. Some
-initial results in this regard have been obtained in
-\cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet
-is a very tight bound for the size. Such a tight bound is suggested by
-work of Antimirov who proved that (partial) derivatives can be bound
-by the number of characters contained in the initial regular
-expression \cite{Antimirov95}.
+The argument for complicating the data structures from basic regular
+expressions to those with bitcodes is that we can introduce
+simplification without making the algorithm crash or overly complex to
+reason about. The reason why we need simplification is due to the
+shortcoming of the previous algorithm. The main drawback of building
+successive derivatives according to Brzozowski's definition is that they
+can grow very quickly in size. This is mainly due to the fact that the
+derivative operation generates often ``useless'' $\ZERO$s and $\ONE$s in
+derivatives.  As a result, if implemented naively both algorithms by
+Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example
+when starting with the regular expression $(a + aa)^*$ and building 12
+successive derivatives w.r.t.~the character $a$, one obtains a
+derivative regular expression with more than 8000 nodes (when viewed as
+a tree). Operations like derivative and $\nullable$ need to traverse
+such trees and consequently the bigger the size of the derivative the
+slower the algorithm. Fortunately, one can simplify regular expressions
+after each derivative step. Various simplifications of regular
+expressions are possible, such as the simplifications of $\ZERO + r$, $r
++ \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just $r$. These
+simplifications do not affect the answer for whether a regular
+expression matches a string or not, but fortunately also do not affect
+the POSIX strategy of how regular expressions match strings---although
+the latter is much harder to establish. Some initial results in this
+regard have been obtained in \cite{AusafDyckhoffUrban2016}. However,
+what has not been achieved yet is a very tight bound for the size. Such
+a tight bound is suggested by work of Antimirov who proved that
+(partial) derivatives can be bound by the number of characters contained
+in the initial regular expression \cite{Antimirov95}.
 
-Antimirov defined the "partial derivatives" of regular expressions to be this:
+Antimirov defined the \emph{partial derivatives} of regular expressions to be this:
 %TODO definition of partial derivatives
 \begin{center}
 \begin{tabular}{lcl}