cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:5b10d83b0834
+:d3c22f809dde
 \documentclass[a4paper,UKenglish]{lipics}
 \usepackage{graphic}
 \usepackage{data}
 \usepackage{tikz-cd}
-\usepackage{algorithm}
+%\usepackage{algorithm}
 \usepackage{amsmath}
 \usepackage[noend]{algpseudocode}
 \usepackage{enumitem}
+\definecolor{darkblue}{rgb}{0,0,0.6}
+\hypersetup{colorlinks=true,allcolors=darkblue}
 % \documentclass{article}
 %\usepackage[utf8]{inputenc}
 %\usepackage[english]{babel}
 %\usepackage{listings}
 % \usepackage{amsthm}
-% \usepackage{hyperref}
+%\usepackage{hyperref}
 % \usepackage[margin=0.5in]{geometry}
 %\usepackage{pmboxdraw}
 \title{POSIX Regular Expression Matching and Lexing}
 \author{Chengsong Tan}
 them---\emph{evil regular expressions}. In empiric work, Davis et al
 report that they have found thousands of such evil regular expressions
 in the JavaScript and Python ecosystems \cite{Davis18}.
 This exponential blowup in matching algorithms sometimes causes
-considerable disturbance in real life: for example on 20 July 2016 one evil
+considerable grief in real life: for example on 20 July 2016 one evil
 regular expression brought the webpage
 \href{http://stackexchange.com}{Stack Exchange} to its
-knees.\footnote{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
+knees.\footnote{\url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}}
 In this instance, a regular expression intended to just trim white
 spaces from the beginning and the end of a line actually consumed
 massive amounts of CPU-resources and because of this the web servers
 ground to a halt. This happened when a post with 20,000 white spaces was
 submitted, but importantly the white spaces were neither at the
 that many ``real life'' regular expression matching engines do not use
 DFAs for matching. This is because they support regular expressions that
 are not covered by the classical automata theory, and in this more
 general setting there are quite a few research questions still
 unanswered and fast algorithms still need to be developed (for example
-how to include bounded repetitions, negation and  back-references).
+how to treat bounded repetitions, negation and  back-references
+efficiently).
 There is also another under-researched problem to do with regular
 expressions and lexing, i.e.~the process of breaking up strings into
 sequences of tokens according to some regular expressions. In this
 setting one is not just interested in whether or not a regular
 	\end{tabular}
 \end{center}
 \noindent
 There is no value  corresponding to $\ZERO$; $\Empty$ corresponds to
-$\ONE$; $\Seq$ to the sequence regular expression and so on. The idea of
+$\ONE$; $\Char$ to the character regular expression; $\Seq$ to the
-values is to encode parse trees. To see this, suppose a \emph{flatten}
+sequence regular expression and so on. The idea of values is to encode
-operation, written $|v|$, which we can use to extract the underlying
+parse trees. To see this, suppose a \emph{flatten} operation, written
-string of a value $v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \,
+$|v|$. We use this function to extract the underlying string of a value
-(\textit{Char y})|$ is the string $xy$. We omit the straightforward
+$v$. For example, $|\mathit{Seq} \, (\textit{Char x}) \, (\textit{Char
-definition of flatten. Using flatten, we can describe how values encode
+y})|$ is the string $xy$.  Using flatten, we can describe how values
-parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @ |v_2|$
+encode parse trees: $\Seq\,v_1\, v_2$ encodes how the string $|v_1| @
-matches the regex $r_1 \cdot r_2$: $r_1$ matches the substring $|v_1|$ and,
+|v_2|$ matches the regex $r_1 \cdot r_2$ whereby $r_1$ matches the
-respectively, $r_2$ matches the substring $|v_2|$. Exactly how these two are matched
+substring $|v_1|$ and, respectively, $r_2$ matches the substring
-is contained in the sub-structure of $v_1$ and $v_2$.
+$|v_2|$. Exactly how these two are matched is contained in the
+sub-structure of $v_1$ and $v_2$.
 To give a concrete example of how value works, consider the string $xy$
 and the regular expression $(x + (y + xy))^*$. We can view this regular
 expression as a tree and if the string $xy$ is matched by two Star
 ``iterations'', then the $x$ is matched by the left-most alternative in
 \end{equation}
 \noindent
 For convenience, we shall employ the following notations: the regular expression we
 start with is $r_0$, and the given string $s$ is composed of characters $c_0 c_1
-\ldots c_n$. First, we build the derivatives $r_1$, $r_2$, \ldots  according to
+\ldots c_n$. In  the first phase, we build the derivatives $r_1$, $r_2$, \ldots  according to
 the characters $c_0$, $c_1$,\ldots  until we exhaust the string and
 obtain at the derivative $r_n$. We test whether this derivative is
 $\textit{nullable}$ or not. If not, we know the string does not match
 $r$ and no value needs to be generated. If yes, we start building the
 parse tree incrementally by \emph{injecting} back the characters into
 also that in case of alternatives we give preference to the
 regular expression on the left-hand side. This will become
 important later on.
 After this, we inject back the characters one by one in order to build
-the parse tree $v_i$ for how the regex $r_i$ matches the string
+the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$
-$s_i$ ($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After injecting back $n$ characters, we
+($s_i = c_i \ldots c_n$ ) from the previous parse tree $v_{i+1}$. After
-get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
+injecting back $n$ characters, we get the parse tree for how $r_0$
+matches $s$. For this Sulzmann and Lu defined a function that reverses
-We need a function that
+the ``chopping off'' of characters during the derivative phase. The
-reverses the "chopping off" for values which we did in the
+corresponding function is called $\textit{inj}$; it takes three
-first phase for derivatives. The corresponding function is
+arguments: the first one is a regular expression ${r_{i-1}}$, before the
-called $\textit{inj}$ (short for injection). This function takes three
+character is chopped off, the second is a character ${c_{i-1}}$, the
-arguments: the first one is a regular expression ${r_{i-1}}$,
+character we want to inject and the third argument is the value
-before the character is chopped off, the second is  ${c_{i-1}}$,
+$textit{v_i}$, into which one wants to inject the character (it
-the character
+corresponds to the regular expression after the character has been
-we
+chopped off). The result of this function is a new value. The definition
-want to inject and the third argument is the value $textit{v_i}$, where we
+of $\textit{inj}$ is as follows:
-will inject the character(it corresponds to the regular expression
-after the character
-has been chopped off). The result of this function is a
-new value. The definition of $\textit{inj}$ is as follows:
 \begin{center}
 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
 $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
 $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
 $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
 \end{tabular}
 \end{center}
-\noindent This definition is by recursion on the regular
+\noindent This definition is by recursion on the ``shape'' of regular
-expressions' and values' shapes.
+expressions and values. To understands this definition better consider
-To understands this definition in a more vivid way, the reader
+the situation when we build the derivative on regular expression $r_{i-1}$.
-might imagine this: when we do the derivative on regular expression
+For this we chop off a character from $r_{i-1}$ to form $r_i$. This leaves a
-$r_{i-1}$, we chop off a character from $r_{i-1}$ to form $r_i$.
+``hole'' in $r_i$. In order to calculate a value for $r_{i-1}$, we need to
-This leaves a "hole" on $r_i$. In order to calculate a value for
+locate where that hole is. We can find this location information by
-$r_{i-1}$, we need to locate where that hole is. We can find this location
+comparing $r_{i-1}$ and $v_i$. For instance, if $r_{i-1}$ is of shape
-informtation by comparing $r_{i-1}$ and $v_i$.
+$r_a \cdot r_b$, and $v_i$ is of shape $\textit{Left}(\textit{Seq}(v_a,
-For instance, if $r_{i-1}$ is of shape $r_a \cdot r_b$, and $v_i$
+v_b))$, we know immediately that
-is of shape $\textit{Left}(\textit{Seq}(v_a, v_b))$, we know immediately that
+%
-\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c\],
+\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b \,+\, r_b\backslash c,\]
+\noindent
 otherwise if $r_a$ is not nullable,
-\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b\],
+\[ (r_a \cdot r_b)\backslash c = (r_a\backslash c) \cdot r_b,\]
-the value $v_i$ should be of shape $Seq(\ldots)$, contradicting the fact that
-$v_i$ is actually $Left(\ldots)$. Furthermore, since $v_i$ is of shape
+\noindent
-$Left(\ldots)$ instead of $Right(\ldots)$, we know that the left
+the value $v_i$ should be of shape $\Seq(\ldots)$, contradicting the fact that
+$v_i$ is actually $\Left(\ldots)$. Furthermore, since $v_i$ is of shape
+$\Left(\ldots)$ instead of $\Right(\ldots)$, we know that the left
 branch is taken instead of the right one. We have therefore found out
-that the hole will be on $r_a$. So we recursively call $\textit{inj}
+that the hole will be on $r_a$. So we recursively call $\inj\,
-r_a c v_a$ to fill that hole in $v_a$. After injection, the value
+r_a\,c\,v_a$ in order to fill that hole in $v_a$. After injection, the value
-$v_i$ for $r_i = r_a \cdot r_b$ should be $\textit{Seq}(\textit{inj}
+$v_i$ for $r_i = r_a \cdot r_b$ should be $\Seq\,(\inj\,r_a\,c\,v_a)\,v_b$.
-r_a c v_a, v_b)$.
 Other clauses can be understood in a similar way.
-To understand what is going on, it might be best to do some
+Let us do some further examples involving $\inj$: Suppose we have the
-example calculations and compare them with Figure~\eqref{graph:2}.
+regular expression $((((a+b)+ab)+c)+abc)^*$ and want to match it against
-% A correctness proof using induction can be routinely established.
+the string $abc$ (when $abc$ is written as a regular expression, the most
-Suppose we have a regular expression $((((a+b)+ab)+c)+abc)^*$ and want to
+standard way of expressing it should be $a \cdot (b \cdot c)$. We omit
-match it against the string $abc$(when $abc$ is written as a regular
+the parentheses and dots here for readability). By POSIX rules the lexer
-expression, the most standard way of expressing it should be $a \cdot (b
+should go for the longest matching, i.e. it should match the string
-\cdot c)$. We omit the parentheses and dots here for readability). By
+$abc$ in one star iteration, using the longest string $abc$ in the
-POSIX rules the lexer should go for the longest matching, i.e. it should
+sub-expression $a+b+ab+c+abc$ (we use $r$ to denote this sub-expression
-match the string $abc$ in one star iteration, using the longest string
+for conciseness). Here is how the lexer achieves a parse tree for this
-$abc$ in the sub-expression $a+b+ab+c+abc$(we use $r$ to denote this
+matching. First, build successive derivatives until we exhaust the
-sub-expression for conciseness). Here is how the lexer achieves a parse
+string, as illustrated here (we simplified some regular expressions like
-tree for this matching. First, build successive derivatives until we
+$0 \cdot b$ to $0$ for conciseness; we also omit some parentheses if
-exhaust the string, as illustrated here(we simplified some regular
+they are clear from the context):
-expressions like $0 \cdot b$ to $0$ for conciseness. We also omit
-some parentheses if they are clear from the context.):
 %Similarly, we allow
 %$\textit{ALT}$ to take a list of regular expressions as an argument
 %instead of just 2 operands to reduce the nested depth of
 %$\textit{ALT}$
 \begin{center}
-\[ r^* \xrightarrow{\backslash a} r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r^* \xrightarrow{\backslash b}\]
+\begin{tabular}{lcl}
-\[r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0  + 0 + 0) \cdot r^* \xrightarrow{\backslash c}\]
+$r^*$ & $\xrightarrow{\backslash a}$ & $r_1 = (1+0+1 \cdot b + 0 + 1 \cdot b \cdot c) \cdot r^*$\\
-\[r_3 = (( 0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r^*) +((0+1+0  + 0 + 0) \cdot r^*+(0+0+0  + 1 + 0) \cdot r^* )
+& $\xrightarrow{\backslash b}$ & $r_2 = (0+0+1 \cdot 1 + 0 + 1 \cdot 1 \cdot c) \cdot r^* +(0+1+0  + 0 + 0) \cdot r^*$\\
-\]
+& $\xrightarrow{\backslash c}$ & $r_3 = ((0+0+0 + 0 + 1 \cdot 1 \cdot 1) \cdot r^* + (0+0+0  + 1 + 0) \cdot r^*) + $\\
+&                              & $\phantom{r_3 = (} ((0+1+0  + 0 + 0) \cdot r^* + (0+0+0  + 1 + 0) \cdot r^* )$
+\end{tabular}
 \end{center}
+\noindent
 Now instead when $nullable$ gives a $yes$ on $r_3$, we  call $mkeps$
 to construct a parse tree for how $r_3$ matched the string $abc$.
 $mkeps$ gives the following value $v_3$:
 \[$Left(Left(Seq(Right(Seq(Empty, Seq(Empty, Empty))), Stars [])))$\]
 The outer 2 $Left$ tells us the first nullable part  hides in the term
 Which one of these alternatives is chosen later does not affect their relative position because our algorithm
 does not change this order. If this parsing information can be determined and does
 not change because of later derivatives,
 there is no point in traversing this information twice. This leads to a new approach of lexing-- if we store the information for parse trees  in the corresponding regular expression pieces, update this information when we do derivative operation on them, and collect the information when finished with derivatives and calling $mkeps$ for deciding which branch is POSIX, we can generate the parse tree in one pass, instead of doing an n-step backward transformation.This leads to Sulzmann and Lu's novel idea of using bit-codes on derivatives.
-In the next section, we shall focus on the bit-coded algorithm and the natural
+In the next section, we shall focus on the bit-coded algorithm and the
-process of simplification of regular expressions using bit-codes, which is needed in
+process of simplification of regular expressions. This is needed in
 order to obtain \emph{fast} versions of the Brzozowski's, and Sulzmann
-and Lu's algorithms.  This is where the PhD-project hopes to advance
+and Lu's algorithms.  This is where the PhD-project aims to advance the
-the state-of-the-art.
+state-of-the-art.
 \section{Simplification of Regular Expressions}
-Using bit-codes to guide  parsing is not a new idea.
-It was applied to context free grammars and then adapted by Henglein and Nielson for efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and Lu took a step further by integrating bitcodes into derivatives.
+Using bit-codes to guide  parsing is not a novel idea. It was applied to
-The argument for complicating the data structures from basic regular expressions to those with bitcodes
+context free grammars and then adapted by Henglein and Nielson for
-is that we can introduce simplification without making the algorithm crash or overly complex to reason about.
+efficient regular expression parsing \cite{nielson11bcre}. Sulzmann and
-The reason why we need simplification is due to the shortcoming of the previous algorithm.
+Lu took a step further by integrating bitcodes into derivatives.
-The main drawback of building successive derivatives according to
-Brzozowski's definition is that they can grow very quickly in size.
+The argument for complicating the data structures from basic regular
-This is mainly due to the fact that the derivative operation generates
+expressions to those with bitcodes is that we can introduce
-often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result,
+simplification without making the algorithm crash or overly complex to
-if implemented naively both algorithms by Brzozowski and by Sulzmann
+reason about. The reason why we need simplification is due to the
-and Lu are excruciatingly slow. For example when starting with the
+shortcoming of the previous algorithm. The main drawback of building
-regular expression $(a + aa)^*$ and building 12 successive derivatives
+successive derivatives according to Brzozowski's definition is that they
-w.r.t.~the character $a$, one obtains a derivative regular expression
+can grow very quickly in size. This is mainly due to the fact that the
-with more than 8000 nodes (when viewed as a tree). Operations like
+derivative operation generates often ``useless'' $\ZERO$s and $\ONE$s in
-derivative and $\nullable$ need to traverse such trees and
+derivatives.  As a result, if implemented naively both algorithms by
-consequently the bigger the size of the derivative the slower the
+Brzozowski and by Sulzmann and Lu are excruciatingly slow. For example
-algorithm. Fortunately, one can simplify regular expressions after
+when starting with the regular expression $(a + aa)^*$ and building 12
-each derivative step. Various simplifications of regular expressions
+successive derivatives w.r.t.~the character $a$, one obtains a
-are possible, such as the simplifications of $\ZERO + r$,
+derivative regular expression with more than 8000 nodes (when viewed as
-$r + \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just
+a tree). Operations like derivative and $\nullable$ need to traverse
-$r$. These simplifications do not affect the answer for whether a
+such trees and consequently the bigger the size of the derivative the
-regular expression matches a string or not, but fortunately also do
+slower the algorithm. Fortunately, one can simplify regular expressions
-not affect the POSIX strategy of how regular expressions match
+after each derivative step. Various simplifications of regular
-strings---although the latter is much harder to establish. Some
+expressions are possible, such as the simplifications of $\ZERO + r$, $r
-initial results in this regard have been obtained in
++ \ZERO$, $\ONE\cdot r$, $r \cdot \ONE$, and $r + r$ to just $r$. These
-\cite{AusafDyckhoffUrban2016}. However, what has not been achieved yet
+simplifications do not affect the answer for whether a regular
-is a very tight bound for the size. Such a tight bound is suggested by
+expression matches a string or not, but fortunately also do not affect
-work of Antimirov who proved that (partial) derivatives can be bound
+the POSIX strategy of how regular expressions match strings---although
-by the number of characters contained in the initial regular
+the latter is much harder to establish. Some initial results in this
-expression \cite{Antimirov95}.
+regard have been obtained in \cite{AusafDyckhoffUrban2016}. However,
+what has not been achieved yet is a very tight bound for the size. Such
-Antimirov defined the "partial derivatives" of regular expressions to be this:
+a tight bound is suggested by work of Antimirov who proved that
+(partial) derivatives can be bound by the number of characters contained
+in the initial regular expression \cite{Antimirov95}.
+Antimirov defined the \emph{partial derivatives} of regular expressions to be this:
 %TODO definition of partial derivatives
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{pder} \; c \; 0$ & $\dn$ & $\emptyset$\\
 $\textit{pder} \; c \; 1$ & $\dn$ & $\emptyset$ \\

changeset 63	d3c22f809dde
parent 62	5b10d83b0834
child 64	afd0d702a4fe