# HG changeset patch # User Christian Urban # Date 1562143722 -3600 # Node ID 8063792920ef08926c03a8209e537037b91e8143 # Parent 7d18745dd7c9d9e28c44c387e5c051cdc44c3e19 proof reading diff -r 7d18745dd7c9 -r 8063792920ef ninems/ninems.tex --- a/ninems/ninems.tex Tue Jul 02 22:24:27 2019 +0100 +++ b/ninems/ninems.tex Wed Jul 03 09:48:42 2019 +0100 @@ -65,13 +65,16 @@ Brzozowski introduced in 1964 a beautifully simple algorithm for regular expression matching based on the notion of derivatives of regular expressions. In 2014, Sulzmann and Lu extended this - algorithm to not just give a YES/NO answer for whether or not a regular - expression matches a string, but in case it matches also \emph{how} - it matches the string. This is important for applications such as - lexing (tokenising a string). The problem is to make the algorithm - by Sulzmann and Lu fast on all inputs without breaking its - correctness. We have already developed some simplification rules, but have not shown that they - preserve the correctness. We also have not yet looked at extended regular expressions. + algorithm to not just give a YES/NO answer for whether or not a + regular expression matches a string, but in case it matches also + \emph{how} it matches the string. This is important for + applications such as lexing (tokenising a string). The problem is to + make the algorithm by Sulzmann and Lu fast on all inputs without + breaking its correctness. We have already developed some + simplification rules for this, but have not proved that they + preserve the correctness of the algorithm. We also have not yet + looked at extended regular expressions, such as bounded repetitions, + negation and back-references. \end{abstract} \section{Introduction} @@ -242,7 +245,7 @@ \section{The Algorithms by Brzozowski, and Sulzmann and Lu} -Suppose basic regular expressions are given by the following grammar:\\ +Suppose (basic) regular expressions are given by the following grammar: \[ r ::= \ZERO \mid \ONE \mid c \mid r_1 \cdot r_2 @@ -251,15 +254,18 @@ \] \noindent -The intended meaning of the regular expressions is as usual: $\ZERO$ +The intended meaning of the constructors is as usual: $\ZERO$ cannot match any string, $\ONE$ can match the empty string, the character regular expression $c$ can match the character $c$, and so -on. The brilliant contribution by Brzozowski is the notion of +on. + +The brilliant contribution by Brzozowski is the notion of \emph{derivatives} of regular expressions. The idea behind this notion is as follows: suppose a regular expression $r$ can match a string of the form $c\!::\! s$ (that is a list of characters starting with $c$), what does the regular expression look like that can match -just $s$? Brzozowski gave a neat answer to this question. He started with the definition of $nullable$: +just $s$? Brzozowski gave a neat answer to this question. He started +with the definition of $nullable$: \begin{center} \begin{tabular}{lcl} $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\ @@ -295,8 +301,8 @@ %Assuming the classic notion of a %\emph{language} of a regular expression, written $L(\_)$, t -The main -property of the derivative operation is that +\noindent +The main property of the derivative operation is that \begin{center} $c\!::\!s \in L(r)$ holds @@ -316,37 +322,46 @@ For the moment however, we focus only on the usual basic regular expressions. -Now if we want to find out whether a string $s$ -matches with a regular expression $r$, build the derivatives of $r$ -w.r.t.\ (in succession) all the characters of the string $s$. Finally, -test whether the resulting regular expression can match the empty -string. If yes, then $r$ matches $s$, and no in the negative -case. - -For this we can generalise the derivative operation for strings like this: +Now if we want to find out whether a string $s$ matches with a regular +expression $r$, build the derivatives of $r$ w.r.t.\ (in succession) +all the characters of the string $s$. Finally, test whether the +resulting regular expression can match the empty string. If yes, then +$r$ matches $s$, and no in the negative case. To implement this idea +we can generalise the derivative operation to strings like this: \begin{center} \begin{tabular}{lcl} $r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\ -$r \backslash \epsilon $ & $\dn$ & $r$ +$r \backslash [\,] $ & $\dn$ & $r$ \end{tabular} \end{center} + \noindent -Using the above definition we obtain a simple and elegant regular +Using this definition we obtain a simple and elegant regular expression matching algorithm: \[ match\;s\;r \;\dn\; nullable(r\backslash s) \] -This algorithm can be illustrated as follows: + +\noindent +Pictorially this algorithm can be illustrated as follows: + +\begin{center} \begin{tikzcd}\label{graph:*} r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"nullable?"] & Yes/No \end{tikzcd} -where we bnuild the successive derivative until we exhaust the string. +\end{center} + +\noindent +where we start with a regular expression $r_0$, build successive derivatives +until we exhaust the string and then use \textit{nullable} to test whether the +result can match the empty string. It can be relatively easily shown that this +matcher is correct. One limitation, however, of Brzozowski's algorithm is that it only produces a YES/NO answer for whether a string is being matched by a regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm to allow generation of an actual matching, called a -\emph{value}. +\emph{value}. \begin{center} \begin{tabular}{c@{\hspace{20mm}}c}