--- a/ninems/ninems.tex Tue Jul 02 00:14:42 2019 +0100
+++ b/ninems/ninems.tex Tue Jul 02 11:22:39 2019 +0100
@@ -215,8 +215,8 @@
matching, only relatively recently precise definitions of what POSIX
matching actually means have been formalised
\cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly,
-POSIX matching means matching the longest initial substring and
-in the case of a tie, the initial submatch is chosen according to some priorities attached to the
+POSIX matching means matching the longest initial substring.
+In the case of a tie, the initial submatch is chosen according to some priorities attached to the
regular expressions (e.g.~keywords have a higher priority than
identifiers). This sounds rather simple, but according to Grathwohl et
al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
@@ -291,6 +291,16 @@
\noindent
+ For us the main advantage is that derivatives can be
+straightforwardly implemented in any functional programming language,
+and are easily definable and reasoned about in theorem provers---the
+definitions just consist of inductive datatypes and simple recursive
+functions. Moreover, the notion of derivatives can be easily
+generalised to cover extended regular expression constructors such as
+the not-regular expression, written $\neg\,r$, or bounded repetitions
+(for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
+straightforwardly realised within the classic automata approach.
+
%Assuming the classic notion of a
@@ -309,7 +319,8 @@
w.r.t.\ (in succession) all the characters of the string $s$. Finally,
test whether the resulting regular expression can match the empty
string. If yes, then $r$ matches $s$, and no in the negative
-case.\\
+case.
+
We can generalise the derivative operation for strings like this:
\begin{center}
\begin{tabular}{lcl}
@@ -317,27 +328,12 @@
$r \backslash \epsilon $ & $\dn$ & $r$
\end{tabular}
\end{center}
- For us the main advantage is that derivatives can be
-straightforwardly implemented in any functional programming language,
-and are easily definable and reasoned about in theorem provers---the
-definitions just consist of inductive datatypes and simple recursive
-functions. Moreover, the notion of derivatives can be easily
-generalised to cover extended regular expression constructors such as
-the not-regular expression, written $\neg\,r$, or bounded repetitions
-(for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
-straightforwardly realised within the classic automata approach.
-
-
-We obtain a simple and elegant regular
+\noindent
+Using the above definition we obtain a simple and elegant regular
expression matching algorithm:
-\begin{definition}{matcher}
\[
match\;s\;r \;\dn\; nullable(r\backslash s)
\]
-\end{definition}
-
-
-
One limitation, however, of Brzozowski's algorithm is that it only
produces a YES/NO answer for whether a string is being matched by a
regular expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this
@@ -372,6 +368,7 @@
\noindent
Here we put the regular expression and values of the same shape on the same level to illustrate the corresponding relation between them.
+
The flatten notation $| v |$ means extracting the characters in the value $v$ to form a string. For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$. We omit this straightforward definition.
Values are a way of expressing parse trees(the tree structure that tells how a sub-regex matches a substring). For example, $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$.
@@ -414,8 +411,9 @@
r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n
\end{tikzcd}
+
We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$.
-First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ... until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for the empty word epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string:
+First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ... until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for how the empty word matched the empty regular expression epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string:
After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
An inductive proof can be routinely established.
@@ -500,7 +498,7 @@
where $\Z$ and $\S$ are arbitrary names for the bits in the
bitsequences.
Here code encodes a value into a bitsequence by converting Left into $\Z$, Right into $\S$, the start point of a non-empty star iteration into $\S$, and the border where a local star terminates into $\Z$. This conversion is apparently lossy, as it throws away the character information, and does not decode the boundary between the two operands of the sequence constructor. Moreover, with only the bitcode we cannot even tell whether the $\S$s and $\Z$s are for $Left/Right$ or $Stars$. The reason for choosing this compact way of storing information is that the relatively small size of bits can be easily moved around during the lexing process. In order to recover the bitcode back into values, we will need the regular expression as the extra information and decode them back into value:\\
-\begin{definition}[Bitdecoding of Values]\mbox{}
+%\begin{definition}[Bitdecoding of Values]\mbox{}
\begin{center}
\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
$\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
@@ -527,12 +525,12 @@
\textit{else}\;\textit{None}$
\end{tabular}
\end{center}
-\end{definition}
+%\end{definition}
To do lexing using annotated regular expressions, we shall first transform the
usual (un-annotated) regular expressions into annotated regular
expressions:\\
-\begin{definition}
+%\begin{definition}
\begin{center}
\begin{tabular}{lcl}
$(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
@@ -547,9 +545,9 @@
$\textit{STAR}\;[]\,r^\uparrow$\\
\end{tabular}
\end{center}
-\end{definition}
+%\end{definition}
Then we do successive derivative operations on the annotated regular expression. This derivative operation is the same as what we previously have for the simple regular expressions, except that we take special care of the bits to store the parse tree information:\\
-\begin{definition}{bder}
+%\begin{definition}{bder}
\begin{center}
\begin{tabular}{@{}lcl@{}}
$(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\
@@ -569,9 +567,9 @@
(\textit{STAR}\,[]\,r)$
\end{tabular}
\end{center}
-\end{definition}
+%\end{definition}
This way, we do not have to use an injection function and a second phase, but instead only need to collect the bits while running $mkeps$:
-\begin{definition}[\textit{bmkeps}]\mbox{}
+%\begin{definition}[\textit{bmkeps}]\mbox{}
\begin{center}
\begin{tabular}{lcl}
$\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\
@@ -585,10 +583,17 @@
$bs \,@\, [\S]$
\end{tabular}
\end{center}
-\end{definition}
-and then decode the bits using the regular expression. The whole process looks like this:\\
-r
-\\
+%\end{definition}
+and then decode the bits using the regular expression. After putting these pieces together, the whole process looks like this:\\
+\begin{center}
+\begin{tabular}{lcl}
+ $\textit{blexer}\;r\,s$ & $\dn$ &
+ $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
+ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+ & & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
The main point of the bitsequences and annotated regular expressions
is that we can apply rather aggressive (in terms of size)