cst_tests: comparison ninems/ninems.tex

equal deleted inserted replaced

-:97c9ac95194d
+:17d8e7599a01
 ``fire''---so is it an identifier or a keyword?  While in applications
 there is a well-known strategy to decide these questions, called POSIX
 matching, only relatively recently precise definitions of what POSIX
 matching actually means have been formalised
 \cite{AusafDyckhoffUrban2016,OkuiSuzuki2010,Vansummeren2006}. Roughly,
-POSIX matching means matching the longest initial substring and
+POSIX matching means matching the longest initial substring.
-in the case of a tie, the initial submatch is chosen according to some priorities attached to the
+In the case of a tie, the initial submatch is chosen according to some priorities attached to the
 regular expressions (e.g.~keywords have a higher priority than
 identifiers). This sounds rather simple, but according to Grathwohl et
 al \cite[Page 36]{CrashCourse2014} this is not the case. They wrote:
 \begin{quote}
 \end{tabular}
 \end{center}
 \noindent
-%Assuming the classic notion of a
-%\emph{language} of a regular expression, written $L(\_)$, t
-The main
-property of the derivative operation is that
-\begin{center}
-$c\!::\!s \in L(r)$ holds
-if and only if $s \in L(r\backslash c)$.
-\end{center}
-\noindent
-So if we want to find out whether a string $s$
-matches with a regular expression $r$, build the derivatives of $r$
-w.r.t.\ (in succession) all the characters of the string $s$. Finally,
-test whether the resulting regular expression can match the empty
-string.  If yes, then $r$ matches $s$, and no in the negative
-case.\\
-We can generalise the derivative operation for strings like this:
-\begin{center}
-\begin{tabular}{lcl}
-$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
-$r \backslash \epsilon $ & $\dn$ & $r$
-\end{tabular}
-\end{center}
 For us the main advantage is that derivatives can be
 straightforwardly implemented in any functional programming language,
 and are easily definable and reasoned about in theorem provers---the
 definitions just consist of inductive datatypes and simple recursive
 functions. Moreover, the notion of derivatives can be easily
 the not-regular expression, written $\neg\,r$, or bounded repetitions
 (for example $r^{\{n\}}$ and $r^{\{n..m\}}$), which cannot be so
 straightforwardly realised within the classic automata approach.
-We obtain a simple and elegant regular
+%Assuming the classic notion of a
+%\emph{language} of a regular expression, written $L(\_)$, t
+The main
+property of the derivative operation is that
+\begin{center}
+$c\!::\!s \in L(r)$ holds
+if and only if $s \in L(r\backslash c)$.
+\end{center}
+\noindent
+So if we want to find out whether a string $s$
+matches with a regular expression $r$, build the derivatives of $r$
+w.r.t.\ (in succession) all the characters of the string $s$. Finally,
+test whether the resulting regular expression can match the empty
+string.  If yes, then $r$ matches $s$, and no in the negative
+case.
+We can generalise the derivative operation for strings like this:
+\begin{center}
+\begin{tabular}{lcl}
+$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
+$r \backslash \epsilon $ & $\dn$ & $r$
+\end{tabular}
+\end{center}
+\noindent
+Using the above definition we obtain a simple and elegant regular
 expression matching algorithm:
-\begin{definition}{matcher}
 \[
 match\;s\;r \;\dn\; nullable(r\backslash s)
 \]
-\end{definition}
 One limitation, however, of Brzozowski's algorithm is that it only
 produces a YES/NO answer for whether a string is being matched by a
 regular expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this
 algorithm to allow generation of an actual matching, called a
 \emph{value}.
 	\end{tabular}
 \end{center}
 \noindent
 Here we put the regular expression and values of the same shape on the same level to illustrate the corresponding relation between them.
 The flatten notation $| v |$ means extracting the characters in the value $v$ to form a string. For example, $|\mathit{Seq} \, \mathit{Char(c)} \, \mathit{Char(d)}|$ = $cd$. We omit this straightforward definition.
 Values are a way of expressing parse trees(the tree structure that tells how a sub-regex matches a substring). For example, $\Seq\,v_1\, v_2$ tells us how the string $|v_1| \cdot |v_2|$ matches the regex $r_1 \cdot r_2$: $r_1$ matches $|v_1|$ and $r_2$ matches $|v_2|$. Exactly how these two are matched are contained in the sub-structure of $v_1$ and $v_2$.
 To give a concrete example of how value works, consider the string $xy$ and the
 regular expression $(x + (y + xy))^*$. We can view this regular
 \end{tikzcd}
 \begin{tikzcd}
 r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n
 \end{tikzcd}
 We shall briefly explain this interesting process.\\ For the convenience of explanation, we have the following notations: the regular expression $r$ used for matching is also called $r_0$ and the string $s$ is composed of $n$ characters $c_0 c_1 ... c_{n-1}$.
-First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ...  until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for the empty word epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string:
+First, we do the derivative operation on $r_0$, $r_1$, ..., using characters $c_0$, $c_1$, ...  until we get the final derivative $r_n$.We test whether it is $nullable$ or not. If no we know immediately the string does not match the regex. If yes, we start building the parse tree incrementally. We first call $mkeps$(which stands for make epsilon--make the parse tree for how the empty word matched the empty regular expression epsilon) to construct the parse tree $v_n$ for how the final derivative $r_n$ matches the empty string:
 After this, we inject back the characters one by one, in reverse order as they were chopped off, to build the parse tree $v_i$ for how the regex $r_i$ matches the string $s_i$($s_i$ means the string s with the first $i$ characters being chopped off) from the previous parse tree. After $n$ transformations, we get the parse tree for how $r_0$ matches $s$, exactly as we wanted.
 An inductive proof can be routinely established.
 We omit the details of injection function, which is provided by Sulzmann and Lu's paper \cite{Sulzmann2014}. Rather, we shall focus next on the
 process of simplification of regular expressions, which is needed in
 \end{tabular}
 \end{center}
 where $\Z$ and $\S$ are arbitrary names for the bits in the
 bitsequences.
 Here code encodes a value into a bitsequence by converting Left into $\Z$, Right into $\S$, the start point of a non-empty star iteration into $\S$, and the border where a local star terminates into $\Z$. This conversion is apparently lossy, as it throws away the character information, and does not decode the boundary between the two operands of the sequence constructor. Moreover, with only the bitcode we cannot even tell whether the $\S$s and $\Z$s are for $Left/Right$ or $Stars$. The reason for choosing this compact way of storing information is that the relatively small size of bits can be easily moved around during the lexing process. In order to recover the bitcode back into values, we will need the regular expression as the extra information and decode them back into value:\\
-\begin{definition}[Bitdecoding of Values]\mbox{}
+%\begin{definition}[Bitdecoding of Values]\mbox{}
 \begin{center}
 \begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
 $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
 $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
 $\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
 $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
 & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
 \textit{else}\;\textit{None}$
 \end{tabular}
 \end{center}
-\end{definition}
+%\end{definition}
 To do lexing using annotated regular expressions, we shall first transform the
 usual (un-annotated) regular expressions into annotated regular
 expressions:\\
-\begin{definition}
+%\begin{definition}
 \begin{center}
 \begin{tabular}{lcl}
 $(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
 $(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
 $(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
 $\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\
 $(r^*)^\uparrow$ & $\dn$ &
 $\textit{STAR}\;[]\,r^\uparrow$\\
 \end{tabular}
 \end{center}
-\end{definition}
+%\end{definition}
 Then we do successive derivative operations on the annotated regular expression. This derivative operation is the same as what we previously have for the simple regular expressions, except that we take special care of the bits to store the parse tree information:\\
-\begin{definition}{bder}
+%\begin{definition}{bder}
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 $(\textit{ZERO})\backslash c$ & $\dn$ & $\textit{ZERO}$\\
 $(\textit{ONE}\;bs)\backslash c$ & $\dn$ & $\textit{ZERO}$\\
 $(\textit{CHAR}\;bs\,d)\backslash c$ & $\dn$ &
 $(\textit{STAR}\,bs\,a)\backslash c$ & $\dn$ &
 $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\backslash c))\,
 (\textit{STAR}\,[]\,r)$
 \end{tabular}
 \end{center}
-\end{definition}
+%\end{definition}
 This way, we do not have to use an injection function and a second phase, but instead only need to collect the bits while running $mkeps$:
-\begin{definition}[\textit{bmkeps}]\mbox{}
+%\begin{definition}[\textit{bmkeps}]\mbox{}
 \begin{center}
 \begin{tabular}{lcl}
 $\textit{bmkeps}\,(\textit{ONE}\,bs)$ & $\dn$ & $bs$\\
 $\textit{bmkeps}\,(\textit{ALT}\,bs\,a_1\,a_2)$ & $\dn$ &
 $\textit{if}\;\textit{bnullable}\,a_1$\\
 $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
 $\textit{bmkeps}\,(\textit{STAR}\,bs\,a)$ & $\dn$ &
 $bs \,@\, [\S]$
 \end{tabular}
 \end{center}
-\end{definition}
+%\end{definition}
-and then decode the bits using the regular expression. The whole process looks like this:\\
+and then decode the bits using the regular expression. After putting these pieces together, the whole process looks like this:\\
-r
+\begin{center}
-\\
+\begin{tabular}{lcl}
+$\textit{blexer}\;r\,s$ & $\dn$ &
+$\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
+& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+& & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
 The main point of the bitsequences and annotated regular expressions
 is that we can apply rather aggressive (in terms of size)
 simplification rules in order to keep derivatives small.

changeset 37	17d8e7599a01
parent 36	97c9ac95194d
child 38	b5363c0dcd99