cst_tests: comparison etnms/etnms.tex

equal deleted inserted replaced

-:e0db3242d8b5
+:b1e365afa29c
 \newcommand{\ONE}{\mbox{\bf 1}}
 \def\erase{\textit{erase}}
 \def\bders{\textit{bders}}
 \def\lexer{\mathit{lexer}}
 \def\blexer{\textit{blexer}}
+\def\fuse{\textit{fuse}}
+\def\flatten{\textit{flatten}}
+\def\map{\textit{map}}
 \def\blexers{\mathit{blexer\_simp}}
 \def\simp{\mathit{simp}}
 \def\mkeps{\mathit{mkeps}}
 \def\bmkeps{\textit{bmkeps}}
 \def\inj{\mathit{inj}}
 formally established their correctness. We have also not yet looked
 at extended regular expressions, such as bounded repetitions,
 negation and back-references.
 \end{abstract}
+\section{Recapitulation of Concepts From the Last Report}
+\subsection{The Algorithm by Brzozowski based on Derivatives of Regular
+Expressions}
+Suppose (basic) regular expressions are given by the following grammar:
+\[			r ::=   \ZERO \mid  \ONE
+			 \mid  c
+			 \mid  r_1 \cdot r_2
+			 \mid  r_1 + r_2
+			 \mid r^*
+\]
+\noindent
+The ingenious contribution by Brzozowski is the notion of
+\emph{derivatives} of regular expressions.
+\begin{center}
+		\begin{tabular}{lcl}
+			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\
+			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
+			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
+			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
+			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
+			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
+		\end{tabular}
+	\end{center}
+\begin{center}
+\begin{tabular}{lcl}
+		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
+		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
+		$d \backslash c$     & $\dn$ &
+		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
+	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
+	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
+	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
+\end{tabular}
+\end{center}
+%Assuming the classic notion of a
+%\emph{language} of a regular expression, written $L(\_)$, t
+\noindent
+The main property of the derivative operation is that
+\begin{center}
+$c\!::\!s \in L(r)$ holds
+if and only if $s \in L(r\backslash c)$.
+\end{center}
+\noindent
+Now we can generalise the derivative operation to strings like this:
+\begin{center}
+\begin{tabular}{lcl}
+$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
+$r \backslash [\,] $ & $\dn$ & $r$
+\end{tabular}
+\end{center}
+\noindent
+and then define as  regular-expression matching algorithm:
+\[
+match\;s\;r \;\dn\; nullable(r\backslash s)
+\]
+\noindent
+This algorithm looks graphically as follows:
+\begin{equation}\label{graph:*}
+\begin{tikzcd}
+r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
+\end{tikzcd}
+\end{equation}
+\noindent
+where we start with  a regular expression  $r_0$, build successive
+derivatives until we exhaust the string and then use \textit{nullable}
+to test whether the result can match the empty string. It can  be
+relatively  easily shown that this matcher is correct  (that is given
+an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
+\subsection{Values and the Algorithm by Sulzmann and Lu}
+One limitation of Brzozowski's algorithm is that it only produces a
+YES/NO answer for whether a string is being matched by a regular
+expression.  Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
+to allow generation of an actual matching, called a \emph{value} or
+sometimes also \emph{lexical value}.  These values and regular
+expressions correspond to each other as illustrated in the following
+table:
+\begin{center}
+	\begin{tabular}{c@{\hspace{20mm}}c}
+		\begin{tabular}{@{}rrl@{}}
+			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
+			$r$ & $::=$  & $\ZERO$\\
+			& $\mid$ & $\ONE$   \\
+			& $\mid$ & $c$          \\
+			& $\mid$ & $r_1 \cdot r_2$\\
+			& $\mid$ & $r_1 + r_2$   \\
+			\\
+			& $\mid$ & $r^*$         \\
+		\end{tabular}
+		&
+		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
+			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
+			$v$ & $::=$  & \\
+			&        & $\Empty$   \\
+			& $\mid$ & $\Char(c)$          \\
+			& $\mid$ & $\Seq\,v_1\, v_2$\\
+			& $\mid$ & $\Left(v)$   \\
+			& $\mid$ & $\Right(v)$  \\
+			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
+		\end{tabular}
+	\end{tabular}
+\end{center}
+\noindent
+The contribution of Sulzmann and Lu is an extension of Brzozowski's
+algorithm by a second phase (the first phase being building successive
+derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
+is generated in case the regular expression matches  the string.
+Pictorially, the Sulzmann and Lu algorithm is as follows:
+\begin{ceqn}
+\begin{equation}\label{graph:2}
+\begin{tikzcd}
+r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
+v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]
+\end{tikzcd}
+\end{equation}
+\end{ceqn}
+\noindent
+For convenience, we shall employ the following notations: the regular
+expression we start with is $r_0$, and the given string $s$ is composed
+of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
+left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
+to the characters $c_0$, $c_1$  until we exhaust the string and obtain
+the derivative $r_n$. We test whether this derivative is
+$\textit{nullable}$ or not. If not, we know the string does not match
+$r$ and no value needs to be generated. If yes, we start building the
+values incrementally by \emph{injecting} back the characters into the
+earlier values $v_n, \ldots, v_0$. This is the second phase of the
+algorithm from the right to left. For the first value $v_n$, we call the
+function $\textit{mkeps}$, which builds the lexical value
+for how the empty string has been matched by the (nullable) regular
+expression $r_n$. This function is defined as
+	\begin{center}
+		\begin{tabular}{lcl}
+			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
+			$\mkeps(r_{1}+r_{2})$	& $\dn$
+			& \textit{if} $\nullable(r_{1})$\\
+			& & \textit{then} $\Left(\mkeps(r_{1}))$\\
+			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
+			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
+			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
+		\end{tabular}
+	\end{center}
+\noindent
+After the $\mkeps$-call, we inject back the characters one by one in order to build
+the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
+($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
+After injecting back $n$ characters, we get the lexical value for how $r_0$
+matches $s$. For this Sulzmann and Lu defined a function that reverses
+the ``chopping off'' of characters during the derivative phase. The
+corresponding function is called \emph{injection}, written
+$\textit{inj}$; it takes three arguments: the first one is a regular
+expression ${r_{i-1}}$, before the character is chopped off, the second
+is a character ${c_{i-1}}$, the character we want to inject and the
+third argument is the value ${v_i}$, into which one wants to inject the
+character (it corresponds to the regular expression after the character
+has been chopped off). The result of this function is a new value. The
+definition of $\textit{inj}$ is as follows:
+\begin{center}
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
+$\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
+$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
+$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
+$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
+$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
+\end{tabular}
+\end{center}
+\noindent This definition is by recursion on the ``shape'' of regular
+expressions and values.
+\subsection{Simplification of Regular Expressions}
+The main drawback of building successive derivatives according
+to Brzozowski's definition is that they can grow very quickly in size.
+This is mainly due to the fact that the derivative operation generates
+often ``useless'' $\ZERO$s and $\ONE$s in derivatives.  As a result, if
+implemented naively both algorithms by Brzozowski and by Sulzmann and Lu
+are excruciatingly slow. For example when starting with the regular
+expression $(a + aa)^*$ and building 12 successive derivatives
+w.r.t.~the character $a$, one obtains a derivative regular expression
+with more than 8000 nodes (when viewed as a tree). Operations like
+$\textit{der}$ and $\nullable$ need to traverse such trees and
+consequently the bigger the size of the derivative the slower the
+algorithm.
+Fortunately, one can simplify regular expressions after each derivative
+step. Various simplifications of regular expressions are possible, such
+as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
+\cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
+affect the answer for whether a regular expression matches a string or
+not, but fortunately also do not affect the POSIX strategy of how
+regular expressions match strings---although the latter is much harder
+to establish. Some initial results in this regard have been
+obtained in \cite{AusafDyckhoffUrban2016}.
+If we want the size of derivatives in Sulzmann and Lu's algorithm to
+stay below this bound, we would need more aggressive simplifications.
+Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
+deleting duplicates whenever possible. For example, the parentheses in
+$(a+b) \cdot c + bc$ can be opened up to get $a\cdot c +  b \cdot c + b
+\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
+example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
+$a^*+a+\ONE$. Adding these more aggressive simplification rules helps us
+to achieve the same size bound as that of the partial derivatives.
+In order to implement the idea of ``spilling out alternatives'' and to
+make them compatible with the $\text{inj}$-mechanism, we use
+\emph{bitcodes}. Bits and bitcodes (lists of bits) are just:
+\begin{center}
+		$b ::=   S \mid  Z \qquad
+bs ::= [] \mid b:bs
+$
+\end{center}
+\noindent
+The $S$ and $Z$ are arbitrary names for the bits in order to avoid
+confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
+bit-lists) can be used to encode values (or incomplete values) in a
+compact form. This can be straightforwardly seen in the following
+coding function from values to bitcodes:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{code}(\Empty)$ & $\dn$ & $[]$\\
+$\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
+$\textit{code}(\Left\,v)$ & $\dn$ & $\Z :: code(v)$\\
+$\textit{code}(\Right\,v)$ & $\dn$ & $\S :: code(v)$\\
+$\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
+$\textit{code}(\Stars\,[])$ & $\dn$ & $[\Z]$\\
+$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $\S :: code(v) \;@\;
+code(\Stars\,vs)$
+\end{tabular}
+\end{center}
+\noindent
+Here $\textit{code}$ encodes a value into a bitcodes by converting
+$\Left$ into $\Z$, $\Right$ into $\S$, the start point of a non-empty
+star iteration into $\S$, and the border where a local star terminates
+into $\Z$. This coding is lossy, as it throws away the information about
+characters, and also does not encode the ``boundary'' between two
+sequence values. Moreover, with only the bitcode we cannot even tell
+whether the $\S$s and $\Z$s are for $\Left/\Right$ or $\Stars$. The
+reason for choosing this compact way of storing information is that the
+relatively small size of bits can be easily manipulated and ``moved
+around'' in a regular expression. In order to recover values, we will
+need the corresponding regular expression as an extra information. This
+means the decoding function is defined as:
+%\begin{definition}[Bitdecoding of Values]\mbox{}
+\begin{center}
+\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
+$\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
+$\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
+$\textit{decode}'\,(\Z\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
+(\Left\,v, bs_1)$\\
+$\textit{decode}'\,(\S\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
+(\Right\,v, bs_1)$\\
+$\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
+$\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
+& &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
+& &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
+$\textit{decode}'\,(\Z\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
+$\textit{decode}'\,(\S\!::\!bs)\,(r^*)$ & $\dn$ &
+$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
+& &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
+& &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
+$\textit{decode}\,bs\,r$ & $\dn$ &
+$\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
+& & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
+\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+%\end{definition}
+Sulzmann and Lu's integrated the bitcodes into regular expressions to
+create annotated regular expressions \cite{Sulzmann2014}.
+\emph{Annotated regular expressions} are defined by the following
+grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{a}$ & $::=$  & $\textit{ZERO}$\\
+& $\mid$ & $\textit{ONE}\;\;bs$\\
+& $\mid$ & $\textit{CHAR}\;\;bs\,c$\\
+& $\mid$ & $\textit{ALTS}\;\;bs\,as$\\
+& $\mid$ & $\textit{SEQ}\;\;bs\,a_1\,a_2$\\
+& $\mid$ & $\textit{STAR}\;\;bs\,a$
+\end{tabular}
+\end{center}
+%(in \textit{ALTS})
+\noindent
+where $bs$ stands for bitcodes, $a$  for $\bold{a}$nnotated regular
+expressions and $as$ for a list of annotated regular expressions.
+The alternative constructor($\textit{ALTS}$) has been generalized to
+accept a list of annotated regular expressions rather than just 2.
+We will show that these bitcodes encode information about
+the (POSIX) value that should be generated by the Sulzmann and Lu
+algorithm.
+To do lexing using annotated regular expressions, we shall first
+transform the usual (un-annotated) regular expressions into annotated
+regular expressions. This operation is called \emph{internalisation} and
+defined as follows:
+%\begin{definition}
+\begin{center}
+\begin{tabular}{lcl}
+$(\ZERO)^\uparrow$ & $\dn$ & $\textit{ZERO}$\\
+$(\ONE)^\uparrow$ & $\dn$ & $\textit{ONE}\,[]$\\
+$(c)^\uparrow$ & $\dn$ & $\textit{CHAR}\,[]\,c$\\
+$(r_1 + r_2)^\uparrow$ & $\dn$ &
+$\textit{ALTS}\;[]\,List((\textit{fuse}\,[\Z]\,r_1^\uparrow),\,
+(\textit{fuse}\,[\S]\,r_2^\uparrow))$\\
+$(r_1\cdot r_2)^\uparrow$ & $\dn$ &
+$\textit{SEQ}\;[]\,r_1^\uparrow\,r_2^\uparrow$\\
+$(r^*)^\uparrow$ & $\dn$ &
+$\textit{STAR}\;[]\,r^\uparrow$\\
+\end{tabular}
+\end{center}
+%\end{definition}
+\noindent
+We use up arrows here to indicate that the basic un-annotated regular
+expressions are ``lifted up'' into something slightly more complex. In the
+fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
+attach bits to the front of an annotated regular expression. Its
+definition is as follows:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{fuse}\;bs\,(\textit{ZERO})$ & $\dn$ & $\textit{ZERO}$\\
+$\textit{fuse}\;bs\,(\textit{ONE}\,bs')$ & $\dn$ &
+$\textit{ONE}\,(bs\,@\,bs')$\\
+$\textit{fuse}\;bs\,(\textit{CHAR}\,bs'\,c)$ & $\dn$ &
+$\textit{CHAR}\,(bs\,@\,bs')\,c$\\
+$\textit{fuse}\;bs\,(\textit{ALTS}\,bs'\,as)$ & $\dn$ &
+$\textit{ALTS}\,(bs\,@\,bs')\,as$\\
+$\textit{fuse}\;bs\,(\textit{SEQ}\,bs'\,a_1\,a_2)$ & $\dn$ &
+$\textit{SEQ}\,(bs\,@\,bs')\,a_1\,a_2$\\
+$\textit{fuse}\;bs\,(\textit{STAR}\,bs'\,a)$ & $\dn$ &
+$\textit{STAR}\,(bs\,@\,bs')\,a$
+\end{tabular}
+\end{center}
+\noindent
+After internalising the regular expression, we perform successive
+derivative operations on the annotated regular expressions. This
+derivative operation is the same as what we had previously for the
+basic regular expressions, except that we beed to take care of
+the bitcodes:
+%\begin{definition}{bder}
+\begin{center}
+\begin{tabular}{@{}lcl@{}}
+$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
+$(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
+$(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
+$\textit{if}\;c=d\; \;\textit{then}\;
+\textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
+$(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
+$\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
+$(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
+$\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
+					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
+& &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
+$(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
+$\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
+(\textit{STAR}\,[]\,r)$
+\end{tabular}
+\end{center}
+%\end{definition}
+\noindent
+For instance, when we unfold $\textit{STAR} \; bs \; a$ into a sequence,
+we need to attach an additional bit $Z$ to the front of $r \backslash c$
+to indicate that there is one more star iteration. Also the $SEQ$ clause
+is more subtle---when $a_1$ is $\textit{bnullable}$ (here
+\textit{bnullable} is exactly the same as $\textit{nullable}$, except
+that it is for annotated regular expressions, therefore we omit the
+definition). Assume that $bmkeps$ correctly extracts the bitcode for how
+$a_1$ matches the string prior to character $c$ (more on this later),
+then the right branch of $ALTS$, which is $fuse \; bmkeps \;  a_1 (a_2
+\backslash c)$ will collapse the regular expression $a_1$(as it has
+already been fully matched) and store the parsing information at the
+head of the regular expression $a_2 \backslash c$ by fusing to it. The
+bitsequence $bs$, which was initially attached to the head of $SEQ$, has
+now been elevated to the top-level of $ALTS$, as this information will be
+needed whichever way the $SEQ$ is matched---no matter whether $c$ belongs
+to $a_1$ or $ a_2$. After building these derivatives and maintaining all
+the lexing information, we complete the lexing by collecting the
+bitcodes using a generalised version of the $\textit{mkeps}$ function
+for annotated regular expressions, called $\textit{bmkeps}$:
+%\begin{definition}[\textit{bmkeps}]\mbox{}
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{bmkeps}\,(\textit{ONE}\;bs)$ & $\dn$ & $bs$\\
+$\textit{bmkeps}\,(\textit{ALTS}\;bs\,a::as)$ & $\dn$ &
+$\textit{if}\;\textit{bnullable}\,a$\\
+& &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
+& &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(\textit{ALTS}\;bs\,as)$\\
+$\textit{bmkeps}\,(\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ &
+$bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
+$\textit{bmkeps}\,(\textit{STAR}\;bs\,a)$ & $\dn$ &
+$bs \,@\, [\S]$
+\end{tabular}
+\end{center}
+%\end{definition}
+\noindent
+This function completes the value information by travelling along the
+path of the regular expression that corresponds to a POSIX value and
+collecting all the bitcodes, and using $S$ to indicate the end of star
+iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
+decode them, we get the value we expect. The corresponding lexing
+algorithm looks as follows:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{blexer}\;r\,s$ & $\dn$ &
+$\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
+& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+& & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+\noindent
+In this definition $\_\backslash s$ is the  generalisation  of the derivative
+operation from characters to strings (just like the derivatives for un-annotated
+regular expressions).
+The main point of the bitcodes and annotated regular expressions is that
+we can apply rather aggressive (in terms of size) simplification rules
+in order to keep derivatives small. We have developed such
+``aggressive'' simplification rules and generated test data that show
+that the expected bound can be achieved. Obviously we could only
+partially cover  the search space as there are infinitely many regular
+expressions and strings.
+One modification we introduced is to allow a list of annotated regular
+expressions in the \textit{ALTS} constructor. This allows us to not just
+delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
+also unnecessary ``copies'' of regular expressions (very similar to
+simplifying $r + r$ to just $r$, but in a more general setting). Another
+modification is that we use simplification rules inspired by Antimirov's
+work on partial derivatives. They maintain the idea that only the first
+``copy'' of a regular expression in an alternative contributes to the
+calculation of a POSIX value. All subsequent copies can be pruned away from
+the regular expression. A recursive definition of our  simplification function
+that looks somewhat similar to our Scala code is given below:
+%\comment{Use $\ZERO$, $\ONE$ and so on.
+%Is it $ALTS$ or $ALTS$?}\\
+\begin{center}
+\begin{tabular}{@{}lcl@{}}
+$\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
+&&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
+&&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
+&&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
+&&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
+&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow  \textit{SEQ} \; bs \; a_1' \;  a_2'$ \\
+$\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
+&&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
+&&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
+&&$\quad\textit{case} \;  as' \Rightarrow  \textit{ALTS}\;bs\;as'$\\
+$\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
+\end{tabular}
+\end{center}
+\noindent
+The simplification does a pattern matching on the regular expression.
+When it detected that the regular expression is an alternative or
+sequence, it will try to simplify its children regular expressions
+recursively and then see if one of the children turn into $\ZERO$ or
+$\ONE$, which might trigger further simplification at the current level.
+The most involved part is the $\textit{ALTS}$ clause, where we use two
+auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
+$\textit{ALTS}$ and reduce as many duplicates as possible. Function
+$\textit{distinct}$  keeps the first occurring copy only and remove all later ones
+when detected duplicates. Function $\textit{flatten}$ opens up nested \textit{ALTS}.
+Its recursive definition is given below:
+\begin{center}
+\begin{tabular}{@{}lcl@{}}
+$\textit{flatten} \; (\textit{ALTS}\;bs\,as) :: as'$ & $\dn$ & $(\textit{map} \;
+(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
+$\textit{flatten} \; \textit{ZERO} :: as'$ & $\dn$ & $ \textit{flatten} \;  as' $ \\
+$\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; as'$ \quad(otherwise)
+\end{tabular}
+\end{center}
+\noindent
+Here $\textit{flatten}$ behaves like the traditional functional programming flatten
+function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
+removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
+Suppose we apply simplification after each derivative step, and view
+these two operations as an atomic one: $a \backslash_{simp}\,c \dn
+\textit{simp}(a \backslash c)$. Then we can use the previous natural
+extension from derivative w.r.t.~character to derivative
+w.r.t.~string:%\comment{simp in  the [] case?}
+\begin{center}
+\begin{tabular}{lcl}
+$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
+$r \backslash_{simp} [\,] $ & $\dn$ & $r$
+\end{tabular}
+\end{center}
+\noindent
+we obtain an optimised version of the algorithm:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{blexer\_simp}\;r\,s$ & $\dn$ &
+$\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
+& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+& & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+\noindent
+This algorithm keeps the regular expression size small, for example,
+with this simplification our previous $(a + aa)^*$ example's 8000 nodes
+will be reduced to just 6 and stays constant, no matter how long the
+input string is.
 \section{Introduction}
 While we believe derivatives of regular expressions, written
 $r\backslash s$, are a beautiful concept (in terms of ease of
 implementing them in functional programming languages and in terms of
 This property allows one to rewrite an induction hypothesis like
 \begin{center}
 $ \flex \; r\; id\; (s@[c])\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; (\inj \; (r\backslash s) \;c\;v)\;) r$
 \end{center}
+\subsubsection{Retrieve Function}
+The crucial point is to find the
+$\textit{POSIX}$  information of a regular expression and how it is modified,
+augmented and propagated
+during simplification in parallel with the regular expression that
+has not been simplified in the subsequent derivative operations.  To aid this,
+we use the helper function retrieve described by Sulzmann and Lu:
+\begin{center}
+\begin{tabular}{@{}l@{\hspace{2mm}}c@{\hspace{2mm}}l@{}}
+$\textit{retrieve}\,(\textit{ONE}\,bs)\,\Empty$ & $\dn$ & $bs$\\
+$\textit{retrieve}\,(\textit{CHAR}\,bs\,c)\,(\Char\,d)$ & $\dn$ & $bs$\\
+$\textit{retrieve}\,(\textit{ALTS}\,bs\,a::as)\,(\Left\,v)$ & $\dn$ &
+$bs \,@\, \textit{retrieve}\,a\,v$\\
+$\textit{retrieve}\,(\textit{ALTS}\,bs\,a::as)\,(\Right\,v)$ & $\dn$ &
+$bs \,@\, \textit{retrieve}\,(\textit{ALTS}\,bs\,as)\,v$\\
+$\textit{retrieve}\,(\textit{SEQ}\,bs\,a_1\,a_2)\,(\Seq\,v_1\,v_2)$ & $\dn$ &
+$bs \,@\,\textit{retrieve}\,a_1\,v_1\,@\, \textit{retrieve}\,a_2\,v_2$\\
+$\textit{retrieve}\,(\textit{STAR}\,bs\,a)\,(\Stars\,[])$ & $\dn$ &
+$bs \,@\, [\S]$\\
+$\textit{retrieve}\,(\textit{STAR}\,bs\,a)\,(\Stars\,(v\!::\!vs))$ & $\dn$ &\\
+\multicolumn{3}{l}{
+\hspace{3cm}$bs \,@\, [\Z] \,@\, \textit{retrieve}\,a\,v\,@\,
+\textit{retrieve}\,(\textit{STAR}\,[]\,a)\,(\Stars\,vs)$}\\
+\end{tabular}
+\end{center}
+%\comment{Did not read further}\\
+This function assembles the bitcode
+%that corresponds to a lexical value for how
+%the current derivative matches the suffix of the string(the characters that
+%have not yet appeared, but will appear as the successive derivatives go on.
+%How do we get this "future" information? By the value $v$, which is
+%computed by a pass of the algorithm that uses
+%$inj$ as described in the previous section).
+using information from both the derivative regular expression and the
+value. Sulzmann and Lu poroposed this function, but did not prove
+anything about it. Ausaf and Urban used it to connect the bitcoded
+algorithm to the older algorithm by the following equation:
+\begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\;
+	 (r^\uparrow)\backslash_{simp} \,c)\,v)$
+\end{center}
+\noindent
+whereby $r^\uparrow$ stands for the internalised version of $r$. Ausaf
+and Urban also used this fact to prove  the correctness of bitcoded
+algorithm without simplification.  Our purpose of using this, however,
+is to establish
+\begin{center}
+$ \textit{retrieve} \;
+a \; v \;=\; \textit{retrieve}  \; (\textit{simp}\,a) \; v'.$
+\end{center}
+The idea is that using $v'$, a simplified version of $v$ that had gone
+through the same simplification step as $\textit{simp}(a)$, we are able
+to extract the bitcode that gives the same parsing information as the
+unsimplified one. However, we noticed that constructing such a  $v'$
+from $v$ is not so straightforward. The point of this is that  we might
+be able to finally bridge the gap by proving
 \noindent\rule[1.5ex]{\linewidth}{4pt}
 There is no mention of retrieve yet .... this is the second trick in the
 existing proof. I am not sure whether you need to explain annotated regular
 expressions much earlier - maybe before the ``existing proof'' section, or
 \end{center}
 \noindent
 $f$ rectifies $r\backslash s$ so the value $\mkeps(f(r\backslash s))$ becomes
 something simpler
 to make the retrieve function defined.\\
+\subsubsection{Ways to Rectify Value}
 One way to do this is to prove the following:
 \begin{center}
 $\retrieve \; \rup\backslash_{simp} s \; \mkeps(\simp(r\backslash s))=\textit{retrieve} \; \rup\backslash s \; \mkeps(r\backslash s)$
 \end{center}
 \noindent
 $\erase$ in the above equality means to remove the bit-codes
 in an annotated regular expression and only keep the original
 regular expression(just like "erasing" the bits). Its definition is omitted.
 $\rup\backslash_{simp} \, s$ and $\simp(\rup\backslash s)$
 are very closely related, but not identical.
+\subsubsection{Example for $\rup\backslash_{simp} \, s \neq \simp(\rup\backslash s)$}
 For example, let $r$ be the regular expression
 $(a+b)(a+a*)$ and $s$  be the string $aa$, then
 both $\erase (\rup\backslash_{simp} \, s)$ and $\erase (\simp (\rup\backslash s))$
 are $\ONE + a^*$. However, without $\erase$
 \begin{center}
 whereas
 \begin{center}
 $\simp(\rup\backslash  s)$ is equal to $(_{00}\ONE +_{011}a^*)$
 \end{center}
 \noindent
-For the sake of visual simplicity, we use numbers to denote the bits
+(For the sake of visual simplicity, we use numbers to denote the bits
 in bitcodes as we have previously defined for annotated
 regular expressions. $\S$ is replaced by
-subscript $_1$ and $\Z$ by $_0$.
+subscript $_1$ and $\Z$ by $_0$.)
-Two "rules" might be inferred from the above example.
+What makes the difference?
-First, after erasing the bits the two regular expressions
+%Two "rules" might be inferred from the above example.
-are exactly the same: both become $1+a^*$. Here the
-function $\simp$ exhibits the "one in the end equals many times
+%First, after erasing the bits the two regular expressions
-at the front"
+%are exactly the same: both become $1+a^*$. Here the
-property: one simplification in the end causes the
+%function $\simp$ exhibits the "one in the end equals many times
-same regular expression structure as
+%at the front"
-successive simplifications done alongside derivatives.
+%property: one simplification in the end causes the
-$\rup\backslash_{simp} \, s$ unfolds to
+%same regular expression structure as
-$\simp((\simp(r\backslash a))\backslash a)$
+%successive simplifications done alongside derivatives.
-and $\simp(\rup\backslash s)$ unfolds to
+%$\rup\backslash_{simp} \, s$ unfolds to
-$\simp((r\backslash a)\backslash a)$. The one simplification
+%$\simp((\simp(r\backslash a))\backslash a)$
-in the latter causes the resulting regular expression to
+%and $\simp(\rup\backslash s)$ unfolds to
-become $1+a^*$, exactly the same as the former with
+%$\simp((r\backslash a)\backslash a)$. The one simplification
-two simplifications.
+%in the latter causes the resulting regular expression to
+%become $1+a^*$, exactly the same as the former with
-Second, the bit-codes are different, but they are essentially
+%two simplifications.
-the same: if we push the outmost bits ${\bf_0}(_0\ONE +_{11}a^*)$ of $\rup\backslash_{simp} \, s$
-inside then we get $(_{00}\ONE +_{011}a^*)$, exactly the
+%Second, the bit-codes are different, but they are essentially
-same as that of $\rup\backslash \, s$. And this difference
+%the same: if we push the outmost bits ${\bf_0}(_0\ONE +_{11}a^*)$ of $\rup\backslash_{simp} \, s$
-does not matter when we try to apply $\bmkeps$ or $\retrieve$
+%inside then we get $(_{00}\ONE +_{011}a^*)$, exactly the
-to it. This seems a good news if we want to use $\retrieve$
+%same as that of $\rup\backslash \, s$. And this difference
-to prove things.
+%does not matter when we try to apply $\bmkeps$ or $\retrieve$
+%to it. This seems a good news if we want to use $\retrieve$
-If we look into the difference above, we could see that the
+%to prove things.
-difference is not fundamental: the bits are just being moved
-around in a way that does not hurt the correctness.
+%If we look into the difference above, we could see that the
+%difference is not fundamental: the bits are just being moved
+%around in a way that does not hurt the correctness.
 During the first derivative operation,
 $\rup\backslash a=(_0\ONE  + \ZERO)(_0a  +  _1a^*)$  is
-in the form of a sequence regular expression with the first
+in the form of a sequence regular expression with
-part being nullable.
+two components, the first
+one $\ONE + \ZERO$ being nullable.
 Recall the simplification function definition:
 \begin{center}
 \begin{tabular}{@{}lcl@{}}
 $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
 $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
 \end{tabular}
 \end{center}
 \noindent
-If we call $\simp$ on $\rup$, just as $\backslash_{simp}$
+and the difinition of $\flatten$:
+\begin{center}
+\begin{tabular}{c c c}
+$\flatten \; []$ & $\dn$ & $[]$\\
+$\flatten \; \ZERO::rs$ & $\dn$ & $rs$\\
+$\flatten \;(_{\textit{bs}_1}\oplus \textit{rs}_1 ::rs)$ & $\dn$ & $(\map \, (\fuse \, \textit{bs}_1) \,\textit{rs}_1) ::: \flatten(rs)$\\
+$\flatten \; r :: rs$ & $\dn$ & $r::\flatten(rs)$
+\end{tabular}
+\end{center}
+\noindent
+If we call $\simp$ on $\rup\backslash a$, just as $\backslash_{simp}$
 requires, then we would go throught the third clause of
 the sequence case:$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$.
-The $\ZERO$ of $(_0\ONE  + \ZERO)$ is simplified away and
+The $\ZERO$ of $(_0\ONE  + \ZERO)$ is thrown away
-$_0\ONE$ merged into $_0a  +  _1a^*$ by simply
+by $\flatten$ and
+$_0\ONE$ merged into $(_0a  +  _1a^*)$ by simply
 putting its bits($_0$) to the front of the second component:
 ${\bf_0}(_0a  +  _1a^*)$.
 After a second derivative operation,
 namely, $(_0(_0a  +  _1a^*))\backslash a$, we get
 $
 _0(_0 \ONE  +  _1(_1\ONE \cdot a^*))
 $, and this simplifies to $_0(_0 \ONE  +  _{11} a^*)$
 by the third clause of the alternative case:
 $\quad\textit{case} \;  as' \Rightarrow  \textit{ALTS}\;bs\;as'$.
-The outmost bit $_0$ remains unchanged and stays with
+The outmost bit $_0$ stays with
-the outmost regular expression. However, things are a bit
+the outmost regular expression, rather than being fused to
-different when it comes to $\simp(\rup\backslash \, s)$, because
+its child regular expressions, as what we will later see happens
-without simplification,  first term of the sequence
+to $\simp(\rup\backslash \, s)$.
-$\rup\backslash a=(_0\ONE  + \ZERO)(_0a  +  _1a^*)$
+If we choose to not simplify in the midst of derivative operations,
-is not merged into the second component
+but only do it at the end after the string has been exhausted,
-and is nullable.
+namely, $\simp(\rup\backslash \, s)=\simp((\rup\backslash a)\backslash a)$,
+then at the {\bf second} derivative of
+$(\rup\backslash a)\bf{\backslash a}$
+we will go throught the clause of $\backslash$:
+\begin{center}
+\begin{tabular}{lcl}
+$(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
+$(when \; \textit{bnullable}\,a_1)$\\
+					       & &$\textit{ALTS}\,bs\,List(\;\;(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
+					       & &$(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))\;\;)$\\
+\end{tabular}
+\end{center}
+because
+$\rup\backslash a = (_0\ONE  + \ZERO)(_0a  +  _1a^*)$
+is a sequence
+with the first component being nullable
+(unsimplified, unlike the first round of running$\backslash_{simp}$).
 Therefore $((_0\ONE  + \ZERO)(_0a  +  _1a^*))\backslash a$ splits into
 $([(\ZERO + \ZERO)\cdot(_0a  +  _1a^*)] + _0( _0\ONE  + _1[_1\ONE \cdot a^*]))$.
 After these two successive derivatives without simplification,
 we apply $\simp$ to this regular expression, which goes through
-the alternative clause, and each component of $([(\ZERO + \ZERO)\cdot(_0a  +  _1a^*)] + _0( _0\ONE  + _1[_1\ONE \cdot a^*]))$ will be simplified, giving us the list:$[\ZERO, _0(_0\ONE  + _{11}a^*]$,this
+the alternative clause, and each component of
-list is then flattened--for
+$([(\ZERO + \ZERO)\cdot(_0a  +  _1a^*)] + _0( _0\ONE  + _1[_1\ONE \cdot a^*]))$
-$([(\ZERO + \ZERO)\cdot(_0a  +  _1a^*)]$ it will be simplified into $\ZERO$
+will be simplified, giving us the list:$[\ZERO, _0(_0\ONE  + _{11}a^*)]$
-and then thrown away by $\textit{flatten}$; $ _0( _0\ONE  + _1[_1\ONE \cdot a^*]))$
+This list is then "flattened"--$\ZERO$ will be
-becomes  $ _{00}\ONE  + _{011}a^*]))$ because flatten opens up the alternative
+thrown away by $\textit{flatten}$; $ _0(_0\ONE  + _{11}a^*)$
-$\ONE + a^*$ and fuses the front bit(s) $_0$ to the front of $_0\ONE $ and $_{11}a^*$
+is opened up to make the list consisting of two separate elements
-and get $_{00}$ and $_{011}$.
+$_{00}\ONE$ and $_{011}a^*$, note that $flatten$
-%CONSTRUCTION SITE
+$\fuse$s the bit(s) $_0$ to the front of $_0\ONE $ and $_{11}a^*$.
+In a nutshell, the order of simplification causes
+the bits to be moved differently.
+\subsubsection{A Failed Attempt To Remedy the Problem Above}
+A simple class of regular expressions and strings
+pairs can be deduced
+from the above example which
+trigger the difference between
+$\rup\backslash_{simp} \, s$
+and  $\simp(\rup\backslash s)$:
+\[
+D = (c_1c_2, \{r_1\cdot r_2 \mid \text{$ c_1\in L(r_1), r_1 \; not \; \nullable, c_2 \in L(r_2),
+r_2 \backslash c_2$ is of shape alternative  and $c_1c_2 \notin L(r_1)$}\})
+\]
 and we can use this to construct a set of examples based
 on this type of behaviour of two operations:
 $r_1$
 that is to say, despite the bits are being moved around on the regular expression
 (difference in bits), the structure of the (unannotated)regular expression
 after one simplification is exactly the same after the
 same sequence of derivative operations
 regardless of whether we did simplification
 along the way.
+One way would be to give a function that calls
+%CONSTRUCTION SITE
+fuse is the culprit: it causes the order in which alternatives
+are opened up to be remembered and finally the difference
+appear in $\simp(\rup \backslash s)$ and $\rup \backslash{simp} \,s$.
+but we have to use them as they are essential in the simplification:
+flatten needs them.
 However, without erase the above equality does not hold:
 for the regular expression
 $(a+b)(a+a*)$,
 if we do derivative with respect to string $aa$,
 we get

changeset 107	b1e365afa29c
parent 106	e0db3242d8b5
child 108	0a0c551bb368