--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/etnms/20200205.tex Wed Feb 05 22:55:21 2020 +0000
@@ -0,0 +1,1836 @@
+\documentclass[a4paper,UKenglish]{lipics}
+\usepackage{graphic}
+\usepackage{data}
+\usepackage{tikz-cd}
+\usepackage{tikz}
+\usetikzlibrary{graphs}
+\usetikzlibrary{graphdrawing}
+\usegdlibrary{trees}
+%\usepackage{algorithm}
+\usepackage{amsmath}
+\usepackage{xcolor}
+\usepackage[noend]{algpseudocode}
+\usepackage{enumitem}
+\usepackage{nccmath}
+\usepackage{soul}
+
+\definecolor{darkblue}{rgb}{0,0,0.6}
+\hypersetup{colorlinks=true,allcolors=darkblue}
+\newcommand{\comment}[1]%
+{{\color{red}$\Rightarrow$}\marginpar{\raggedright\small{\bf\color{red}#1}}}
+
+% \documentclass{article}
+%\usepackage[utf8]{inputenc}
+%\usepackage[english]{babel}
+%\usepackage{listings}
+% \usepackage{amsthm}
+%\usepackage{hyperref}
+% \usepackage[margin=0.5in]{geometry}
+%\usepackage{pmboxdraw}
+
+\title{POSIX Regular Expression Matching and Lexing}
+\author{Chengsong Tan}
+\affil{King's College London\\
+London, UK\\
+\texttt{chengsong.tan@kcl.ac.uk}}
+\authorrunning{Chengsong Tan}
+\Copyright{Chengsong Tan}
+
+\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
+\newcommand{\ZERO}{\mbox{\bf 0}}
+\newcommand{\ONE}{\mbox{\bf 1}}
+\def\erase{\textit{erase}}
+\def\bders{\textit{bders}}
+\def\lexer{\mathit{lexer}}
+\def\blexer{\textit{blexer}}
+\def\fuse{\textit{fuse}}
+\def\flatten{\textit{flatten}}
+\def\map{\textit{map}}
+\def\blexers{\mathit{blexer\_simp}}
+\def\simp{\mathit{simp}}
+\def\mkeps{\mathit{mkeps}}
+\def\bmkeps{\textit{bmkeps}}
+\def\inj{\mathit{inj}}
+\def\Empty{\mathit{Empty}}
+\def\Left{\mathit{Left}}
+\def\Right{\mathit{Right}}
+\def\Stars{\mathit{Stars}}
+\def\Char{\mathit{Char}}
+\def\Seq{\mathit{Seq}}
+\def\Der{\mathit{Der}}
+\def\nullable{\mathit{nullable}}
+\def\Z{\mathit{Z}}
+\def\S{\mathit{S}}
+\def\flex{\textit{flex}}
+\def\rup{r^\uparrow}
+\def\retrieve{\textit{retrieve}}
+\def\AALTS{\textit{AALTS}}
+\def\AONE{\textit{AONE}}
+%\theoremstyle{theorem}
+%\newtheorem{theorem}{Theorem}
+%\theoremstyle{lemma}
+%\newtheorem{lemma}{Lemma}
+%\newcommand{\lemmaautorefname}{Lemma}
+%\theoremstyle{definition}
+%\newtheorem{definition}{Definition}
+\algnewcommand\algorithmicswitch{\textbf{switch}}
+\algnewcommand\algorithmiccase{\textbf{case}}
+\algnewcommand\algorithmicassert{\texttt{assert}}
+\algnewcommand\Assert[1]{\State \algorithmicassert(#1)}%
+% New "environments"
+\algdef{SE}[SWITCH]{Switch}{EndSwitch}[1]{\algorithmicswitch\ #1\ \algorithmicdo}{\algorithmicend\ \algorithmicswitch}%
+\algdef{SE}[CASE]{Case}{EndCase}[1]{\algorithmiccase\ #1}{\algorithmicend\ \algorithmiccase}%
+\algtext*{EndSwitch}%
+\algtext*{EndCase}%
+
+
+\begin{document}
+
+\maketitle
+
+\begin{abstract}
+ Brzozowski introduced in 1964 a beautifully simple algorithm for
+ regular expression matching based on the notion of derivatives of
+ regular expressions. In 2014, Sulzmann and Lu extended this
+ algorithm to not just give a YES/NO answer for whether or not a
+ regular expression matches a string, but in case it does also
+ answers with \emph{how} it matches the string. This is important for
+ applications such as lexing (tokenising a string). The problem is to
+ make the algorithm by Sulzmann and Lu fast on all inputs without
+ breaking its correctness. Being fast depends on a complete set of
+ simplification rules, some of which
+ have been put forward by Sulzmann and Lu. We have extended their
+ rules in order to obtain a tight bound on the size of regular expressions.
+ We have tested these extended rules, but have not
+ formally established their correctness. We have also not yet looked
+ at extended regular expressions, such as bounded repetitions,
+ negation and back-references.
+\end{abstract}
+
+\section{Recapitulation of Concepts From the Last Report}
+
+\subsection*{Regular Expressions and Derivatives}
+
+Suppose (basic) regular expressions are given by the following grammar:
+
+\[ r ::= \ZERO \mid \ONE
+ \mid c
+ \mid r_1 \cdot r_2
+ \mid r_1 + r_2
+ \mid r^*
+\]
+
+\noindent
+The ingenious contribution of Brzozowski is the notion of \emph{derivatives} of
+regular expressions, written~$\_ \backslash \_$. It uses the auxiliary notion of
+$\nullable$ defined below.
+
+\begin{center}
+ \begin{tabular}{lcl}
+ $\nullable(\ZERO)$ & $\dn$ & $\mathit{false}$ \\
+ $\nullable(\ONE)$ & $\dn$ & $\mathit{true}$ \\
+ $\nullable(c)$ & $\dn$ & $\mathit{false}$ \\
+ $\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
+ $\nullable(r_1\cdot r_2)$ & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
+ $\nullable(r^*)$ & $\dn$ & $\mathit{true}$ \\
+ \end{tabular}
+ \end{center}
+
+\begin{center}
+\begin{tabular}{lcl}
+ $\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
+ $\ONE \backslash c$ & $\dn$ & $\ZERO$\\
+ $d \backslash c$ & $\dn$ &
+ $\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+$(r_1 + r_2)\backslash c$ & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
+ & & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
+ & & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
+ $(r^*)\backslash c$ & $\dn$ & $(r\backslash c) \cdot r^*$\\
+\end{tabular}
+\end{center}
+
+%Assuming the classic notion of a
+%\emph{language} of a regular expression, written $L(\_)$, t
+
+\noindent
+The main property of the derivative operation is that
+
+\begin{center}
+$c\!::\!s \in L(r)$ holds
+if and only if $s \in L(r\backslash c)$.
+\end{center}
+
+\noindent
+We can generalise the derivative operation shown above for single characters
+to strings as follows:
+
+\begin{center}
+\begin{tabular}{lcl}
+$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
+$r \backslash [\,] $ & $\dn$ & $r$
+\end{tabular}
+\end{center}
+
+\noindent
+and then define Brzozowski's regular-expression matching algorithm as:
+
+\[
+match\;s\;r \;\dn\; nullable(r\backslash s)
+\]
+
+\noindent
+Assuming the a string is givane as a sequence of characters, say $c_0c_1..c_n$,
+this algorithm presented graphically is as follows:
+
+\begin{equation}\label{graph:*}
+\begin{tikzcd}
+r_0 \arrow[r, "\backslash c_0"] & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed] & r_n \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
+\end{tikzcd}
+\end{equation}
+
+\noindent
+where we start with a regular expression $r_0$, build successive
+derivatives until we exhaust the string and then use \textit{nullable}
+to test whether the result can match the empty string. It can be
+relatively easily shown that this matcher is correct (that is given
+an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
+
+
+\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
+
+One limitation of Brzozowski's algorithm is that it only produces a
+YES/NO answer for whether a string is being matched by a regular
+expression. Sulzmann and Lu~\cite{Sulzmann2014} extended this algorithm
+to allow generation of an actual matching, called a \emph{value} or
+sometimes also \emph{lexical value}. These values and regular
+expressions correspond to each other as illustrated in the following
+table:
+
+
+\begin{center}
+ \begin{tabular}{c@{\hspace{20mm}}c}
+ \begin{tabular}{@{}rrl@{}}
+ \multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
+ $r$ & $::=$ & $\ZERO$\\
+ & $\mid$ & $\ONE$ \\
+ & $\mid$ & $c$ \\
+ & $\mid$ & $r_1 \cdot r_2$\\
+ & $\mid$ & $r_1 + r_2$ \\
+ \\
+ & $\mid$ & $r^*$ \\
+ \end{tabular}
+ &
+ \begin{tabular}{@{\hspace{0mm}}rrl@{}}
+ \multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
+ $v$ & $::=$ & \\
+ & & $\Empty$ \\
+ & $\mid$ & $\Char(c)$ \\
+ & $\mid$ & $\Seq\,v_1\, v_2$\\
+ & $\mid$ & $\Left(v)$ \\
+ & $\mid$ & $\Right(v)$ \\
+ & $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
+ \end{tabular}
+ \end{tabular}
+\end{center}
+
+\noindent
+The contribution of Sulzmann and Lu is an extension of Brzozowski's
+algorithm by a second phase (the first phase being building successive
+derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
+is generated in case the regular expression matches the string.
+Pictorially, the Sulzmann and Lu algorithm is as follows:
+
+\begin{ceqn}
+\begin{equation}\label{graph:2}
+\begin{tikzcd}
+r_0 \arrow[r, "\backslash c_0"] \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
+v_0 & v_1 \arrow[l,"inj_{r_0} c_0"] & v_2 \arrow[l, "inj_{r_1} c_1"] & v_n \arrow[l, dashed]
+\end{tikzcd}
+\end{equation}
+\end{ceqn}
+
+\noindent
+For convenience, we shall employ the following notations: the regular
+expression we start with is $r_0$, and the given string $s$ is composed
+of characters $c_0 c_1 \ldots c_{n-1}$. In the first phase from the
+left to right, we build the derivatives $r_1$, $r_2$, \ldots according
+to the characters $c_0$, $c_1$ until we exhaust the string and obtain
+the derivative $r_n$. We test whether this derivative is
+$\textit{nullable}$ or not. If not, we know the string does not match
+$r$ and no value needs to be generated. If yes, we start building the
+values incrementally by \emph{injecting} back the characters into the
+earlier values $v_n, \ldots, v_0$. This is the second phase of the
+algorithm from the right to left. For the first value $v_n$, we call the
+function $\textit{mkeps}$, which builds the lexical value
+for how the empty string has been matched by the (nullable) regular
+expression $r_n$. This function is defined as
+
+ \begin{center}
+ \begin{tabular}{lcl}
+ $\mkeps(\ONE)$ & $\dn$ & $\Empty$ \\
+ $\mkeps(r_{1}+r_{2})$ & $\dn$
+ & \textit{if} $\nullable(r_{1})$\\
+ & & \textit{then} $\Left(\mkeps(r_{1}))$\\
+ & & \textit{else} $\Right(\mkeps(r_{2}))$\\
+ $\mkeps(r_1\cdot r_2)$ & $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
+ $mkeps(r^*)$ & $\dn$ & $\Stars\,[]$
+ \end{tabular}
+ \end{center}
+
+
+\noindent
+After the $\mkeps$-call, we inject back the characters one by one in order to build
+the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
+($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
+After injecting back $n$ characters, we get the lexical value for how $r_0$
+matches $s$. For this Sulzmann and Lu defined a function that reverses
+the ``chopping off'' of characters during the derivative phase. The
+corresponding function is called \emph{injection}, written
+$\textit{inj}$; it takes three arguments: the first one is a regular
+expression ${r_{i-1}}$, before the character is chopped off, the second
+is a character ${c_{i-1}}$, the character we want to inject and the
+third argument is the value ${v_i}$, into which one wants to inject the
+character (it corresponds to the regular expression after the character
+has been chopped off). The result of this function is a new value. The
+definition of $\textit{inj}$ is as follows:
+
+\begin{center}
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
+ $\textit{inj}\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\
+ $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
+ $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
+ $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+ $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+ $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
+ $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
+\end{tabular}
+\end{center}
+
+\noindent This definition is by recursion on the ``shape'' of regular
+expressions and values.
+
+
+\subsection*{Simplification of Regular Expressions}
+
+The main drawback of building successive derivatives according
+to Brzozowski's definition is that they can grow very quickly in size.
+This is mainly due to the fact that the derivative operation generates
+often ``useless'' $\ZERO$s and $\ONE$s in derivatives. As a result, if
+implemented naively both algorithms by Brzozowski and by Sulzmann and Lu
+are excruciatingly slow. For example when starting with the regular
+expression $(a + aa)^*$ and building 12 successive derivatives
+w.r.t.~the character $a$, one obtains a derivative regular expression
+with more than 8000 nodes (when viewed as a tree). Operations like
+$\textit{der}$ and $\nullable$ need to traverse such trees and
+consequently the bigger the size of the derivative the slower the
+algorithm.
+
+Fortunately, one can simplify regular expressions after each derivative
+step. Various simplifications of regular expressions are possible, such
+as the simplification of $\ZERO + r$, $r + \ZERO$, $\ONE\cdot r$, $r
+\cdot \ONE$, and $r + r$ to just $r$. These simplifications do not
+affect the answer for whether a regular expression matches a string or
+not, but fortunately also do not affect the POSIX strategy of how
+regular expressions match strings---although the latter is much harder
+to establish. Some initial results in this regard have been
+obtained in \cite{AusafDyckhoffUrban2016}.
+
+If we want the size of derivatives in Sulzmann and Lu's algorithm to
+stay below this bound, we would need more aggressive simplifications.
+Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
+deleting duplicates whenever possible. For example, the parentheses in
+$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
+\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
+example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
+$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
+to achieve the same size bound as that of the partial derivatives.
+
+In order to implement the idea of ``spilling out alternatives'' and to
+make them compatible with the $\textit{inj}$-mechanism, we use
+\emph{bitcodes}. They were first introduced by Sulzmann and Lu.
+Here bits and bitcodes (lists of bits) are defined as:
+
+\begin{center}
+ $b ::= 1 \mid 0 \qquad
+bs ::= [] \mid b:bs
+$
+\end{center}
+
+\noindent
+The $1$ and $0$ are not in bold in order to avoid
+confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
+bit-lists) can be used to encode values (or incomplete values) in a
+compact form. This can be straightforwardly seen in the following
+coding function from values to bitcodes:
+
+\begin{center}
+\begin{tabular}{lcl}
+ $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
+ $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
+ $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
+ $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
+ $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
+ $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
+ $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
+ code(\Stars\,vs)$
+\end{tabular}
+\end{center}
+
+\noindent
+Here $\textit{code}$ encodes a value into a bitcodes by converting
+$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
+star iteration by $1$. The border where a local star terminates
+is marked by $0$. This coding is lossy, as it throws away the information about
+characters, and also does not encode the ``boundary'' between two
+sequence values. Moreover, with only the bitcode we cannot even tell
+whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
+reason for choosing this compact way of storing information is that the
+relatively small size of bits can be easily manipulated and ``moved
+around'' in a regular expression. In order to recover values, we will
+need the corresponding regular expression as an extra information. This
+means the decoding function is defined as:
+
+
+%\begin{definition}[Bitdecoding of Values]\mbox{}
+\begin{center}
+\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
+ $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
+ $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
+ $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+ $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
+ (\Left\,v, bs_1)$\\
+ $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+ $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
+ (\Right\,v, bs_1)$\\
+ $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
+ $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
+ & & $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
+ & & \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
+ $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
+ $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ &
+ $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
+ & & $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
+ & & \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
+
+ $\textit{decode}\,bs\,r$ & $\dn$ &
+ $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
+ & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
+ \textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+%\end{definition}
+
+Sulzmann and Lu's integrated the bitcodes into regular expressions to
+create annotated regular expressions \cite{Sulzmann2014}.
+\emph{Annotated regular expressions} are defined by the following
+grammar:%\comment{ALTS should have an $as$ in the definitions, not just $a_1$ and $a_2$}
+
+\begin{center}
+\begin{tabular}{lcl}
+ $\textit{a}$ & $::=$ & $\ZERO$\\
+ & $\mid$ & $_{bs}\ONE$\\
+ & $\mid$ & $_{bs}{\bf c}$\\
+ & $\mid$ & $_{bs}\oplus\,as$\\
+ & $\mid$ & $_{bs}a_1\cdot a_2$\\
+ & $\mid$ & $_{bs}a^*$
+\end{tabular}
+\end{center}
+%(in \textit{ALTS})
+
+\noindent
+where $bs$ stands for bitcodes, $a$ for $\bold{a}$nnotated regular
+expressions and $as$ for a list of annotated regular expressions.
+The alternative constructor($\oplus$) has been generalized to
+accept a list of annotated regular expressions rather than just 2.
+We will show that these bitcodes encode information about
+the (POSIX) value that should be generated by the Sulzmann and Lu
+algorithm.
+
+
+To do lexing using annotated regular expressions, we shall first
+transform the usual (un-annotated) regular expressions into annotated
+regular expressions. This operation is called \emph{internalisation} and
+defined as follows:
+
+%\begin{definition}
+\begin{center}
+\begin{tabular}{lcl}
+ $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
+ $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
+ $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
+ $(r_1 + r_2)^\uparrow$ & $\dn$ &
+ $_{[]}\oplus[\textit{fuse}\,[0]\,r_1^\uparrow,\,
+ \textit{fuse}\,[1]\,r_2^\uparrow]$\\
+ $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
+ $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
+ $(r^*)^\uparrow$ & $\dn$ &
+ $_{[]}(r^\uparrow)^*$\\
+\end{tabular}
+\end{center}
+%\end{definition}
+
+\noindent
+We use up arrows here to indicate that the basic un-annotated regular
+expressions are ``lifted up'' into something slightly more complex. In the
+fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
+attach bits to the front of an annotated regular expression. Its
+definition is as follows:
+
+\begin{center}
+\begin{tabular}{lcl}
+ $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
+ $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
+ $_{bs @ bs'}\ONE$\\
+ $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
+ $_{bs@bs'}{\bf c}$\\
+ $\textit{fuse}\;bs\,_{bs'}\oplus\textit{as}$ & $\dn$ &
+ $_{bs@bs'}\oplus\textit{as}$\\
+ $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
+ $_{bs@bs'}a_1 \cdot a_2$\\
+ $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
+ $_{bs @ bs'}a^*$
+\end{tabular}
+\end{center}
+
+\noindent
+After internalising the regular expression, we perform successive
+derivative operations on the annotated regular expressions. This
+derivative operation is the same as what we had previously for the
+basic regular expressions, except that we beed to take care of
+the bitcodes:
+
+
+\iffalse
+ %\begin{definition}{bder}
+\begin{center}
+ \begin{tabular}{@{}lcl@{}}
+ $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
+ $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
+ $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
+ $\textit{if}\;c=d\; \;\textit{then}\;
+ \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
+ $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
+ $\textit{ALTS}\;bs\,(as.map(\backslash c))$\\
+ $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
+ $\textit{if}\;\textit{bnullable}\,a_1$\\
+ & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
+ & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
+ & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
+ $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
+ $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
+ (\textit{STAR}\,[]\,r)$
+\end{tabular}
+\end{center}
+%\end{definition}
+
+\begin{center}
+ \begin{tabular}{@{}lcl@{}}
+ $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
+ $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
+ $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
+ $\textit{if}\;c=d\; \;\textit{then}\;
+ _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\
+ $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
+ $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
+ $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
+ $\textit{if}\;\textit{bnullable}\,a_1$\\
+ & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
+ & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
+ & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
+ $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
+ $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
+ (_{bs}\textit{STAR}\,[]\,r)$
+\end{tabular}
+\end{center}
+%\end{definition}
+\fi
+
+\begin{center}
+ \begin{tabular}{@{}lcl@{}}
+ $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\
+ $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\
+ $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
+ $\textit{if}\;c=d\; \;\textit{then}\;
+ _{bs}\ONE\;\textit{else}\;\ZERO$\\
+ $(_{bs}\oplus \;\textit{as})\,\backslash c$ & $\dn$ &
+ $_{bs}\oplus\;(\textit{as.map}(\backslash c))$\\
+ $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
+ $\textit{if}\;\textit{bnullable}\,a_1$\\
+ & &$\textit{then}\;_{bs}\oplus\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
+ & &$\phantom{\textit{then},\;_{bs}\oplus\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
+ & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
+ $(_{bs}a^*)\,\backslash c$ & $\dn$ &
+ $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
+ (_{[]}r^*))$
+\end{tabular}
+\end{center}
+
+%\end{definition}
+\noindent
+For instance, when we do derivative of $_{bs}a^*$ with respect to c,
+we need to unfold it into a sequence,
+and attach an additional bit $0$ to the front of $r \backslash c$
+to indicate that there is one more star iteration. Also the sequence clause
+is more subtle---when $a_1$ is $\textit{bnullable}$ (here
+\textit{bnullable} is exactly the same as $\textit{nullable}$, except
+that it is for annotated regular expressions, therefore we omit the
+definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
+$a_1$ matches the string prior to character $c$ (more on this later),
+then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \; a_1 (a_2
+\backslash c)$ will collapse the regular expression $a_1$(as it has
+already been fully matched) and store the parsing information at the
+head of the regular expression $a_2 \backslash c$ by fusing to it. The
+bitsequence $\textit{bs}$, which was initially attached to the
+first element of the sequence $a_1 \cdot a_2$, has
+now been elevated to the top-level of $\oplus$, as this information will be
+needed whichever way the sequence is matched---no matter whether $c$ belongs
+to $a_1$ or $ a_2$. After building these derivatives and maintaining all
+the lexing information, we complete the lexing by collecting the
+bitcodes using a generalised version of the $\textit{mkeps}$ function
+for annotated regular expressions, called $\textit{bmkeps}$:
+
+
+%\begin{definition}[\textit{bmkeps}]\mbox{}
+\begin{center}
+\begin{tabular}{lcl}
+ $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
+ $\textit{bmkeps}\,(_{bs}\oplus a::\textit{as})$ & $\dn$ &
+ $\textit{if}\;\textit{bnullable}\,a$\\
+ & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
+ & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\oplus \textit{as})$\\
+ $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
+ $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
+ $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
+ $bs \,@\, [0]$
+\end{tabular}
+\end{center}
+%\end{definition}
+
+\noindent
+This function completes the value information by travelling along the
+path of the regular expression that corresponds to a POSIX value and
+collecting all the bitcodes, and using $S$ to indicate the end of star
+iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
+decode them, we get the value we expect. The corresponding lexing
+algorithm looks as follows:
+
+\begin{center}
+\begin{tabular}{lcl}
+ $\textit{blexer}\;r\,s$ & $\dn$ &
+ $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
+ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+ & & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+
+\noindent
+In this definition $\_\backslash s$ is the generalisation of the derivative
+operation from characters to strings (just like the derivatives for un-annotated
+regular expressions).
+
+
+\subsection*{Our Simplification Rules}
+
+The main point of the bitcodes and annotated regular expressions is that
+we can apply rather aggressive (in terms of size) simplification rules
+in order to keep derivatives small. We have developed such
+``aggressive'' simplification rules and generated test data that show
+that the expected bound can be achieved. Obviously we could only
+partially cover the search space as there are infinitely many regular
+expressions and strings.
+
+One modification we introduced is to allow a list of annotated regular
+expressions in the $\oplus$ constructor. This allows us to not just
+delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
+also unnecessary ``copies'' of regular expressions (very similar to
+simplifying $r + r$ to just $r$, but in a more general setting). Another
+modification is that we use simplification rules inspired by Antimirov's
+work on partial derivatives. They maintain the idea that only the first
+``copy'' of a regular expression in an alternative contributes to the
+calculation of a POSIX value. All subsequent copies can be pruned away from
+the regular expression. A recursive definition of our simplification function
+that looks somewhat similar to our Scala code is given below:
+%\comment{Use $\ZERO$, $\ONE$ and so on.
+%Is it $ALTS$ or $ALTS$?}\\
+
+\begin{center}
+ \begin{tabular}{@{}lcl@{}}
+
+ $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\
+ &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\
+ &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\
+ &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\
+ &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\
+ &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\
+
+ $\textit{simp} \; (_{bs}\oplus \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
+ &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\
+ &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\
+ &&$\quad\textit{case} \; as' \Rightarrow _{bs}\oplus \textit{as'}$\\
+
+ $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
+\end{tabular}
+\end{center}
+
+\noindent
+The simplification does a pattern matching on the regular expression.
+When it detected that the regular expression is an alternative or
+sequence, it will try to simplify its children regular expressions
+recursively and then see if one of the children turn into $\ZERO$ or
+$\ONE$, which might trigger further simplification at the current level.
+The most involved part is the $\oplus$ clause, where we use two
+auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
+alternatives and reduce as many duplicates as possible. Function
+$\textit{distinct}$ keeps the first occurring copy only and remove all later ones
+when detected duplicates. Function $\textit{flatten}$ opens up nested $\oplus$s.
+Its recursive definition is given below:
+
+ \begin{center}
+ \begin{tabular}{@{}lcl@{}}
+ $\textit{flatten} \; (_{bs}\oplus \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
+ (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
+ $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \; \textit{as'} $ \\
+ $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise)
+\end{tabular}
+\end{center}
+
+\noindent
+Here $\textit{flatten}$ behaves like the traditional functional programming flatten
+function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
+removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
+
+Suppose we apply simplification after each derivative step, and view
+these two operations as an atomic one: $a \backslash_{simp}\,c \dn
+\textit{simp}(a \backslash c)$. Then we can use the previous natural
+extension from derivative w.r.t.~character to derivative
+w.r.t.~string:%\comment{simp in the [] case?}
+
+\begin{center}
+\begin{tabular}{lcl}
+$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
+$r \backslash_{simp} [\,] $ & $\dn$ & $r$
+\end{tabular}
+\end{center}
+
+\noindent
+we obtain an optimised version of the algorithm:
+
+ \begin{center}
+\begin{tabular}{lcl}
+ $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
+ $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
+ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+ & & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+
+\noindent
+This algorithm keeps the regular expression size small, for example,
+with this simplification our previous $(a + aa)^*$ example's 8000 nodes
+will be reduced to just 6 and stays constant, no matter how long the
+input string is.
+
+
+
+\section{Introduction}
+
+While we believe derivatives of regular expressions, written
+$r\backslash s$, are a beautiful concept (in terms of ease of
+implementing them in functional programming languages and in terms of
+reasoning about them formally), they have one major drawback: every
+derivative step can make regular expressions grow drastically in
+size. This in turn has negative effect on the runtime of the
+corresponding lexing algorithms. Consider for example the regular
+expression $(a+aa)^*$ and the short string $aaaaaaaaaaaa$. The
+corresponding derivative contains already 8668 nodes where we assume
+the derivative is given as a tree. The reason for the poor runtime of
+the derivative-based lexing algorithms is that they need to traverse
+such trees over and over again. The solution is to find a complete set
+of simplification rules that keep the sizes of derivatives uniformly
+small.
+
+For reasons beyond this report, it turns out that a complete set of
+simplification rules depends on values being encoded as
+bitsequences.\footnote{Values are the results the lexing algorithms
+ generate; they encode how a regular expression matched a string.} We
+already know that the lexing algorithm using bitsequences but
+\emph{without} simplification is correct, albeilt horribly
+slow. Therefore in the past 6 months I was trying to prove that the
+algorithm using bitsequences plus our simplification rules is
+also correct. Formally this amounts to show that
+
+\begin{equation}\label{mainthm}
+\blexers \; r \; s = \blexer \;r\;s
+\end{equation}
+
+\noindent
+whereby $\blexers$ simplifies (makes derivatives smaller) in each
+step, whereas with $\blexer$ the size can grow exponentially. This
+would be an important milestone for my thesis, because we already
+have a very good idea how to establish that our set our simplification
+rules keeps the size of derivativs below a relatively tight bound.
+
+In order to prove the main theorem in \eqref{mainthm}, we need to prove that the
+two functions produce the same output. The definition of these two functions
+is shown below.
+
+\begin{center}
+\begin{tabular}{lcl}
+ $\textit{blexer}\;r\,s$ & $\dn$ &
+ $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
+ & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+ & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+ & & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+
+\begin{center}
+\begin{tabular}{lcl}
+ $\blexers \; r \, s$ &$\dn$ &
+ $\textit{let} \; a = (r^\uparrow)\backslash_{simp}\, s\; \textit{in}$\\
+ & & $\; \; \textit{if} \; \textit{bnullable}(a)$\\
+ & & $\; \; \textit{then} \; \textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+ & & $\;\; \textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+\noindent
+In these definitions $(r^\uparrow)$ is a kind of coding function that
+is the same in each case, similarly the decode and the \textit{bmkeps}
+are functions that are the same in each case. Our main
+theorem~\eqref{mainthm} therefore boils down to proving the following
+two propositions (depending on which branch the if-else clause
+takes). They establish how the derivatives \emph{with} simplification
+do not change the computed result:
+
+\begin{itemize}
+\item{(a)} If a string $s$ is in the language of $L(r)$, then \\
+$\textit{bmkeps} (r^\uparrow)\backslash_{simp}\,s = \textit{bmkeps} (r^\uparrow)\backslash s$,\\
+\item{(b)} If a string $s$ is in the language $L(r)$, then
+$\rup \backslash_{simp} \,s$ is not nullable.
+\end{itemize}
+
+\noindent
+We have already proved the second part in Isabelle. This is actually
+not too difficult because we can show that simplification does not
+change the language of simplified regular expressions.
+
+If we can prove the first part, that is the bitsequence algorithm with
+simplification produces the same result as the one without
+simplification, then we are done. Unfortunately that part requires
+more effort, because simplification does not only need to \emph{not}
+change the language, but also not change the value (that is the
+computed result).
+
+%\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
+%Do you want to keep this? You essentially want to say that the old
+%method used retrieve, which unfortunately cannot be adopted to
+%the simplification rules. You could just say that and give an example.
+%However you have to think about how you give the example....nobody knows
+%about AZERO etc yet. Maybe it might be better to use normal regexes
+%like $a + aa$, but annotate bitsequences as subscript like $_1(_0a + _1aa)$.
+
+%\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
+%REPLY:\\
+%Yes, I am essentially saying that the old method
+%cannot be adopted without adjustments.
+%But this does not mean we should skip
+%the proof of the bit-coded algorithm
+%as it is still the main direction we are looking into
+%to prove things. We are trying to modify
+%the old proof to suit our needs, but not give
+%up it totally, that is why i believe the old
+%proof is fundamental in understanding
+%what we are doing in the past 6 months.
+%\bigskip\noindent\rule[1.5ex]{\linewidth}{5pt}
+
+\subsubsection*{Existing Proof}
+
+For this we have started with looking at the original proof that
+established that the bitsequence algorrithm produces the same result
+as the algorithm not using bitsequences. Formally this proof
+established
+
+\begin{equation}\label{lexer}
+\blexer \; (r^\uparrow) s = \lexer \;r \;s
+\end{equation}
+
+%\noindent
+%might provide us insight into proving
+%\begin{center}
+%$\blexer \; r^\uparrow \;s = \blexers \; r^\uparrow \;s$
+%\end{center}
+
+\noindent
+The proof uses two ``tricks''. One is that it uses a \flex-function
+
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{flex} \;r\; f\; (c\!::\!s) $ & $\dn$ & $\textit{flex} \; (r\backslash c) \;(\lambda v. f (inj \; r \; c \; v)) \;s$ \\
+$\textit{flex} \;r\; f\; [\,] $ & $\dn$ & $f$
+\end{tabular}
+\end{center}
+
+\noindent
+and then proves for the right-hand side in \eqref{lexer}
+
+\begin{center}
+$\lexer \;r\; s = \flex \;\textit{id} \; r\;s \;(\mkeps \; (r\backslash s))$
+\end{center}.
+
+
+
+
+\noindent
+The $\flex$-function essentially does lexing by
+stacking up injection functions while doing derivatives.
+
+explicitly showing the order of characters being
+injected back in each step.
+With $\flex$ we can write $\lexer$ this way:
+
+\begin{center}
+$\lexer \;r\; s = \flex \;id \; r\;s \;(\mkeps r\backslash s)$
+\end{center}
+
+%\noindent
+%$\flex$ focuses on the injections instead of the derivatives ,
+%compared to the original definition of $\lexer$, which puts equal
+%amount of emphasis on injection and derivative with respect to each
+%character:
+
+%\begin{center}
+%\begin{tabular}{lcl}
+%$\textit{lexer} \; r\; (c\!::\!s) $ & $\dn$ & $\textit{case} \; \lexer \; (r\backslash c) \;s \; %\textit{of}$ \\
+% & & $\textit{None} \; \Longrightarrow \; \textit{None}$\\
+% & & $\textbar \; v \; \Longrightarrow \; \inj \; r\;c\;v$\\
+%$\textit{lexer} \; r\; [\,] $ & $\dn$ & $\textit{if} \; \nullable (r) \; \textit{then} \; \mkeps% (r) \; \textit{else} \;None$
+%\end{tabular}
+%\end{center}
+
+\noindent
+The crux in the existing proof is how $\flex$ relates to injection, namely
+
+\begin{center}
+$\flex \; r \; id \; (s@[c]) \; v = \flex \; r \; id \; s \; (inj \; (r\backslash s) \; c\; v)$.
+\end{center}
+
+\noindent
+This property allows one to rewrite an induction hypothesis like
+
+\begin{center}
+$ \flex \; r\; id\; (s@[c])\; v = \textit{decode} \;( \textit{retrieve}\; (\rup \backslash s) \; (\inj \; (r\backslash s) \;c\;v)\;) r$
+\end{center}
+
+
+\subsubsection{Retrieve Function}
+The crucial point is to find the
+$\textit{POSIX}$ information of a regular expression and how it is modified,
+augmented and propagated
+during simplification in parallel with the regular expression that
+has not been simplified in the subsequent derivative operations. To aid this,
+we use the helper function retrieve described by Sulzmann and Lu:
+\begin{center}
+\begin{tabular}{@{}l@{\hspace{2mm}}c@{\hspace{2mm}}l@{}}
+ $\textit{retrieve}\,(_{bs}\ONE)\,\Empty$ & $\dn$ & $bs$\\
+ $\textit{retrieve}\,(_{bs}{\bf c})\,(\Char\,d)$ & $\dn$ & $bs$\\
+ $\textit{retrieve}\,(_{bs}\oplus a::as)\,(\Left\,v)$ & $\dn$ &
+ $bs \,@\, \textit{retrieve}\,a\,v$\\
+ $\textit{retrieve}\,(_{bs}\oplus a::as)\,(\Right\,v)$ & $\dn$ &
+ $\textit{bs} \,@\, \textit{retrieve}\,(_{[]}\oplus as)\,v$\\
+ $\textit{retrieve}\,(_{bs}a_1\cdot a_2)\,(\Seq\,v_1\,v_2)$ & $\dn$ &
+ $bs \,@\,\textit{retrieve}\,a_1\,v_1\,@\, \textit{retrieve}\,a_2\,v_2$\\
+ $\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,[])$ & $\dn$ &
+ $bs \,@\, [0]$\\
+ $\textit{retrieve}\,(_{bs}a^*)\,(\Stars\,(v\!::\!vs))$ & $\dn$ &\\
+ \multicolumn{3}{l}{
+ \hspace{3cm}$bs \,@\, [1] \,@\, \textit{retrieve}\,a\,v\,@\,
+ \textit{retrieve}\,(_{[]}a^*)\,(\Stars\,vs)$}\\
+\end{tabular}
+\end{center}
+%\comment{Did not read further}\\
+This function assembles the bitcode
+%that corresponds to a lexical value for how
+%the current derivative matches the suffix of the string(the characters that
+%have not yet appeared, but will appear as the successive derivatives go on.
+%How do we get this "future" information? By the value $v$, which is
+%computed by a pass of the algorithm that uses
+%$inj$ as described in the previous section).
+using information from both the derivative regular expression and the
+value. Sulzmann and Lu poroposed this function, but did not prove
+anything about it. Ausaf and Urban used it to connect the bitcoded
+algorithm to the older algorithm by the following equation:
+
+ \begin{center} $inj \;a\; c \; v = \textit{decode} \; (\textit{retrieve}\;
+ (r^\uparrow)\backslash_{simp} \,c)\,v)$
+ \end{center}
+
+\noindent
+whereby $r^\uparrow$ stands for the internalised version of $r$. Ausaf
+and Urban also used this fact to prove the correctness of bitcoded
+algorithm without simplification. Our purpose of using this, however,
+is to establish
+
+\begin{center}
+$ \textit{retrieve} \;
+a \; v \;=\; \textit{retrieve} \; (\textit{simp}\,a) \; v'.$
+\end{center}
+The idea is that using $v'$, a simplified version of $v$ that had gone
+through the same simplification step as $\textit{simp}(a)$, we are able
+to extract the bitcode that gives the same parsing information as the
+unsimplified one. However, we noticed that constructing such a $v'$
+from $v$ is not so straightforward. The point of this is that we might
+be able to finally bridge the gap by proving
+
+\noindent
+By using a property of retrieve we have the $\textit{RHS}$ of the above equality is
+$decode (retrieve (r^\uparrow \backslash(s @ [c])) v) r$, and this gives the
+main lemma result:
+
+\begin{center}
+$ \flex \;r\; id \; (s@[c]) \; v =\textit{decode}(\textit{retrieve} (\rup \backslash (s@[c])) \;v) r$
+\end{center}
+
+
+
+
+\noindent
+To use this lemma result for our
+correctness proof, simply replace the $v$ in the
+$\textit{RHS}$ of the above equality with
+$\mkeps\;(r\backslash (s@[c]))$, and apply the lemma that
+
+\begin{center}
+$\textit{decode} \; \bmkeps \; \rup \; r = \textit{decode} \; (\textit{retrieve} \; \rup \; \mkeps(r)) \;r$
+\end{center}
+\noindent
+We get the correctness of our bit-coded algorithm:
+\begin{center}
+$\flex \;r\; id \; s \; (\mkeps \; r\backslash s) = \textit{decode} \; \bmkeps \; \rup\backslash s \; r$
+\end{center}
+\noindent
+The bridge between the above chain of equalities
+is the use of $\retrieve$,
+if we want to use a similar technique for the
+simplified version of algorithm,
+we face the problem that in the above
+equalities,
+$\retrieve \; a \; v$ is not always defined.
+for example,
+$\retrieve \; _0(_1a+_0a) \; \Left(\Empty)$
+is defined, but not $\retrieve \; (_{01}a) \;\Left(\Empty)$,
+though we can extract the same POSIX
+bits from the two annotated regular expressions.
+The latter might occur when we try to retrieve from
+a simplified regular expression using the same value
+as the unsimplified one.
+This is because $\Left(\Empty)$ corresponds to
+the regular expression structure $\ONE+r_2$ instead of
+$\ONE$.
+That means, if we
+want to prove that
+\begin{center}
+$\textit{decode} \; \bmkeps \; \rup\backslash s \; r = \textit{decode} \; \bmkeps \; \rup\backslash_{simp} s \; r$
+\end{center}
+\noindent
+holds by using $\retrieve$,
+we probably need to prove an equality like below:
+\begin{center}
+%$\retrieve \; \rup\backslash_{simp} s \; \mkeps(r\backslash_{simp} s)=\textit{retrieve} \; \rup\backslash s \; \mkeps(r\backslash s)$
+$\retrieve \; \rup\backslash_{simp} s \; \mkeps(f(r\backslash s))=\textit{retrieve} \; \rup\backslash s \; \mkeps(r\backslash s)$
+\end{center}
+\noindent
+$f$ rectifies $r\backslash s$ so the value $\mkeps(f(r\backslash s))$ becomes
+ something simpler
+to make the retrieve function defined.\\
+\subsubsection{Ways to Rectify Value}
+One way to do this is to prove the following:
+\begin{center}
+$\retrieve \; \rup\backslash_{simp} s \; \mkeps(\simp(r\backslash s))=\textit{retrieve} \; \rup\backslash s \; \mkeps(r\backslash s)$
+\end{center}
+\noindent
+The reason why we choose $\simp$ as $f$ is because
+$\rup\backslash_{simp} \, s$ and $\simp(\rup\backslash \, s)$
+have the same shape:
+\begin{center}
+$\erase (\rup\backslash_{simp} \, s) = \erase(\simp(\rup\backslash s))$
+\end{center}
+
+\noindent
+$\erase$ in the above equality means to remove the bit-codes
+in an annotated regular expression and only keep the original
+regular expression(just like "erasing" the bits). Its definition is omitted.
+$\rup\backslash_{simp} \, s$ and $\simp(\rup\backslash s)$
+are very closely related, but not identical.
+\subsubsection{Example for $\rup\backslash_{simp} \, s \neq \simp(\rup\backslash s)$}
+For example, let $r$ be the regular expression
+$(a+b)(a+a*)$ and $s$ be the string $aa$, then
+both $\erase (\rup\backslash_{simp} \, s)$ and $\erase (\simp (\rup\backslash s))$
+are $\ONE + a^*$. However, without $\erase$
+\begin{center}
+$\rup\backslash_{simp} \, s$ is equal to $_0(_0\ONE +_{11}a^*)$
+\end{center}
+\noindent
+whereas
+\begin{center}
+$\simp(\rup\backslash s)$ is equal to $(_{00}\ONE +_{011}a^*)$
+\end{center}
+\noindent
+(For the sake of visual simplicity, we use numbers to denote the bits
+in bitcodes as we have previously defined for annotated
+regular expressions. $\S$ is replaced by
+subscript $_1$ and $\Z$ by $_0$.)
+
+What makes the difference?
+
+%Two "rules" might be inferred from the above example.
+
+%First, after erasing the bits the two regular expressions
+%are exactly the same: both become $1+a^*$. Here the
+%function $\simp$ exhibits the "one in the end equals many times
+%at the front"
+%property: one simplification in the end causes the
+%same regular expression structure as
+%successive simplifications done alongside derivatives.
+%$\rup\backslash_{simp} \, s$ unfolds to
+%$\simp((\simp(r\backslash a))\backslash a)$
+%and $\simp(\rup\backslash s)$ unfolds to
+%$\simp((r\backslash a)\backslash a)$. The one simplification
+%in the latter causes the resulting regular expression to
+%become $1+a^*$, exactly the same as the former with
+%two simplifications.
+
+%Second, the bit-codes are different, but they are essentially
+%the same: if we push the outmost bits ${\bf_0}(_0\ONE +_{11}a^*)$ of $\rup\backslash_{simp} \, s$
+%inside then we get $(_{00}\ONE +_{011}a^*)$, exactly the
+%same as that of $\rup\backslash \, s$. And this difference
+%does not matter when we try to apply $\bmkeps$ or $\retrieve$
+%to it. This seems a good news if we want to use $\retrieve$
+%to prove things.
+
+%If we look into the difference above, we could see that the
+%difference is not fundamental: the bits are just being moved
+%around in a way that does not hurt the correctness.
+During the first derivative operation,
+$\rup\backslash a=(_0\ONE + \ZERO)(_0a + _1a^*)$ is
+in the form of a sequence regular expression with
+two components, the first
+one $\ONE + \ZERO$ being nullable.
+Recall the simplification function definition:
+\begin{center}
+ \begin{tabular}{@{}lcl@{}}
+
+ $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\
+ &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\
+ &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\
+ &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\
+ &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\
+ &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\
+
+ $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
+ &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\
+ &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\
+ &&$\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$\\
+
+ $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
+\end{tabular}
+\end{center}
+
+\noindent
+
+and the difinition of $\flatten$:
+ \begin{center}
+ \begin{tabular}{c c c}
+ $\flatten \; []$ & $\dn$ & $[]$\\
+ $\flatten \; \ZERO::rs$ & $\dn$ & $rs$\\
+ $\flatten \;(_{\textit{bs}_1}\oplus \textit{rs}_1 ::rs)$ & $\dn$ & $(\map \, (\fuse \, \textit{bs}_1) \,\textit{rs}_1) ::: \flatten(rs)$\\
+ $\flatten \; r :: rs$ & $\dn$ & $r::\flatten(rs)$
+ \end{tabular}
+ \end{center}
+
+ \noindent
+If we call $\simp$ on $\rup\backslash a$, just as $\backslash_{simp}$
+requires, then we would go throught the third clause of
+the sequence case:$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$.
+The $\ZERO$ of $(_0\ONE + \ZERO)$ is thrown away
+by $\flatten$ and
+$_0\ONE$ merged into $(_0a + _1a^*)$ by simply
+putting its bits($_0$) to the front of the second component:
+ ${\bf_0}(_0a + _1a^*)$.
+ After a second derivative operation,
+ namely, $(_0(_0a + _1a^*))\backslash a$, we get
+ $
+ _0(_0 \ONE + _1(_1\ONE \cdot a^*))
+ $, and this simplifies to $_0(_0 \ONE + _{11} a^*)$
+ by the third clause of the alternative case:
+ $\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$.
+The outmost bit $_0$ stays with
+the outmost regular expression, rather than being fused to
+its child regular expressions, as what we will later see happens
+to $\simp(\rup\backslash \, s)$.
+If we choose to not simplify in the midst of derivative operations,
+but only do it at the end after the string has been exhausted,
+namely, $\simp(\rup\backslash \, s)=\simp((\rup\backslash a)\backslash a)$,
+then at the {\bf second} derivative of
+$(\rup\backslash a)\bf{\backslash a}$
+we will go throught the clause of $\backslash$:
+\begin{center}
+\begin{tabular}{lcl}
+$(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
+ $(when \; \textit{bnullable}\,a_1)$\\
+ & &$\textit{ALTS}\,bs\,List(\;\;(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
+ & &$(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))\;\;)$\\
+\end{tabular}
+\end{center}
+
+because
+$\rup\backslash a = (_0\ONE + \ZERO)(_0a + _1a^*)$
+is a sequence
+with the first component being nullable
+(unsimplified, unlike the first round of running$\backslash_{simp}$).
+Therefore $((_0\ONE + \ZERO)(_0a + _1a^*))\backslash a$ splits into
+$([(\ZERO + \ZERO)\cdot(_0a + _1a^*)] + _0( _0\ONE + _1[_1\ONE \cdot a^*]))$.
+After these two successive derivatives without simplification,
+we apply $\simp$ to this regular expression, which goes through
+the alternative clause, and each component of
+$([(\ZERO + \ZERO)\cdot(_0a + _1a^*)] + _0( _0\ONE + _1[_1\ONE \cdot a^*]))$
+will be simplified, giving us the list:$[\ZERO, _0(_0\ONE + _{11}a^*)]$
+This list is then "flattened"--$\ZERO$ will be
+thrown away by $\textit{flatten}$; $ _0(_0\ONE + _{11}a^*)$
+is opened up to make the list consisting of two separate elements
+$_{00}\ONE$ and $_{011}a^*$, note that $flatten$
+$\fuse$s the bit(s) $_0$ to the front of $_0\ONE $ and $_{11}a^*$.
+The order of simplification, which impacts the order that alternatives
+are opened up, causes
+the bits to be moved differently.
+
+ \subsubsection{A Failed Attempt To Remedy the Problem Above}
+A simple class of regular expression and string
+pairs $(r, s)$ can be deduced from the above example
+which trigger the difference between
+$\rup\backslash_{simp} \, s$
+and $\simp(\rup\backslash s)$:
+\begin{center}
+\begin{tabular}{lcl}
+$D =\{ (r_1 \cdot r_2,\; [c_1c_2]) \mid $ & $\simp(r_2) = r_2, \simp(r_1 \backslash c_1) = \ONE,$\\
+ $r_1 \; not \; \nullable, c_2 \in L(r_2),$ & $\exists \textit{rs},\textit{bs}.\; r_2 \backslash c_2 = _{bs}{\oplus rs}$\\
+$\exists \textit{rs}_1. \; \simp(r_2 \backslash c_2) = _{bs}{\oplus \textit{rs}_1}$ & $and \;\simp(r_1 \backslash [c_1c_2]) = \ZERO\}$\\
+\end{tabular}
+\end{center}
+We take a pair $(r, \;s)$ from the set $D$.
+
+Now we compute ${\bf \rup \backslash_{simp} s}$, we get:
+\begin{center}
+\begin{tabular}{lcl}
+$(r_1\cdot r_2)\backslash_{simp} \, [c_1c_2]$ & $= \simp\left[ \big(\simp\left[ \left( r_1\cdot r_2 \right) \backslash c_1\right] \big)\backslash c_2\right]$\\
+ & $= \simp\left[ \big(\simp \left[ \left(r_1 \backslash c_1\right) \cdot r_2 \right] \big) \backslash c_2 \right]$\\
+ & $= \simp \left[ (\fuse \; \bmkeps(r_1\backslash c_1) \; \simp(r_2) ) \backslash c_2 \right]$,\\
+ & $= \simp \left[ (\fuse \; \bmkeps(r_1\backslash c_1) \; r_2 ) \backslash c_2 \right]$,
+\end{tabular}
+\end{center}
+\noindent
+from the definition of $D$ we know $r_1 \backslash c_1$ is nullable, therefore
+$\bmkeps(r_1\backslash c_1)$ returns a bitcode, we shall call it
+ $\textit{bs}_2$.
+The above term can be rewritten as
+\begin{center}
+$ \simp \left[ \fuse \; \textit{bs}_2\; r_2 \backslash c_2 \right]$,
+\end{center}
+which is equal to
+\begin{center}
+$\simp \left[ \fuse \; \textit{bs}_2 \; _{bs}{\oplus rs} \right]$\\
+$=\simp \left[ \; _{bs_2++bs}{\oplus rs} \right]$\\
+$= \; _{bs_2++bs}{\oplus \textit{rs}_1} $
+\end{center}
+\noindent
+by using the properties from the set $D$ again
+and again(The reason why we set so many conditions
+that the pair $(r,s)$ need to satisfy is because we can
+rewrite them easily to construct the difference.)
+
+Now we compute ${\bf \simp(\rup \backslash s)}$:
+\begin{center}
+$\simp \big[(r_1\cdot r_2) \backslash [c_1c_2] \big]= \simp \left[ ((r_1 \cdot r_2 )\backslash c_1) \backslash c_2 \right]$
+\end{center}
+\noindent
+Again, using the properties above, we obtain
+the following chain of equalities:
+\begin{center}
+$\simp(\rup \backslash s)= \simp \left[ ((r_1 \cdot r_2 )\backslash c_1) \backslash c_2 \right]= \simp\left[ \left(r_1 \backslash c_1\right) \cdot r_2 \big) \backslash c_2 \right]$\\
+$= \simp \left[ \oplus[\big( \left(r_1 \backslash c_1\right) \backslash c_2 \big) \cdot r_2 \; , \; \fuse \; (\bmkeps \;r_1\backslash c_1) \; r_2 \backslash c_2 ] \right]$,
+\end{center}
+\noindent
+as before, we call the bitcode returned by
+$\bmkeps(r_1\backslash c_1)$ as
+$\textit{bs}_2$.
+Also, $\simp(r_2 \backslash c_2)$ is
+$_{bs}\oplus \textit{rs}_1$,
+and $( \left(r_1 \backslash c_1\right) \backslash c_2 \cdot r_2)$
+simplifies to $\ZERO$,
+so the above term can be expanded as
+\begin{center}
+\begin{tabular}{l}
+$\textit{distinct}(\flatten[\ZERO\;, \; _{\textit{bs}_2++\textit{bs}}\oplus \textit{rs}_1] ) \; \textit{match} $ \\
+ $\textit{case} \; [] \Rightarrow \ZERO$ \\
+ $\textit{case} \; a :: [] \Rightarrow \textit{\fuse \; \textit{bs} a}$ \\
+ $\textit{case} \; as' \Rightarrow _{[]}\oplus as'$\\
+\end{tabular}
+\end{center}
+\noindent
+Applying the definition of $\flatten$, we get
+\begin{center}
+$_{[]}\oplus (\textit{map} \; \fuse (\textit{bs}_2 ++ bs) rs_1)$
+\end{center}
+\noindent
+compared to the result
+\begin{center}
+$ \; _{bs_2++bs}{\oplus \textit{rs}_1} $
+\end{center}
+\noindent
+Note how these two regular expressions only
+differ in terms of the position of the bits
+$\textit{bs}_2++\textit{bs}$. They are the same otherwise.
+What caused this difference to happen?
+The culprit is the $\flatten$ function, which spills
+out the bitcodes in the inner alternatives when
+there exists an outer alternative.
+Note how the absence of simplification
+caused $\simp(\rup \backslash s)$ to
+generate the nested alternatives structure:
+\begin{center}
+$ \oplus[\ZERO \;, \; _{bs}\oplus \textit{rs} ]$
+\end{center}
+and this will always trigger the $\flatten$ to
+spill out the inner alternative's bitcode $\textit{bs}$,
+whereas when
+simplification is done along the way,
+the structure of nested alternatives is never created(we can
+actually prove that simplification function never allows nested
+alternatives to happen, more on this later).
+
+How about we do not allow the function $\simp$
+to fuse out the bits when it is unnecessary?
+Like, for the above regular expression, we might
+just delete the outer layer of alternative
+\begin{center}
+\st{$ {\oplus[\ZERO \;,}$} $_{bs}\oplus \textit{rs}$ \st{$]$}
+\end{center}
+and get $_{bs}\oplus \textit{rs}$ instead, without
+fusing the bits $\textit{bs}$ inside to every element
+of $\textit{rs}$.
+This idea can be realized by making the following
+changes to the $\simp$-function:
+\begin{center}
+ \begin{tabular}{@{}lcl@{}}
+
+ $\textit{simp}' \; (_{\textit{bs}}(a_1 \cdot a_2))$ & $\dn$ & $\textit{as} \; \simp \; \textit{was} \; \textit{before} $ \\
+
+ $\textit{simp}' \; (_{bs}\oplus as)$ & $\dn$ & \st{$\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $} \\
+ &&\st{$\quad\textit{case} \; [] \Rightarrow \ZERO$} \\
+ &&\st{$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$} \\
+ &&\st{$\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$}\\
+ &&$\textit{if}(\textit{hollowAlternatives}( \textit{map \; simp \; as}))$\\
+ &&$\textit{then} \; \fuse \; \textit{bs}\; \textit{extractAlt}(\textit{map} \; \simp \; \textit{as} )$\\
+ &&$\textit{else} \; \simp(_{bs} \oplus \textit{as})$\\
+
+
+ $\textit{simp}' \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
+\end{tabular}
+\end{center}
+
+\noindent
+given the definition of $\textit{hollowAlternatives}$ and $\textit{extractAlt}$ :
+\begin{center}
+$\textit{hollowAlternatives}( \textit{rs}) \dn
+\exists r = (_{\textit{bs}_1}\oplus \textit{rs}_1) \in \textit{rs}. \forall r' \in \textit{rs}, \;
+\textit{either} \; r' = \ZERO \; \textit{or} \; r' = r $
+$\textit{extractAlt}( \textit{rs}) \dn \textit{if}\big(
+\exists r = (_{\textit{bs}_1}\oplus \textit{rs}_1) \in \textit{rs}. \forall r' \in \textit{rs}, \;
+\textit{either} \; r' = \ZERO \; \textit{or} \; r' = r \big)\; \textit{then} \; \textit{return} \; r$
+\end{center}
+\noindent
+Basically, $\textit{hollowAlternatives}$ captures the feature of
+a list of regular expression of the shape
+\begin{center}
+$ \oplus[\ZERO \;, \; _{bs}\oplus \textit{rs} ]$
+\end{center}
+and this means we can simply elevate the
+inner regular expression $_{bs}\oplus \textit{rs}$
+ to the outmost
+and throw away the useless $\ZERO$s and
+the outer $\oplus$ wrapper.
+Using this new definition of simp,
+under the example where $r$ is the regular expression
+$(a+b)(a+a*)$ and $s$ is the string $aa$
+the problem of $\rup\backslash_{simp} \, s \neq \simp(\rup\backslash s)$
+is resolved.
+
+Unfortunately this causes new problems:
+for the counterexample where
+\begin{center}
+$r$ is the regular expression
+$(ab+(a^*+aa))$ and $s$ is the string $aa$
+\end{center}
+
+\noindent
+$\rup\backslash_{simp} \, s$ is equal to
+$ _1(_{011}a^* + _1\ONE) $ whereas
+$ \simp(\rup\backslash s) = (_{1011}a^* + _{11}\ONE)$.
+This discrepancy does not appear for the old
+version of $\simp$.
+
+
+During the first derivative operation,
+$\rup\backslash a=(_0\ONE + \ZERO)(_0a + _1a^*)$ is
+in the form of a sequence regular expression with
+two components, the first
+one $\ONE + \ZERO$ being nullable.
+Recall the simplification function definition:
+\begin{center}
+ \begin{tabular}{@{}lcl@{}}
+
+ $\textit{simp} \; (\textit{SEQ}\;bs\,a_1\,a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp} \; a_2) \; \textit{match} $ \\
+ &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow \ZERO$ \\
+ &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow \ZERO$ \\
+ &&$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$ \\
+ &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow \textit{fuse} \; bs \; a_1'$ \\
+ &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow \textit{SEQ} \; bs \; a_1' \; a_2'$ \\
+
+ $\textit{simp} \; (\textit{ALTS}\;bs\,as)$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{map simp as})) \; \textit{match} $ \\
+ &&$\quad\textit{case} \; [] \Rightarrow \ZERO$ \\
+ &&$\quad\textit{case} \; a :: [] \Rightarrow \textit{fuse bs a}$ \\
+ &&$\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$\\
+
+ $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
+\end{tabular}
+\end{center}
+
+\noindent
+
+and the difinition of $\flatten$:
+ \begin{center}
+ \begin{tabular}{c c c}
+ $\flatten \; []$ & $\dn$ & $[]$\\
+ $\flatten \; \ZERO::rs$ & $\dn$ & $rs$\\
+ $\flatten \;(_{\textit{bs}_1}\oplus \textit{rs}_1 ::rs)$ & $\dn$ & $(\map \, (\fuse \, \textit{bs}_1) \,\textit{rs}_1) ::: \flatten(rs)$\\
+ $\flatten \; r :: rs$ & $\dn$ & $r::\flatten(rs)$
+ \end{tabular}
+ \end{center}
+
+ \noindent
+
+If we call $\simp$ on $\rup\backslash a$,
+then we would go throught the third clause of
+the sequence case:$\quad\textit{case} \; (\ONE, a_2') \Rightarrow \textit{fuse} \; bs \; a_2'$.
+The $\ZERO$ of $(_0\ONE + \ZERO)$ is thrown away
+by $\flatten$ and
+$_0\ONE$ merged into $(_0a + _1a^*)$ by simply
+putting its bits($_0$) to the front of the second component:
+ ${\bf_0}(_0a + _1a^*)$.
+ After a second derivative operation,
+ namely, $(_0(_0a + _1a^*))\backslash a$, we get
+ $
+ _0(_0 \ONE + _1(_1\ONE \cdot a^*))
+ $, and this simplifies to $_0(_0 \ONE + _{11} a^*)$
+ by the third clause of the alternative case:
+ $\quad\textit{case} \; as' \Rightarrow \textit{ALTS}\;bs\;as'$.
+The outmost bit $_0$ stays with
+the outmost regular expression, rather than being fused to
+its child regular expressions, as what we will later see happens
+to $\simp(\rup\backslash \, s)$.
+If we choose to not simplify in the midst of derivative operations,
+but only do it at the end after the string has been exhausted,
+namely, $\simp(\rup\backslash \, s)=\simp((\rup\backslash a)\backslash a)$,
+then at the {\bf second} derivative of
+$(\rup\backslash a)\bf{\backslash a}$
+we will go throught the clause of $\backslash$:
+\begin{center}
+\begin{tabular}{lcl}
+$(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
+ $(when \; \textit{bnullable}\,a_1)$\\
+ & &$\textit{ALTS}\,bs\,List(\;\;(\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
+ & &$(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))\;\;)$\\
+\end{tabular}
+\end{center}
+
+because
+$\rup\backslash a = (_0\ONE + \ZERO)(_0a + _1a^*)$
+is a sequence
+with the first component being nullable
+(unsimplified, unlike the first round of running$\backslash_{simp}$).
+Therefore $((_0\ONE + \ZERO)(_0a + _1a^*))\backslash a$ splits into
+$([(\ZERO + \ZERO)\cdot(_0a + _1a^*)] + _0( _0\ONE + _1[_1\ONE \cdot a^*]))$.
+After these two successive derivatives without simplification,
+we apply $\simp$ to this regular expression, which goes through
+the alternative clause, and each component of
+$([(\ZERO + \ZERO)\cdot(_0a + _1a^*)] + _0( _0\ONE + _1[_1\ONE \cdot a^*]))$
+will be simplified, giving us the list:$[\ZERO, _0(_0\ONE + _{11}a^*)]$
+This list is then "flattened"--$\ZERO$ will be
+thrown away by $\textit{flatten}$; $ _0(_0\ONE + _{11}a^*)$
+is opened up to make the list consisting of two separate elements
+$_{00}\ONE$ and $_{011}a^*$, note that $flatten$
+$\fuse$s the bit(s) $_0$ to the front of $_0\ONE $ and $_{11}a^*$.
+The order of simplification, which impacts the order that alternatives
+are opened up, causes
+the bits to be moved differently.
+
+
+
+We have changed the algorithm to suppress the old
+counterexample, but this gives rise to new counterexamples.
+This dilemma causes this amendment not a successful
+attempt to make $\rup\backslash_{simp} \, s = \simp(\rup\backslash s)$
+under every possible regular expression and string.
+\subsection{Properties of the Function $\simp$}
+
+We have proved in Isabelle quite a few properties
+of the $\simp$-function, which helps the proof to go forward
+and we list them here to aid comprehension.
+
+To start, we need a bit of auxiliary notations,
+which is quite basic and is only written here
+for clarity.
+
+$\textit{sub}(r)$ computes the set of the
+sub-regular expression of $r$:
+\begin{center}
+$\textit{sub}(\ONE) \dn \{\ONE\}$\\
+$\textit{sub}(r_1 \cdot r_2) \dn \textit{sub}(r_1) \cup \textit{sub}(r_2) \cup \{r_1 \cdot r_2\}$\\
+$\textit{sub}(r_1 + r_2) \dn \textit{sub}(r_1) \cup \textit{sub}(r_2) \cup \{r_1+r_2\}$\\
+\end{center}
+$\textit{good}(r) \dn r \neq \ZERO \land \\
+\forall r' \in \textit{sub}(r), \textit{if} \; r' = _{bs_1}\oplus(rs_1), \;
+\textit{then} \nexists r'' \in \textit{rs}_1 \; s.t.\;
+r'' = _{bs_2}\oplus \textit{rs}_2 $
+
+The properties are mainly the ones below:
+\begin{itemize}
+\item
+\begin{center}
+$\simp(\simp(r)) = \simp(r)$
+\end{center}
+\item
+\begin{center}
+$\textit{if} r = \simp(r') \textit{then} \; \textit{good}(r) $
+\end{center}
+\end{itemize}
+\subsection{the Contains relation}
+$\retrieve$ is a too strong relation in that
+it only extracts one bitcode instead of a set of them.
+Therefore we try to define another relation(predicate)
+to capture the fact the regular expression has bits
+being moved around but still has all the bits needed.
+The contains symbol, written$\gg$, is a relation that
+takes two arguments in an infix form
+and returns a truth value.
+
+
+In other words, from the set of regular expression and
+bitcode pairs
+$\textit{RV} = \{(r, v) \mid r \text{r is a regular expression, v is a value}\}$,
+those that satisfy the following requirements are in the set
+$\textit{RV}_Contains$.
+Unlike the $\retrieve$
+function, which takes two arguments $r$ and $v$ and
+produces an only answer $\textit{bs}$, it takes only
+one argument $r$ and returns a set of bitcodes that
+can be generated by $r$.
+\begin{center}
+\begin{tabular}{llclll}
+& & & $_{bs}\ONE$ & $\gg$ & $\textit{bs}$\\
+& & & $_{bs}{\bf c}$ & $\gg$ & $\textit{bs}$\\
+$\textit{if} \; r_1 \gg \textit{bs}_1$ & $r_2 \; \gg \textit{bs}_2$
+& $\textit{then}$ &
+ $_{bs}{r_1 \cdot r_2}$ &
+ $\gg$ &
+ $\textit{bs} @ \textit{bs}_1 @ \textit{bs}_2$\\
+
+ $\textit{if} \; r \gg \textit{bs}_1$ & & $\textit{then}$ &
+ $_{bs}{\oplus(r :: \textit{rs}})$ &
+ $\gg$ &
+ $\textit{bs} @ \textit{bs}_1 $\\
+
+ $\textit{if} \; _{bs}(\oplus \textit{rs}) \gg \textit{bs} @ \textit{bs}_1$ & & $\textit{then}$ &
+ $_{bs}{\oplus(r :: \textit{rs}})$ &
+ $\gg$ &
+ $\textit{bs} @ \textit{bs}_1 $\\
+
+ $\textit{if} \; r \gg \textit{bs}_1\; \textit{and}$ & $_{bs}r^* \gg \textit{bs} @ \textit{bs}_2$ & $\textit{then}$ &
+ $_{bs}r^* $ &
+ $\gg$ &
+ $\textit{bs} @ [0] @ \textit{bs}_1@ \textit{bs}_2 $\\
+
+ & & & $_{bs}r^*$ & $\gg$ & $\textit{bs} @ [1]$\\
+\end{tabular}
+\end{center}
+It is a predicate in the sense that given
+a regular expression and a bitcode, it
+returns true or false, depending on whether
+or not the regular expression can actually produce that
+value. If the predicates returns a true, then
+we say that the regular expression $r$ contains
+the bitcode $\textit{bs}$, written
+$r \gg \textit{bs}$.
+The $\gg$ operator with the
+regular expression $r$ may also be seen as a
+machine that does a derivative of regular expressions
+on all strings simultaneously, taking
+the bits by going throught the regular expression tree
+ structure in a depth first manner, regardless of whether
+ the part being traversed is nullable or not.
+ It put all possible bits that can be produced on such a traversal
+ into a set.
+ For example, if we are given the regular expression
+$((a+b)(c+d))^*$, the tree structure may be written as
+\begin{center}
+\begin{tikzpicture}
+\tikz[tree layout]\graph[nodes={draw, circle}] {
+* ->
+ {@-> {
+ {+1 ->
+ {a , b}
+ },
+ {+ ->
+ {c , d }
+ }
+ }
+ }
+};
+\end{tikzpicture}
+\end{center}
+\subsection{the $\textit{ders}_2$ Function}
+If we want to prove the result
+\begin{center}
+ $ \textit{blexer}\_{simp}(r, \; s) = \textit{blexer}(r, \; s)$
+\end{center}
+inductively
+on the structure of the regular expression,
+then we need to induct on the case $r_1 \cdot r_2$,
+it can be good if we could express $(r_1 \cdot r_2) \backslash s$
+in terms of $r_1 \backslash s$ and $r_2 \backslash s$,
+and this naturally induces the $ders2$ function,
+which does a "convolution" on $r_1$ and $r_2$ using the string
+$s$.
+It is based on the observation that the derivative of $r_1 \cdot r_2$
+with respect to a string $s$ can actually be written in an "explicit form"
+composed of $r_1$'s derivative of $s$ and $r_2$'s derivative of $s$.
+This can be illustrated in the following procedure execution
+\begin{center}
+ $ (r_1 \cdot r_2) \backslash [c_1c_2] = (\textit{if} \; \nullable(r_1)\; \textit{then} \; ((r_1 \backslash c_1) \cdot r_2 + r_2 \backslash c_1) \; \textit{else} \; (r_1\backslash c_1) \cdot r_2) \backslash c_2$
+\end{center}
+which can also be written in a "convoluted sum"
+format:
+\begin{center}
+ $ (r_1 \cdot r_2) \backslash [c_1c_2] = \sum{r_1 \backslash s_i \cdot r_2 \backslash s_j}$
+\end{center}
+In a more serious manner, we should write
+\begin{center}
+ $ (r_1 \cdot r_2) \backslash [c_1c_2] = \sum{r_1 \backslash s_i \cdot r_2 \backslash s_j}$
+\end{center}
+Note this differentiates from the previous form in the sense that
+it calculates the results $r_1\backslash s_i , r_2 \backslash s_j$ first and then glue them together
+through nested alternatives whereas the $r_1 \cdot r_2 \backslash s$ procedure,
+used by algorithm $\lexer$, can only produce each component of the
+resulting alternatives regular expression altogether rather than
+generating each of the children nodes one by one
+n a single recursive call that is only for generating that
+very expression itself.
+We have this
+\section{Conclusion}
+Under the exhaustive tests we believe the main
+result holds, yet a proof still seems elusive.
+We have tried out different approaches, and
+found a lot of properties of the function $\simp$.
+The counterexamples where $\rup\backslash_{simp} \, s \neq \simp(\rup\backslash s)$
+are also valuable in the sense that
+we get to know better why they are not equal and what
+are the subtle differences between a
+nested simplified regular expression and a
+regular expression that is simplified at the final moment.
+We are almost there, but a last step is needed to make the proof work.
+Hopefully in the next few weeks we will be able to find one.
+%CONSTRUCTION SITE HERE
+that is to say, despite the bits are being moved around on the regular expression
+(difference in bits), the structure of the (unannotated)regular expression
+after one simplification is exactly the same after the
+same sequence of derivative operations
+regardless of whether we did simplification
+along the way.
+ One way would be to give a function that calls
+
+fuse is the culprit: it causes the order in which alternatives
+are opened up to be remembered and finally the difference
+appear in $\simp(\rup \backslash s)$ and $\rup \backslash{simp} \,s$.
+but we have to use them as they are essential in the simplification:
+flatten needs them.
+
+
+
+However, without erase the above equality does not hold:
+for the regular expression
+$(a+b)(a+a*)$,
+if we do derivative with respect to string $aa$,
+we get
+
+\subsection{Another Proof Strategy}
+sdddddr does not equal sdsdsdsr sometimes.\\
+For example,
+
+This equicalence class method might still have the potential of proving this,
+but not yet
+i parallelly tried another method of using retrieve\\
+
+
+The vsimp function, defined as follows
+tries to simplify the value in lockstep with
+regular expression:\\
+
+
+The problem here is that
+
+we used retrieve for the key induction:
+$decode (retrieve (r\backslash (s @ [c])) v) r $
+$decode (retrieve (r\backslash s) (inj (r\backslash s) c v)) r$
+Here, decode recovers a value that corresponds to a match(possibly partial)
+from bits, and the bits are extracted by retrieve,
+and the key value $v$ that guides retrieve is
+$mkeps r\backslash s$, $inj r c (mkeps r\backslash s)$, $inj (inj (v))$, ......
+if we can
+the problem is that
+need vsiimp to make a value that is suitable for decoding
+$Some(flex rid(s@[c])v) = Some(flex rids(inj (r\backslash s)cv))$
+another way that christian came up with that might circumvent the
+prblem of finding suitable value is by not stating the visimp
+function but include all possible value in a set that a regex is able to produce,
+and proving that both r and sr are able to produce the bits that correspond the POSIX value
+
+produced by feeding the same initial regular expression $r$ and string $s$ to the
+ two functions $ders$ and $ders\_simp$.
+The reason why
+Namely, if $bmkeps( r_1) = bmkeps(r_2)$, then we
+
+
+If we define the equivalence relation $\sim_{m\epsilon}$ between two regular expressions
+$r_1$ and $r_2$as follows:
+$r_1 \sim_{m\epsilon} r_2 \iff bmkeps(r_1)= bmkeps(r_2)$
+(in other words, they $r1$ and $r2$ produce the same output under the function $bmkeps$.)
+Then the first goal
+might be restated as
+$(r^\uparrow)\backslash_{simp}\, s \sim_{m\epsilon} (r^\uparrow)\backslash s$.
+I tried to establish an equivalence relation between the regular experssions
+like dddr dddsr,.....
+but right now i am only able to establish dsr and dr, using structural induction on r.
+Those involve multiple derivative operations are harder to prove.
+Two attempts have been made:
+(1)induction on the number of der operations(or in other words, the length of the string s),
+the inductive hypothesis was initially specified as
+"For an arbitrary regular expression r,
+For all string s in the language of r whose length do not exceed
+the number n, ders s r me derssimp s r"
+and the proof goal may be stated as
+"For an arbitrary regular expression r,
+For all string s in the language of r whose length do not exceed
+the number n+1, ders s r me derssimp s r"
+the problem here is that although we can easily break down
+a string s of length n+1 into s1@list(c), it is not that easy
+to use the i.h. as a stepping stone to prove anything because s1 may well be not
+in the language L(r). This inhibits us from obtaining the fact that
+ders s1 r me derssimps s1 r.
+Further exploration is needed to amend this hypothesis so it includes the
+situation when s1 is not nullable.
+For example, what information(bits?
+values?) can be extracted
+from the regular expression ders(s1,r) so that we can compute or predict the possible
+result of bmkeps after another derivative operation. What function f can used to
+carry out the task? The possible way of exploration can be
+more directly perceived throught the graph below:
+find a function
+f
+such that
+f(bders s1 r)
+= re1
+f(bderss s1 r)
+= re2
+bmkeps(bders s r) = g(re1,c)
+bmkeps(bderssimp s r) = g(re2,c)
+and g(re1,c) = g(re2,c)
+The inductive hypothesis would be
+"For all strings s1 of length <= n,
+f(bders s1 r)
+= re1
+f(bderss s1 r)
+= re2"
+proving this would be a lemma for the main proof:
+the main proof would be
+"
+bmkeps(bders s r) = g(re1,c)
+bmkeps(bderssimp s r) = g(re2,c)
+for s = s1@c
+"
+and f need to be a recursive property for the lemma to be proved:
+it needs to store not only the "after one char nullable info",
+but also the "after two char nullable info",
+and so on so that it is able to predict what f will compute after a derivative operation,
+in other words, it needs to be "infinitely recursive"\\
+To prove the lemma, in other words, to get
+"For all strings s1 of length <= n+1,
+f(bders s1 r)
+= re3
+f(bderss s1 r)
+= re4"\\
+from\\
+"For all strings s1 of length <= n,
+f(bders s1 r)
+= re1
+f(bderss s1 r)
+= re2"\\
+it might be best to construct an auxiliary function h such that\\
+h(re1, c) = re3\\
+h(re2, c) = re4\\
+and re3 = f(bder c (bders s1 r))\\
+re4 = f(simp(bder c (bderss s1 r)))
+The key point here is that we are not satisfied with what bders s r will produce under
+bmkeps, but also how it will perform after a derivative operation and then bmkeps, and two
+derivative operations and so on. In essence, we are preserving the regular expression
+itself under the function f, in a less compact way than the regluar expression: we are
+not just recording but also interpreting what the regular expression matches.
+In other words, we need to prove the properties of bderss s r beyond the bmkeps result,
+i.e., not just the nullable ones, but also those containing remaining characters.\\
+(2)we observed the fact that
+erase sdddddr= erase sdsdsdsr
+that is to say, despite the bits are being moved around on the regular expression
+(difference in bits), the structure of the (unannotated)regular expression
+after one simplification is exactly the same after the
+same sequence of derivative operations
+regardless of whether we did simplification
+along the way.
+However, without erase the above equality does not hold:
+for the regular expression
+$(a+b)(a+a*)$,
+if we do derivative with respect to string $aa$,
+we get
+%TODO
+sdddddr does not equal sdsdsdsr sometimes.\\
+For example,
+
+This equicalence class method might still have the potential of proving this,
+but not yet
+i parallelly tried another method of using retrieve\\
+
+
+
+\noindent\rule[0.5ex]{\linewidth}{1pt}
+
+
+
+
+
+
+\bibliographystyle{plain}
+\bibliography{root}
+
+
+\end{document}
+