more
authorChengsong
Fri, 27 May 2022 18:27:39 +0100
changeset 519 856d025dbc15
parent 518 ff7945a988a3
child 520 26584b9d47f4
more
ChengsongTanPhdThesis/Chapters/Chapter1.tex
ChengsongTanPhdThesis/Chapters/Chapter2.tex
ChengsongTanPhdThesis/Chapters/Chapter3.tex
ChengsongTanPhdThesis/main.tex
--- a/ChengsongTanPhdThesis/Chapters/Chapter1.tex	Thu May 26 20:51:40 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Chapter1.tex	Fri May 27 18:27:39 2022 +0100
@@ -33,8 +33,10 @@
 
 \newcommand\myequiv{\mathrel{\stackrel{\makebox[0pt]{\mbox{\normalfont\tiny equiv}}}{=}}}
 
+\def\decode{\textit{decode}}
+\def\internalise{\textit{internalise}}
 \def\lexer{\mathit{lexer}}
-\def\mkeps{\mathit{mkeps}}
+\def\mkeps{\textit{mkeps}}
 \newcommand{\rder}[2]{#2 \backslash #1}
 
 \def\AZERO{\textit{AZERO}}
@@ -552,949 +554,8 @@
  eliminating the exponential behaviours.
  
 
-  
-
- 
-
-
-
-%----------------------------------------------------------------------------------------
-
-\section{Contribution}
-
-
-
-This work addresses the vulnerability of super-linear and
-buggy regex implementations by the combination
-of Brzozowski's derivatives and interactive theorem proving. 
-We give an 
-improved version of  Sulzmann and Lu's bit-coded algorithm using 
-derivatives, which come with a formal guarantee in terms of correctness and 
-running time as an Isabelle/HOL proof.
-Then we improve the algorithm with an even stronger version of 
-simplification, and prove a time bound linear to input and
-cubic to regular expression size using a technique by
-Antimirov.
-
- 
-The main contribution of this thesis is a proven correct lexing algorithm
-with formalized time bounds.
-To our best knowledge, no lexing libraries using Brzozowski derivatives
-have a provable time guarantee, 
-and claims about running time are usually speculative and backed by thin empirical
-evidence.
-%TODO: give references
-For example, Sulzmann and Lu had proposed an algorithm  in which they
-claim a linear running time.
-But that was falsified by our experiments and the running time 
-is actually $\Omega(2^n)$ in the worst case.
-A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
-%TODO: give references
-lexer, which calculates POSIX matches and is based on derivatives.
-They formalized the correctness of the lexer, but not the complexity.
-In the performance evaluation section, they simply analyzed the run time
-of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
-and concluded that the algorithm is quadratic in terms of input length.
-When we tried out their extracted OCaml code with our example $(a+aa)^*$,
-the time it took to lex only 40 $a$'s was 5 minutes.
-
-We  believe our results of a proof of performance on general
-inputs rather than specific examples a novel contribution.\\
-
-
-\subsection{Related Work}
-We are aware
-of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
-Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
-of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
-in Coq is given by Coquand and Siles \parencite{Coquand2012}.
-Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
- 
- %We propose Brzozowski's derivatives as a solution to this problem.
-% about Lexing Using Brzozowski derivatives
- \section{Preliminaries}
-
-Suppose we have an alphabet $\Sigma$, the strings  whose characters
-are from $\Sigma$
-can be expressed as $\Sigma^*$.
-
-We use patterns to define a set of strings concisely. Regular expressions
-are one of such patterns systems:
-The basic regular expressions  are defined inductively
- by the following grammar:
-\[			r ::=   \ZERO \mid  \ONE
-			 \mid  c  
-			 \mid  r_1 \cdot r_2
-			 \mid  r_1 + r_2   
-			 \mid r^*         
-\]
-
-The language or set of strings defined by regular expressions are defined as
-%TODO: FILL in the other defs
-\begin{center}
-\begin{tabular}{lcl}
-$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
-$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
-\end{tabular}
-\end{center}
-Which are also called the "language interpretation".
-
-
-
-The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
-where the operation transforms the regex to a new one containing
-strings without the head character $c$.
-
-Formally, we define first such a transformation on any string set, which
-we call semantic derivative:
-\begin{center}
-$\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$
-\end{center}
-Mathematically, it can be expressed as the 
-
-If the $\textit{StringSet}$ happen to have some structure, for example,
-if it is regular, then we have that it
-
-% Derivatives of a
-%regular expression, written $r \backslash c$, give a simple solution
-%to the problem of matching a string $s$ with a regular
-%expression $r$: if the derivative of $r$ w.r.t.\ (in
-%succession) all the characters of the string matches the empty string,
-%then $r$ matches $s$ (and {\em vice versa}).  
-
-The the derivative of regular expression, denoted as
-$r \backslash c$, is a function that takes parameters
-$r$ and $c$, and returns another regular expression $r'$,
-which is computed by the following recursive function:
-
-\begin{center}
-\begin{tabular}{lcl}
-		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
-		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
-		$d \backslash c$     & $\dn$ & 
-		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
-$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
-$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
-	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
-	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
-	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
-\end{tabular}
-\end{center}
-\noindent
-\noindent
-
-The $\nullable$ function tests whether the empty string $""$ 
-is in the language of $r$:
-
-
-\begin{center}
-		\begin{tabular}{lcl}
-			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
-			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
-			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
-			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
-			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
-			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
-		\end{tabular}
-\end{center}
-\noindent
-The empty set does not contain any string and
-therefore not the empty string, the empty string 
-regular expression contains the empty string
-by definition, the character regular expression
-is the singleton that contains character only,
-and therefore does not contain the empty string,
-the alternative regular expression(or "or" expression)
-might have one of its children regular expressions
-being nullable and any one of its children being nullable
-would suffice. The sequence regular expression
-would require both children to have the empty string
-to compose an empty string and the Kleene star
-operation naturally introduced the empty string.
-
-We can give the meaning of regular expressions derivatives
-by language interpretation:
-
-
- 
+  \section{Motivation}
   
-\begin{center}
-\begin{tabular}{lcl}
-		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
-		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
-		$d \backslash c$     & $\dn$ & 
-		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
-$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
-$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
-	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
-	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
-	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
-\end{tabular}
-\end{center}
-\noindent
-\noindent
-The function derivative, written $\backslash c$, 
-defines how a regular expression evolves into
-a new regular expression after all the string it contains
-is chopped off a certain head character $c$.
-The most involved cases are the sequence 
-and star case.
-The sequence case says that if the first regular expression
-contains an empty string then the second component of the sequence
-might be chosen as the target regular expression to be chopped
-off its head character.
-The star regular expression's derivative unwraps the iteration of
-regular expression and attaches the star regular expression
-to the sequence's second element to make sure a copy is retained
-for possible more iterations in later phases of lexing.
-
-
-The main property of the derivative operation
-that enables us to reason about the correctness of
-an algorithm using derivatives is 
-
-\begin{center}
-$c\!::\!s \in L(r)$ holds
-if and only if $s \in L(r\backslash c)$.
-\end{center}
-
-\noindent
-We can generalise the derivative operation shown above for single characters
-to strings as follows:
-
-\begin{center}
-\begin{tabular}{lcl}
-$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
-$r \backslash [\,] $ & $\dn$ & $r$
-\end{tabular}
-\end{center}
-
-\noindent
-and then define Brzozowski's  regular-expression matching algorithm as:
-
-\[
-match\;s\;r \;\dn\; nullable(r\backslash s)
-\]
-
-\noindent
-Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, 
-this algorithm presented graphically is as follows:
-
-\begin{equation}\label{graph:*}
-\begin{tikzcd}
-r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
-\end{tikzcd}
-\end{equation}
-
-\noindent
-where we start with  a regular expression  $r_0$, build successive
-derivatives until we exhaust the string and then use \textit{nullable}
-to test whether the result can match the empty string. It can  be
-relatively  easily shown that this matcher is correct  (that is given
-an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
-
-Beautiful and simple definition.
-
-If we implement the above algorithm naively, however,
-the algorithm can be excruciatingly slow. 
-
-
-\begin{figure}
-\centering
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,-0.05)}},
-    ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=10000,
-    ytick={0,1000,...,10000},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={JavaScript},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
-\end{axis}
-\end{tikzpicture}\\
-\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
-           of the form $\underbrace{aa..a}_{n}$.}
-\end{tabular}    
-\caption{EightThousandNodes} \label{fig:EightThousandNodes}
-\end{figure}
-
-
-(8000 node data to be added here)
-For example, when starting with the regular
-expression $(a + aa)^*$ and building a few successive derivatives (around 10)
-w.r.t.~the character $a$, one obtains a derivative regular expression
-with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
-The reason why $(a + aa) ^*$ explodes so drastically is that without
-pruning, the algorithm will keep records of all possible ways of matching:
-\begin{center}
-$(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
-\end{center}
-
-\noindent
-Each of the above alternative branches correspond to the match 
-$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
-These different ways of matching will grow exponentially with the string length,
-and without simplifications that throw away some of these very similar matchings,
-it is no surprise that these expressions grow so quickly.
-Operations like
-$\backslash$ and $\nullable$ need to traverse such trees and
-consequently the bigger the size of the derivative the slower the
-algorithm. 
-
-Brzozowski was quick in finding that during this process a lot useless
-$\ONE$s and $\ZERO$s are generated and therefore not optimal.
-He also introduced some "similarity rules" such
-as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
-different but language-equivalent sub-regexes to further decrease the size
-of the intermediate regexes. 
-
-More simplifications are possible, such as deleting duplicates
-and opening up nested alternatives to trigger even more simplifications.
-And suppose we apply simplification after each derivative step, and compose
-these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
-\textit{simp}(a \backslash c)$. Then we can build
-a matcher without having  cumbersome regular expressions.
-
-
-If we want the size of derivatives in the algorithm to
-stay even lower, we would need more aggressive simplifications.
-Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
-deleting duplicates whenever possible. For example, the parentheses in
-$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
-\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
-example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
-$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
-to achieve a very tight size bound, namely,
- the same size bound as that of the \emph{partial derivatives}. 
-
-Building derivatives and then simplify them.
-So far so good. But what if we want to 
-do lexing instead of just a YES/NO answer?
-This requires us to go back again to the world 
-without simplification first for a moment.
-Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
-elegant(arguably as beautiful as the original
-derivatives definition) solution for this.
-
-\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
-
-
-They first defined the datatypes for storing the 
-lexing information called a \emph{value} or
-sometimes also \emph{lexical value}.  These values and regular
-expressions correspond to each other as illustrated in the following
-table:
-
-\begin{center}
-	\begin{tabular}{c@{\hspace{20mm}}c}
-		\begin{tabular}{@{}rrl@{}}
-			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
-			$r$ & $::=$  & $\ZERO$\\
-			& $\mid$ & $\ONE$   \\
-			& $\mid$ & $c$          \\
-			& $\mid$ & $r_1 \cdot r_2$\\
-			& $\mid$ & $r_1 + r_2$   \\
-			\\
-			& $\mid$ & $r^*$         \\
-		\end{tabular}
-		&
-		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
-			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
-			$v$ & $::=$  & \\
-			&        & $\Empty$   \\
-			& $\mid$ & $\Char(c)$          \\
-			& $\mid$ & $\Seq\,v_1\, v_2$\\
-			& $\mid$ & $\Left(v)$   \\
-			& $\mid$ & $\Right(v)$  \\
-			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
-		\end{tabular}
-	\end{tabular}
-\end{center}
-
-\noindent
-
-Building on top of Sulzmann and Lu's attempt to formalize the 
-notion of POSIX lexing rules \parencite{Sulzmann2014}, 
-Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
-POSIX matching as a ternary relation recursively defined in a
-natural deduction style.
-With the formally-specified rules for what a POSIX matching is,
-they proved in Isabelle/HOL that the algorithm gives correct results.
-
-But having a correct result is still not enough, 
-we want at least some degree of $\mathbf{efficiency}$.
-
-
-
-One regular expression can have multiple lexical values. For example
-for the regular expression $(a+b)^*$, it has a infinite list of
-values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
-$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
-$\ldots$, and vice versa.
-Even for the regular expression matching a certain string, there could 
-still be more than one value corresponding to it.
-Take the example where $r= (a^*\cdot a^*)^*$ and the string 
-$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
-The number of different ways of matching 
-without allowing any value under a star to be flattened
-to an empty string can be given by the following formula:
-\begin{equation}
-	C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}
-\end{equation}
-and a closed form formula can be calculated to be
-\begin{equation}
-	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
-\end{equation}
-which is clearly in exponential order.
-
-A lexer aimed at getting all the possible values has an exponential
-worst case runtime. Therefore it is impractical to try to generate
-all possible matches in a run. In practice, we are usually 
-interested about POSIX values, which by intuition always
-\begin{itemize}
-\item
-match the leftmost regular expression when multiple options of matching
-are available  
-\item 
-always match a subpart as much as possible before proceeding
-to the next token.
-\end{itemize}
-
-
- For example, the above example has the POSIX value
-$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
-The output of an algorithm we want would be a POSIX matching
-encoded as a value.
-The reason why we are interested in $\POSIX$ values is that they can
-be practically used in the lexing phase of a compiler front end.
-For instance, when lexing a code snippet 
-$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
-as an identifier rather than a keyword.
-
-The contribution of Sulzmann and Lu is an extension of Brzozowski's
-algorithm by a second phase (the first phase being building successive
-derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
-is generated in case the regular expression matches  the string. 
-Pictorially, the Sulzmann and Lu algorithm is as follows:
-
-\begin{ceqn}
-\begin{equation}\label{graph:2}
-\begin{tikzcd}
-r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
-v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
-\end{tikzcd}
-\end{equation}
-\end{ceqn}
-
-
-\noindent
-For convenience, we shall employ the following notations: the regular
-expression we start with is $r_0$, and the given string $s$ is composed
-of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
-left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
-to the characters $c_0$, $c_1$  until we exhaust the string and obtain
-the derivative $r_n$. We test whether this derivative is
-$\textit{nullable}$ or not. If not, we know the string does not match
-$r$ and no value needs to be generated. If yes, we start building the
-values incrementally by \emph{injecting} back the characters into the
-earlier values $v_n, \ldots, v_0$. This is the second phase of the
-algorithm from the right to left. For the first value $v_n$, we call the
-function $\textit{mkeps}$, which builds a POSIX lexical value
-for how the empty string has been matched by the (nullable) regular
-expression $r_n$. This function is defined as
-
-	\begin{center}
-		\begin{tabular}{lcl}
-			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
-			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
-			& \textit{if} $\nullable(r_{1})$\\ 
-			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
-			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
-			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
-			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
-		\end{tabular}
-	\end{center}
-
-
-\noindent 
-After the $\mkeps$-call, we inject back the characters one by one in order to build
-the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
-($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
-After injecting back $n$ characters, we get the lexical value for how $r_0$
-matches $s$. The POSIX value is maintained throught out the process.
-For this Sulzmann and Lu defined a function that reverses
-the ``chopping off'' of characters during the derivative phase. The
-corresponding function is called \emph{injection}, written
-$\textit{inj}$; it takes three arguments: the first one is a regular
-expression ${r_{i-1}}$, before the character is chopped off, the second
-is a character ${c_{i-1}}$, the character we want to inject and the
-third argument is the value ${v_i}$, into which one wants to inject the
-character (it corresponds to the regular expression after the character
-has been chopped off). The result of this function is a new value. The
-definition of $\textit{inj}$ is as follows: 
-
-\begin{center}
-\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
-  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
-  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
-  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
-  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
-  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
-\end{tabular}
-\end{center}
-
-\noindent This definition is by recursion on the ``shape'' of regular
-expressions and values. 
-The clauses basically do one thing--identifying the ``holes'' on 
-value to inject the character back into.
-For instance, in the last clause for injecting back to a value
-that would turn into a new star value that corresponds to a star,
-we know it must be a sequence value. And we know that the first 
-value of that sequence corresponds to the child regex of the star
-with the first character being chopped off--an iteration of the star
-that had just been unfolded. This value is followed by the already
-matched star iterations we collected before. So we inject the character 
-back to the first value and form a new value with this new iteration
-being added to the previous list of iterations, all under the $Stars$
-top level.
-
-We have mentioned before that derivatives without simplification 
-can get clumsy, and this is true for values as well--they reflect
-the regular expressions size by definition.
-
-One can introduce simplification on the regex and values, but have to
-be careful in not breaking the correctness as the injection 
-function heavily relies on the structure of the regexes and values
-being correct and match each other.
-It can be achieved by recording some extra rectification functions
-during the derivatives step, and applying these rectifications in 
-each run during the injection phase.
-And we can prove that the POSIX value of how
-regular expressions match strings will not be affected---although is much harder
-to establish. 
-Some initial results in this regard have been
-obtained in \cite{AusafDyckhoffUrban2016}. 
-
-
-
-%Brzozowski, after giving the derivatives and simplification,
-%did not explore lexing with simplification or he may well be 
-%stuck on an efficient simplificaiton with a proof.
-%He went on to explore the use of derivatives together with 
-%automaton, and did not try lexing using derivatives.
-
-We want to get rid of complex and fragile rectification of values.
-Can we not create those intermediate values $v_1,\ldots v_n$,
-and get the lexing information that should be already there while
-doing derivatives in one pass, without a second phase of injection?
-In the meantime, can we make sure that simplifications
-are easily handled without breaking the correctness of the algorithm?
-
-Sulzmann and Lu solved this problem by
-introducing additional informtaion to the 
-regular expressions called \emph{bitcodes}.
-
-\subsection*{Bit-coded Algorithm}
-Bits and bitcodes (lists of bits) are defined as:
-
-\begin{center}
-		$b ::=   1 \mid  0 \qquad
-bs ::= [] \mid b::bs    
-$
-\end{center}
-
-\noindent
-The $1$ and $0$ are not in bold in order to avoid 
-confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
-bit-lists) can be used to encode values (or potentially incomplete values) in a
-compact form. This can be straightforwardly seen in the following
-coding function from values to bitcodes: 
-
-\begin{center}
-\begin{tabular}{lcl}
-  $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
-  $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
-  $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
-  $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
-  $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
-  $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
-  $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
-                                                 code(\Stars\,vs)$
-\end{tabular}    
-\end{center} 
-
-\noindent
-Here $\textit{code}$ encodes a value into a bitcodes by converting
-$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
-star iteration by $1$. The border where a local star terminates
-is marked by $0$. This coding is lossy, as it throws away the information about
-characters, and also does not encode the ``boundary'' between two
-sequence values. Moreover, with only the bitcode we cannot even tell
-whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
-reason for choosing this compact way of storing information is that the
-relatively small size of bits can be easily manipulated and ``moved
-around'' in a regular expression. In order to recover values, we will 
-need the corresponding regular expression as an extra information. This
-means the decoding function is defined as:
-
-
-%\begin{definition}[Bitdecoding of Values]\mbox{}
-\begin{center}
-\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
-  $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
-  $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
-  $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
-     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
-       (\Left\,v, bs_1)$\\
-  $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
-     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
-       (\Right\,v, bs_1)$\\                           
-  $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
-        $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
-  & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
-  & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
-  $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
-  $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & 
-         $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
-  & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
-  & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
-  
-  $\textit{decode}\,bs\,r$ & $\dn$ &
-     $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
-  & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
-       \textit{else}\;\textit{None}$                       
-\end{tabular}    
-\end{center}    
-%\end{definition}
-
-Sulzmann and Lu's integrated the bitcodes into regular expressions to
-create annotated regular expressions \cite{Sulzmann2014}.
-\emph{Annotated regular expressions} are defined by the following
-grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
-
-\begin{center}
-\begin{tabular}{lcl}
-  $\textit{a}$ & $::=$  & $\ZERO$\\
-                  & $\mid$ & $_{bs}\ONE$\\
-                  & $\mid$ & $_{bs}{\bf c}$\\
-                  & $\mid$ & $_{bs}\sum\,as$\\
-                  & $\mid$ & $_{bs}a_1\cdot a_2$\\
-                  & $\mid$ & $_{bs}a^*$
-\end{tabular}    
-\end{center}  
-%(in \textit{ALTS})
-
-\noindent
-where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
-expressions and $as$ for a list of annotated regular expressions.
-The alternative constructor($\sum$) has been generalized to 
-accept a list of annotated regular expressions rather than just 2.
-We will show that these bitcodes encode information about
-the (POSIX) value that should be generated by the Sulzmann and Lu
-algorithm.
-
-
-To do lexing using annotated regular expressions, we shall first
-transform the usual (un-annotated) regular expressions into annotated
-regular expressions. This operation is called \emph{internalisation} and
-defined as follows:
-
-%\begin{definition}
-\begin{center}
-\begin{tabular}{lcl}
-  $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
-  $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
-  $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
-  $(r_1 + r_2)^\uparrow$ & $\dn$ &
-  $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
-  \textit{fuse}\,[1]\,r_2^\uparrow]$\\
-  $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
-         $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
-  $(r^*)^\uparrow$ & $\dn$ &
-         $_{[]}(r^\uparrow)^*$\\
-\end{tabular}    
-\end{center}    
-%\end{definition}
-
-\noindent
-We use up arrows here to indicate that the basic un-annotated regular
-expressions are ``lifted up'' into something slightly more complex. In the
-fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
-attach bits to the front of an annotated regular expression. Its
-definition is as follows:
-
-\begin{center}
-\begin{tabular}{lcl}
-  $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
-  $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
-     $_{bs @ bs'}\ONE$\\
-  $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
-     $_{bs@bs'}{\bf c}$\\
-  $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
-     $_{bs@bs'}\sum\textit{as}$\\
-  $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
-     $_{bs@bs'}a_1 \cdot a_2$\\
-  $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
-     $_{bs @ bs'}a^*$
-\end{tabular}    
-\end{center}  
-
-\noindent
-After internalising the regular expression, we perform successive
-derivative operations on the annotated regular expressions. This
-derivative operation is the same as what we had previously for the
-basic regular expressions, except that we beed to take care of
-the bitcodes:
-
-
-\iffalse
- %\begin{definition}{bder}
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
-        $\textit{if}\;c=d\; \;\textit{then}\;
-         \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
-  $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
-  $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
-  $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a_1$\\
-					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
-					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
-  & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
-  $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
-      $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
-       (\textit{STAR}\,[]\,r)$
-\end{tabular}    
-\end{center}    
-%\end{definition}
-
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
-  $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
-        $\textit{if}\;c=d\; \;\textit{then}\;
-         _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
-  $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
-  $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
-  $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a_1$\\
-					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
-					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
-  & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
-  $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
-      $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
-       (_{bs}\textit{STAR}\,[]\,r)$
-\end{tabular}    
-\end{center}    
-%\end{definition}
-\fi
-
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
-  $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
-  $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
-        $\textit{if}\;c=d\; \;\textit{then}\;
-         _{bs}\ONE\;\textit{else}\;\ZERO$\\  
-  $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
-  $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
-  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a_1$\\
-					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
-					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
-  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
-  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
-      $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
-       (_{[]}r^*))$
-\end{tabular}    
-\end{center}    
-
-%\end{definition}
-\noindent
-For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
-we need to unfold it into a sequence,
-and attach an additional bit $0$ to the front of $r \backslash c$
-to indicate one more star iteration. Also the sequence clause
-is more subtle---when $a_1$ is $\textit{bnullable}$ (here
-\textit{bnullable} is exactly the same as $\textit{nullable}$, except
-that it is for annotated regular expressions, therefore we omit the
-definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
-$a_1$ matches the string prior to character $c$ (more on this later),
-then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
-\backslash c)$ will collapse the regular expression $a_1$(as it has
-already been fully matched) and store the parsing information at the
-head of the regular expression $a_2 \backslash c$ by fusing to it. The
-bitsequence $\textit{bs}$, which was initially attached to the
-first element of the sequence $a_1 \cdot a_2$, has
-now been elevated to the top-level of $\sum$, as this information will be
-needed whichever way the sequence is matched---no matter whether $c$ belongs
-to $a_1$ or $ a_2$. After building these derivatives and maintaining all
-the lexing information, we complete the lexing by collecting the
-bitcodes using a generalised version of the $\textit{mkeps}$ function
-for annotated regular expressions, called $\textit{bmkeps}$:
-
-
-%\begin{definition}[\textit{bmkeps}]\mbox{}
-\begin{center}
-\begin{tabular}{lcl}
-  $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
-  $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
-     $\textit{if}\;\textit{bnullable}\,a$\\
-  & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
-  & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
-  $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
-     $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
-  $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
-     $bs \,@\, [0]$
-\end{tabular}    
-\end{center}    
-%\end{definition}
-
-\noindent
-This function completes the value information by travelling along the
-path of the regular expression that corresponds to a POSIX value and
-collecting all the bitcodes, and using $S$ to indicate the end of star
-iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
-decode them, we get the value we expect. The corresponding lexing
-algorithm looks as follows:
-
-\begin{center}
-\begin{tabular}{lcl}
-  $\textit{blexer}\;r\,s$ & $\dn$ &
-      $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
-  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
-  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
-  & & $\;\;\textit{else}\;\textit{None}$
-\end{tabular}
-\end{center}
-
-\noindent
-In this definition $\_\backslash s$ is the  generalisation  of the derivative
-operation from characters to strings (just like the derivatives for un-annotated
-regular expressions).
-
-Now we introduce the simplifications, which is why we introduce the 
-bitcodes in the first place.
-
-\subsection*{Simplification Rules}
-
-This section introduces aggressive (in terms of size) simplification rules
-on annotated regular expressions
-to keep derivatives small. Such simplifications are promising
-as we have
-generated test data that show
-that a good tight bound can be achieved. We could only
-partially cover the search space as there are infinitely many regular
-expressions and strings. 
-
-One modification we introduced is to allow a list of annotated regular
-expressions in the $\sum$ constructor. This allows us to not just
-delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
-also unnecessary ``copies'' of regular expressions (very similar to
-simplifying $r + r$ to just $r$, but in a more general setting). Another
-modification is that we use simplification rules inspired by Antimirov's
-work on partial derivatives. They maintain the idea that only the first
-``copy'' of a regular expression in an alternative contributes to the
-calculation of a POSIX value. All subsequent copies can be pruned away from
-the regular expression. A recursive definition of our  simplification function 
-that looks somewhat similar to our Scala code is given below:
-%\comment{Use $\ZERO$, $\ONE$ and so on. 
-%Is it $ALTS$ or $ALTS$?}\\
-
-\begin{center}
-  \begin{tabular}{@{}lcl@{}}
-   
-  $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
-   &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
-   &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
-   &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
-   &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
-   &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
-
-  $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
-  &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
-   &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
-   &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
-
-   $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
-\end{tabular}    
-\end{center}    
-
-\noindent
-The simplification does a pattern matching on the regular expression.
-When it detected that the regular expression is an alternative or
-sequence, it will try to simplify its child regular expressions
-recursively and then see if one of the children turns into $\ZERO$ or
-$\ONE$, which might trigger further simplification at the current level.
-The most involved part is the $\sum$ clause, where we use two
-auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
-alternatives and reduce as many duplicates as possible. Function
-$\textit{distinct}$  keeps the first occurring copy only and removes all later ones
-when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
-Its recursive definition is given below:
-
- \begin{center}
-  \begin{tabular}{@{}lcl@{}}
-  $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
-     (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
-  $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
-    $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
-\end{tabular}    
-\end{center}  
-
-\noindent
-Here $\textit{flatten}$ behaves like the traditional functional programming flatten
-function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
-removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
-
-Having defined the $\simp$ function,
-we can use the previous notation of  natural
-extension from derivative w.r.t.~character to derivative
-w.r.t.~string:%\comment{simp in  the [] case?}
-
-\begin{center}
-\begin{tabular}{lcl}
-$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
-$r \backslash_{simp} [\,] $ & $\dn$ & $r$
-\end{tabular}
-\end{center}
-
-\noindent
-to obtain an optimised version of the algorithm:
-
- \begin{center}
-\begin{tabular}{lcl}
-  $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
-      $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
-  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
-  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
-  & & $\;\;\textit{else}\;\textit{None}$
-\end{tabular}
-\end{center}
-
-\noindent
-This algorithm keeps the regular expression size small, for example,
-with this simplification our previous $(a + aa)^*$ example's 8000 nodes
-will be reduced to just 6 and stays constant, no matter how long the
-input string is.
-
-
- 
 Derivatives give a simple solution
 to the problem of matching a string $s$ with a regular
 expression $r$: if the derivative of $r$ w.r.t.\ (in
@@ -1566,8 +627,9 @@
 \emph{incremental parsing method} (that is the algorithm to be formalised
 in this paper):
 
-%motivation part
-\begin{quote}\it
+
+  
+  \begin{quote}\it
   ``Correctness Claim: We further claim that the incremental parsing
   method [..] in combination with the simplification steps [..]
   yields POSIX parse trees. We have tested this claim
@@ -1575,7 +637,97 @@
   have to work out all proof details.'' \cite[Page 14]{Sulzmann2014}
 \end{quote}  
 
+Ausaf and Urban were able to back this correctness claim with
+a formal proof.
 
+But as they stated,
+  \begin{quote}\it
+The next step would be to implement a more aggressive simplification procedure on annotated regular expressions and then prove the corresponding algorithm generates the same values as blexer. Alas due to time constraints we are unable to do so here.
+\end{quote}  
+
+This thesis implements the aggressive simplifications envisioned
+by Ausaf and Urban,
+and gives a formal proof of the correctness with those simplifications.
+
+
+ 
+
+
+
+%----------------------------------------------------------------------------------------
+
+\section{Contribution}
+
+
+
+This work addresses the vulnerability of super-linear and
+buggy regex implementations by the combination
+of Brzozowski's derivatives and interactive theorem proving. 
+We give an 
+improved version of  Sulzmann and Lu's bit-coded algorithm using 
+derivatives, which come with a formal guarantee in terms of correctness and 
+running time as an Isabelle/HOL proof.
+Then we improve the algorithm with an even stronger version of 
+simplification, and prove a time bound linear to input and
+cubic to regular expression size using a technique by
+Antimirov.
+
+ 
+The main contribution of this thesis is a proven correct lexing algorithm
+with formalized time bounds.
+To our best knowledge, no lexing libraries using Brzozowski derivatives
+have a provable time guarantee, 
+and claims about running time are usually speculative and backed by thin empirical
+evidence.
+%TODO: give references
+For example, Sulzmann and Lu had proposed an algorithm  in which they
+claim a linear running time.
+But that was falsified by our experiments and the running time 
+is actually $\Omega(2^n)$ in the worst case.
+A similar claim about a theoretical runtime of $O(n^2)$ is made for the Verbatim
+%TODO: give references
+lexer, which calculates POSIX matches and is based on derivatives.
+They formalized the correctness of the lexer, but not the complexity.
+In the performance evaluation section, they simply analyzed the run time
+of matching $a$ with the string $\underbrace{a \ldots a}_{\text{n a's}}$
+and concluded that the algorithm is quadratic in terms of input length.
+When we tried out their extracted OCaml code with our example $(a+aa)^*$,
+the time it took to lex only 40 $a$'s was 5 minutes.
+
+We  believe our results of a proof of performance on general
+inputs rather than specific examples a novel contribution.\\
+
+
+\subsection{Related Work}
+We are aware
+of a mechanised correctness proof of Brzozowski's derivative-based matcher in HOL4 by
+Owens and Slind~\parencite{Owens2008}. Another one in Isabelle/HOL is part
+of the work by Krauss and Nipkow \parencite{Krauss2011}.  And another one
+in Coq is given by Coquand and Siles \parencite{Coquand2012}.
+Also Ribeiro and Du Bois give one in Agda \parencite{RibeiroAgda2017}.
+ 
+ %We propose Brzozowski's derivatives as a solution to this problem.
+% about Lexing Using Brzozowski derivatives
+
+
+\section{Structure of the thesis}
+In chapter 2 \ref{Chapter2} we will introduce the concepts
+and notations we 
+use for describing the lexing algorithm by Sulzmann and Lu,
+and then give the algorithm and its variant, and discuss
+why more aggressive simplifications are needed. 
+Then we illustrate in Chapter 3\ref{Chapter3}
+how the algorithm without bitcodes falls short for such aggressive 
+simplifications and therefore introduce our version of the
+ bitcoded algorithm and 
+its correctness proof .  
+In Chapter 4 \ref{Chapter4} we give the second guarantee
+of our bitcoded algorithm, that is a finite bound on the size of any 
+regex's derivatives.
+In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
+in Chapter 4 to a polynomial one, and demonstrate how one can extend the
+algorithm to include constructs such as bounded repetitions and negations.
+ 
 
 
 
--- a/ChengsongTanPhdThesis/Chapters/Chapter2.tex	Thu May 26 20:51:40 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Chapter2.tex	Fri May 27 18:27:39 2022 +0100
@@ -1,8 +1,896 @@
 % Chapter Template
 
-\chapter{Chapter Title Here} % Main chapter title
+\chapter{Regular Expressions and Sulzmanna and Lu's Lexing Algorithm Without Bitcodes} % Main chapter title
+
+\label{Chapter2} % In chapter 2 \ref{Chapter2} we will introduce the concepts
+%and notations we 
+%use for describing the lexing algorithm by Sulzmann and Lu,
+%and then give the algorithm and its variant, and discuss
+%why more aggressive simplifications are needed. 
+
+
+\section{Preliminaries}
+
+Suppose we have an alphabet $\Sigma$, the strings  whose characters
+are from $\Sigma$
+can be expressed as $\Sigma^*$.
+
+We use patterns to define a set of strings concisely. Regular expressions
+are one of such patterns systems:
+The basic regular expressions  are defined inductively
+ by the following grammar:
+\[			r ::=   \ZERO \mid  \ONE
+			 \mid  c  
+			 \mid  r_1 \cdot r_2
+			 \mid  r_1 + r_2   
+			 \mid r^*         
+\]
+
+The language or set of strings defined by regular expressions are defined as
+%TODO: FILL in the other defs
+\begin{center}
+\begin{tabular}{lcl}
+$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
+$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
+\end{tabular}
+\end{center}
+Which are also called the "language interpretation".
+
+
+
+The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
+where the operation transforms the regex to a new one containing
+strings without the head character $c$.
+
+Formally, we define first such a transformation on any string set, which
+we call semantic derivative:
+\begin{center}
+$\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$
+\end{center}
+Mathematically, it can be expressed as the 
+
+If the $\textit{StringSet}$ happen to have some structure, for example,
+if it is regular, then we have that it
+
+% Derivatives of a
+%regular expression, written $r \backslash c$, give a simple solution
+%to the problem of matching a string $s$ with a regular
+%expression $r$: if the derivative of $r$ w.r.t.\ (in
+%succession) all the characters of the string matches the empty string,
+%then $r$ matches $s$ (and {\em vice versa}).  
+
+The the derivative of regular expression, denoted as
+$r \backslash c$, is a function that takes parameters
+$r$ and $c$, and returns another regular expression $r'$,
+which is computed by the following recursive function:
+
+\begin{center}
+\begin{tabular}{lcl}
+		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
+		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
+		$d \backslash c$     & $\dn$ & 
+		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
+	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
+	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
+	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
+\end{tabular}
+\end{center}
+\noindent
+\noindent
+
+The $\nullable$ function tests whether the empty string $""$ 
+is in the language of $r$:
+
+
+\begin{center}
+		\begin{tabular}{lcl}
+			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
+			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
+			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
+			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
+			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
+			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
+		\end{tabular}
+\end{center}
+\noindent
+The empty set does not contain any string and
+therefore not the empty string, the empty string 
+regular expression contains the empty string
+by definition, the character regular expression
+is the singleton that contains character only,
+and therefore does not contain the empty string,
+the alternative regular expression(or "or" expression)
+might have one of its children regular expressions
+being nullable and any one of its children being nullable
+would suffice. The sequence regular expression
+would require both children to have the empty string
+to compose an empty string and the Kleene star
+operation naturally introduced the empty string.
+
+We can give the meaning of regular expressions derivatives
+by language interpretation:
+
+
+ 
+  
+\begin{center}
+\begin{tabular}{lcl}
+		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
+		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
+		$d \backslash c$     & $\dn$ & 
+		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
+	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
+	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
+	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
+\end{tabular}
+\end{center}
+\noindent
+\noindent
+The function derivative, written $\backslash c$, 
+defines how a regular expression evolves into
+a new regular expression after all the string it contains
+is chopped off a certain head character $c$.
+The most involved cases are the sequence 
+and star case.
+The sequence case says that if the first regular expression
+contains an empty string then the second component of the sequence
+might be chosen as the target regular expression to be chopped
+off its head character.
+The star regular expression's derivative unwraps the iteration of
+regular expression and attaches the star regular expression
+to the sequence's second element to make sure a copy is retained
+for possible more iterations in later phases of lexing.
+
+
+The main property of the derivative operation
+that enables us to reason about the correctness of
+an algorithm using derivatives is 
+
+\begin{center}
+$c\!::\!s \in L(r)$ holds
+if and only if $s \in L(r\backslash c)$.
+\end{center}
+
+\noindent
+We can generalise the derivative operation shown above for single characters
+to strings as follows:
+
+\begin{center}
+\begin{tabular}{lcl}
+$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
+$r \backslash [\,] $ & $\dn$ & $r$
+\end{tabular}
+\end{center}
+
+\noindent
+and then define Brzozowski's  regular-expression matching algorithm as:
+
+\[
+match\;s\;r \;\dn\; nullable(r\backslash s)
+\]
+
+\noindent
+Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, 
+this algorithm presented graphically is as follows:
+
+\begin{equation}\label{graph:*}
+\begin{tikzcd}
+r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
+\end{tikzcd}
+\end{equation}
+
+\noindent
+where we start with  a regular expression  $r_0$, build successive
+derivatives until we exhaust the string and then use \textit{nullable}
+to test whether the result can match the empty string. It can  be
+relatively  easily shown that this matcher is correct  (that is given
+an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
+
+Beautiful and simple definition.
+
+If we implement the above algorithm naively, however,
+the algorithm can be excruciatingly slow. 
+
+
+\begin{figure}
+\centering
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+    xlabel={$n$},
+    x label style={at={(1.05,-0.05)}},
+    ylabel={time in secs},
+    enlargelimits=false,
+    xtick={0,5,...,30},
+    xmax=33,
+    ymax=10000,
+    ytick={0,1000,...,10000},
+    scaled ticks=false,
+    axis lines=left,
+    width=5cm,
+    height=4cm, 
+    legend entries={JavaScript},  
+    legend pos=north west,
+    legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
+\end{axis}
+\end{tikzpicture}\\
+\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
+           of the form $\underbrace{aa..a}_{n}$.}
+\end{tabular}    
+\caption{EightThousandNodes} \label{fig:EightThousandNodes}
+\end{figure}
+
+
+(8000 node data to be added here)
+For example, when starting with the regular
+expression $(a + aa)^*$ and building a few successive derivatives (around 10)
+w.r.t.~the character $a$, one obtains a derivative regular expression
+with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
+The reason why $(a + aa) ^*$ explodes so drastically is that without
+pruning, the algorithm will keep records of all possible ways of matching:
+\begin{center}
+$(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
+\end{center}
+
+\noindent
+Each of the above alternative branches correspond to the match 
+$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
+These different ways of matching will grow exponentially with the string length,
+and without simplifications that throw away some of these very similar matchings,
+it is no surprise that these expressions grow so quickly.
+Operations like
+$\backslash$ and $\nullable$ need to traverse such trees and
+consequently the bigger the size of the derivative the slower the
+algorithm. 
+
+Brzozowski was quick in finding that during this process a lot useless
+$\ONE$s and $\ZERO$s are generated and therefore not optimal.
+He also introduced some "similarity rules" such
+as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
+different but language-equivalent sub-regexes to further decrease the size
+of the intermediate regexes. 
+
+More simplifications are possible, such as deleting duplicates
+and opening up nested alternatives to trigger even more simplifications.
+And suppose we apply simplification after each derivative step, and compose
+these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
+\textit{simp}(a \backslash c)$. Then we can build
+a matcher without having  cumbersome regular expressions.
+
+
+If we want the size of derivatives in the algorithm to
+stay even lower, we would need more aggressive simplifications.
+Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
+deleting duplicates whenever possible. For example, the parentheses in
+$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
+\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
+example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
+$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
+to achieve a very tight size bound, namely,
+ the same size bound as that of the \emph{partial derivatives}. 
+
+Building derivatives and then simplify them.
+So far so good. But what if we want to 
+do lexing instead of just a YES/NO answer?
+This requires us to go back again to the world 
+without simplification first for a moment.
+Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
+elegant(arguably as beautiful as the original
+derivatives definition) solution for this.
+
+\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
+
+
+They first defined the datatypes for storing the 
+lexing information called a \emph{value} or
+sometimes also \emph{lexical value}.  These values and regular
+expressions correspond to each other as illustrated in the following
+table:
+
+\begin{center}
+	\begin{tabular}{c@{\hspace{20mm}}c}
+		\begin{tabular}{@{}rrl@{}}
+			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
+			$r$ & $::=$  & $\ZERO$\\
+			& $\mid$ & $\ONE$   \\
+			& $\mid$ & $c$          \\
+			& $\mid$ & $r_1 \cdot r_2$\\
+			& $\mid$ & $r_1 + r_2$   \\
+			\\
+			& $\mid$ & $r^*$         \\
+		\end{tabular}
+		&
+		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
+			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
+			$v$ & $::=$  & \\
+			&        & $\Empty$   \\
+			& $\mid$ & $\Char(c)$          \\
+			& $\mid$ & $\Seq\,v_1\, v_2$\\
+			& $\mid$ & $\Left(v)$   \\
+			& $\mid$ & $\Right(v)$  \\
+			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
+		\end{tabular}
+	\end{tabular}
+\end{center}
+
+\noindent
+
+Building on top of Sulzmann and Lu's attempt to formalize the 
+notion of POSIX lexing rules \parencite{Sulzmann2014}, 
+Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
+POSIX matching as a ternary relation recursively defined in a
+natural deduction style.
+With the formally-specified rules for what a POSIX matching is,
+they proved in Isabelle/HOL that the algorithm gives correct results.
+
+But having a correct result is still not enough, 
+we want at least some degree of $\mathbf{efficiency}$.
+
+
+
+One regular expression can have multiple lexical values. For example
+for the regular expression $(a+b)^*$, it has a infinite list of
+values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
+$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
+$\ldots$, and vice versa.
+Even for the regular expression matching a certain string, there could 
+still be more than one value corresponding to it.
+Take the example where $r= (a^*\cdot a^*)^*$ and the string 
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
+The number of different ways of matching 
+without allowing any value under a star to be flattened
+to an empty string can be given by the following formula:
+\begin{equation}
+	C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}
+\end{equation}
+and a closed form formula can be calculated to be
+\begin{equation}
+	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
+\end{equation}
+which is clearly in exponential order.
+
+A lexer aimed at getting all the possible values has an exponential
+worst case runtime. Therefore it is impractical to try to generate
+all possible matches in a run. In practice, we are usually 
+interested about POSIX values, which by intuition always
+\begin{itemize}
+\item
+match the leftmost regular expression when multiple options of matching
+are available  
+\item 
+always match a subpart as much as possible before proceeding
+to the next token.
+\end{itemize}
+
+
+ For example, the above example has the POSIX value
+$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
+The output of an algorithm we want would be a POSIX matching
+encoded as a value.
+The reason why we are interested in $\POSIX$ values is that they can
+be practically used in the lexing phase of a compiler front end.
+For instance, when lexing a code snippet 
+$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
+as an identifier rather than a keyword.
+
+The contribution of Sulzmann and Lu is an extension of Brzozowski's
+algorithm by a second phase (the first phase being building successive
+derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
+is generated in case the regular expression matches  the string. 
+Pictorially, the Sulzmann and Lu algorithm is as follows:
+
+\begin{ceqn}
+\begin{equation}\label{graph:2}
+\begin{tikzcd}
+r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
+v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
+\end{tikzcd}
+\end{equation}
+\end{ceqn}
+
+
+\noindent
+For convenience, we shall employ the following notations: the regular
+expression we start with is $r_0$, and the given string $s$ is composed
+of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
+left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
+to the characters $c_0$, $c_1$  until we exhaust the string and obtain
+the derivative $r_n$. We test whether this derivative is
+$\textit{nullable}$ or not. If not, we know the string does not match
+$r$ and no value needs to be generated. If yes, we start building the
+values incrementally by \emph{injecting} back the characters into the
+earlier values $v_n, \ldots, v_0$. This is the second phase of the
+algorithm from the right to left. For the first value $v_n$, we call the
+function $\textit{mkeps}$, which builds a POSIX lexical value
+for how the empty string has been matched by the (nullable) regular
+expression $r_n$. This function is defined as
+
+	\begin{center}
+		\begin{tabular}{lcl}
+			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
+			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
+			& \textit{if} $\nullable(r_{1})$\\ 
+			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
+			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
+			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
+			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
+		\end{tabular}
+	\end{center}
+
+
+\noindent 
+After the $\mkeps$-call, we inject back the characters one by one in order to build
+the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
+($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
+After injecting back $n$ characters, we get the lexical value for how $r_0$
+matches $s$. The POSIX value is maintained throught out the process.
+For this Sulzmann and Lu defined a function that reverses
+the ``chopping off'' of characters during the derivative phase. The
+corresponding function is called \emph{injection}, written
+$\textit{inj}$; it takes three arguments: the first one is a regular
+expression ${r_{i-1}}$, before the character is chopped off, the second
+is a character ${c_{i-1}}$, the character we want to inject and the
+third argument is the value ${v_i}$, into which one wants to inject the
+character (it corresponds to the regular expression after the character
+has been chopped off). The result of this function is a new value. The
+definition of $\textit{inj}$ is as follows: 
 
-\label{ChapterX} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
+\begin{center}
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
+  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
+  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
+  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
+  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
+  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
+\end{tabular}
+\end{center}
+
+\noindent This definition is by recursion on the ``shape'' of regular
+expressions and values. 
+The clauses basically do one thing--identifying the ``holes'' on 
+value to inject the character back into.
+For instance, in the last clause for injecting back to a value
+that would turn into a new star value that corresponds to a star,
+we know it must be a sequence value. And we know that the first 
+value of that sequence corresponds to the child regex of the star
+with the first character being chopped off--an iteration of the star
+that had just been unfolded. This value is followed by the already
+matched star iterations we collected before. So we inject the character 
+back to the first value and form a new value with this new iteration
+being added to the previous list of iterations, all under the $Stars$
+top level.
+
+We have mentioned before that derivatives without simplification 
+can get clumsy, and this is true for values as well--they reflect
+the regular expressions size by definition.
+
+One can introduce simplification on the regex and values, but have to
+be careful in not breaking the correctness as the injection 
+function heavily relies on the structure of the regexes and values
+being correct and match each other.
+It can be achieved by recording some extra rectification functions
+during the derivatives step, and applying these rectifications in 
+each run during the injection phase.
+And we can prove that the POSIX value of how
+regular expressions match strings will not be affected---although is much harder
+to establish. 
+Some initial results in this regard have been
+obtained in \cite{AusafDyckhoffUrban2016}. 
+
+
+
+%Brzozowski, after giving the derivatives and simplification,
+%did not explore lexing with simplification or he may well be 
+%stuck on an efficient simplificaiton with a proof.
+%He went on to explore the use of derivatives together with 
+%automaton, and did not try lexing using derivatives.
+
+We want to get rid of complex and fragile rectification of values.
+Can we not create those intermediate values $v_1,\ldots v_n$,
+and get the lexing information that should be already there while
+doing derivatives in one pass, without a second phase of injection?
+In the meantime, can we make sure that simplifications
+are easily handled without breaking the correctness of the algorithm?
+
+Sulzmann and Lu solved this problem by
+introducing additional informtaion to the 
+regular expressions called \emph{bitcodes}.
+
+\subsection*{Bit-coded Algorithm}
+Bits and bitcodes (lists of bits) are defined as:
+
+\begin{center}
+		$b ::=   1 \mid  0 \qquad
+bs ::= [] \mid b::bs    
+$
+\end{center}
+
+\noindent
+The $1$ and $0$ are not in bold in order to avoid 
+confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
+bit-lists) can be used to encode values (or potentially incomplete values) in a
+compact form. This can be straightforwardly seen in the following
+coding function from values to bitcodes: 
+
+\begin{center}
+\begin{tabular}{lcl}
+  $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
+  $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
+  $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
+  $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
+  $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
+  $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
+  $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
+                                                 code(\Stars\,vs)$
+\end{tabular}    
+\end{center} 
+
+\noindent
+Here $\textit{code}$ encodes a value into a bitcodes by converting
+$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
+star iteration by $1$. The border where a local star terminates
+is marked by $0$. This coding is lossy, as it throws away the information about
+characters, and also does not encode the ``boundary'' between two
+sequence values. Moreover, with only the bitcode we cannot even tell
+whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
+reason for choosing this compact way of storing information is that the
+relatively small size of bits can be easily manipulated and ``moved
+around'' in a regular expression. In order to recover values, we will 
+need the corresponding regular expression as an extra information. This
+means the decoding function is defined as:
+
+
+%\begin{definition}[Bitdecoding of Values]\mbox{}
+\begin{center}
+\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
+  $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
+  $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
+  $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
+       (\Left\,v, bs_1)$\\
+  $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
+       (\Right\,v, bs_1)$\\                           
+  $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
+        $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
+  & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
+  & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
+  $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
+  $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & 
+         $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
+  & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
+  & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
+  
+  $\textit{decode}\,bs\,r$ & $\dn$ &
+     $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
+  & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
+       \textit{else}\;\textit{None}$                       
+\end{tabular}    
+\end{center}    
+%\end{definition}
+
+Sulzmann and Lu's integrated the bitcodes into regular expressions to
+create annotated regular expressions \cite{Sulzmann2014}.
+\emph{Annotated regular expressions} are defined by the following
+grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
+
+\begin{center}
+\begin{tabular}{lcl}
+  $\textit{a}$ & $::=$  & $\ZERO$\\
+                  & $\mid$ & $_{bs}\ONE$\\
+                  & $\mid$ & $_{bs}{\bf c}$\\
+                  & $\mid$ & $_{bs}\sum\,as$\\
+                  & $\mid$ & $_{bs}a_1\cdot a_2$\\
+                  & $\mid$ & $_{bs}a^*$
+\end{tabular}    
+\end{center}  
+%(in \textit{ALTS})
+
+\noindent
+where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
+expressions and $as$ for a list of annotated regular expressions.
+The alternative constructor($\sum$) has been generalized to 
+accept a list of annotated regular expressions rather than just 2.
+We will show that these bitcodes encode information about
+the (POSIX) value that should be generated by the Sulzmann and Lu
+algorithm.
+
+
+To do lexing using annotated regular expressions, we shall first
+transform the usual (un-annotated) regular expressions into annotated
+regular expressions. This operation is called \emph{internalisation} and
+defined as follows:
+
+%\begin{definition}
+\begin{center}
+\begin{tabular}{lcl}
+  $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
+  $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
+  $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
+  $(r_1 + r_2)^\uparrow$ & $\dn$ &
+  $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
+  \textit{fuse}\,[1]\,r_2^\uparrow]$\\
+  $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
+         $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
+  $(r^*)^\uparrow$ & $\dn$ &
+         $_{[]}(r^\uparrow)^*$\\
+\end{tabular}    
+\end{center}    
+%\end{definition}
+
+\noindent
+We use up arrows here to indicate that the basic un-annotated regular
+expressions are ``lifted up'' into something slightly more complex. In the
+fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
+attach bits to the front of an annotated regular expression. Its
+definition is as follows:
+
+\begin{center}
+\begin{tabular}{lcl}
+  $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
+  $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
+     $_{bs @ bs'}\ONE$\\
+  $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
+     $_{bs@bs'}{\bf c}$\\
+  $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
+     $_{bs@bs'}\sum\textit{as}$\\
+  $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
+     $_{bs@bs'}a_1 \cdot a_2$\\
+  $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
+     $_{bs @ bs'}a^*$
+\end{tabular}    
+\end{center}  
+
+\noindent
+After internalising the regular expression, we perform successive
+derivative operations on the annotated regular expressions. This
+derivative operation is the same as what we had previously for the
+basic regular expressions, except that we beed to take care of
+the bitcodes:
+
+
+\iffalse
+ %\begin{definition}{bder}
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
+  $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
+  $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
+        $\textit{if}\;c=d\; \;\textit{then}\;
+         \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
+  $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
+  $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
+  $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
+     $\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
+					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
+  & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
+  $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
+      $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
+       (\textit{STAR}\,[]\,r)$
+\end{tabular}    
+\end{center}    
+%\end{definition}
+
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
+  $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
+  $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
+        $\textit{if}\;c=d\; \;\textit{then}\;
+         _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
+  $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
+  $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
+  $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
+     $\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
+					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
+  & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
+  $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
+      $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
+       (_{bs}\textit{STAR}\,[]\,r)$
+\end{tabular}    
+\end{center}    
+%\end{definition}
+\fi
+
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+  $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
+  $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
+  $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
+        $\textit{if}\;c=d\; \;\textit{then}\;
+         _{bs}\ONE\;\textit{else}\;\ZERO$\\  
+  $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
+  $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
+  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
+     $\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
+					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
+  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
+  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
+      $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
+       (_{[]}r^*))$
+\end{tabular}    
+\end{center}    
+
+%\end{definition}
+\noindent
+For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
+we need to unfold it into a sequence,
+and attach an additional bit $0$ to the front of $r \backslash c$
+to indicate one more star iteration. Also the sequence clause
+is more subtle---when $a_1$ is $\textit{bnullable}$ (here
+\textit{bnullable} is exactly the same as $\textit{nullable}$, except
+that it is for annotated regular expressions, therefore we omit the
+definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
+$a_1$ matches the string prior to character $c$ (more on this later),
+then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
+\backslash c)$ will collapse the regular expression $a_1$(as it has
+already been fully matched) and store the parsing information at the
+head of the regular expression $a_2 \backslash c$ by fusing to it. The
+bitsequence $\textit{bs}$, which was initially attached to the
+first element of the sequence $a_1 \cdot a_2$, has
+now been elevated to the top-level of $\sum$, as this information will be
+needed whichever way the sequence is matched---no matter whether $c$ belongs
+to $a_1$ or $ a_2$. After building these derivatives and maintaining all
+the lexing information, we complete the lexing by collecting the
+bitcodes using a generalised version of the $\textit{mkeps}$ function
+for annotated regular expressions, called $\textit{bmkeps}$:
+
+
+%\begin{definition}[\textit{bmkeps}]\mbox{}
+\begin{center}
+\begin{tabular}{lcl}
+  $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
+  $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
+     $\textit{if}\;\textit{bnullable}\,a$\\
+  & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
+  & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
+  $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
+     $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
+  $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
+     $bs \,@\, [0]$
+\end{tabular}    
+\end{center}    
+%\end{definition}
+
+\noindent
+This function completes the value information by travelling along the
+path of the regular expression that corresponds to a POSIX value and
+collecting all the bitcodes, and using $S$ to indicate the end of star
+iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
+decode them, we get the value we expect. The corresponding lexing
+algorithm looks as follows:
+
+\begin{center}
+\begin{tabular}{lcl}
+  $\textit{blexer}\;r\,s$ & $\dn$ &
+      $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
+  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+  & & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+
+\noindent
+In this definition $\_\backslash s$ is the  generalisation  of the derivative
+operation from characters to strings (just like the derivatives for un-annotated
+regular expressions).
+
+Now we introduce the simplifications, which is why we introduce the 
+bitcodes in the first place.
+
+\subsection*{Simplification Rules}
+
+This section introduces aggressive (in terms of size) simplification rules
+on annotated regular expressions
+to keep derivatives small. Such simplifications are promising
+as we have
+generated test data that show
+that a good tight bound can be achieved. We could only
+partially cover the search space as there are infinitely many regular
+expressions and strings. 
+
+One modification we introduced is to allow a list of annotated regular
+expressions in the $\sum$ constructor. This allows us to not just
+delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
+also unnecessary ``copies'' of regular expressions (very similar to
+simplifying $r + r$ to just $r$, but in a more general setting). Another
+modification is that we use simplification rules inspired by Antimirov's
+work on partial derivatives. They maintain the idea that only the first
+``copy'' of a regular expression in an alternative contributes to the
+calculation of a POSIX value. All subsequent copies can be pruned away from
+the regular expression. A recursive definition of our  simplification function 
+that looks somewhat similar to our Scala code is given below:
+%\comment{Use $\ZERO$, $\ONE$ and so on. 
+%Is it $ALTS$ or $ALTS$?}\\
+
+\begin{center}
+  \begin{tabular}{@{}lcl@{}}
+   
+  $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
+   &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
+   &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
+   &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
+   &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
+   &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
+
+  $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
+  &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
+   &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
+   &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 
+
+   $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
+\end{tabular}    
+\end{center}    
+
+\noindent
+The simplification does a pattern matching on the regular expression.
+When it detected that the regular expression is an alternative or
+sequence, it will try to simplify its child regular expressions
+recursively and then see if one of the children turns into $\ZERO$ or
+$\ONE$, which might trigger further simplification at the current level.
+The most involved part is the $\sum$ clause, where we use two
+auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
+alternatives and reduce as many duplicates as possible. Function
+$\textit{distinct}$  keeps the first occurring copy only and removes all later ones
+when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
+Its recursive definition is given below:
+
+ \begin{center}
+  \begin{tabular}{@{}lcl@{}}
+  $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
+     (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
+  $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
+    $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
+\end{tabular}    
+\end{center}  
+
+\noindent
+Here $\textit{flatten}$ behaves like the traditional functional programming flatten
+function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
+removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
+
+Having defined the $\simp$ function,
+we can use the previous notation of  natural
+extension from derivative w.r.t.~character to derivative
+w.r.t.~string:%\comment{simp in  the [] case?}
+
+\begin{center}
+\begin{tabular}{lcl}
+$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
+$r \backslash_{simp} [\,] $ & $\dn$ & $r$
+\end{tabular}
+\end{center}
+
+\noindent
+to obtain an optimised version of the algorithm:
+
+ \begin{center}
+\begin{tabular}{lcl}
+  $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
+      $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
+  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+  & & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+
+\noindent
+This algorithm keeps the regular expression size small, for example,
+with this simplification our previous $(a + aa)^*$ example's 8000 nodes
+will be reduced to just 6 and stays constant, no matter how long the
+input string is.
+
+
 
 
 
@@ -17,749 +905,91 @@
 
 
 
-%-----------------------------------
-%	SECTION ?
-%-----------------------------------
-\section{preparatory properties for getting a finiteness bound}
-Before we get to the proof that says the intermediate result of our lexer will
-remain finitely bounded, which is an important efficiency/liveness guarantee,
-we shall first develop a few preparatory properties and definitions to 
-make the process of proving that a breeze.
-
-We define rewriting relations for $\rrexp$s, which allows us to do the 
-same trick as we did for the correctness proof,
-but this time we will have stronger equalities established.
-\subsection{"hrewrite" relation}
-List of 1-step rewrite rules for regular expressions simplification without bitcodes:
-\begin{center}
-\begin{tabular}{lcl}
-$r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$
-\end{tabular}
-\end{center}
+%----------------------------------------------------------------------------------------
+%	SECTION  correctness proof
+%----------------------------------------------------------------------------------------
+\section{Correctness of Bit-coded Algorithm (Without Simplification)
+We now give the proof the correctness of the algorithm with bit-codes.
 
-And we define an "grewrite" relation that works on lists:
-\begin{center}
-\begin{tabular}{lcl}
-$ \ZERO :: rs$ & $\rightarrow_g$ & $rs$
-\end{tabular}
-\end{center}
-
-
-
-With these relations we prove
+Ausaf and Urban cleverly defined an auxiliary function called $\flex$,
+defined as
+\[
+\flex \; r \; f \; [] \; v \; = \; f\; v
+\flex \; r \; f \; c :: s \; v =  \flex r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v
+\]
+which accumulates the characters that needs to be injected back, 
+and does the injection in a stack-like manner (last taken derivative first injected).
+$\flex$ is connected to the $\lexer$:
 \begin{lemma}
-$rs \rightarrow rs'  \implies \RALTS{rs} \rightarrow \RALTS{rs'}$
+$\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$
 \end{lemma}
-which enables us to prove "closed forms" equalities such as
-\begin{lemma}
-$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
-\end{lemma}
-
-But the most involved part of the above lemma is proving the following:
-\begin{lemma}
-$\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $ 
-\end{lemma}
-which is needed in proving things like 
-\begin{lemma}
-$r \rightarrow_f r'  \implies \rsimp{r} \rightarrow \rsimp{r'}$
+$\flex$ provides us a bridge between $\lexer$ and $\blexer$.
+What is even better about $\flex$ is that it allows us to 
+directly operate on the value $\mkeps (r\backslash v)$,
+which is pivotal in the definition of  $\lexer $ and $\blexer$, but not visible as an argument.
+When the value created by $\mkeps$ becomes available, one can 
+prove some stepwise properties of lexing nicely:
+\begin{lemma}\label{flexStepwise}
+$\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $
 \end{lemma}
 
-Fortunately this is provable by induction on the list $rs$
-
-
-
-%-----------------------------------
-%	SECTION 2
-%-----------------------------------
-
-\section{Finiteness Property}
-In this section let us describe our argument for why the size of the simplified
-derivatives with the aggressive simplification function is finitely bounded.
- Suppose
-we have a size function for bitcoded regular expressions, written
-$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
-
-\begin{center}
-\begin{tabular}{ccc}
-$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\
-\end{tabular}
-\end{center}
-(TODO: COMPLETE this defn and for $rs$)
-
-The size is based on a recursive function on the structure of the regex,
-not the bitcodes.
-Therefore we may as well talk about size of an annotated regular expression 
-in their un-annotated form:
-\begin{center}
-$\asize(a) = \size(\erase(a))$. 
-\end{center}
-(TODO: turn equals to roughly equals)
-
-But there is a minor nuisance here:
-the erase function actually messes with the structure of the regular expression:
-\begin{center}
-\begin{tabular}{ccc}
-$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\
-\end{tabular}
-\end{center}
-(TODO: complete this definition with singleton r in alts)
-
-An alternative regular expression with an empty list of children
- is turned into an $\ZERO$ during the
-$\erase$ function, thereby changing the size and structure of the regex.
-These will likely be fixable if we really want to use plain $\rexp$s for dealing
-with size, but we choose a more straightforward (or stupid) method by 
-defining a new datatype that is similar to plain $\rexp$s but can take
-non-binary arguments for its alternative constructor,
- which we call $\rrexp$ to denote
-the difference between it and plain regular expressions. 
-\[			\rrexp ::=   \RZERO \mid  \RONE
-			 \mid  \RCHAR{c}  
-			 \mid  \RSEQ{r_1}{r_2}
-			 \mid  \RALTS{rs}
-			 \mid \RSTAR{r}        
+And for $\blexer$ we have something with stepwise properties like $\flex$ as well,
+called $\retrieve$:
+\[
+\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}
 \]
-For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, 
-but keep everything else intact.
-It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
-(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
-$\ALTS$).
-We denote the operation of erasing the bits and turning an annotated regular expression 
-into an $\rrexp{}$ as $\rerase{}$.
-\begin{center}
-\begin{tabular}{lcl}
-$\rerase{\AZERO}$ & $=$ & $\RZERO$\\
-$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
-$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
-\end{tabular}
-\end{center}
-%TODO: FINISH definition of rerase
-Similarly we could define the derivative  and simplification on 
-$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, 
-except that now they can operate on alternatives taking multiple arguments.
+$\retrieve$ takes bit-codes from annotated regular expressions
+guided by a value.
+$\retrieve$ is connected to the $\blexer$ in the following way:
+\begin{lemma}\label{blexer_retrieve}
+$\blexer \; r \; s = \decode  \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
+\end{lemma}
+If you take derivative of an annotated regular expression, 
+you can $\retrieve$ the same bit-codes as before the derivative took place,
+provided that you use the corresponding value:
 
-\begin{center}
-\begin{tabular}{lcr}
-$\RALTS{rs} \backslash c$ & $=$ &  $\RALTS{map\; (\_ \backslash c) \;rs}$\\
-(other clauses omitted)
-\end{tabular}
-\end{center}
-
-Now that $\rrexp$s do not have bitcodes on them, we can do the 
-duplicate removal without  an equivalence relation:
-\begin{center}
-\begin{tabular}{lcl}
-$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
-           			    &        & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
-\end{tabular}
-\end{center}
-%TODO: definition of rsimp (maybe only the alternative clause)
-
-
-The reason why these definitions  mirror precisely the corresponding operations
-on annotated regualar expressions is that we can calculate sizes more easily:
-
-\begin{lemma}
-$\rsize{\rerase a} = \asize a$
+\begin{lemma}\label{retrieveStepwise}
+$\retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
 \end{lemma}
-This lemma says that the size of an r-erased regex is the same as the annotated regex.
-this does not hold for plain $\rexp$s due to their ways of how alternatives are represented.
-\begin{lemma}
-$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
-\end{lemma}
-Putting these two together we get a property that allows us to estimate the 
-size of an annotated regular expression derivative using its un-annotated counterpart:
-\begin{lemma}
-$\asize{\bderssimp{r}{s}} =  \rsize{\rderssimp{\rerase{r}}{s}}$
+The other good thing about $\retrieve$ is that it can be connected to $\flex$:
+%centralLemma1
+\begin{lemma}\label{flex_retrieve}
+$\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
 \end{lemma}
-Unless stated otherwise in this chapter all $\textit{rexp}$s without
- bitcodes are seen as $\rrexp$s.
- We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
- as the former suits people's intuitive way of stating a binary alternative
- regular expression.
-
-
-\begin{theorem}
-For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$
-\end{theorem}
-
-\noindent
 \begin{proof}
-We prove this by induction on $r$. The base cases for $\AZERO$,
-$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
-for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
-hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
-$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
-%
-\begin{center}
-\begin{tabular}{lcll}
-& & $ \llbracket   \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
-& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
-    [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
-& $\leq$ &
-    $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
-    [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
-& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
-             \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
-      \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
-\end{tabular}
-\end{center}
-
-
-\noindent
-where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
-The reason why we could write the derivatives of a sequence this way is described in section 2.
-The term (2) is used to control (1). 
-That is because one can obtain an overall
-smaller regex list
-by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
-Section 3 is dedicated to its proof.
-In (3) we know that  $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is 
-bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
-than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
-for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
-We reason similarly for  $\STAR$.\medskip
+By induction on $s$. The induction tactic is reverse induction on strings.
+$v$ is allowed to be arbitrary.
+The crucial point is to rewrite 
+\[
+\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
+\]
+as
+\[
+\retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\;  \mkeps (r \backslash s@[c]))
+\].
+This enables us to equate 
+\[
+\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
+\] 
+with 
+\[
+\flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c])))
+\],
+which in turn can be rewritten as
+\[
+\flex \; r \; \textit{id} \; s@[c] \;  (\mkeps (r\backslash s@[c]))
+\].
 \end{proof}
 
-What guarantee does this bound give us?
-
-Whatever the regex is, it will not grow indefinitely.
-Take our previous example $(a + aa)^*$ as an example:
-\begin{center}
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={number of $a$'s},
-    x label style={at={(1.05,-0.05)}},
-    ylabel={regex size},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax= 40,
-    ytick={0,10,...,40},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={$(a + aa)^*$},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
-\end{axis}
-\end{tikzpicture}
-\end{tabular}
-\end{center}
-We are able to limit the size of the regex $(a + aa)^*$'s derivatives
- with our simplification
-rules very effectively.
-
-
-As the graphs of  some more randomly generated regexes show, the size of 
-the derivative might grow quickly at the start of the input,
- but after a sufficiently long  input string, the trend will stop.
-
-
-%a few sample regular experessions' derivatives
-%size change
-%TODO: giving graphs showing a few regexes' size changes 
-%w;r;t the input characters number
-%a*, aa*, aaa*, .....
-%randomly generated regexes
-\begin{center}
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={number of $a$'s},
-    x label style={at={(1.05,-0.05)}},
-    ylabel={regex size},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=1000,
-    ytick={0,100,...,1000},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={regex1},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
-\end{axis}
-\end{tikzpicture}
-  &
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,-0.05)}},
-    %ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=1000,
-    ytick={0,100,...,1000},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={regex2},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
-\end{axis}
-\end{tikzpicture}
-  &
-\begin{tikzpicture}
-\begin{axis}[
-    xlabel={$n$},
-    x label style={at={(1.05,-0.05)}},
-    %ylabel={time in secs},
-    enlargelimits=false,
-    xtick={0,5,...,30},
-    xmax=33,
-    ymax=1000,
-    ytick={0,100,...,1000},
-    scaled ticks=false,
-    axis lines=left,
-    width=5cm,
-    height=4cm, 
-    legend entries={regex3},  
-    legend pos=north west,
-    legend cell align=left]
-\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
-\end{axis}
-\end{tikzpicture}\\
-\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.}
-\end{tabular}    
-\end{center}  
-
-
-
-
-
-\noindent
-Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
-far from the actual bound we can expect. 
-In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound
-is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.
-Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
-inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
-$f(x) = x * 2^x$.
-This means the bound we have will surge up at least
-tower-exponentially with a linear increase of the depth.
-For a regex of depth $n$, the bound
-would be approximately $4^n$.
-%TODO: change this to tower exponential!
-
-We can do better than this, but this does not improve
-the finiteness property we are proving. If we are interested in a polynomial bound,
-one would hope to obtain a similar tight bound as for partial
-derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
- $\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
-partial derivatives). 
-To obtain the exact same bound would mean
-we need to introduce simplifications, such as
- $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
-which exist for partial derivatives. 
-
-However, if we introduce them in our
-setting we would lose the POSIX property of our calculated values. 
-A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$.
-If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer 
-would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of
-what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information
-in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$
-occurs, and apply them in the right order once we get 
-a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value.
-This is unlike the simplification we had before, where the rewriting rules 
-such  as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value.
-We will discuss better
-bounds in the last section of this chapter.\\[-6.5mm]
-
-
-
-
-%----------------------------------------------------------------------------------------
-%	SECTION ??
-%----------------------------------------------------------------------------------------
-
-\section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
-To embark on getting the "closed forms" of regexes, we first
-define a few auxiliary definitions to make expressing them smoothly.
-
- \begin{center}  
- \begin{tabular}{ccc}
- $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
-$\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
-$\sflataux r$ & $=$ & $ [r]$
-\end{tabular}
-\end{center}
-The intuition behind $\sflataux{\_}$ is to break up nested regexes 
-of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
-into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
-It will return the singleton list $[r]$ otherwise.
-
-$\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
-the output type a regular expression, not a list:
- \begin{center} 
- \begin{tabular}{ccc}
- $\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
-$\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\
-$\sflat r$ & $=$ & $ [r]$
-\end{tabular}
-\end{center}
-$\sflataux{\_}$  and $\sflat{\_}$ is only recursive in terms of the
- first element of the list of children of
-an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for  dealing with the regular expression
-$r_1 \cdot r_2 \backslash s$.
-
-With $\sflat{\_}$ and $\sflataux{\_}$ we make 
-precise what  "closed forms" we have for the sequence derivatives and their simplifications,
-in other words, how can we construct $(r_1 \cdot r_2) \backslash s$
-and $\bderssimp{(r_1\cdot r_2)}{s}$,
-if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)
-and  $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$  ranges over 
-the substring of $s$?
-First let's look at a series of derivatives steps on a sequence 
-regular expression,  (assuming) that each time the first
-component of the sequence is always nullable):
-\begin{center}
-
-$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c}  \quad   r_1  \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
-$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c''   \longrightarrow_{\backslash c''} \quad
- \ldots$
-
-\end{center}
-%TODO: cite indian paper
-Indianpaper have  come up with a slightly more formal way of putting the above process:
-\begin{center}
-$r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) +
-\delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots
-+ \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$
-\end{center}
-where  $\delta(b, r)$ will produce $r$
-if $b$ evaluates to true, 
-and $\ZERO$ otherwise.
-
- But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical
- equivalence. To make this intuition useful 
- for a formal proof, we need something
-stronger than language equivalence.
-With the help of $\sflat{\_}$ we can state the equation in Indianpaper
-more rigorously:
-\begin{lemma}
-$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
-\end{lemma}
-
-The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
-
-\begin{center}
-\begin{tabular}{lcl}
-$\vsuf{[]}{\_} $ & $=$ &  $[]$\\
-$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
-                                     && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
-\end{tabular}
-\end{center}                   
-It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
-and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
-the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
-In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing 
-the entire result of  $(r_1 \cdot r_2) \backslash s$, 
-it only stores all the $s''$ such that $r_2 \backslash s''$  are occurring terms in $(r_1\cdot r_2)\backslash s$.
-
-With this we can also add simplifications to both sides and get
-\begin{lemma}
-$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
-\end{lemma}
-Together with the idempotency property of $\rsimp$,
-%TODO: state the idempotency property of rsimp
-\begin{lemma}
-$\rsimp(r) = \rsimp (\rsimp(r))$
-\end{lemma}
-it is possible to convert the above lemma to obtain a "closed form"
-for  derivatives nested with simplification:
-\begin{lemma}
-$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
-\end{lemma}
-And now the reason we have (1) in section 1 is clear:
-$\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, 
-is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$
-    as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$.
-
-%----------------------------------------------------------------------------------------
-%	SECTION 3
-%----------------------------------------------------------------------------------------
+With the above lemma we can now link $\flex$ and $\blexer$.
 
-\section{interaction between $\distinctWith$ and $\flts$}
-Note that it is not immediately obvious that 
-\begin{center}
-$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket   \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket  $.
-\end{center}
-The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
-duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 
-
-
-%----------------------------------------------------------------------------------------
-%	SECTION ALTS CLOSED FORM
-%----------------------------------------------------------------------------------------
-\section{A Closed Form for \textit{ALTS}}
-Now we prove that  $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
-
-
-There are a few key steps, one of these steps is
-$rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {})))
-= rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {}))$
-
-One might want to prove this by something a simple statement like: 
-$map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$.
-
-For this to hold we want the $\textit{distinct}$ function to pick up
-the elements before and after derivatives correctly:
-$r \in rset \equiv (rder x r) \in (rder x rset)$.
-which essentially requires that the function $\backslash$ is an injective mapping.
-
-Unfortunately the function $\backslash c$ is not an injective mapping.
-
-\subsection{function $\backslash c$ is not injective (1-to-1)}
-\begin{center}
-The derivative $w.r.t$ character $c$ is not one-to-one.
-Formally,
-	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
-\end{center}
-This property is trivially true for the
-character regex example:
-\begin{center}
-	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
-\end{center}
-But apart from the cases where the derivative
-output is $\ZERO$, are there non-trivial results
-of derivatives which contain strings?
-The answer is yes.
-For example,
-\begin{center}
-	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
-	where $a$ is not nullable.\\
-	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
-	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
-\end{center}
-We start with two syntactically different regexes,
-and end up with the same derivative result.
-This is not surprising as we have such 
-equality as below in the style of Arden's lemma:\\
-\begin{center}
-	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
-\end{center}
-
-%----------------------------------------------------------------------------------------
-%	SECTION 4
-%----------------------------------------------------------------------------------------
-\section{A Bound for the Star Regular Expression}
-We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
-the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then 
-the property of the $\distinct$ function.
-Now we try to get a bound on $r^* \backslash s$ as well.
-Again, we first look at how a star's derivatives evolve, if they grow maximally: 
-\begin{center}
-
-$r^* \quad \longrightarrow_{\backslash c}  \quad   (r\backslash c)  \cdot  r^* \quad \longrightarrow_{\backslash c'}  \quad
-r \backslash cc'  \cdot r^* + r \backslash c' \cdot r^*  \quad \longrightarrow_{\backslash c''} \quad 
-(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)   \quad \longrightarrow_{\backslash c'''}
-\quad \ldots$
-
-\end{center}
-When we have a string $s = c :: c' :: c'' \ldots$  such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, 
-$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
-the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
-of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not 
-count the possible size explosions of $r \backslash c$ themselves.
-
-Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like
-$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
-into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
-and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
-For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$:
-%TODO: definitions of  and \hflataux \hflat
- \begin{center}  
- \begin{tabular}{ccc}
- $\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
-$\hflataux r$ & $=$ & $ [r]$
-\end{tabular}
-\end{center}
-
- \begin{center}  
- \begin{tabular}{ccc}
- $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
-$\hflat r$ & $=$ & $ r$
-\end{tabular}
-\end{center}s
-Again these definitions are tailor-made for dealing with alternatives that have
-originated from a star's derivatives, so we don't attempt to open up all possible 
-regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
-elements.
-We give a predicate for such "star-created" regular expressions:
-\begin{center}
-\begin{tabular}{lcr}
-         &    &       $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
- $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
- \end{tabular}
- \end{center}
- 
- These definitions allows us the flexibility to talk about 
- regular expressions in their most convenient format,
- for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $
- instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.
- These definitions help express that certain classes of syntatically 
- distinct regular expressions are actually the same under simplification.
- This is not entirely true for annotated regular expressions: 
- %TODO: bsimp bders \neq bderssimp
- \begin{center}
- $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
- \end{center}
- For bit-codes, the order in which simplification is applied
- might cause a difference in the location they are placed.
- If we want something like
- \begin{center}
- $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
- \end{center}
- Some "canonicalization" procedure is required,
- which either pushes all the common bitcodes to nodes
-  as senior as possible:
-  \begin{center}
-  $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
-  \end{center}
- or does the reverse. However bitcodes are not of interest if we are talking about
- the $\llbracket r \rrbracket$ size of a regex.
- Therefore for the ease and simplicity of producing a
- proof for a size bound, we are happy to restrict ourselves to 
- unannotated regular expressions, and obtain such equalities as
- \begin{lemma}
- $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
- \end{lemma}
- 
- \begin{proof}
- By using the rewriting relation $\rightsquigarrow$
- \end{proof}
- %TODO: rsimp sflat
-And from this we obtain a proof that a star's derivative will be the same
-as if it had all its nested alternatives created during deriving being flattened out:
- For example,
- \begin{lemma}
- $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
- \end{lemma}
- \begin{proof}
- By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.
- \end{proof}
-% The simplification of a flattened out regular expression, provided it comes
-%from the derivative of a star, is the same as the one nested.
- 
- 
- 
- 
- 
- 
- 
- 
- 
-One might wonder the actual bound rather than the loose bound we gave
-for the convenience of an easier proof.
-How much can the regex $r^* \backslash s$ grow? 
-As  earlier graphs have shown,
-%TODO: reference that graph where size grows quickly
- they can grow at a maximum speed
-  exponential $w.r.t$ the number of characters, 
-but will eventually level off when the string $s$ is long enough.
-If they grow to a size exponential $w.r.t$ the original regex, our algorithm
-would still be slow.
-And unfortunately, we have concrete examples
-where such regexes grew exponentially large before levelling off:
-$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
-(\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
- size that is  exponential on the number $n$ 
-under our current simplification rules:
-%TODO: graph of a regex whose size increases exponentially.
-\begin{center}
-\begin{tikzpicture}
-    \begin{axis}[
-        height=0.5\textwidth,
-        width=\textwidth,
-        xlabel=number of a's,
-        xtick={0,...,9},
-        ylabel=maximum size,
-        ymode=log,
-       log basis y={2}
-]
-        \addplot[mark=*,blue] table {re-chengsong.data};
-    \end{axis}
-\end{tikzpicture}
-\end{center}
-
-For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
-to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
-(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
-The exponential size is triggered by that the regex
-$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
-inside the $(\ldots) ^*$ having exponentially many
-different derivatives, despite those difference being minor.
-$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
-will therefore contain the following terms (after flattening out all nested 
-alternatives):
-\begin{center}
-$(\oplus_{i = 1]{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
-$(1 \leq m' \leq m )$
-\end{center}
-These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).
- With each new input character taking the derivative against the intermediate result, more and more such distinct
- terms will accumulate, 
-until the length reaches $L.C.M.(1, \ldots, n)$.
-$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms 
-$(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
-
-$(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
- where $m' \neq m''$ \\
- as they are slightly different.
- This means that with our current simplification methods,
- we will not be able to control the derivative so that
- $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$
- as there are already exponentially many terms.
- These terms are similar in the sense that the head of those terms
- are all consisted of sub-terms of the form: 
- $(\underbrace{a \ldots a}_{\text{j a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* $.
- For  $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
- $n * (n + 1) / 2$ such terms. 
- For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
- can be described by 6 terms:
- $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, 
- $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
-The total number of different "head terms",  $n * (n + 1) / 2$,
- is proportional to the number of characters in the regex 
-$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
-This suggests a slightly different notion of size, which we call the 
-alphabetic width:
-%TODO:
-(TODO: Alphabetic width def.)
-
- 
-Antimirov\parencite{Antimirov95} has proven that 
-$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.
-where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
-created by doing derivatives of $r$ against all possible strings.
-If we can make sure that at any moment in our lexing algorithm our 
-intermediate result hold at most one copy of each of the 
-subterms then we can get the same bound as Antimirov's.
-This leads to the algorithm in the next section.
-
-%----------------------------------------------------------------------------------------
-%	SECTION 5
-%----------------------------------------------------------------------------------------
-\section{A Stronger Version of Simplification Inspired by Antimirov}
-%TODO: search for isabelle proofs of algorithms that check equivalence 
-
-
-
-
-
+\begin{lemma}\label{flex_blexer}
+$\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s)  = \blexer \; r \; s$
+\end{lemma}
+\begin{proof}
+Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}.
+\end{proof}
+Finally 
 
 
 
--- a/ChengsongTanPhdThesis/Chapters/Chapter3.tex	Thu May 26 20:51:40 2022 +0100
+++ b/ChengsongTanPhdThesis/Chapters/Chapter3.tex	Fri May 27 18:27:39 2022 +0100
@@ -1,16 +1,26 @@
 % Chapter Template
 
-\chapter{Common Identities In Simplification-Related Functions} % Main chapter title
+% Main chapter title
 
-\label{ChapterX} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
+\label{Chapter3} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
+%Then we illustrate how the algorithm without bitcodes falls short for such aggressive 
+%simplifications and therefore introduce our version of the bitcoded algorithm and 
+%its correctness proof in 
+%Chapter 3\ref{Chapter3}. 
+
 
 
 
 %----------------------------------------------------------------------------------------
-%	SECTION 1
+%	SECTION common identities
 %----------------------------------------------------------------------------------------
+\section{Common Identities In Simplification-Related Functions} 
+Need to prove these before starting on the big correctness proof.
 
-\section{Idempotency of $\simp$}
+%-----------------------------------
+%	SUBSECTION 
+%-----------------------------------
+\subsection{Idempotency of $\simp$}
 
 \begin{equation}
 	\simp \;r = \simp\; \simp \; r 
@@ -23,7 +33,7 @@
 The proof is by structural induction on $r$.
 
 %-----------------------------------
-%	SUBSECTION 1
+%	SUBSECTION 
 %-----------------------------------
 \subsection{Syntactic Equivalence Under $\simp$}
 We prove that minor differences can be annhilated
@@ -35,17 +45,19 @@
 \end{center}
 
 
-%-----------------------------------
-%	SUBSECTION 2
-%-----------------------------------
+
 
-\subsection{Subsection 2}
-Morbi rutrum odio eget arcu adipiscing sodales. Aenean et purus a est pulvinar pellentesque. Cras in elit neque, quis varius elit. Phasellus fringilla, nibh eu tempus venenatis, dolor elit posuere quam, quis adipiscing urna leo nec orci. Sed nec nulla auctor odio aliquet consequat. Ut nec nulla in ante ullamcorper aliquam at sed dolor. Phasellus fermentum magna in augue gravida cursus. Cras sed pretium lorem. Pellentesque eget ornare odio. Proin accumsan, massa viverra cursus pharetra, ipsum nisi lobortis velit, a malesuada dolor lorem eu neque.
+
+
 
 %----------------------------------------------------------------------------------------
-%	SECTION 2
+%	SECTION corretness proof
 %----------------------------------------------------------------------------------------
-
-\section{Main Section 2}
-
-Sed ullamcorper quam eu nisl interdum at interdum enim egestas. Aliquam placerat justo sed lectus lobortis ut porta nisl porttitor. Vestibulum mi dolor, lacinia molestie gravida at, tempus vitae ligula. Donec eget quam sapien, in viverra eros. Donec pellentesque justo a massa fringilla non vestibulum metus vestibulum. Vestibulum in orci quis felis tempor lacinia. Vivamus ornare ultrices facilisis. Ut hendrerit volutpat vulputate. Morbi condimentum venenatis augue, id porta ipsum vulputate in. Curabitur luctus tempus justo. Vestibulum risus lectus, adipiscing nec condimentum quis, condimentum nec nisl. Aliquam dictum sagittis velit sed iaculis. Morbi tristique augue sit amet nulla pulvinar id facilisis ligula mollis. Nam elit libero, tincidunt ut aliquam at, molestie in quam. Aenean rhoncus vehicula hendrerit.
+\section{Correctness of Bit-coded Algorithm with Simplification}
+The non-trivial part of proving the correctness of the algorithm with simplification
+compared with not having simplification is that we can no longer use the argument 
+in \cref{flex_retrieve}.
+The function \retrieve needs the structure of the annotated regular expression to 
+agree with the structure of the value, but simplification will always mess with the 
+structure:
+%TODO: after simp does not agree with each other: (a + 0) --> a v.s. Left(Char(a))
\ No newline at end of file
--- a/ChengsongTanPhdThesis/main.tex	Thu May 26 20:51:40 2022 +0100
+++ b/ChengsongTanPhdThesis/main.tex	Fri May 27 18:27:39 2022 +0100
@@ -59,6 +59,7 @@
 \usepackage{amsmath}
 \usepackage{amsthm}
 \usepackage{amssymb}
+\usepackage{cleveref}
 %\usepackage{mathtools}
 \usepackage[noend]{algpseudocode}
 \usepackage{enumitem}
@@ -196,8 +197,15 @@
 %	ABSTRACT PAGE
 %----------------------------------------------------------------------------------------
 
-\begin{abstract}
+%\begin{abstract}
+
+ 
+%\end{abstract}
+
+
+
 \addchaptertocentry{\abstractname} % Add the abstract to the table of contents
+\addchap{Abstract} 
 This work is about regular expressions and derivatives. It combines functional algorithms and their formal verification in the Isabelle/HOL theorem prover.
 
 Theoretical results say that regular expression matching
@@ -227,8 +235,7 @@
 work with these two guarantees together.
 
 
- 
-\end{abstract}
+
 
 %----------------------------------------------------------------------------------------
 %	ACKNOWLEDGEMENTS
@@ -360,8 +367,8 @@
 \include{Chapters/Chapter1}
 \include{Chapters/Chapter2} 
 \include{Chapters/Chapter3}
-%\include{Chapters/Chapter4} 
-%\include{Chapters/Chapter5} 
+\include{Chapters/Chapter4} 
+\include{Chapters/Chapter5} 
 
 %----------------------------------------------------------------------------------------
 %	THESIS CONTENT - APPENDICES