lexing: comparison ChengsongTanPhdThesis/Chapters/Chapter2.tex

equal deleted inserted replaced

-:ff7945a988a3
+:856d025dbc15
 % Chapter Template
-\chapter{Chapter Title Here} % Main chapter title
+\chapter{Regular Expressions and Sulzmanna and Lu's Lexing Algorithm Without Bitcodes} % Main chapter title
-\label{ChapterX} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
+\label{Chapter2} % In chapter 2 \ref{Chapter2} we will introduce the concepts
+%and notations we
+%use for describing the lexing algorithm by Sulzmann and Lu,
+%and then give the algorithm and its variant, and discuss
+%why more aggressive simplifications are needed.
+\section{Preliminaries}
+Suppose we have an alphabet $\Sigma$, the strings  whose characters
+are from $\Sigma$
+can be expressed as $\Sigma^*$.
+We use patterns to define a set of strings concisely. Regular expressions
+are one of such patterns systems:
+The basic regular expressions  are defined inductively
+by the following grammar:
+\[			r ::=   \ZERO \mid  \ONE
+			 \mid  c
+			 \mid  r_1 \cdot r_2
+			 \mid  r_1 + r_2
+			 \mid r^*
+\]
+The language or set of strings defined by regular expressions are defined as
+%TODO: FILL in the other defs
+\begin{center}
+\begin{tabular}{lcl}
+$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
+$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
+\end{tabular}
+\end{center}
+Which are also called the "language interpretation".
+The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
+where the operation transforms the regex to a new one containing
+strings without the head character $c$.
+Formally, we define first such a transformation on any string set, which
+we call semantic derivative:
+\begin{center}
+$\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$
+\end{center}
+Mathematically, it can be expressed as the
+If the $\textit{StringSet}$ happen to have some structure, for example,
+if it is regular, then we have that it
+% Derivatives of a
+%regular expression, written $r \backslash c$, give a simple solution
+%to the problem of matching a string $s$ with a regular
+%expression $r$: if the derivative of $r$ w.r.t.\ (in
+%succession) all the characters of the string matches the empty string,
+%then $r$ matches $s$ (and {\em vice versa}).
+The the derivative of regular expression, denoted as
+$r \backslash c$, is a function that takes parameters
+$r$ and $c$, and returns another regular expression $r'$,
+which is computed by the following recursive function:
+\begin{center}
+\begin{tabular}{lcl}
+		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
+		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
+		$d \backslash c$     & $\dn$ &
+		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
+	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
+	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
+	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
+\end{tabular}
+\end{center}
+\noindent
+\noindent
+The $\nullable$ function tests whether the empty string $""$
+is in the language of $r$:
+\begin{center}
+		\begin{tabular}{lcl}
+			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\
+			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
+			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
+			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
+			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
+			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
+		\end{tabular}
+\end{center}
+\noindent
+The empty set does not contain any string and
+therefore not the empty string, the empty string
+regular expression contains the empty string
+by definition, the character regular expression
+is the singleton that contains character only,
+and therefore does not contain the empty string,
+the alternative regular expression(or "or" expression)
+might have one of its children regular expressions
+being nullable and any one of its children being nullable
+would suffice. The sequence regular expression
+would require both children to have the empty string
+to compose an empty string and the Kleene star
+operation naturally introduced the empty string.
+We can give the meaning of regular expressions derivatives
+by language interpretation:
+\begin{center}
+\begin{tabular}{lcl}
+		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\
+		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
+		$d \backslash c$     & $\dn$ &
+		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
+$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
+$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
+	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
+	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
+	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
+\end{tabular}
+\end{center}
+\noindent
+\noindent
+The function derivative, written $\backslash c$,
+defines how a regular expression evolves into
+a new regular expression after all the string it contains
+is chopped off a certain head character $c$.
+The most involved cases are the sequence
+and star case.
+The sequence case says that if the first regular expression
+contains an empty string then the second component of the sequence
+might be chosen as the target regular expression to be chopped
+off its head character.
+The star regular expression's derivative unwraps the iteration of
+regular expression and attaches the star regular expression
+to the sequence's second element to make sure a copy is retained
+for possible more iterations in later phases of lexing.
+The main property of the derivative operation
+that enables us to reason about the correctness of
+an algorithm using derivatives is
+\begin{center}
+$c\!::\!s \in L(r)$ holds
+if and only if $s \in L(r\backslash c)$.
+\end{center}
+\noindent
+We can generalise the derivative operation shown above for single characters
+to strings as follows:
+\begin{center}
+\begin{tabular}{lcl}
+$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
+$r \backslash [\,] $ & $\dn$ & $r$
+\end{tabular}
+\end{center}
+\noindent
+and then define Brzozowski's  regular-expression matching algorithm as:
+\[
+match\;s\;r \;\dn\; nullable(r\backslash s)
+\]
+\noindent
+Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$,
+this algorithm presented graphically is as follows:
+\begin{equation}\label{graph:*}
+\begin{tikzcd}
+r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
+\end{tikzcd}
+\end{equation}
+\noindent
+where we start with  a regular expression  $r_0$, build successive
+derivatives until we exhaust the string and then use \textit{nullable}
+to test whether the result can match the empty string. It can  be
+relatively  easily shown that this matcher is correct  (that is given
+an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).
+Beautiful and simple definition.
+If we implement the above algorithm naively, however,
+the algorithm can be excruciatingly slow.
+\begin{figure}
+\centering
+\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
+\begin{tikzpicture}
+\begin{axis}[
+xlabel={$n$},
+x label style={at={(1.05,-0.05)}},
+ylabel={time in secs},
+enlargelimits=false,
+xtick={0,5,...,30},
+xmax=33,
+ymax=10000,
+ytick={0,1000,...,10000},
+scaled ticks=false,
+axis lines=left,
+width=5cm,
+height=4cm,
+legend entries={JavaScript},
+legend pos=north west,
+legend cell align=left]
+\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
+\end{axis}
+\end{tikzpicture}\\
+\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings
+of the form $\underbrace{aa..a}_{n}$.}
+\end{tabular}
+\caption{EightThousandNodes} \label{fig:EightThousandNodes}
+\end{figure}
+(8000 node data to be added here)
+For example, when starting with the regular
+expression $(a + aa)^*$ and building a few successive derivatives (around 10)
+w.r.t.~the character $a$, one obtains a derivative regular expression
+with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
+The reason why $(a + aa) ^*$ explodes so drastically is that without
+pruning, the algorithm will keep records of all possible ways of matching:
+\begin{center}
+$(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
+\end{center}
+\noindent
+Each of the above alternative branches correspond to the match
+$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
+These different ways of matching will grow exponentially with the string length,
+and without simplifications that throw away some of these very similar matchings,
+it is no surprise that these expressions grow so quickly.
+Operations like
+$\backslash$ and $\nullable$ need to traverse such trees and
+consequently the bigger the size of the derivative the slower the
+algorithm.
+Brzozowski was quick in finding that during this process a lot useless
+$\ONE$s and $\ZERO$s are generated and therefore not optimal.
+He also introduced some "similarity rules" such
+as $P+(Q+R) = (P+Q)+R$ to merge syntactically
+different but language-equivalent sub-regexes to further decrease the size
+of the intermediate regexes.
+More simplifications are possible, such as deleting duplicates
+and opening up nested alternatives to trigger even more simplifications.
+And suppose we apply simplification after each derivative step, and compose
+these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
+\textit{simp}(a \backslash c)$. Then we can build
+a matcher without having  cumbersome regular expressions.
+If we want the size of derivatives in the algorithm to
+stay even lower, we would need more aggressive simplifications.
+Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
+deleting duplicates whenever possible. For example, the parentheses in
+$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
+\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
+example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
+$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
+to achieve a very tight size bound, namely,
+the same size bound as that of the \emph{partial derivatives}.
+Building derivatives and then simplify them.
+So far so good. But what if we want to
+do lexing instead of just a YES/NO answer?
+This requires us to go back again to the world
+without simplification first for a moment.
+Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and
+elegant(arguably as beautiful as the original
+derivatives definition) solution for this.
+\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}
+They first defined the datatypes for storing the
+lexing information called a \emph{value} or
+sometimes also \emph{lexical value}.  These values and regular
+expressions correspond to each other as illustrated in the following
+table:
+\begin{center}
+	\begin{tabular}{c@{\hspace{20mm}}c}
+		\begin{tabular}{@{}rrl@{}}
+			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
+			$r$ & $::=$  & $\ZERO$\\
+			& $\mid$ & $\ONE$   \\
+			& $\mid$ & $c$          \\
+			& $\mid$ & $r_1 \cdot r_2$\\
+			& $\mid$ & $r_1 + r_2$   \\
+			\\
+			& $\mid$ & $r^*$         \\
+		\end{tabular}
+		&
+		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
+			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
+			$v$ & $::=$  & \\
+			&        & $\Empty$   \\
+			& $\mid$ & $\Char(c)$          \\
+			& $\mid$ & $\Seq\,v_1\, v_2$\\
+			& $\mid$ & $\Left(v)$   \\
+			& $\mid$ & $\Right(v)$  \\
+			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
+		\end{tabular}
+	\end{tabular}
+\end{center}
+\noindent
+Building on top of Sulzmann and Lu's attempt to formalize the
+notion of POSIX lexing rules \parencite{Sulzmann2014},
+Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
+POSIX matching as a ternary relation recursively defined in a
+natural deduction style.
+With the formally-specified rules for what a POSIX matching is,
+they proved in Isabelle/HOL that the algorithm gives correct results.
+But having a correct result is still not enough,
+we want at least some degree of $\mathbf{efficiency}$.
+One regular expression can have multiple lexical values. For example
+for the regular expression $(a+b)^*$, it has a infinite list of
+values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
+$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
+$\ldots$, and vice versa.
+Even for the regular expression matching a certain string, there could
+still be more than one value corresponding to it.
+Take the example where $r= (a^*\cdot a^*)^*$ and the string
+$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
+The number of different ways of matching
+without allowing any value under a star to be flattened
+to an empty string can be given by the following formula:
+\begin{equation}
+	C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}
+\end{equation}
+and a closed form formula can be calculated to be
+\begin{equation}
+	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
+\end{equation}
+which is clearly in exponential order.
+A lexer aimed at getting all the possible values has an exponential
+worst case runtime. Therefore it is impractical to try to generate
+all possible matches in a run. In practice, we are usually
+interested about POSIX values, which by intuition always
+\begin{itemize}
+\item
+match the leftmost regular expression when multiple options of matching
+are available
+\item
+always match a subpart as much as possible before proceeding
+to the next token.
+\end{itemize}
+For example, the above example has the POSIX value
+$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
+The output of an algorithm we want would be a POSIX matching
+encoded as a value.
+The reason why we are interested in $\POSIX$ values is that they can
+be practically used in the lexing phase of a compiler front end.
+For instance, when lexing a code snippet
+$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
+as an identifier rather than a keyword.
+The contribution of Sulzmann and Lu is an extension of Brzozowski's
+algorithm by a second phase (the first phase being building successive
+derivatives---see \eqref{graph:*}). In this second phase, a POSIX value
+is generated in case the regular expression matches  the string.
+Pictorially, the Sulzmann and Lu algorithm is as follows:
+\begin{ceqn}
+\begin{equation}\label{graph:2}
+\begin{tikzcd}
+r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
+v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]
+\end{tikzcd}
+\end{equation}
+\end{ceqn}
+\noindent
+For convenience, we shall employ the following notations: the regular
+expression we start with is $r_0$, and the given string $s$ is composed
+of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
+left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
+to the characters $c_0$, $c_1$  until we exhaust the string and obtain
+the derivative $r_n$. We test whether this derivative is
+$\textit{nullable}$ or not. If not, we know the string does not match
+$r$ and no value needs to be generated. If yes, we start building the
+values incrementally by \emph{injecting} back the characters into the
+earlier values $v_n, \ldots, v_0$. This is the second phase of the
+algorithm from the right to left. For the first value $v_n$, we call the
+function $\textit{mkeps}$, which builds a POSIX lexical value
+for how the empty string has been matched by the (nullable) regular
+expression $r_n$. This function is defined as
+	\begin{center}
+		\begin{tabular}{lcl}
+			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
+			$\mkeps(r_{1}+r_{2})$	& $\dn$
+			& \textit{if} $\nullable(r_{1})$\\
+			& & \textit{then} $\Left(\mkeps(r_{1}))$\\
+			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
+			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
+			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
+		\end{tabular}
+	\end{center}
+\noindent
+After the $\mkeps$-call, we inject back the characters one by one in order to build
+the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
+($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
+After injecting back $n$ characters, we get the lexical value for how $r_0$
+matches $s$. The POSIX value is maintained throught out the process.
+For this Sulzmann and Lu defined a function that reverses
+the ``chopping off'' of characters during the derivative phase. The
+corresponding function is called \emph{injection}, written
+$\textit{inj}$; it takes three arguments: the first one is a regular
+expression ${r_{i-1}}$, before the character is chopped off, the second
+is a character ${c_{i-1}}$, the character we want to inject and the
+third argument is the value ${v_i}$, into which one wants to inject the
+character (it corresponds to the regular expression after the character
+has been chopped off). The result of this function is a new value. The
+definition of $\textit{inj}$ is as follows:
+\begin{center}
+\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
+$\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
+$\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
+$\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
+$\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+$\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
+$\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
+$\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
+\end{tabular}
+\end{center}
+\noindent This definition is by recursion on the ``shape'' of regular
+expressions and values.
+The clauses basically do one thing--identifying the ``holes'' on
+value to inject the character back into.
+For instance, in the last clause for injecting back to a value
+that would turn into a new star value that corresponds to a star,
+we know it must be a sequence value. And we know that the first
+value of that sequence corresponds to the child regex of the star
+with the first character being chopped off--an iteration of the star
+that had just been unfolded. This value is followed by the already
+matched star iterations we collected before. So we inject the character
+back to the first value and form a new value with this new iteration
+being added to the previous list of iterations, all under the $Stars$
+top level.
+We have mentioned before that derivatives without simplification
+can get clumsy, and this is true for values as well--they reflect
+the regular expressions size by definition.
+One can introduce simplification on the regex and values, but have to
+be careful in not breaking the correctness as the injection
+function heavily relies on the structure of the regexes and values
+being correct and match each other.
+It can be achieved by recording some extra rectification functions
+during the derivatives step, and applying these rectifications in
+each run during the injection phase.
+And we can prove that the POSIX value of how
+regular expressions match strings will not be affected---although is much harder
+to establish.
+Some initial results in this regard have been
+obtained in \cite{AusafDyckhoffUrban2016}.
+%Brzozowski, after giving the derivatives and simplification,
+%did not explore lexing with simplification or he may well be
+%stuck on an efficient simplificaiton with a proof.
+%He went on to explore the use of derivatives together with
+%automaton, and did not try lexing using derivatives.
+We want to get rid of complex and fragile rectification of values.
+Can we not create those intermediate values $v_1,\ldots v_n$,
+and get the lexing information that should be already there while
+doing derivatives in one pass, without a second phase of injection?
+In the meantime, can we make sure that simplifications
+are easily handled without breaking the correctness of the algorithm?
+Sulzmann and Lu solved this problem by
+introducing additional informtaion to the
+regular expressions called \emph{bitcodes}.
+\subsection*{Bit-coded Algorithm}
+Bits and bitcodes (lists of bits) are defined as:
+\begin{center}
+		$b ::=   1 \mid  0 \qquad
+bs ::= [] \mid b::bs
+$
+\end{center}
+\noindent
+The $1$ and $0$ are not in bold in order to avoid
+confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
+bit-lists) can be used to encode values (or potentially incomplete values) in a
+compact form. This can be straightforwardly seen in the following
+coding function from values to bitcodes:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{code}(\Empty)$ & $\dn$ & $[]$\\
+$\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
+$\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
+$\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
+$\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
+$\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
+$\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
+code(\Stars\,vs)$
+\end{tabular}
+\end{center}
+\noindent
+Here $\textit{code}$ encodes a value into a bitcodes by converting
+$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
+star iteration by $1$. The border where a local star terminates
+is marked by $0$. This coding is lossy, as it throws away the information about
+characters, and also does not encode the ``boundary'' between two
+sequence values. Moreover, with only the bitcode we cannot even tell
+whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
+reason for choosing this compact way of storing information is that the
+relatively small size of bits can be easily manipulated and ``moved
+around'' in a regular expression. In order to recover values, we will
+need the corresponding regular expression as an extra information. This
+means the decoding function is defined as:
+%\begin{definition}[Bitdecoding of Values]\mbox{}
+\begin{center}
+\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
+$\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
+$\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
+$\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
+(\Left\,v, bs_1)$\\
+$\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
+$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
+(\Right\,v, bs_1)$\\
+$\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
+$\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
+& &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
+& &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
+$\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
+$\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ &
+$\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
+& &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
+& &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
+$\textit{decode}\,bs\,r$ & $\dn$ &
+$\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
+& & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
+\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+%\end{definition}
+Sulzmann and Lu's integrated the bitcodes into regular expressions to
+create annotated regular expressions \cite{Sulzmann2014}.
+\emph{Annotated regular expressions} are defined by the following
+grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{a}$ & $::=$  & $\ZERO$\\
+& $\mid$ & $_{bs}\ONE$\\
+& $\mid$ & $_{bs}{\bf c}$\\
+& $\mid$ & $_{bs}\sum\,as$\\
+& $\mid$ & $_{bs}a_1\cdot a_2$\\
+& $\mid$ & $_{bs}a^*$
+\end{tabular}
+\end{center}
+%(in \textit{ALTS})
+\noindent
+where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
+expressions and $as$ for a list of annotated regular expressions.
+The alternative constructor($\sum$) has been generalized to
+accept a list of annotated regular expressions rather than just 2.
+We will show that these bitcodes encode information about
+the (POSIX) value that should be generated by the Sulzmann and Lu
+algorithm.
+To do lexing using annotated regular expressions, we shall first
+transform the usual (un-annotated) regular expressions into annotated
+regular expressions. This operation is called \emph{internalisation} and
+defined as follows:
+%\begin{definition}
+\begin{center}
+\begin{tabular}{lcl}
+$(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
+$(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
+$(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
+$(r_1 + r_2)^\uparrow$ & $\dn$ &
+$_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
+\textit{fuse}\,[1]\,r_2^\uparrow]$\\
+$(r_1\cdot r_2)^\uparrow$ & $\dn$ &
+$_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
+$(r^*)^\uparrow$ & $\dn$ &
+$_{[]}(r^\uparrow)^*$\\
+\end{tabular}
+\end{center}
+%\end{definition}
+\noindent
+We use up arrows here to indicate that the basic un-annotated regular
+expressions are ``lifted up'' into something slightly more complex. In the
+fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
+attach bits to the front of an annotated regular expression. Its
+definition is as follows:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
+$\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
+$_{bs @ bs'}\ONE$\\
+$\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
+$_{bs@bs'}{\bf c}$\\
+$\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
+$_{bs@bs'}\sum\textit{as}$\\
+$\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
+$_{bs@bs'}a_1 \cdot a_2$\\
+$\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
+$_{bs @ bs'}a^*$
+\end{tabular}
+\end{center}
+\noindent
+After internalising the regular expression, we perform successive
+derivative operations on the annotated regular expressions. This
+derivative operation is the same as what we had previously for the
+basic regular expressions, except that we beed to take care of
+the bitcodes:
+\iffalse
+%\begin{definition}{bder}
+\begin{center}
+\begin{tabular}{@{}lcl@{}}
+$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
+$(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
+$(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
+$\textit{if}\;c=d\; \;\textit{then}\;
+\textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\
+$(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
+$\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
+$(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
+$\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
+					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
+& &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
+$(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
+$\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
+(\textit{STAR}\,[]\,r)$
+\end{tabular}
+\end{center}
+%\end{definition}
+\begin{center}
+\begin{tabular}{@{}lcl@{}}
+$(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
+$(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\
+$(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
+$\textit{if}\;c=d\; \;\textit{then}\;
+_{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\
+$(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
+$_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
+$(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
+$\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
+					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
+& &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
+$(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
+$_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
+(_{bs}\textit{STAR}\,[]\,r)$
+\end{tabular}
+\end{center}
+%\end{definition}
+\fi
+\begin{center}
+\begin{tabular}{@{}lcl@{}}
+$(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\
+$(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\
+$(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
+$\textit{if}\;c=d\; \;\textit{then}\;
+_{bs}\ONE\;\textit{else}\;\ZERO$\\
+$(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
+$_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
+$(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
+$\textit{if}\;\textit{bnullable}\,a_1$\\
+					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
+					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
+& &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
+$(_{bs}a^*)\,\backslash c$ & $\dn$ &
+$_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
+(_{[]}r^*))$
+\end{tabular}
+\end{center}
+%\end{definition}
+\noindent
+For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
+we need to unfold it into a sequence,
+and attach an additional bit $0$ to the front of $r \backslash c$
+to indicate one more star iteration. Also the sequence clause
+is more subtle---when $a_1$ is $\textit{bnullable}$ (here
+\textit{bnullable} is exactly the same as $\textit{nullable}$, except
+that it is for annotated regular expressions, therefore we omit the
+definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
+$a_1$ matches the string prior to character $c$ (more on this later),
+then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
+\backslash c)$ will collapse the regular expression $a_1$(as it has
+already been fully matched) and store the parsing information at the
+head of the regular expression $a_2 \backslash c$ by fusing to it. The
+bitsequence $\textit{bs}$, which was initially attached to the
+first element of the sequence $a_1 \cdot a_2$, has
+now been elevated to the top-level of $\sum$, as this information will be
+needed whichever way the sequence is matched---no matter whether $c$ belongs
+to $a_1$ or $ a_2$. After building these derivatives and maintaining all
+the lexing information, we complete the lexing by collecting the
+bitcodes using a generalised version of the $\textit{mkeps}$ function
+for annotated regular expressions, called $\textit{bmkeps}$:
+%\begin{definition}[\textit{bmkeps}]\mbox{}
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
+$\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
+$\textit{if}\;\textit{bnullable}\,a$\\
+& &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
+& &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
+$\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
+$bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
+$\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
+$bs \,@\, [0]$
+\end{tabular}
+\end{center}
+%\end{definition}
+\noindent
+This function completes the value information by travelling along the
+path of the regular expression that corresponds to a POSIX value and
+collecting all the bitcodes, and using $S$ to indicate the end of star
+iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
+decode them, we get the value we expect. The corresponding lexing
+algorithm looks as follows:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{blexer}\;r\,s$ & $\dn$ &
+$\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\
+& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+& & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+\noindent
+In this definition $\_\backslash s$ is the  generalisation  of the derivative
+operation from characters to strings (just like the derivatives for un-annotated
+regular expressions).
+Now we introduce the simplifications, which is why we introduce the
+bitcodes in the first place.
+\subsection*{Simplification Rules}
+This section introduces aggressive (in terms of size) simplification rules
+on annotated regular expressions
+to keep derivatives small. Such simplifications are promising
+as we have
+generated test data that show
+that a good tight bound can be achieved. We could only
+partially cover the search space as there are infinitely many regular
+expressions and strings.
+One modification we introduced is to allow a list of annotated regular
+expressions in the $\sum$ constructor. This allows us to not just
+delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
+also unnecessary ``copies'' of regular expressions (very similar to
+simplifying $r + r$ to just $r$, but in a more general setting). Another
+modification is that we use simplification rules inspired by Antimirov's
+work on partial derivatives. They maintain the idea that only the first
+``copy'' of a regular expression in an alternative contributes to the
+calculation of a POSIX value. All subsequent copies can be pruned away from
+the regular expression. A recursive definition of our  simplification function
+that looks somewhat similar to our Scala code is given below:
+%\comment{Use $\ZERO$, $\ONE$ and so on.
+%Is it $ALTS$ or $ALTS$?}\\
+\begin{center}
+\begin{tabular}{@{}lcl@{}}
+$\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
+&&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
+&&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
+&&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
+&&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
+&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
+$\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
+&&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
+&&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
+&&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\
+$\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$
+\end{tabular}
+\end{center}
+\noindent
+The simplification does a pattern matching on the regular expression.
+When it detected that the regular expression is an alternative or
+sequence, it will try to simplify its child regular expressions
+recursively and then see if one of the children turns into $\ZERO$ or
+$\ONE$, which might trigger further simplification at the current level.
+The most involved part is the $\sum$ clause, where we use two
+auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
+alternatives and reduce as many duplicates as possible. Function
+$\textit{distinct}$  keeps the first occurring copy only and removes all later ones
+when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
+Its recursive definition is given below:
+\begin{center}
+\begin{tabular}{@{}lcl@{}}
+$\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
+(\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
+$\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
+$\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise)
+\end{tabular}
+\end{center}
+\noindent
+Here $\textit{flatten}$ behaves like the traditional functional programming flatten
+function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
+removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.
+Having defined the $\simp$ function,
+we can use the previous notation of  natural
+extension from derivative w.r.t.~character to derivative
+w.r.t.~string:%\comment{simp in  the [] case?}
+\begin{center}
+\begin{tabular}{lcl}
+$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
+$r \backslash_{simp} [\,] $ & $\dn$ & $r$
+\end{tabular}
+\end{center}
+\noindent
+to obtain an optimised version of the algorithm:
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{blexer\_simp}\;r\,s$ & $\dn$ &
+$\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\
+& & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
+& & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
+& & $\;\;\textit{else}\;\textit{None}$
+\end{tabular}
+\end{center}
+\noindent
+This algorithm keeps the regular expression size small, for example,
+with this simplification our previous $(a + aa)^*$ example's 8000 nodes
+will be reduced to just 6 and stays constant, no matter how long the
+input string is.
 To be completed.
-%-----------------------------------
+%----------------------------------------------------------------------------------------
-%	SECTION ?
+%	SECTION  correctness proof
-%-----------------------------------
+%----------------------------------------------------------------------------------------
-\section{preparatory properties for getting a finiteness bound}
+\section{Correctness of Bit-coded Algorithm (Without Simplification)
-Before we get to the proof that says the intermediate result of our lexer will
+We now give the proof the correctness of the algorithm with bit-codes.
-remain finitely bounded, which is an important efficiency/liveness guarantee,
-we shall first develop a few preparatory properties and definitions to
+Ausaf and Urban cleverly defined an auxiliary function called $\flex$,
-make the process of proving that a breeze.
+defined as
+\[
-We define rewriting relations for $\rrexp$s, which allows us to do the
+\flex \; r \; f \; [] \; v \; = \; f\; v
-same trick as we did for the correctness proof,
+\flex \; r \; f \; c :: s \; v =  \flex r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v
-but this time we will have stronger equalities established.
+\]
-\subsection{"hrewrite" relation}
+which accumulates the characters that needs to be injected back,
-List of 1-step rewrite rules for regular expressions simplification without bitcodes:
+and does the injection in a stack-like manner (last taken derivative first injected).
-\begin{center}
+$\flex$ is connected to the $\lexer$:
-\begin{tabular}{lcl}
-$r_1 \cdot \ZERO$ & $\rightarrow$ & $\ZERO$
-\end{tabular}
-\end{center}
-And we define an "grewrite" relation that works on lists:
-\begin{center}
-\begin{tabular}{lcl}
-$ \ZERO :: rs$ & $\rightarrow_g$ & $rs$
-\end{tabular}
-\end{center}
-With these relations we prove
 \begin{lemma}
-$rs \rightarrow rs'  \implies \RALTS{rs} \rightarrow \RALTS{rs'}$
+$\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$
 \end{lemma}
-which enables us to prove "closed forms" equalities such as
+$\flex$ provides us a bridge between $\lexer$ and $\blexer$.
-\begin{lemma}
+What is even better about $\flex$ is that it allows us to
-$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
+directly operate on the value $\mkeps (r\backslash v)$,
+which is pivotal in the definition of  $\lexer $ and $\blexer$, but not visible as an argument.
+When the value created by $\mkeps$ becomes available, one can
+prove some stepwise properties of lexing nicely:
+\begin{lemma}\label{flexStepwise}
+$\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $
 \end{lemma}
-But the most involved part of the above lemma is proving the following:
+And for $\blexer$ we have something with stepwise properties like $\flex$ as well,
-\begin{lemma}
+called $\retrieve$:
-$\bsimp{\RALTS{rs @ \RALTS{rs_1} @ rs'}} = \bsimp{\RALTS{rs @rs_1 @ rs'}} $
+\[
+\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}
+\]
+$\retrieve$ takes bit-codes from annotated regular expressions
+guided by a value.
+$\retrieve$ is connected to the $\blexer$ in the following way:
+\begin{lemma}\label{blexer_retrieve}
+$\blexer \; r \; s = \decode  \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
 \end{lemma}
-which is needed in proving things like
+If you take derivative of an annotated regular expression,
-\begin{lemma}
+you can $\retrieve$ the same bit-codes as before the derivative took place,
-$r \rightarrow_f r'  \implies \rsimp{r} \rightarrow \rsimp{r'}$
+provided that you use the corresponding value:
+\begin{lemma}\label{retrieveStepwise}
+$\retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
 \end{lemma}
+The other good thing about $\retrieve$ is that it can be connected to $\flex$:
-Fortunately this is provable by induction on the list $rs$
+%centralLemma1
+\begin{lemma}\label{flex_retrieve}
+$\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
+\end{lemma}
-%-----------------------------------
+\begin{proof}
-%	SECTION 2
+By induction on $s$. The induction tactic is reverse induction on strings.
-%-----------------------------------
+$v$ is allowed to be arbitrary.
+The crucial point is to rewrite
-\section{Finiteness Property}
+\[
-In this section let us describe our argument for why the size of the simplified
+\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c])
-derivatives with the aggressive simplification function is finitely bounded.
-Suppose
-we have a size function for bitcoded regular expressions, written
-$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
-\begin{center}
-\begin{tabular}{ccc}
-$\llbracket \ACHAR{bs}{c} \rrbracket $ & $\dn$ & $1$\\
-\end{tabular}
-\end{center}
-(TODO: COMPLETE this defn and for $rs$)
-The size is based on a recursive function on the structure of the regex,
-not the bitcodes.
-Therefore we may as well talk about size of an annotated regular expression
-in their un-annotated form:
-\begin{center}
-$\asize(a) = \size(\erase(a))$.
-\end{center}
-(TODO: turn equals to roughly equals)
-But there is a minor nuisance here:
-the erase function actually messes with the structure of the regular expression:
-\begin{center}
-\begin{tabular}{ccc}
-$\erase(\AALTS{bs}{[]} )$ & $\dn$ & $\ZERO$\\
-\end{tabular}
-\end{center}
-(TODO: complete this definition with singleton r in alts)
-An alternative regular expression with an empty list of children
-is turned into an $\ZERO$ during the
-$\erase$ function, thereby changing the size and structure of the regex.
-These will likely be fixable if we really want to use plain $\rexp$s for dealing
-with size, but we choose a more straightforward (or stupid) method by
-defining a new datatype that is similar to plain $\rexp$s but can take
-non-binary arguments for its alternative constructor,
-which we call $\rrexp$ to denote
-the difference between it and plain regular expressions.
-\[			\rrexp ::=   \RZERO \mid  \RONE
-			 \mid  \RCHAR{c}
-			 \mid  \RSEQ{r_1}{r_2}
-			 \mid  \RALTS{rs}
-			 \mid \RSTAR{r}
 \]
-For $\rrexp$ we throw away the bitcodes on the annotated regular expressions,
+as
-but keep everything else intact.
+\[
-It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
+\retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\;  \mkeps (r \backslash s@[c]))
-(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
+\].
-$\ALTS$).
+This enables us to equate
-We denote the operation of erasing the bits and turning an annotated regular expression
+\[
-into an $\rrexp{}$ as $\rerase{}$.
+\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c])
-\begin{center}
+\]
-\begin{tabular}{lcl}
+with
-$\rerase{\AZERO}$ & $=$ & $\RZERO$\\
+\[
-$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
+\flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c])))
-$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
+\],
-\end{tabular}
+which in turn can be rewritten as
-\end{center}
+\[
-%TODO: FINISH definition of rerase
+\flex \; r \; \textit{id} \; s@[c] \;  (\mkeps (r\backslash s@[c]))
-Similarly we could define the derivative  and simplification on
+\].
-$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1,
+\end{proof}
-except that now they can operate on alternatives taking multiple arguments.
+With the above lemma we can now link $\flex$ and $\blexer$.
-\begin{center}
-\begin{tabular}{lcr}
+\begin{lemma}\label{flex_blexer}
-$\RALTS{rs} \backslash c$ & $=$ &  $\RALTS{map\; (\_ \backslash c) \;rs}$\\
+$\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s)  = \blexer \; r \; s$
-(other clauses omitted)
-\end{tabular}
-\end{center}
-Now that $\rrexp$s do not have bitcodes on them, we can do the
-duplicate removal without  an equivalence relation:
-\begin{center}
-\begin{tabular}{lcl}
-$\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
-			    &        & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
-\end{tabular}
-\end{center}
-%TODO: definition of rsimp (maybe only the alternative clause)
-The reason why these definitions  mirror precisely the corresponding operations
-on annotated regualar expressions is that we can calculate sizes more easily:
-\begin{lemma}
-$\rsize{\rerase a} = \asize a$
 \end{lemma}
-This lemma says that the size of an r-erased regex is the same as the annotated regex.
-this does not hold for plain $\rexp$s due to their ways of how alternatives are represented.
-\begin{lemma}
-$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
-\end{lemma}
-Putting these two together we get a property that allows us to estimate the
-size of an annotated regular expression derivative using its un-annotated counterpart:
-\begin{lemma}
-$\asize{\bderssimp{r}{s}} =  \rsize{\rderssimp{\rerase{r}}{s}}$
-\end{lemma}
-Unless stated otherwise in this chapter all $\textit{rexp}$s without
-bitcodes are seen as $\rrexp$s.
-We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
-as the former suits people's intuitive way of stating a binary alternative
-regular expression.
-\begin{theorem}
-For any regex $r$, $\exists N_r. \forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N_r$
-\end{theorem}
-\noindent
 \begin{proof}
-We prove this by induction on $r$. The base cases for $\AZERO$,
+Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}.
-$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
-for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
-hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
-$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
-%
-\begin{center}
-\begin{tabular}{lcll}
-& & $ \llbracket   \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
-& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
-[\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
-& $\leq$ &
-$\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
-[$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
-& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
-\llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
-\llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
-& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
-\end{tabular}
-\end{center}
-\noindent
-where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
-The reason why we could write the derivatives of a sequence this way is described in section 2.
-The term (2) is used to control (1).
-That is because one can obtain an overall
-smaller regex list
-by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
-Section 3 is dedicated to its proof.
-In (3) we know that  $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is
-bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
-than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
-for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
-We reason similarly for  $\STAR$.\medskip
 \end{proof}
+Finally
-What guarantee does this bound give us?
-Whatever the regex is, it will not grow indefinitely.
-Take our previous example $(a + aa)^*$ as an example:
-\begin{center}
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={number of $a$'s},
-x label style={at={(1.05,-0.05)}},
-ylabel={regex size},
-enlargelimits=false,
-xtick={0,5,...,30},
-xmax=33,
-ymax= 40,
-ytick={0,10,...,40},
-scaled ticks=false,
-axis lines=left,
-width=5cm,
-height=4cm,
-legend entries={$(a + aa)^*$},
-legend pos=north west,
-legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data};
-\end{axis}
-\end{tikzpicture}
-\end{tabular}
-\end{center}
-We are able to limit the size of the regex $(a + aa)^*$'s derivatives
-with our simplification
-rules very effectively.
-As the graphs of  some more randomly generated regexes show, the size of
-the derivative might grow quickly at the start of the input,
-but after a sufficiently long  input string, the trend will stop.
-%a few sample regular experessions' derivatives
-%size change
-%TODO: giving graphs showing a few regexes' size changes
-%w;r;t the input characters number
-%a*, aa*, aaa*, .....
-%randomly generated regexes
-\begin{center}
-\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={number of $a$'s},
-x label style={at={(1.05,-0.05)}},
-ylabel={regex size},
-enlargelimits=false,
-xtick={0,5,...,30},
-xmax=33,
-ymax=1000,
-ytick={0,100,...,1000},
-scaled ticks=false,
-axis lines=left,
-width=5cm,
-height=4cm,
-legend entries={regex1},
-legend pos=north west,
-legend cell align=left]
-\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data};
-\end{axis}
-\end{tikzpicture}
-&
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={$n$},
-x label style={at={(1.05,-0.05)}},
-%ylabel={time in secs},
-enlargelimits=false,
-xtick={0,5,...,30},
-xmax=33,
-ymax=1000,
-ytick={0,100,...,1000},
-scaled ticks=false,
-axis lines=left,
-width=5cm,
-height=4cm,
-legend entries={regex2},
-legend pos=north west,
-legend cell align=left]
-\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data};
-\end{axis}
-\end{tikzpicture}
-&
-\begin{tikzpicture}
-\begin{axis}[
-xlabel={$n$},
-x label style={at={(1.05,-0.05)}},
-%ylabel={time in secs},
-enlargelimits=false,
-xtick={0,5,...,30},
-xmax=33,
-ymax=1000,
-ytick={0,100,...,1000},
-scaled ticks=false,
-axis lines=left,
-width=5cm,
-height=4cm,
-legend entries={regex3},
-legend pos=north west,
-legend cell align=left]
-\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
-\end{axis}
-\end{tikzpicture}\\
-\multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.}
-\end{tabular}
-\end{center}
-\noindent
-Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
-far from the actual bound we can expect.
-In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound
-is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$.
-Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$
-inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function
-$f(x) = x * 2^x$.
-This means the bound we have will surge up at least
-tower-exponentially with a linear increase of the depth.
-For a regex of depth $n$, the bound
-would be approximately $4^n$.
-%TODO: change this to tower exponential!
-We can do better than this, but this does not improve
-the finiteness property we are proving. If we are interested in a polynomial bound,
-one would hope to obtain a similar tight bound as for partial
-derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
-$\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
-partial derivatives).
-To obtain the exact same bound would mean
-we need to introduce simplifications, such as
-$(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
-which exist for partial derivatives.
-However, if we introduce them in our
-setting we would lose the POSIX property of our calculated values.
-A simple example for this would be the regex $(a + a\cdot b)\cdot(b\cdot c + c)$.
-If we split this regex up into $a\cdot(b\cdot c + c) + a\cdot b \cdot (b\cdot c + c)$, the lexer
-would give back $\Left(\Seq(\Char(a), \Left(\Char(b \cdot c))))$ instead of
-what we want: $\Seq(\Right(ab), \Right(c))$. Unless we can store the structural information
-in all the places where a transformation of the form $(r_1 + r_2)\cdot r \rightarrow r_1 \cdot r + r_2 \cdot r$
-occurs, and apply them in the right order once we get
-a result of the "aggressively simplified" regex, it would be impossible to still get a $\POSIX$ value.
-This is unlike the simplification we had before, where the rewriting rules
-such  as $\ONE \cdot r \rightsquigarrow r$, under which our lexer will give the same value.
-We will discuss better
-bounds in the last section of this chapter.\\[-6.5mm]
-%----------------------------------------------------------------------------------------
-%	SECTION ??
-%----------------------------------------------------------------------------------------
-\section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
-To embark on getting the "closed forms" of regexes, we first
-define a few auxiliary definitions to make expressing them smoothly.
-\begin{center}
-\begin{tabular}{ccc}
-$\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
-$\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
-$\sflataux r$ & $=$ & $ [r]$
-\end{tabular}
-\end{center}
-The intuition behind $\sflataux{\_}$ is to break up nested regexes
-of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
-into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
-It will return the singleton list $[r]$ otherwise.
-$\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
-the output type a regular expression, not a list:
-\begin{center}
-\begin{tabular}{ccc}
-$\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
-$\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\
-$\sflat r$ & $=$ & $ [r]$
-\end{tabular}
-\end{center}
-$\sflataux{\_}$  and $\sflat{\_}$ is only recursive in terms of the
-first element of the list of children of
-an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for  dealing with the regular expression
-$r_1 \cdot r_2 \backslash s$.
-With $\sflat{\_}$ and $\sflataux{\_}$ we make
-precise what  "closed forms" we have for the sequence derivatives and their simplifications,
-in other words, how can we construct $(r_1 \cdot r_2) \backslash s$
-and $\bderssimp{(r_1\cdot r_2)}{s}$,
-if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)
-and  $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$  ranges over
-the substring of $s$?
-First let's look at a series of derivatives steps on a sequence
-regular expression,  (assuming) that each time the first
-component of the sequence is always nullable):
-\begin{center}
-$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c}  \quad   r_1  \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
-$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c''   \longrightarrow_{\backslash c''} \quad
-\ldots$
-\end{center}
-%TODO: cite indian paper
-Indianpaper have  come up with a slightly more formal way of putting the above process:
-\begin{center}
-$r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) +
-\delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots
-+ \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$
-\end{center}
-where  $\delta(b, r)$ will produce $r$
-if $b$ evaluates to true,
-and $\ZERO$ otherwise.
-But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical
-equivalence. To make this intuition useful
-for a formal proof, we need something
-stronger than language equivalence.
-With the help of $\sflat{\_}$ we can state the equation in Indianpaper
-more rigorously:
-\begin{lemma}
-$\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
-\end{lemma}
-The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
-\begin{center}
-\begin{tabular}{lcl}
-$\vsuf{[]}{\_} $ & $=$ &  $[]$\\
-$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
-&& $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
-\end{tabular}
-\end{center}
-It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
-and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
-the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
-In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing
-the entire result of  $(r_1 \cdot r_2) \backslash s$,
-it only stores all the $s''$ such that $r_2 \backslash s''$  are occurring terms in $(r_1\cdot r_2)\backslash s$.
-With this we can also add simplifications to both sides and get
-\begin{lemma}
-$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
-\end{lemma}
-Together with the idempotency property of $\rsimp$,
-%TODO: state the idempotency property of rsimp
-\begin{lemma}
-$\rsimp(r) = \rsimp (\rsimp(r))$
-\end{lemma}
-it is possible to convert the above lemma to obtain a "closed form"
-for  derivatives nested with simplification:
-\begin{lemma}
-$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
-\end{lemma}
-And now the reason we have (1) in section 1 is clear:
-$\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$,
-is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$
-as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$.
-%----------------------------------------------------------------------------------------
-%	SECTION 3
-%----------------------------------------------------------------------------------------
-\section{interaction between $\distinctWith$ and $\flts$}
-Note that it is not immediately obvious that
-\begin{center}
-$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket   \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket  $.
-\end{center}
-The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of
-duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$.
-%----------------------------------------------------------------------------------------
-%	SECTION ALTS CLOSED FORM
-%----------------------------------------------------------------------------------------
-\section{A Closed Form for \textit{ALTS}}
-Now we prove that  $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$.
-There are a few key steps, one of these steps is
-$rsimp (rsimp\_ALTs (map (rder x) (rdistinct (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs)) {})))
-= rsimp (rsimp\_ALTs (rdistinct (map (rder x) (rflts (map (rsimp \circ (\lambda r. rders\_simp r xs)) rs))) {}))$
-One might want to prove this by something a simple statement like:
-$map (rder x) (rdistinct rs rset) = rdistinct (map (rder x) rs) (rder x) ` rset)$.
-For this to hold we want the $\textit{distinct}$ function to pick up
-the elements before and after derivatives correctly:
-$r \in rset \equiv (rder x r) \in (rder x rset)$.
-which essentially requires that the function $\backslash$ is an injective mapping.
-Unfortunately the function $\backslash c$ is not an injective mapping.
-\subsection{function $\backslash c$ is not injective (1-to-1)}
-\begin{center}
-The derivative $w.r.t$ character $c$ is not one-to-one.
-Formally,
-	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
-\end{center}
-This property is trivially true for the
-character regex example:
-\begin{center}
-	$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
-\end{center}
-But apart from the cases where the derivative
-output is $\ZERO$, are there non-trivial results
-of derivatives which contain strings?
-The answer is yes.
-For example,
-\begin{center}
-	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
-	where $a$ is not nullable.\\
-	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
-	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
-\end{center}
-We start with two syntactically different regexes,
-and end up with the same derivative result.
-This is not surprising as we have such
-equality as below in the style of Arden's lemma:\\
-\begin{center}
-	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
-\end{center}
-%----------------------------------------------------------------------------------------
-%	SECTION 4
-%----------------------------------------------------------------------------------------
-\section{A Bound for the Star Regular Expression}
-We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
-the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then
-the property of the $\distinct$ function.
-Now we try to get a bound on $r^* \backslash s$ as well.
-Again, we first look at how a star's derivatives evolve, if they grow maximally:
-\begin{center}
-$r^* \quad \longrightarrow_{\backslash c}  \quad   (r\backslash c)  \cdot  r^* \quad \longrightarrow_{\backslash c'}  \quad
-r \backslash cc'  \cdot r^* + r \backslash c' \cdot r^*  \quad \longrightarrow_{\backslash c''} \quad
-(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)   \quad \longrightarrow_{\backslash c'''}
-\quad \ldots$
-\end{center}
-When we have a string $s = c :: c' :: c'' \ldots$  such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$,
-$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
-the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
-of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not
-count the possible size explosions of $r \backslash c$ themselves.
-Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like
-$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $
-into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
-and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
-For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$:
-%TODO: definitions of  and \hflataux \hflat
-\begin{center}
-\begin{tabular}{ccc}
-$\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
-$\hflataux r$ & $=$ & $ [r]$
-\end{tabular}
-\end{center}
-\begin{center}
-\begin{tabular}{ccc}
-$\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
-$\hflat r$ & $=$ & $ r$
-\end{tabular}
-\end{center}s
-Again these definitions are tailor-made for dealing with alternatives that have
-originated from a star's derivatives, so we don't attempt to open up all possible
-regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
-elements.
-We give a predicate for such "star-created" regular expressions:
-\begin{center}
-\begin{tabular}{lcr}
-&    &       $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
-$\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
-\end{tabular}
-\end{center}
-These definitions allows us the flexibility to talk about
-regular expressions in their most convenient format,
-for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $
-instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$.
-These definitions help express that certain classes of syntatically
-distinct regular expressions are actually the same under simplification.
-This is not entirely true for annotated regular expressions:
-%TODO: bsimp bders \neq bderssimp
-\begin{center}
-$(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
-\end{center}
-For bit-codes, the order in which simplification is applied
-might cause a difference in the location they are placed.
-If we want something like
-\begin{center}
-$\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
-\end{center}
-Some "canonicalization" procedure is required,
-which either pushes all the common bitcodes to nodes
-as senior as possible:
-\begin{center}
-$_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
-\end{center}
-or does the reverse. However bitcodes are not of interest if we are talking about
-the $\llbracket r \rrbracket$ size of a regex.
-Therefore for the ease and simplicity of producing a
-proof for a size bound, we are happy to restrict ourselves to
-unannotated regular expressions, and obtain such equalities as
-\begin{lemma}
-$\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$
-\end{lemma}
-\begin{proof}
-By using the rewriting relation $\rightsquigarrow$
-\end{proof}
-%TODO: rsimp sflat
-And from this we obtain a proof that a star's derivative will be the same
-as if it had all its nested alternatives created during deriving being flattened out:
-For example,
-\begin{lemma}
-$\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
-\end{lemma}
-\begin{proof}
-By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$.
-\end{proof}
-% The simplification of a flattened out regular expression, provided it comes
-%from the derivative of a star, is the same as the one nested.
-One might wonder the actual bound rather than the loose bound we gave
-for the convenience of an easier proof.
-How much can the regex $r^* \backslash s$ grow?
-As  earlier graphs have shown,
-%TODO: reference that graph where size grows quickly
-they can grow at a maximum speed
-exponential $w.r.t$ the number of characters,
-but will eventually level off when the string $s$ is long enough.
-If they grow to a size exponential $w.r.t$ the original regex, our algorithm
-would still be slow.
-And unfortunately, we have concrete examples
-where such regexes grew exponentially large before levelling off:
-$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots +
-(\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
-size that is  exponential on the number $n$
-under our current simplification rules:
-%TODO: graph of a regex whose size increases exponentially.
-\begin{center}
-\begin{tikzpicture}
-\begin{axis}[
-height=0.5\textwidth,
-width=\textwidth,
-xlabel=number of a's,
-xtick={0,...,9},
-ylabel=maximum size,
-ymode=log,
-log basis y={2}
-]
-\addplot[mark=*,blue] table {re-chengsong.data};
-\end{axis}
-\end{tikzpicture}
-\end{center}
-For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$
-to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots +
-(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion.
-The exponential size is triggered by that the regex
-$\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$
-inside the $(\ldots) ^*$ having exponentially many
-different derivatives, despite those difference being minor.
-$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$
-will therefore contain the following terms (after flattening out all nested
-alternatives):
-\begin{center}
-$(\oplus_{i = 1]{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\
-$(1 \leq m' \leq m )$
-\end{center}
-These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix).
-With each new input character taking the derivative against the intermediate result, more and more such distinct
-terms will accumulate,
-until the length reaches $L.C.M.(1, \ldots, n)$.
-$\textit{distinctBy}$ will not be able to de-duplicate any two of these terms
-$(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
-$(\oplus_{i = 1}^{n}  (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\
-where $m' \neq m''$ \\
-as they are slightly different.
-This means that with our current simplification methods,
-we will not be able to control the derivative so that
-$\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$
-as there are already exponentially many terms.
-These terms are similar in the sense that the head of those terms
-are all consisted of sub-terms of the form:
-$(\underbrace{a \ldots a}_{\text{j a's}})\cdot  (\underbrace{a \ldots a}_{\text{i a's}})^* $.
-For  $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most
-$n * (n + 1) / 2$ such terms.
-For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives
-can be described by 6 terms:
-$a^*$, $a\cdot (aa)^*$, $ (aa)^*$,
-$aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$.
-The total number of different "head terms",  $n * (n + 1) / 2$,
-is proportional to the number of characters in the regex
-$(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$.
-This suggests a slightly different notion of size, which we call the
-alphabetic width:
-%TODO:
-(TODO: Alphabetic width def.)
-Antimirov\parencite{Antimirov95} has proven that
-$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$.
-where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms
-created by doing derivatives of $r$ against all possible strings.
-If we can make sure that at any moment in our lexing algorithm our
-intermediate result hold at most one copy of each of the
-subterms then we can get the same bound as Antimirov's.
-This leads to the algorithm in the next section.
-%----------------------------------------------------------------------------------------
-%	SECTION 5
-%----------------------------------------------------------------------------------------
-\section{A Stronger Version of Simplification Inspired by Antimirov}
-%TODO: search for isabelle proofs of algorithms that check equivalence

changeset 519	856d025dbc15
parent 518	ff7945a988a3
child 524	947cbbd4e4a7