lexing: ChengsongTanPhdThesis/Chapters/Chapter2.tex@947cbbd4e4a7


% Chapter Template

\chapter{Regular Expressions and Sulzmanna and Lu's Lexing Algorithm Without Bitcodes} % Main chapter title

\label{Chapter2} % In chapter 2 \ref{Chapter2} we will introduce the concepts
%and notations we 
%use for describing the lexing algorithm by Sulzmann and Lu,
%and then give the algorithm and its variant, and discuss
%why more aggressive simplifications are needed. 


\section{Preliminaries}

Suppose we have an alphabet $\Sigma$, the strings  whose characters
are from $\Sigma$
can be expressed as $\Sigma^*$.

We use patterns to define a set of strings concisely. Regular expressions
are one of such patterns systems:
The basic regular expressions  are defined inductively
 by the following grammar:
\[			r ::=   \ZERO \mid  \ONE
			 \mid  c  
			 \mid  r_1 \cdot r_2
			 \mid  r_1 + r_2   
			 \mid r^*         
\]

The language or set of strings defined by regular expressions are defined as
%TODO: FILL in the other defs
\begin{center}
\begin{tabular}{lcl}
$L \; (r_1 + r_2)$ & $\dn$ & $ L \; (r_1) \cup L \; ( r_2)$\\
$L \; (r_1 \cdot r_2)$ & $\dn$ & $ L \; (r_1) \cap L \; (r_2)$\\
\end{tabular}
\end{center}
Which are also called the "language interpretation".



The Brzozowski derivative w.r.t character $c$ is an operation on the regex,
where the operation transforms the regex to a new one containing
strings without the head character $c$.

Formally, we define first such a transformation on any string set, which
we call semantic derivative:
\begin{center}
$\Der \; c\; \textit{A} = \{s \mid c :: s \in A\}$
\end{center}
Mathematically, it can be expressed as the 

If the $\textit{StringSet}$ happen to have some structure, for example,
if it is regular, then we have that it

% Derivatives of a
%regular expression, written $r \backslash c$, give a simple solution
%to the problem of matching a string $s$ with a regular
%expression $r$: if the derivative of $r$ w.r.t.\ (in
%succession) all the characters of the string matches the empty string,
%then $r$ matches $s$ (and {\em vice versa}).  

The the derivative of regular expression, denoted as
$r \backslash c$, is a function that takes parameters
$r$ and $c$, and returns another regular expression $r'$,
which is computed by the following recursive function:

\begin{center}
\begin{tabular}{lcl}
		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
		$d \backslash c$     & $\dn$ & 
		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
\end{tabular}
\end{center}
\noindent
\noindent

The $\nullable$ function tests whether the empty string $""$ 
is in the language of $r$:


\begin{center}
		\begin{tabular}{lcl}
			$\nullable(\ZERO)$     & $\dn$ & $\mathit{false}$ \\  
			$\nullable(\ONE)$      & $\dn$ & $\mathit{true}$ \\
			$\nullable(c)$ 	       & $\dn$ & $\mathit{false}$ \\
			$\nullable(r_1 + r_2)$ & $\dn$ & $\nullable(r_1) \vee \nullable(r_2)$ \\
			$\nullable(r_1\cdot r_2)$  & $\dn$ & $\nullable(r_1) \wedge \nullable(r_2)$ \\
			$\nullable(r^*)$       & $\dn$ & $\mathit{true}$ \\
		\end{tabular}
\end{center}
\noindent
The empty set does not contain any string and
therefore not the empty string, the empty string 
regular expression contains the empty string
by definition, the character regular expression
is the singleton that contains character only,
and therefore does not contain the empty string,
the alternative regular expression(or "or" expression)
might have one of its children regular expressions
being nullable and any one of its children being nullable
would suffice. The sequence regular expression
would require both children to have the empty string
to compose an empty string and the Kleene star
operation naturally introduced the empty string.

We can give the meaning of regular expressions derivatives
by language interpretation:


 
  
\begin{center}
\begin{tabular}{lcl}
		$\ZERO \backslash c$ & $\dn$ & $\ZERO$\\  
		$\ONE \backslash c$  & $\dn$ & $\ZERO$\\
		$d \backslash c$     & $\dn$ & 
		$\mathit{if} \;c = d\;\mathit{then}\;\ONE\;\mathit{else}\;\ZERO$\\
$(r_1 + r_2)\backslash c$     & $\dn$ & $r_1 \backslash c \,+\, r_2 \backslash c$\\
$(r_1 \cdot r_2)\backslash c$ & $\dn$ & $\mathit{if} \, nullable(r_1)$\\
	&   & $\mathit{then}\;(r_1\backslash c) \cdot r_2 \,+\, r_2\backslash c$\\
	&   & $\mathit{else}\;(r_1\backslash c) \cdot r_2$\\
	$(r^*)\backslash c$           & $\dn$ & $(r\backslash c) \cdot r^*$\\
\end{tabular}
\end{center}
\noindent
\noindent
The function derivative, written $\backslash c$, 
defines how a regular expression evolves into
a new regular expression after all the string it contains
is chopped off a certain head character $c$.
The most involved cases are the sequence 
and star case.
The sequence case says that if the first regular expression
contains an empty string then the second component of the sequence
might be chosen as the target regular expression to be chopped
off its head character.
The star regular expression's derivative unwraps the iteration of
regular expression and attaches the star regular expression
to the sequence's second element to make sure a copy is retained
for possible more iterations in later phases of lexing.


The main property of the derivative operation
that enables us to reason about the correctness of
an algorithm using derivatives is 

\begin{center}
$c\!::\!s \in L(r)$ holds
if and only if $s \in L(r\backslash c)$.
\end{center}

\noindent
We can generalise the derivative operation shown above for single characters
to strings as follows:

\begin{center}
\begin{tabular}{lcl}
$r \backslash (c\!::\!s) $ & $\dn$ & $(r \backslash c) \backslash s$ \\
$r \backslash [\,] $ & $\dn$ & $r$
\end{tabular}
\end{center}

\noindent
and then define Brzozowski's  regular-expression matching algorithm as:

\[
match\;s\;r \;\dn\; nullable(r\backslash s)
\]

\noindent
Assuming the a string is given as a sequence of characters, say $c_0c_1..c_n$, 
this algorithm presented graphically is as follows:

\begin{equation}\label{graph:*}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"]  & r_1 \arrow[r, "\backslash c_1"] & r_2 \arrow[r, dashed]  & r_n  \arrow[r,"\textit{nullable}?"] & \;\textrm{YES}/\textrm{NO}
\end{tikzcd}
\end{equation}

\noindent
where we start with  a regular expression  $r_0$, build successive
derivatives until we exhaust the string and then use \textit{nullable}
to test whether the result can match the empty string. It can  be
relatively  easily shown that this matcher is correct  (that is given
an $s = c_0...c_{n-1}$ and an $r_0$, it generates YES if and only if $s \in L(r_0)$).

Beautiful and simple definition.

If we implement the above algorithm naively, however,
the algorithm can be excruciatingly slow. 


\begin{figure}
\centering
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
    xlabel={$n$},
    x label style={at={(1.05,-0.05)}},
    ylabel={time in secs},
    enlargelimits=false,
    xtick={0,5,...,30},
    xmax=33,
    ymax=10000,
    ytick={0,1000,...,10000},
    scaled ticks=false,
    axis lines=left,
    width=5cm,
    height=4cm, 
    legend entries={JavaScript},  
    legend pos=north west,
    legend cell align=left]
\addplot[red,mark=*, mark options={fill=white}] table {EightThousandNodes.data};
\end{axis}
\end{tikzpicture}\\
\multicolumn{3}{c}{Graphs: Runtime for matching $(a^*)^*\,b$ with strings 
           of the form $\underbrace{aa..a}_{n}$.}
\end{tabular}    
\caption{EightThousandNodes} \label{fig:EightThousandNodes}
\end{figure}


(8000 node data to be added here)
For example, when starting with the regular
expression $(a + aa)^*$ and building a few successive derivatives (around 10)
w.r.t.~the character $a$, one obtains a derivative regular expression
with more than 8000 nodes (when viewed as a tree)\ref{EightThousandNodes}.
The reason why $(a + aa) ^*$ explodes so drastically is that without
pruning, the algorithm will keep records of all possible ways of matching:
\begin{center}
$(a + aa) ^* \backslash (aa) = (\ZERO + \ONE \ONE)\cdot(a + aa)^* + (\ONE + \ONE a) \cdot (a + aa)^*$
\end{center}

\noindent
Each of the above alternative branches correspond to the match 
$aa $, $a \quad a$ and $a \quad a \cdot (a)$(incomplete).
These different ways of matching will grow exponentially with the string length,
and without simplifications that throw away some of these very similar matchings,
it is no surprise that these expressions grow so quickly.
Operations like
$\backslash$ and $\nullable$ need to traverse such trees and
consequently the bigger the size of the derivative the slower the
algorithm. 

Brzozowski was quick in finding that during this process a lot useless
$\ONE$s and $\ZERO$s are generated and therefore not optimal.
He also introduced some "similarity rules" such
as $P+(Q+R) = (P+Q)+R$ to merge syntactically 
different but language-equivalent sub-regexes to further decrease the size
of the intermediate regexes. 

More simplifications are possible, such as deleting duplicates
and opening up nested alternatives to trigger even more simplifications.
And suppose we apply simplification after each derivative step, and compose
these two operations together as an atomic one: $a \backslash_{simp}\,c \dn
\textit{simp}(a \backslash c)$. Then we can build
a matcher without having  cumbersome regular expressions.


If we want the size of derivatives in the algorithm to
stay even lower, we would need more aggressive simplifications.
Essentially we need to delete useless $\ZERO$s and $\ONE$s, as well as
deleting duplicates whenever possible. For example, the parentheses in
$(a+b) \cdot c + b\cdot c$ can be opened up to get $a\cdot c + b \cdot c + b
\cdot c$, and then simplified to just $a \cdot c + b \cdot c$. Another
example is simplifying $(a^*+a) + (a^*+ \ONE) + (a +\ONE)$ to just
$a^*+a+\ONE$. Adding these more aggressive simplification rules help us
to achieve a very tight size bound, namely,
 the same size bound as that of the \emph{partial derivatives}. 

Building derivatives and then simplify them.
So far so good. But what if we want to 
do lexing instead of just a YES/NO answer?
This requires us to go back again to the world 
without simplification first for a moment.
Sulzmann and Lu~\cite{Sulzmann2014} first came up with a nice and 
elegant(arguably as beautiful as the original
derivatives definition) solution for this.

\subsection*{Values and the Lexing Algorithm by Sulzmann and Lu}


They first defined the datatypes for storing the 
lexing information called a \emph{value} or
sometimes also \emph{lexical value}.  These values and regular
expressions correspond to each other as illustrated in the following
table:

\begin{center}
	\begin{tabular}{c@{\hspace{20mm}}c}
		\begin{tabular}{@{}rrl@{}}
			\multicolumn{3}{@{}l}{\textbf{Regular Expressions}}\medskip\\
			$r$ & $::=$  & $\ZERO$\\
			& $\mid$ & $\ONE$   \\
			& $\mid$ & $c$          \\
			& $\mid$ & $r_1 \cdot r_2$\\
			& $\mid$ & $r_1 + r_2$   \\
			\\
			& $\mid$ & $r^*$         \\
		\end{tabular}
		&
		\begin{tabular}{@{\hspace{0mm}}rrl@{}}
			\multicolumn{3}{@{}l}{\textbf{Values}}\medskip\\
			$v$ & $::=$  & \\
			&        & $\Empty$   \\
			& $\mid$ & $\Char(c)$          \\
			& $\mid$ & $\Seq\,v_1\, v_2$\\
			& $\mid$ & $\Left(v)$   \\
			& $\mid$ & $\Right(v)$  \\
			& $\mid$ & $\Stars\,[v_1,\ldots\,v_n]$ \\
		\end{tabular}
	\end{tabular}
\end{center}

\noindent

Building on top of Sulzmann and Lu's attempt to formalize the 
notion of POSIX lexing rules \parencite{Sulzmann2014}, 
Ausaf and Urban\parencite{AusafDyckhoffUrban2016} modelled
POSIX matching as a ternary relation recursively defined in a
natural deduction style.
With the formally-specified rules for what a POSIX matching is,
they proved in Isabelle/HOL that the algorithm gives correct results.

But having a correct result is still not enough, 
we want at least some degree of $\mathbf{efficiency}$.



One regular expression can have multiple lexical values. For example
for the regular expression $(a+b)^*$, it has a infinite list of
values corresponding to it: $\Stars\,[]$, $\Stars\,[\Left(Char(a))]$,
$\Stars\,[\Right(Char(b))]$, $\Stars\,[\Left(Char(a),\,\Right(Char(b))]$,
$\ldots$, and vice versa.
Even for the regular expression matching a certain string, there could 
still be more than one value corresponding to it.
Take the example where $r= (a^*\cdot a^*)^*$ and the string 
$s=\underbrace{aa\ldots a}_\text{n \textit{a}s}$.
The number of different ways of matching 
without allowing any value under a star to be flattened
to an empty string can be given by the following formula:
\begin{equation}
	C_n = (n+1)+n C_1+\ldots + 2 C_{n-1}
\end{equation}
and a closed form formula can be calculated to be
\begin{equation}
	C_n =\frac{(2+\sqrt{2})^n - (2-\sqrt{2})^n}{4\sqrt{2}}
\end{equation}
which is clearly in exponential order.

A lexer aimed at getting all the possible values has an exponential
worst case runtime. Therefore it is impractical to try to generate
all possible matches in a run. In practice, we are usually 
interested about POSIX values, which by intuition always
\begin{itemize}
\item
match the leftmost regular expression when multiple options of matching
are available  
\item 
always match a subpart as much as possible before proceeding
to the next token.
\end{itemize}


 For example, the above example has the POSIX value
$ \Stars\,[\Seq(Stars\,[\underbrace{\Char(a),\ldots,\Char(a)}_\text{n iterations}], Stars\,[])]$.
The output of an algorithm we want would be a POSIX matching
encoded as a value.
The reason why we are interested in $\POSIX$ values is that they can
be practically used in the lexing phase of a compiler front end.
For instance, when lexing a code snippet 
$\textit{iffoo} = 3$ with the regular expression $\textit{keyword} + \textit{identifier}$, we want $\textit{iffoo}$ to be recognized
as an identifier rather than a keyword.

The contribution of Sulzmann and Lu is an extension of Brzozowski's
algorithm by a second phase (the first phase being building successive
derivatives---see \eqref{graph:*}). In this second phase, a POSIX value 
is generated in case the regular expression matches  the string. 
Pictorially, the Sulzmann and Lu algorithm is as follows:

\begin{ceqn}
\begin{equation}\label{graph:2}
\begin{tikzcd}
r_0 \arrow[r, "\backslash c_0"]  \arrow[d] & r_1 \arrow[r, "\backslash c_1"] \arrow[d] & r_2 \arrow[r, dashed] \arrow[d] & r_n \arrow[d, "mkeps" description] \\
v_0           & v_1 \arrow[l,"inj_{r_0} c_0"]                & v_2 \arrow[l, "inj_{r_1} c_1"]              & v_n \arrow[l, dashed]         
\end{tikzcd}
\end{equation}
\end{ceqn}


\noindent
For convenience, we shall employ the following notations: the regular
expression we start with is $r_0$, and the given string $s$ is composed
of characters $c_0 c_1 \ldots c_{n-1}$. In  the first phase from the
left to right, we build the derivatives $r_1$, $r_2$, \ldots  according
to the characters $c_0$, $c_1$  until we exhaust the string and obtain
the derivative $r_n$. We test whether this derivative is
$\textit{nullable}$ or not. If not, we know the string does not match
$r$ and no value needs to be generated. If yes, we start building the
values incrementally by \emph{injecting} back the characters into the
earlier values $v_n, \ldots, v_0$. This is the second phase of the
algorithm from the right to left. For the first value $v_n$, we call the
function $\textit{mkeps}$, which builds a POSIX lexical value
for how the empty string has been matched by the (nullable) regular
expression $r_n$. This function is defined as

	\begin{center}
		\begin{tabular}{lcl}
			$\mkeps(\ONE)$ 		& $\dn$ & $\Empty$ \\
			$\mkeps(r_{1}+r_{2})$	& $\dn$ 
			& \textit{if} $\nullable(r_{1})$\\ 
			& & \textit{then} $\Left(\mkeps(r_{1}))$\\ 
			& & \textit{else} $\Right(\mkeps(r_{2}))$\\
			$\mkeps(r_1\cdot r_2)$ 	& $\dn$ & $\Seq\,(\mkeps\,r_1)\,(\mkeps\,r_2)$\\
			$mkeps(r^*)$	        & $\dn$ & $\Stars\,[]$
		\end{tabular}
	\end{center}


\noindent 
After the $\mkeps$-call, we inject back the characters one by one in order to build
the lexical value $v_i$ for how the regex $r_i$ matches the string $s_i$
($s_i = c_i \ldots c_{n-1}$ ) from the previous lexical value $v_{i+1}$.
After injecting back $n$ characters, we get the lexical value for how $r_0$
matches $s$. The POSIX value is maintained throught out the process.
For this Sulzmann and Lu defined a function that reverses
the ``chopping off'' of characters during the derivative phase. The
corresponding function is called \emph{injection}, written
$\textit{inj}$; it takes three arguments: the first one is a regular
expression ${r_{i-1}}$, before the character is chopped off, the second
is a character ${c_{i-1}}$, the character we want to inject and the
third argument is the value ${v_i}$, into which one wants to inject the
character (it corresponds to the regular expression after the character
has been chopped off). The result of this function is a new value. The
definition of $\textit{inj}$ is as follows: 

\begin{center}
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
  $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
  $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
  $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
  $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
  $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars((\textit{inj}\,r\,c\,v)\,::\,vs)$\\
\end{tabular}
\end{center}

\noindent This definition is by recursion on the ``shape'' of regular
expressions and values. 
The clauses basically do one thing--identifying the ``holes'' on 
value to inject the character back into.
For instance, in the last clause for injecting back to a value
that would turn into a new star value that corresponds to a star,
we know it must be a sequence value. And we know that the first 
value of that sequence corresponds to the child regex of the star
with the first character being chopped off--an iteration of the star
that had just been unfolded. This value is followed by the already
matched star iterations we collected before. So we inject the character 
back to the first value and form a new value with this new iteration
being added to the previous list of iterations, all under the $Stars$
top level.

We have mentioned before that derivatives without simplification 
can get clumsy, and this is true for values as well--they reflect
the regular expressions size by definition.

One can introduce simplification on the regex and values, but have to
be careful in not breaking the correctness as the injection 
function heavily relies on the structure of the regexes and values
being correct and match each other.
It can be achieved by recording some extra rectification functions
during the derivatives step, and applying these rectifications in 
each run during the injection phase.
And we can prove that the POSIX value of how
regular expressions match strings will not be affected---although is much harder
to establish. 
Some initial results in this regard have been
obtained in \cite{AusafDyckhoffUrban2016}. 



%Brzozowski, after giving the derivatives and simplification,
%did not explore lexing with simplification or he may well be 
%stuck on an efficient simplificaiton with a proof.
%He went on to explore the use of derivatives together with 
%automaton, and did not try lexing using derivatives.

We want to get rid of complex and fragile rectification of values.
Can we not create those intermediate values $v_1,\ldots v_n$,
and get the lexing information that should be already there while
doing derivatives in one pass, without a second phase of injection?
In the meantime, can we make sure that simplifications
are easily handled without breaking the correctness of the algorithm?

Sulzmann and Lu solved this problem by
introducing additional informtaion to the 
regular expressions called \emph{bitcodes}.

\subsection*{Bit-coded Algorithm}
Bits and bitcodes (lists of bits) are defined as:

\begin{center}
		$b ::=   1 \mid  0 \qquad
bs ::= [] \mid b::bs    
$
\end{center}

\noindent
The $1$ and $0$ are not in bold in order to avoid 
confusion with the regular expressions $\ZERO$ and $\ONE$. Bitcodes (or
bit-lists) can be used to encode values (or potentially incomplete values) in a
compact form. This can be straightforwardly seen in the following
coding function from values to bitcodes: 

\begin{center}
\begin{tabular}{lcl}
  $\textit{code}(\Empty)$ & $\dn$ & $[]$\\
  $\textit{code}(\Char\,c)$ & $\dn$ & $[]$\\
  $\textit{code}(\Left\,v)$ & $\dn$ & $0 :: code(v)$\\
  $\textit{code}(\Right\,v)$ & $\dn$ & $1 :: code(v)$\\
  $\textit{code}(\Seq\,v_1\,v_2)$ & $\dn$ & $code(v_1) \,@\, code(v_2)$\\
  $\textit{code}(\Stars\,[])$ & $\dn$ & $[0]$\\
  $\textit{code}(\Stars\,(v\!::\!vs))$ & $\dn$ & $1 :: code(v) \;@\;
                                                 code(\Stars\,vs)$
\end{tabular}    
\end{center} 

\noindent
Here $\textit{code}$ encodes a value into a bitcodes by converting
$\Left$ into $0$, $\Right$ into $1$, and marks the start of a non-empty
star iteration by $1$. The border where a local star terminates
is marked by $0$. This coding is lossy, as it throws away the information about
characters, and also does not encode the ``boundary'' between two
sequence values. Moreover, with only the bitcode we cannot even tell
whether the $1$s and $0$s are for $\Left/\Right$ or $\Stars$. The
reason for choosing this compact way of storing information is that the
relatively small size of bits can be easily manipulated and ``moved
around'' in a regular expression. In order to recover values, we will 
need the corresponding regular expression as an extra information. This
means the decoding function is defined as:


%\begin{definition}[Bitdecoding of Values]\mbox{}
\begin{center}
\begin{tabular}{@{}l@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
  $\textit{decode}'\,bs\,(\ONE)$ & $\dn$ & $(\Empty, bs)$\\
  $\textit{decode}'\,bs\,(c)$ & $\dn$ & $(\Char\,c, bs)$\\
  $\textit{decode}'\,(0\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}\;
       (\Left\,v, bs_1)$\\
  $\textit{decode}'\,(1\!::\!bs)\;(r_1 + r_2)$ & $\dn$ &
     $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r_2\;\textit{in}\;
       (\Right\,v, bs_1)$\\                           
  $\textit{decode}'\,bs\;(r_1\cdot r_2)$ & $\dn$ &
        $\textit{let}\,(v_1, bs_1) = \textit{decode}'\,bs\,r_1\;\textit{in}$\\
  & &   $\textit{let}\,(v_2, bs_2) = \textit{decode}'\,bs_1\,r_2$\\
  & &   \hspace{35mm}$\textit{in}\;(\Seq\,v_1\,v_2, bs_2)$\\
  $\textit{decode}'\,(0\!::\!bs)\,(r^*)$ & $\dn$ & $(\Stars\,[], bs)$\\
  $\textit{decode}'\,(1\!::\!bs)\,(r^*)$ & $\dn$ & 
         $\textit{let}\,(v, bs_1) = \textit{decode}'\,bs\,r\;\textit{in}$\\
  & &   $\textit{let}\,(\Stars\,vs, bs_2) = \textit{decode}'\,bs_1\,r^*$\\
  & &   \hspace{35mm}$\textit{in}\;(\Stars\,v\!::\!vs, bs_2)$\bigskip\\
  
  $\textit{decode}\,bs\,r$ & $\dn$ &
     $\textit{let}\,(v, bs') = \textit{decode}'\,bs\,r\;\textit{in}$\\
  & & $\textit{if}\;bs' = []\;\textit{then}\;\textit{Some}\,v\;
       \textit{else}\;\textit{None}$                       
\end{tabular}    
\end{center}    
%\end{definition}

Sulzmann and Lu's integrated the bitcodes into regular expressions to
create annotated regular expressions \cite{Sulzmann2014}.
\emph{Annotated regular expressions} are defined by the following
grammar:%\comment{ALTS should have  an $as$ in  the definitions, not  just $a_1$ and $a_2$}

\begin{center}
\begin{tabular}{lcl}
  $\textit{a}$ & $::=$  & $\ZERO$\\
                  & $\mid$ & $_{bs}\ONE$\\
                  & $\mid$ & $_{bs}{\bf c}$\\
                  & $\mid$ & $_{bs}\sum\,as$\\
                  & $\mid$ & $_{bs}a_1\cdot a_2$\\
                  & $\mid$ & $_{bs}a^*$
\end{tabular}    
\end{center}  
%(in \textit{ALTS})

\noindent
where $bs$ stands for bitcodes, $a$  for $\mathbf{a}$nnotated regular
expressions and $as$ for a list of annotated regular expressions.
The alternative constructor($\sum$) has been generalized to 
accept a list of annotated regular expressions rather than just 2.
We will show that these bitcodes encode information about
the (POSIX) value that should be generated by the Sulzmann and Lu
algorithm.


To do lexing using annotated regular expressions, we shall first
transform the usual (un-annotated) regular expressions into annotated
regular expressions. This operation is called \emph{internalisation} and
defined as follows:

%\begin{definition}
\begin{center}
\begin{tabular}{lcl}
  $(\ZERO)^\uparrow$ & $\dn$ & $\ZERO$\\
  $(\ONE)^\uparrow$ & $\dn$ & $_{[]}\ONE$\\
  $(c)^\uparrow$ & $\dn$ & $_{[]}{\bf c}$\\
  $(r_1 + r_2)^\uparrow$ & $\dn$ &
  $_{[]}\sum[\textit{fuse}\,[0]\,r_1^\uparrow,\,
  \textit{fuse}\,[1]\,r_2^\uparrow]$\\
  $(r_1\cdot r_2)^\uparrow$ & $\dn$ &
         $_{[]}r_1^\uparrow \cdot r_2^\uparrow$\\
  $(r^*)^\uparrow$ & $\dn$ &
         $_{[]}(r^\uparrow)^*$\\
\end{tabular}    
\end{center}    
%\end{definition}

\noindent
We use up arrows here to indicate that the basic un-annotated regular
expressions are ``lifted up'' into something slightly more complex. In the
fourth clause, $\textit{fuse}$ is an auxiliary function that helps to
attach bits to the front of an annotated regular expression. Its
definition is as follows:

\begin{center}
\begin{tabular}{lcl}
  $\textit{fuse}\;bs \; \ZERO$ & $\dn$ & $\ZERO$\\
  $\textit{fuse}\;bs\; _{bs'}\ONE$ & $\dn$ &
     $_{bs @ bs'}\ONE$\\
  $\textit{fuse}\;bs\;_{bs'}{\bf c}$ & $\dn$ &
     $_{bs@bs'}{\bf c}$\\
  $\textit{fuse}\;bs\,_{bs'}\sum\textit{as}$ & $\dn$ &
     $_{bs@bs'}\sum\textit{as}$\\
  $\textit{fuse}\;bs\; _{bs'}a_1\cdot a_2$ & $\dn$ &
     $_{bs@bs'}a_1 \cdot a_2$\\
  $\textit{fuse}\;bs\,_{bs'}a^*$ & $\dn$ &
     $_{bs @ bs'}a^*$
\end{tabular}    
\end{center}  

\noindent
After internalising the regular expression, we perform successive
derivative operations on the annotated regular expressions. This
derivative operation is the same as what we had previously for the
basic regular expressions, except that we beed to take care of
the bitcodes:


\iffalse
 %\begin{definition}{bder}
\begin{center}
  \begin{tabular}{@{}lcl@{}}
  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
  $(\textit{ONE}\;bs)\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
  $(\textit{CHAR}\;bs\,d)\,\backslash c$ & $\dn$ &
        $\textit{if}\;c=d\; \;\textit{then}\;
         \textit{ONE}\;bs\;\textit{else}\;\textit{ZERO}$\\  
  $(\textit{ALTS}\;bs\,as)\,\backslash c$ & $\dn$ &
  $\textit{ALTS}\;bs\,(map (\backslash c) as)$\\
  $(\textit{SEQ}\;bs\,a_1\,a_2)\,\backslash c$ & $\dn$ &
     $\textit{if}\;\textit{bnullable}\,a_1$\\
					       & &$\textit{then}\;\textit{ALTS}\,bs\,List((\textit{SEQ}\,[]\,(a_1\,\backslash c)\,a_2),$\\
					       & &$\phantom{\textit{then}\;\textit{ALTS}\,bs\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
  & &$\textit{else}\;\textit{SEQ}\,bs\,(a_1\,\backslash c)\,a_2$\\
  $(\textit{STAR}\,bs\,a)\,\backslash c$ & $\dn$ &
      $\textit{SEQ}\;bs\,(\textit{fuse}\, [\Z] (r\,\backslash c))\,
       (\textit{STAR}\,[]\,r)$
\end{tabular}    
\end{center}    
%\end{definition}

\begin{center}
  \begin{tabular}{@{}lcl@{}}
  $(\textit{ZERO})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
  $(_{bs}\textit{ONE})\,\backslash c$ & $\dn$ & $\textit{ZERO}$\\  
  $(_{bs}\textit{CHAR}\;d)\,\backslash c$ & $\dn$ &
        $\textit{if}\;c=d\; \;\textit{then}\;
         _{bs}\textit{ONE}\;\textit{else}\;\textit{ZERO}$\\  
  $(_{bs}\textit{ALTS}\;\textit{as})\,\backslash c$ & $\dn$ &
  $_{bs}\textit{ALTS}\;(\textit{as}.\textit{map}(\backslash c))$\\
  $(_{bs}\textit{SEQ}\;a_1\,a_2)\,\backslash c$ & $\dn$ &
     $\textit{if}\;\textit{bnullable}\,a_1$\\
					       & &$\textit{then}\;_{bs}\textit{ALTS}\,List((_{[]}\textit{SEQ}\,(a_1\,\backslash c)\,a_2),$\\
					       & &$\phantom{\textit{then}\;_{bs}\textit{ALTS}\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c)))$\\
  & &$\textit{else}\;_{bs}\textit{SEQ}\,(a_1\,\backslash c)\,a_2$\\
  $(_{bs}\textit{STAR}\,a)\,\backslash c$ & $\dn$ &
      $_{bs}\textit{SEQ}\;(\textit{fuse}\, [0] \; r\,\backslash c )\,
       (_{bs}\textit{STAR}\,[]\,r)$
\end{tabular}    
\end{center}    
%\end{definition}
\fi

\begin{center}
  \begin{tabular}{@{}lcl@{}}
  $(\ZERO)\,\backslash c$ & $\dn$ & $\ZERO$\\  
  $(_{bs}\ONE)\,\backslash c$ & $\dn$ & $\ZERO$\\  
  $(_{bs}{\bf d})\,\backslash c$ & $\dn$ &
        $\textit{if}\;c=d\; \;\textit{then}\;
         _{bs}\ONE\;\textit{else}\;\ZERO$\\  
  $(_{bs}\sum \;\textit{as})\,\backslash c$ & $\dn$ &
  $_{bs}\sum\;(\textit{as.map}(\backslash c))$\\
  $(_{bs}\;a_1\cdot a_2)\,\backslash c$ & $\dn$ &
     $\textit{if}\;\textit{bnullable}\,a_1$\\
					       & &$\textit{then}\;_{bs}\sum\,[(_{[]}\,(a_1\,\backslash c)\cdot\,a_2),$\\
					       & &$\phantom{\textit{then},\;_{bs}\sum\,}(\textit{fuse}\,(\textit{bmkeps}\,a_1)\,(a_2\,\backslash c))]$\\
  & &$\textit{else}\;_{bs}\,(a_1\,\backslash c)\cdot a_2$\\
  $(_{bs}a^*)\,\backslash c$ & $\dn$ &
      $_{bs}(\textit{fuse}\, [0] \; r\,\backslash c)\cdot
       (_{[]}r^*))$
\end{tabular}    
\end{center}    

%\end{definition}
\noindent
For instance, when we do derivative of  $_{bs}a^*$ with respect to c,
we need to unfold it into a sequence,
and attach an additional bit $0$ to the front of $r \backslash c$
to indicate one more star iteration. Also the sequence clause
is more subtle---when $a_1$ is $\textit{bnullable}$ (here
\textit{bnullable} is exactly the same as $\textit{nullable}$, except
that it is for annotated regular expressions, therefore we omit the
definition). Assume that $\textit{bmkeps}$ correctly extracts the bitcode for how
$a_1$ matches the string prior to character $c$ (more on this later),
then the right branch of alternative, which is $\textit{fuse} \; \bmkeps \;  a_1 (a_2
\backslash c)$ will collapse the regular expression $a_1$(as it has
already been fully matched) and store the parsing information at the
head of the regular expression $a_2 \backslash c$ by fusing to it. The
bitsequence $\textit{bs}$, which was initially attached to the
first element of the sequence $a_1 \cdot a_2$, has
now been elevated to the top-level of $\sum$, as this information will be
needed whichever way the sequence is matched---no matter whether $c$ belongs
to $a_1$ or $ a_2$. After building these derivatives and maintaining all
the lexing information, we complete the lexing by collecting the
bitcodes using a generalised version of the $\textit{mkeps}$ function
for annotated regular expressions, called $\textit{bmkeps}$:


%\begin{definition}[\textit{bmkeps}]\mbox{}
\begin{center}
\begin{tabular}{lcl}
  $\textit{bmkeps}\,(_{bs}\ONE)$ & $\dn$ & $bs$\\
  $\textit{bmkeps}\,(_{bs}\sum a::\textit{as})$ & $\dn$ &
     $\textit{if}\;\textit{bnullable}\,a$\\
  & &$\textit{then}\;bs\,@\,\textit{bmkeps}\,a$\\
  & &$\textit{else}\;bs\,@\,\textit{bmkeps}\,(_{bs}\sum \textit{as})$\\
  $\textit{bmkeps}\,(_{bs} a_1 \cdot a_2)$ & $\dn$ &
     $bs \,@\,\textit{bmkeps}\,a_1\,@\, \textit{bmkeps}\,a_2$\\
  $\textit{bmkeps}\,(_{bs}a^*)$ & $\dn$ &
     $bs \,@\, [0]$
\end{tabular}    
\end{center}    
%\end{definition}

\noindent
This function completes the value information by travelling along the
path of the regular expression that corresponds to a POSIX value and
collecting all the bitcodes, and using $S$ to indicate the end of star
iterations. If we take the bitcodes produced by $\textit{bmkeps}$ and
decode them, we get the value we expect. The corresponding lexing
algorithm looks as follows:

\begin{center}
\begin{tabular}{lcl}
  $\textit{blexer}\;r\,s$ & $\dn$ &
      $\textit{let}\;a = (r^\uparrow)\backslash s\;\textit{in}$\\                
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
  & & $\;\;\textit{else}\;\textit{None}$
\end{tabular}
\end{center}

\noindent
In this definition $\_\backslash s$ is the  generalisation  of the derivative
operation from characters to strings (just like the derivatives for un-annotated
regular expressions).

Now we introduce the simplifications, which is why we introduce the 
bitcodes in the first place.

\subsection*{Simplification Rules}

This section introduces aggressive (in terms of size) simplification rules
on annotated regular expressions
to keep derivatives small. Such simplifications are promising
as we have
generated test data that show
that a good tight bound can be achieved. We could only
partially cover the search space as there are infinitely many regular
expressions and strings. 

One modification we introduced is to allow a list of annotated regular
expressions in the $\sum$ constructor. This allows us to not just
delete unnecessary $\ZERO$s and $\ONE$s from regular expressions, but
also unnecessary ``copies'' of regular expressions (very similar to
simplifying $r + r$ to just $r$, but in a more general setting). Another
modification is that we use simplification rules inspired by Antimirov's
work on partial derivatives. They maintain the idea that only the first
``copy'' of a regular expression in an alternative contributes to the
calculation of a POSIX value. All subsequent copies can be pruned away from
the regular expression. A recursive definition of our  simplification function 
that looks somewhat similar to our Scala code is given below:
%\comment{Use $\ZERO$, $\ONE$ and so on. 
%Is it $ALTS$ or $ALTS$?}\\

\begin{center}
  \begin{tabular}{@{}lcl@{}}
   
  $\textit{simp} \; (_{bs}a_1\cdot a_2)$ & $\dn$ & $ (\textit{simp} \; a_1, \textit{simp}  \; a_2) \; \textit{match} $ \\
   &&$\quad\textit{case} \; (\ZERO, \_) \Rightarrow  \ZERO$ \\
   &&$\quad\textit{case} \; (\_, \ZERO) \Rightarrow  \ZERO$ \\
   &&$\quad\textit{case} \;  (\ONE, a_2') \Rightarrow  \textit{fuse} \; bs \;  a_2'$ \\
   &&$\quad\textit{case} \; (a_1', \ONE) \Rightarrow  \textit{fuse} \; bs \;  a_1'$ \\
   &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\

  $\textit{simp} \; (_{bs}\sum \textit{as})$ & $\dn$ & $\textit{distinct}( \textit{flatten} ( \textit{as.map(simp)})) \; \textit{match} $ \\
  &&$\quad\textit{case} \; [] \Rightarrow  \ZERO$ \\
   &&$\quad\textit{case} \; a :: [] \Rightarrow  \textit{fuse bs a}$ \\
   &&$\quad\textit{case} \;  as' \Rightarrow _{bs}\sum \textit{as'}$\\ 

   $\textit{simp} \; a$ & $\dn$ & $\textit{a} \qquad \textit{otherwise}$   
\end{tabular}    
\end{center}    

\noindent
The simplification does a pattern matching on the regular expression.
When it detected that the regular expression is an alternative or
sequence, it will try to simplify its child regular expressions
recursively and then see if one of the children turns into $\ZERO$ or
$\ONE$, which might trigger further simplification at the current level.
The most involved part is the $\sum$ clause, where we use two
auxiliary functions $\textit{flatten}$ and $\textit{distinct}$ to open up nested
alternatives and reduce as many duplicates as possible. Function
$\textit{distinct}$  keeps the first occurring copy only and removes all later ones
when detected duplicates. Function $\textit{flatten}$ opens up nested $\sum$s.
Its recursive definition is given below:

 \begin{center}
  \begin{tabular}{@{}lcl@{}}
  $\textit{flatten} \; (_{bs}\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $(\textit{map} \;
     (\textit{fuse}\;bs)\; \textit{as}) \; @ \; \textit{flatten} \; as' $ \\
  $\textit{flatten} \; \ZERO :: as'$ & $\dn$ & $ \textit{flatten} \;  \textit{as'} $ \\
    $\textit{flatten} \; a :: as'$ & $\dn$ & $a :: \textit{flatten} \; \textit{as'}$ \quad(otherwise) 
\end{tabular}    
\end{center}  

\noindent
Here $\textit{flatten}$ behaves like the traditional functional programming flatten
function, except that it also removes $\ZERO$s. Or in terms of regular expressions, it
removes parentheses, for example changing $a+(b+c)$ into $a+b+c$.

Having defined the $\simp$ function,
we can use the previous notation of  natural
extension from derivative w.r.t.~character to derivative
w.r.t.~string:%\comment{simp in  the [] case?}

\begin{center}
\begin{tabular}{lcl}
$r \backslash_{simp} (c\!::\!s) $ & $\dn$ & $(r \backslash_{simp}\, c) \backslash_{simp}\, s$ \\
$r \backslash_{simp} [\,] $ & $\dn$ & $r$
\end{tabular}
\end{center}

\noindent
to obtain an optimised version of the algorithm:

 \begin{center}
\begin{tabular}{lcl}
  $\textit{blexer\_simp}\;r\,s$ & $\dn$ &
      $\textit{let}\;a = (r^\uparrow)\backslash_{simp}\, s\;\textit{in}$\\                
  & & $\;\;\textit{if}\; \textit{bnullable}(a)$\\
  & & $\;\;\textit{then}\;\textit{decode}\,(\textit{bmkeps}\,a)\,r$\\
  & & $\;\;\textit{else}\;\textit{None}$
\end{tabular}
\end{center}

\noindent
This algorithm keeps the regular expression size small, for example,
with this simplification our previous $(a + aa)^*$ example's 8000 nodes
will be reduced to just 6 and stays constant, no matter how long the
input string is.







%-----------------------------------
%	SUBSECTION 1
%-----------------------------------
\section{Specifications of Certain Functions to be Used}
Here we give some functions' definitions, 
which we will use later.
\begin{center}
\begin{tabular}{ccc}
$\retrieve \; \ACHAR \, \textit{bs} \, c \; \Char(c) = \textit{bs}
\end{tabular}
\end{center}


%----------------------------------------------------------------------------------------
%	SECTION  correctness proof
%----------------------------------------------------------------------------------------
\section{Correctness of Bit-coded Algorithm (Without Simplification)
We now give the proof the correctness of the algorithm with bit-codes.

Ausaf and Urban cleverly defined an auxiliary function called $\flex$,
defined as
\[
\flex \; r \; f \; [] \; v \; = \; f\; v
\flex \; r \; f \; c :: s \; v =  \flex r \; \lambda v. \, f (\inj \; r\; c\; v)\; s \; v
\]
which accumulates the characters that needs to be injected back, 
and does the injection in a stack-like manner (last taken derivative first injected).
$\flex$ is connected to the $\lexer$:
\begin{lemma}
$\flex \; r \; \textit{id}\; s \; \mkeps (r\backslash s) = \lexer \; r \; s$
\end{lemma}
$\flex$ provides us a bridge between $\lexer$ and $\blexer$.
What is even better about $\flex$ is that it allows us to 
directly operate on the value $\mkeps (r\backslash v)$,
which is pivotal in the definition of  $\lexer $ and $\blexer$, but not visible as an argument.
When the value created by $\mkeps$ becomes available, one can 
prove some stepwise properties of lexing nicely:
\begin{lemma}\label{flexStepwise}
$\textit{flex} \; r \; f \; s@[c] \; v= \flex \; r \; f\; s \; (\inj \; (r\backslash s) \; c \; v) $
\end{lemma}

And for $\blexer$ we have a function with stepwise properties like $\flex$ as well,
called $\retrieve$\ref{retrieveDef}.
$\retrieve$ takes bit-codes from annotated regular expressions
guided by a value.
$\retrieve$ is connected to the $\blexer$ in the following way:
\begin{lemma}\label{blexer_retrieve}
$\blexer \; r \; s = \decode  \; (\retrieve \; (\internalise \; r) \; (\mkeps \; (r \backslash s) )) \; r$
\end{lemma}
If you take derivative of an annotated regular expression, 
you can $\retrieve$ the same bit-codes as before the derivative took place,
provided that you use the corresponding value:

\begin{lemma}\label{retrieveStepwise}
$\retrieve \; (r \backslash c)  \;  v= \retrieve \; r \; (\inj \; r\; c\; v)$
\end{lemma}
The other good thing about $\retrieve$ is that it can be connected to $\flex$:
%centralLemma1
\begin{lemma}\label{flex_retrieve}
$\flex \; r \; \textit{id}\; s\; v = \decode \; (\retrieve \; (r\backslash s )\; v) \; r$
\end{lemma}
\begin{proof}
By induction on $s$. The induction tactic is reverse induction on strings.
$v$ is allowed to be arbitrary.
The crucial point is to rewrite 
\[
\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
\]
as
\[
\retrieve \; (r \backslash s) \; (\inj \; (r \backslash s) \; c\;  \mkeps (r \backslash s@[c]))
\].
This enables us to equate 
\[
\retrieve \; (r \backslash s@[c]) \; \mkeps (r \backslash s@[c]) 
\] 
with 
\[
\flex \; r \; \textit{id} \; s \; (\inj \; (r\backslash s) \; c\; (\mkeps (r\backslash s@[c])))
\],
which in turn can be rewritten as
\[
\flex \; r \; \textit{id} \; s@[c] \;  (\mkeps (r\backslash s@[c]))
\].
\end{proof}

With the above lemma we can now link $\flex$ and $\blexer$.

\begin{lemma}\label{flex_blexer}
$\textit{flex} \; r \; \textit{id} \; s \; \mkeps(r \backslash s)  = \blexer \; r \; s$
\end{lemma}
\begin{proof}
Using two of the above lemmas: \ref{flex_retrieve} and \ref{blexer_retrieve}.
\end{proof}
Finally
author	Chengsong
	Fri, 27 May 2022 22:37:08 +0100
changeset 524	947cbbd4e4a7
parent 519	856d025dbc15
child 525	d8740017324c
permissions	-rwxr-xr-x