lexing: ChengsongTanPhdThesis/Chapters/Cubic.tex@dd9dde2d902b

% Chapter Template

%We also present the idempotency property proof
%of $\bsimp$, which leverages the idempotency proof of $\rsimp$.
%This reinforces our claim that the fixpoint construction
%originally required by Sulzmann and Lu can be removed in $\blexersimp$.
%Last but not least, we present our efforts and challenges we met
%in further improving the algorithm by data
%structures such as zippers.
%----------------------------------------------------------------------------------------
%	SECTION strongsimp
%----------------------------------------------------------------------------------------
%TODO: search for isabelle proofs of algorithms that check equivalence 

\chapter{A Better Size Bound for Derivatives} % Main chapter title

\label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
%in Chapter 4 to a polynomial one, and demonstrate how one can extend the
%algorithm to include constructs such as bounded repetitions and negations.
\lstset{style=myScalastyle}


This chapter is a ``work-in-progress''
chapter which records
extensions to our $\blexersimp$.
We make a conjecture that the finite
size bound from the previous chapter can be 
improved to a cubic bound.
We implemented our conjecture in Scala.
We intend to formalise this part in Isabelle/HOL at a
later stage.
%we have not been able to finish due to time constraints of the PhD.
Nevertheless, we outline the ideas we intend to use for the proof.

\section{A Stronger Version of Simplification}

We present further improvements
for our lexer algorithm $\blexersimp$.
We devise a stronger simplification algorithm,
called $\bsimpStrong$, which can prune away
similar components in two regular expressions at the same 
alternative level,
even if these regular expressions are not exactly the same.
We call the lexer that uses this stronger simplification function
$\blexerStrong$.
%Unfortunately we did not have time to 
%work out the proofs, like in
%the previous chapters.
We conjecture that both
\begin{center}
	$\blexerStrong \;r \; s = \blexer\; r\;s$
\end{center}
and 
\begin{center}
	$\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$
\end{center}
hold.
%but a formalisation
%is still future work.
We give an informal justification 
why the correctness and cubic size bound proofs
can be achieved
by exploring the connection between the internal
data structure of our $\blexerStrong$ and
Animirov's partial derivatives.

In our bitcoded lexing algorithm, (sub)terms represent (sub)matches.
For example, the regular expression 
\[
	aa \cdot a^*+ a \cdot a^* + aa\cdot a^*
\]
contains three terms, 
expressing three possibilities for how it can match some input.
The first and the third terms are identical, which means we can eliminate
the latter as it will not contribute to a POSIX value.
In $\bsimps$, the $\distinctBy$ function takes care of 
such instances.
The criteria $\distinctBy$ uses for removing a duplicate
$a_2$ in the list
\begin{center}
	$rs_a@[a_1]@rs_b@[a_2]@rs_c$
\end{center}
is that 
the two erased regular expressions are equal
\begin{center}
	$\rerase{a_1} = \rerase{a_2}$.
\end{center}
This is characterised as the $LD$ 
rewrite rule in figure \ref{rrewriteRules}.
The problem, however, is that identical components
in two slightly different regular expressions cannot be removed
by the $LD$ rule.
Consider the stronger simplification
\begin{equation}
	\label{eqn:partialDedup}
	(a+b+d) \cdot r_1 + (a+c+e) \cdot r_1 \stackrel{?}{\rightsquigarrow} (a+b+d) \cdot r_1 + (c+e) \cdot r_1
\end{equation}
where the $(\underline{a}+c+e)\cdot r_1$ is deleted in the right alternative
$a+c+e$.
This is permissible because we have $(a+\ldots)\cdot r_1$ in the left
alternative.
The difficulty is that such  ``buried''
alternatives-sequences are not easily recognised.
But simplification like this actually
cannot be omitted, if we want to have a better bound.
For example, the size of derivatives can still
blow up even with our $\textit{bsimp}$ 
function: 
consider again the example
$\protect((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$,
and set $n$ to a relatively small number like $n=5$, then we get the following
exponential growth:
\begin{figure}[H]
\centering
\begin{tikzpicture}
\begin{axis}[
    %xlabel={$n$},
    myplotstyle,
    xlabel={input length},
    ylabel={size},
    ]
\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
\end{axis}
\end{tikzpicture}
\caption{Size of derivatives of $\blexersimp$ from chapter 5 for matching 
	$\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ 
	with strings 
	of the form $\protect\underbrace{aa..a}_{n}$.}\label{blexerExp}
\end{figure}
\noindent
One possible approach would be to apply the rewriting
rule
\[
	(a+b+d) \cdot r_1  \longrightarrow a \cdot r_1 + b \cdot r_1 + d \cdot r_1
\]
\noindent
which pushes the sequence into the alternatives
in our $\simp$ function.
This would then make the simplification shown in \eqref{eqn:partialDedup} possible.
Translating this rule into our $\textit{bsimp}$ function would simply
involve adding a new clause to the $\textit{bsimp}_{ASEQ}$ function:
\begin{center}
	\begin{tabular}{@{}lcl@{}}
		$\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\
						       && $\ldots$\\
   &&$\quad\textit{case} \; (_{bs1}\sum as, a_2') \Rightarrow _{bs1}\sum (
   \map \; (_{[]}\textit{ASEQ} \; \_ \; a_2') \; as)$\\
   &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
	\end{tabular}
\end{center}
\noindent
Unfortunately,
if we introduce this clause in our
setting we would lose the POSIX property of our calculated values. 
For example given the regular expression 
\begin{center}
	$(a + ab)(bc + c)$
\end{center}
and the string $ab$,
then our algorithm generates the following
correct POSIX value
\begin{center}
	$\Seq \; (\Right \; ab) \; (\Right \; c)$.
\end{center}
Essentially it matches the string with the longer Right-alternative
in the first sequence (and
then the 'rest' with the character regular expression $c$ from the second sequence).
If we add the simplification above, however, 
then we would obtain the following value
\begin{center}
	$\Left \; (\Seq \; a \; (\Left \; bc))$
\end{center}
where the $\Left$-alternatives get priority. 
This violates the POSIX rules.
The reason for getting this undesired value
is that the new rule splits this regular expression up into 
a topmost alternative
\begin{center}
	$a\cdot(b c + c) + ab \cdot (bc + c)$,
\end{center}
which is a regular expression with a 
quite different meaning: the original regular expression
is a sequence, but the simplified regular expression is an alternative.
With an alternative the maximal munch rule no longer works.


A method to reconcile this problem is to do the 
transformation in \eqref{eqn:partialDedup} ``non-invasively'',
meaning that we traverse the list of regular expressions
%\begin{center}
%	$rs_a@[a]@rs_c$
%\end{center}
inside alternatives
\begin{center}
	$\sum ( rs_a@[a]@rs_c)$
\end{center}
using  a function similar to $\distinctBy$,
but this time 
we allow the following more general rewrite rule:
\begin{equation}
\label{eqn:cubicRule}
%\mbox{
%\begin{mathpar}	
	\inferrule * [Right = cubicRule]{\vspace{0mm} }{rs_a@[a]@rs_c 
			\stackrel{s}{\rightsquigarrow }
		rs_a@[\textit{prune} \; a \; rs_a]@rs_c }
%\end{mathpar}
%\caption{The rule capturing the pruning simplification needed to achieve
%a cubic bound}
\end{equation}
\noindent
%L \; a_1' = L \; a_1 \setminus (\cup_{a \in rs_a} L \; a)
where $\textit{prune} \;a \; acc$ traverses $a$
without altering the structure of $a$, but removing components in $a$
that have appeared in the accumulator $acc$.
For example
\begin{center}
	$\textit{prune} \;\;\; (r_a+r_f+r_g+r_h)r_d \;\; \; [(r_a+r_b+r_c)r_d, (r_e+r_f)r_d] $
\end{center}
should be equal to 
\begin{center}
	$(r_g+r_h)r_d$
\end{center}
because $r_gr_d$ and 
$r_hr_d$ are the only terms
that do not appeared in the accumulator list 
\begin{center}
$[(r_a+r_b+r_c)r_d, (r_e+r_f)r_d]$.
\end{center}
We implemented the
function $\textit{prune}$ in Scala (see figure \ref{fig:pruneFunc})
The function $\textit{prune}$ 
is a stronger version of $\textit{distinctBy}$.
It does not just walk through a list looking for exact duplicates,
but prunes sub-expressions recursively.
It manages proper contexts by the helper functions
$\textit{removeSeqTail}$, $\textit{isOne}$ and $\textit{atMostEmpty}$.
\begin{figure}%[H]
\begin{lstlisting}[numbers=left]
    def prune(r: ARexp, acc: Set[Rexp]) : ARexp = r match{
      case AALTS(bs, rs) => rs.map(r => prune(r, acc)).filter(_ != AZERO) match
      {
        //all components have been removed, meaning this is effectively a duplicate
        //flats will take care of removing this AZERO 
        case Nil => AZERO
        case r::Nil => fuse(bs, r)
        case rs1 => AALTS(bs, rs1)
      }
      case ASEQ(bs, r1, r2) => 
        //remove the r2 in (ra + rb)r2 to identify the duplicate contents of r1
        prune(r1, acc.map(r => removeSeqTail(r, erase(r2)))) match {
          //after pruning, returns 0
          case AZERO => AZERO
          //after pruning, got r1'.r2, where r1' is equal to 1
          case r1p if(isOne(erase(r1p))) => fuse(bs ++ mkepsBC(r1p), r2)
          //assemble the pruned head r1p with r2
          case r1p => ASEQ(bs, r1p, r2)
        }
        //this does the duplicate component removal task
      case r => if(acc(erase(r))) AZERO else r
    }
\end{lstlisting}
\caption{The function $\textit{prune}$ is called recursively in the 
alternative case (line 2) and in the sequence case (line 12).
In the alternative case we keep all the accumulators the same, but
in the sequence case we are making necessary changes to the accumulators 
to allow correct de-duplication.}\label{fig:pruneFunc}
\end{figure}
\noindent

\begin{figure}
\begin{lstlisting}[numbers=left]
    def atMostEmpty(r: Rexp) : Boolean = r match {
      case ZERO => true
      case ONE => true
      case STAR(r) => atMostEmpty(r)
      case SEQ(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2)
      case ALTS(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2)
      case CHAR(_) => false
    }

    def isOne(r: Rexp) : Boolean = r match {
      case ONE => true
      case SEQ(r1, r2) => isOne(r1) && isOne(r2)
      case ALTS(r1, r2) => (isOne(r1) || isOne(r2)) && (atMostEmpty(r1) && atMostEmpty(r2))
      case STAR(r0) => atMostEmpty(r0)
      case CHAR(c) => false
      case ZERO => false
    }

    def removeSeqTail(r: Rexp, tail: Rexp) : Rexp = 
      if (r == tail)
        ONE
      else {
        r match {
          case SEQ(r1, r2) => 
            if(r2 == tail) r1 else ZERO
          case r => ZERO
        }
      }


\end{lstlisting}
\caption{The helper functions of $\textit{prune}$: 
  $\textit{atMostEmpty}$, $\textit{isOne}$ and $\textit{removeSeqTail}$}
\end{figure}
\noindent
Suppose we feed 
\begin{center}
	$r= (\underline{\ONE}+(\underline{f}+b)\cdot g)\cdot (a\cdot(d\cdot e))$
\end{center}
and 
\begin{center}
	$acc = \{a\cdot(d\cdot e),f\cdot (g \cdot (a \cdot (d \cdot e))) \}$
\end{center}
as the input into $\textit{prune}$.
The end result will be
\[
	b\cdot(g\cdot(a\cdot(d\cdot e)))
\]
where the underlined components in $r$ are eliminated.
Looking more closely, at the topmost call 
\[
	\textit{prune} \quad (\ONE+
	(f+b)\cdot g)\cdot (a\cdot(d\cdot e)) \quad
	\{a\cdot(d\cdot e),f\cdot (g \cdot (a \cdot (d \cdot e))) \}
\]
The sequence clause will be called,
where a sub-call
\[
	\textit{prune} \;\; (\ONE+(f+b)\cdot g)\;\; \{\ONE, f\cdot g \}
\]
is made. The terms in the new accumulator $\{\ONE,\; f\cdot g \}$ come from
the two calls to $\textit{removeSeqTail}$:
\[
	\textit{removeSeqTail} \quad\;\; a \cdot(d\cdot e) \quad\;\; a \cdot(d\cdot e) 
\]
and
\[
	\textit{removeSeqTail} \quad \;\; 
	f\cdot(g\cdot (a \cdot(d\cdot e)))\quad  \;\; a \cdot(d\cdot e).
\]

The idea behind $\textit{removeSeqTail}$ is that
when pruning recursively, we need to ``zoom in''
to sub-expressions, and this ``zoom in'' needs to be performed
on the
accumulators as well, otherwise the deletion will not work.
The sub-call 
$\textit{prune} \;\; (\ONE+(f+b)\cdot g)\;\; \{\ONE, f\cdot g \}$
is simpler, which will trigger the alternative clause, causing
a pruning on each element in $(\ONE+(f+b)\cdot g)$,
leaving us with $b\cdot g$ only.

Our new lexer with stronger simplification
uses $\textit{prune}$ by making it 
the core component of the deduplicating function
called $\textit{distinctWith}$.
$\textit{DistinctWith}$ ensures that all verbose
parts of a regular expression are pruned away.

\begin{figure}[H]
\begin{lstlisting}[numbers=left]
    def turnIntoTerms(r: Rexp): List[Rexp] = r match {
      case SEQ(r1, r2)  => 
        turnIntoTerms(r1).flatMap(r11 => furtherSEQ(r11, r2))
          case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2)
          case ZERO => Nil
          case _ => r :: Nil
    }

    def distinctWith(rs: List[ARexp], 
      pruneFunction: (ARexp, Set[Rexp]) => ARexp, 
      acc: Set[Rexp] = Set()) : List[ARexp] = 
        rs match{
          case Nil => Nil
          case r :: rs => 
            if(acc(erase(r)))
              distinctWith(rs, pruneFunction, acc)
            else {
              val pruned_r = pruneFunction(r, acc)
              pruned_r :: 
              distinctWith(rs, 
                pruneFunction, 
                turnIntoTerms(erase(pruned_r)) ++: acc
                )
            }
        }

\end{lstlisting}
\caption{A Stronger Version of $\textit{distinctBy}$ XXX}
\end{figure}
\noindent
Once a regular expression has been pruned,
all its components will be added to the accumulator
to remove any future regular expressions' duplicate components.

The function $\textit{bsimpStrong}$
is very much the same as $\textit{bsimp}$, just with
$\textit{distinctBy}$ replaced 
by $\textit{distinctWith}$.
\begin{figure}[H]
\begin{lstlisting}

    def bsimpStrong(r: ARexp): ARexp =
    {
      r match {
        case ASEQ(bs1, r1, r2) => (bsimpStrong(r1), bsimpStrong(r2)) match {
        case (AZERO, _) => AZERO
        case (_, AZERO) => AZERO
        case (AONE(bs2), r2s) => fuse(bs1 ++ bs2, r2s)
        case (r1s, AONE(bs2)) => fuse(bs1, r1s) //assert bs2 == Nil
        case (r1s, r2s) => ASEQ(bs1, r1s, r2s)
        }
        case AALTS(bs1, rs) => {
          distinctWith(flats(rs.map(bsimpStrong(_))), prune) match {
            case Nil => AZERO
            case s :: Nil => fuse(bs1, s)
            case rs => AALTS(bs1, rs)
          }
        }
        case ASTAR(bs, r0) if(atMostEmpty(erase(r0))) => AONE(bs)
        case r => r
      }
    }
    def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match {
        case Nil => r
        case c::s => bdersStrong(s, bsimpStrong(bder(c, r)))
      }

\end{lstlisting}
\caption{The function
$\textit{bsimpStrong}$: a stronger version of $\textit{bsimp}$}
\end{figure}
\noindent
The benefits of using 
$\textit{prune}$ refining the finiteness bound
to a cubic bound has not been formalised yet.
Therefore we choose to use Scala code rather than an Isabelle-style formal 
definition like we did for $\simp$, as the definitions might change
to suit our proof needs.
In the rest of the chapter we will use this convention consistently.

%The function $\textit{prune}$ is used in $\distinctWith$.
%$\distinctWith$ is a stronger version of $\distinctBy$
%which not only removes duplicates as $\distinctBy$ would
%do, but also uses the $\textit{pruneFunction}$
%argument to prune away verbose components in a regular expression.\\
%\begin{figure}[H]
%\begin{lstlisting}
%   //a stronger version of simp
%    def bsimpStrong(r: ARexp): ARexp =
%    {
%      r match {
%        case ASEQ(bs1, r1, r2) => (bsimpStrong(r1), bsimpStrong(r2)) match {
%          //normal clauses same as simp
%        case (AZERO, _) => AZERO
%        case (_, AZERO) => AZERO
%        case (AONE(bs2), r2s) => fuse(bs1 ++ bs2, r2s)
%        //bs2 can be discarded
%        case (r1s, AONE(bs2)) => fuse(bs1, r1s) //assert bs2 == Nil
%        case (r1s, r2s) => ASEQ(bs1, r1s, r2s)
%        }
%        case AALTS(bs1, rs) => {
%          //distinctBy(flat_res, erase)
%          distinctWith(flats(rs.map(bsimpStrong(_))), prune) match {
%            case Nil => AZERO
%            case s :: Nil => fuse(bs1, s)
%            case rs => AALTS(bs1, rs)
%          }
%        }
%        //stars that can be treated as 1
%        case ASTAR(bs, r0) if(atMostEmpty(erase(r0))) => AONE(bs)
%        case r => r
%      }
%    }
%    def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match {
%        case Nil => r
%        case c::s => bdersStrong(s, bsimpStrong(bder(c, r)))
%      }
%\end{lstlisting}
%\caption{The function $\bsimpStrong$ and $\bdersStrongs$}
%\end{figure}
%\noindent
%$\distinctWith$, is in turn used in $\bsimpStrong$:
%\begin{figure}[H]
%\begin{lstlisting}
%      //Conjecture: [| bdersStrong(s, r) |] = O([| r |]^3)
%      def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match {
%        case Nil => r
%        case c::s => bdersStrong(s, bsimpStrong(bder(c, r)))
%      }
%\end{lstlisting}
%\caption{The function $\bsimpStrong$ and $\bdersStrongs$}
%\end{figure}
%\noindent
We conjecture that the above Scala function $\bdersStrongs$,
written $\bdersStrong{\_}{\_}$ as an infix notation, 
satisfies the following property:
\begin{conjecture}
	$\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$
\end{conjecture}
\noindent
The stronger version of $\blexersimp$'s
code in Scala looks like: 
\begin{figure}[H]
\begin{lstlisting}
      def strongBlexer(r: Rexp, s: String) : Option[Val] = {
        Try(Some(decode(r, strong_blex_simp(internalise(r), s.toList)))).getOrElse(None)
      }
      def strong_blex_simp(r: ARexp, s: List[Char]) : Bits = s match {
        case Nil => {
          if (bnullable(r)) {
            mkepsBC(r)
          }
          else
            throw new Exception("Not matched")
        }
        case c::cs => {
          strong_blex_simp(strongBsimp(bder(c, r)), cs)
        }
      }
\end{lstlisting}
\end{figure}
\noindent
We call this lexer $\blexerStrong$.
This version is able to reduce the
size of the derivatives which 
otherwise 
triggered exponential behaviour in
$\blexersimp$.
Consider again the runtime for matching 
$\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings 
of the form $\protect\underbrace{aa..a}_{n}$.
They produce the folloiwng graphs ($\blexerStrong$ on the left-hand-side and
$\blexersimp$ on the right-hand-side).
\begin{figure}[H]
\centering
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
    %xlabel={$n$},
    myplotstyle,
    xlabel={input length},
    ylabel={size},
    width = 7cm,
    height = 5cm,
    ]
\addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data};
\end{axis}
\end{tikzpicture}
  &
\begin{tikzpicture}
\begin{axis}[
    %xlabel={$n$},
    myplotstyle,
    xlabel={input length},
    ylabel={size},
    width = 7cm,
    height = 5cm,
    ]
\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
\end{axis}
\end{tikzpicture}\\
\multicolumn{2}{l}{}
\end{tabular}    
\caption{}\label{fig:aaaaaStarStar}
\end{figure}
\noindent
We hope the correctness is preserved.
%We would like to preserve the correctness like the one 
%we had for $\blexersimp$:
The proof idea is to preserve the key lemma in chapter \ref{Bitcoded2}
such as in equation \eqref{eqn:cubicRule}.
\begin{conjecture}\label{cubicConjecture}
	$\blexerStrong \;r \; s = \blexer\; r\;s$
\end{conjecture}
\noindent
The idea is to maintain key lemmas in
chapter \ref{Bitcoded2} like
$r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$
with the new rewriting rule 
shown in figure \eqref{eqn:cubicRule} .

In the next sub-section,
we will describe why we 
believe a cubic size bound can be achieved with
the stronger simplification.
For this we give a short introduction to the
partial derivatives,
which were invented by Antimirov \cite{Antimirov95},
and then link them with the result of the function
$\bdersStrongs$.
 
\subsection{Antimirov's partial derivatives}
Partial derivatives were first introduced by 
Antimirov \cite{Antimirov95}.
They are very similar
to Brzozowski derivatives,
but split children of alternative regular expressions into 
multiple independent terms. This means the output of
partial derivatives is a 
set of regular expressions, defined as follows
\begin{center}  
	\begin{tabular}{lcl@{\hspace{-5mm}}l}
		$\partial_x \; (r_1 \cdot r_2)$ & 
		$\dn$ & $(\partial_x \; r_1) \cdot r_2 \cup
		\partial_x \; r_2 \;$ & $ \textit{if} \; \; \nullable\; r_1$\\
				      & & $(\partial_x \; r_1)\cdot r_2 \quad\quad$ & $ 
		      \textit{otherwise}$\\ 
		$\partial_x \; r^*$ & $\dn$ & $(\partial_x \; r) \cdot r^*$\\
		$\partial_x \; c $ & $\dn$ & $\textit{if} \; x = c \; 
		\textit{then} \;
		\{ \ONE\} \;\;\textit{else} \; \varnothing$\\
		$\partial_x(r_1+r_2)$ & $=$ & $\partial_x(r_1) \cup \partial_x(r_2)$\\
		$\partial_x(\ONE)$ & $=$ & $\varnothing$\\
		$\partial_x(\ZERO)$ & $\dn$ & $\varnothing$\\
	\end{tabular}
\end{center}
\noindent
The $\cdot$ in the example 
$(\partial_x \; r_1) \cdot r_2 $ 
is a shorthand notation for the cartesian product 
$(\partial_x \; r_1) \times \{ r_2\}$.
%Each element in the set generated by a partial derivative
%corresponds to a (potentially partial) match
%TODO: define derivatives w.r.t string s
Rather than joining the calculated derivatives $\partial_x r_1$ and $\partial_x r_2$ together
using the $\sum$ constructor, Antimirov put them into
a set.  This means many subterms will be de-duplicated
because they are sets.
For example, to compute what the derivative of the regular expression 
$x^*(xx + y)^*$
w.r.t. $x$ is, one can compute a partial derivative
and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$
from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$.

The partial derivative w.r.t. a string is defined recursively:
\[
	\partial_{c::cs} r \dn \bigcup_{r'\in (\partial_c r)}
	\partial_{cs} r'
\]
Suppose an alphabet $\Sigma$, we use $\Sigma^*$ for the set of all possible strings
from the alphabet. 
The set of all possible partial derivatives is then defined
as the union of derivatives w.r.t all the strings: 
\begin{center}
	\begin{tabular}{lcl}
		$\textit{PDER}_{\Sigma^*} \; r $ & $\dn $ & $\bigcup_{w \in \Sigma^*}\partial_w \; r$
	\end{tabular}
\end{center}
\noindent
Consider now again our pathological case where we apply the more
aggressive 
simplification
\begin{center}
	$((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$
\end{center}
let use abbreviate theis regular expression with $r$,
then we have that
\begin{center}
$\textit{PDER}_{\Sigma^*} \; r = 
\bigcup_{i=1}^{n}\bigcup_{j=0}^{i-1} \{ 
	(\underbrace{a \ldots a}_{\text{j a's}}\cdot
(\underbrace{a \ldots a}_{\text{i a's}})^*)\cdot r \}$, 
\end{center}
The union on the right-hand-side has $n * (n + 1) / 2$ terms.
This leads us to believe that the maximum number of terms needed
in our derivative is also only $n * (n + 1) / 2$.
Therefore
we conjecture that $\bsimpStrong$ is also able to achieve this
upper limit in general
\begin{conjecture}\label{bsimpStrongInclusionPder}
	Using a suitable transformation $f$, we have that
	\begin{center}
		$\forall s.\; f \; (r \bdersStrong  \; s) \subseteq
		 \textit{PDER}_{\Sigma^*} \; r$
	\end{center}
	holds.
\end{conjecture}
\noindent
The reason is that our \eqref{eqn:cubicRule} will keep only one copy of each term,
where the function $\textit{prune}$ takes care of maintaining
a set like structure similar to partial derivatives.
%We might need to adjust $\textit{prune}$
%slightly to make sure all duplicate terms are eliminated,
%which should be doable.

Antimirov had proven that the sum of all the partial derivative 
terms' sizes is bounded by the cubic of the size of that regular
expression:
\begin{property}\label{pderBound}
	$\llbracket \textit{PDER}_{\Sigma^*} \; r \rrbracket \leq O(\llbracket r \rrbracket^3)$
\end{property}
\noindent
This property was formalised by Wu et al. \cite{Wu2014}, and the 
details can be found in the Archive of Formal Froofs\footnote{https://www.isa-afp.org/entries/Myhill-Nerode.html}.
Once conjecture \ref{bsimpStrongInclusionPder} is proven, then property \ref{pderBound}
would provide us with a cubic bound for our $\blexerStrong$ algorithm: 
\begin{conjecture}\label{strongCubic}
	$\llbracket r \bdersStrong\; s \rrbracket \leq \llbracket r \rrbracket^3$
\end{conjecture}
\noindent
We leave this as future work.


%To get all the "atomic" components of a regular expression's possible derivatives,
%there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes
%whatever character is available at the head of the string inside the language of a
%regular expression, and gives back the character and the derivative regular expression
%as a pair (which he called "monomial"):
% \begin{center}  
% \begin{tabular}{ccc}
% $\lf(\ONE)$ & $=$ & $\phi$\\
%$\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\
% $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
% $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\
%\end{tabular}
%\end{center}
%%TODO: completion
%
%There is a slight difference in the last three clauses compared
%with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular 
%expression $r$ with every element inside $\textit{rset}$ to create a set of 
%sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates 
%on a set of monomials (which Antimirov called "linear form") and a regular 
%expression, and returns a linear form:
% \begin{center}  
% \begin{tabular}{ccc}
% $l \bigodot (\ZERO)$ & $=$ & $\phi$\\
% $l \bigodot (\ONE)$ & $=$ & $l$\\
% $\phi \bigodot t$ & $=$ & $\phi$\\
% $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\
% $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\
%  $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\
% $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
% $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\
%\end{tabular}
%\end{center}
%%TODO: completion
%
% Some degree of simplification is applied when doing $\bigodot$, for example,
% $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$,
% and  $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and
%  $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$,
%  and so on.
%  
%  With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regular expression $r$ with 
%  an iterative procedure:
%   \begin{center}  
% \begin{tabular}{llll}
%$\textit{while}$ & $(\Delta_i \neq \phi)$  &                &          \\
% 		       &  $\Delta_{i+1}$           &        $ =$ & $\lf(\Delta_i) - \PD_i$ \\
%		       &  $\PD_{i+1}$              &         $ =$ & $\Delta_{i+1} \cup \PD_i$ \\
%$\partial_{UNIV}(r)$ & $=$ & $\PD$ &		     
%\end{tabular}
%\end{center}
%
%
% $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,




%----------------------------------------------------------------------------------------
%	SECTION 2
%----------------------------------------------------------------------------------------


%The closed form for them looks like:
%%\begin{center}
%%	\begin{tabular}{llrclll}
%%		$r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & \\
%%		$\textit{rsimp}$ & $($ & $
%%		\sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\
%%			 & & & & $\textit{nupdates} \;$ & 
%%			 $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\
%%			 & & & & $)$ &\\
%%			 & &  $)$ & & &\\
%%			 & $)$ & & & &\\
%%	\end{tabular}
%%\end{center}
%\begin{center}
%	\begin{tabular}{llrcllrllll}
%		$r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & &&&&\\
%			      &&&&$\textit{rsimp}$ & $($ & $
%			      \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\
%					  &&&& & & & & $\;\; \textit{nupdates} \;$ & 
%			 		  $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\
%					  &&&& & & & & $)$ &\\
%					  &&&& & &  $)$ & & &\\
%					  &&&& & $)$ & & & &\\
%	\end{tabular}
%\end{center}
%The $\textit{optermsimp}$ function with the argument $r$ 
%chooses from two options: $\ZERO$ or 
%We define for the $r^{\{n\}}$ constructor something similar to $\starupdate$
%and $\starupdates$:
%\begin{center}
%	\begin{tabular}{lcl}
%		$\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
%		$\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\
%						     & & $\textit{if} \; 
%						     (\rnullable \; (\rders \; r \; s))$ \\
%						     & & $\textit{then} \;\; (s @ [c]) :: [c] :: (
%						     \starupdate \; c \; r \; Ss)$ \\
%						     & & $\textit{else} \;\; (s @ [c]) :: (
%						     \starupdate \; c \; r \; Ss)$
%	\end{tabular}
%\end{center}
%\noindent
%As a generalisation from characters to strings,
%$\starupdates$ takes a string instead of a character
%as the first input argument, and is otherwise the same
%as $\starupdate$.
%\begin{center}
%	\begin{tabular}{lcl}
%		$\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\
%		$\starupdates \; (c :: cs) \; r \; Ss$ &  $=$ &  $\starupdates \; cs \; r \; (
%		\starupdate \; c \; r \; Ss)$
%	\end{tabular}
%\end{center}
%\noindent



%\section{Zippers}
%Zipper is a data structure designed to operate on 
%and navigate between local parts of a tree.
%It was first formally described by Huet \cite{HuetZipper}.
%Typical applications of zippers involve text editor buffers
%and proof system databases.
%In our setting, the idea is to compactify the representation
%of derivatives with zippers, thereby making our algorithm faster.
%Some initial results 
%We first give a brief introduction to what zippers are,
%and other works
%that apply zippers to derivatives
%When dealing with large trees, it would be a waste to 
%traverse the entire tree if
%the operation only
%involves a small fraction of it.
%The idea is to put the focus on that subtree, turning other parts
%of the tree into a context
%
%
%One observation about our derivative-based lexing algorithm is that
%the derivative operation sometimes traverses the entire regular expression
%unnecessarily:


%----------------------------------------------------------------------------------------
%	SECTION 1
%----------------------------------------------------------------------------------------

%\section{Adding Support for the Negation Construct, and its Correctness Proof}
%We now add support for the negation regular expression:
%\[			r ::=   \ZERO \mid  \ONE
%			 \mid  c  
%			 \mid  r_1 \cdot r_2
%			 \mid  r_1 + r_2   
%			 \mid r^*    
%			 \mid \sim r
%\]
%The $\textit{nullable}$ function's clause for it would be 
%\[
%\textit{nullable}(~r) = \neg \nullable(r)
%\]
%The derivative would be
%\[
%~r \backslash c = ~ (r \backslash c)
%\]
% 
%The most tricky part of lexing for the $~r$ regular expression
% is creating a value for it.
% For other regular expressions, the value aligns with the 
% structure of the regular expression:
% \[
% \vdash \Seq(\Char(a), \Char(b)) : a \cdot b
% \]
%But for the $~r$ regular expression, $s$ is a member of it if and only if
%$s$ does not belong to $L(r)$. 
%That means when there
%is a match for the not regular expression, it is not possible to generate how the string $s$ matched
%with $r$. 
%What we can do is preserve the information of how $s$ was not matched by $r$,
%and there are a number of options to do this.
%
%We could give a partial value when there is a partial match for the regular expression inside
%the $\mathbf{not}$ construct.
%For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,
%A value for it could be 
% \[
% \vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)
% \]
% The above example demonstrates what value to construct 
% when the string $s$ is at most a real prefix
% of the strings in $L(r)$. When $s$ instead is not a prefix of any strings
% in $L(r)$, it becomes unclear what to return as a value inside the $\textit{Not}$
% constructor.
% 
% Another option would be to either store the string $s$ that resulted in 
% a mis-match for $r$ or a dummy value as a placeholder:
% \[
% \vdash \textit{Not}(abcd) : ~( r_1 )
% \]
%or
% \[
% \vdash \textit{Not}(\textit{Dummy}) : ~( r_1 )
% \] 
% We choose to implement this as it is most straightforward:
% \[
% \mkeps(~(r))  = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})
% \]
% 
%
%\begin{center}
%	\begin{tabular}{lcl}
%		$\ntset \; r \; (n+1) \; c::cs $ & $\dn$ & $\nupdates \;
%		cs \; r \; [\Some \; ([c], n)]$\\
%		$\ntset \; r\; 0 \; \_$ &  $\dn$ &  $\None$\\
%		$\ntset \; r \; \_ \; [] $ & $ \dn$ & $[]$\\
%	\end{tabular}
%\end{center}
author	Chengsong
	Fri, 30 Dec 2022 01:52:32 +0000 (2022-12-30)
changeset 638	dd9dde2d902b
parent 630	d50a309a0645
child 639	80cc6dc4c98b
permissions	-rwxr-xr-x