ChengsongTanPhdThesis/Chapters/Cubic.tex
author Chengsong
Fri, 02 Sep 2022 00:14:19 +0100
changeset 592 7f4c353c0f6b
parent 591 b2d0de6aee18
child 594 62f8fa03863e
permissions -rwxr-xr-x
more

% Chapter Template

\chapter{A Better Bound and Other Extensions} % Main chapter title

\label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
%in Chapter 4 to a polynomial one, and demonstrate how one can extend the
%algorithm to include constructs such as bounded repetitions and negations.
\lstset{style=myScalastyle}


This chapter is a ``miscellaneous''
chapter which records various
extensions to our $\blexersimp$'s formalisations.\\
Firstly we present further improvements
made to our lexer algorithm $\blexersimp$.
We devise a stronger simplification algorithm,
called $\bsimpStrong$, which can prune away
similar components in two regular expressions at the same 
alternative level,
even if these regular expressions are not exactly the same.
We call the lexer that uses this stronger simplification function
$\blexerStrong$.
We conjecture that both
\begin{center}
	$\blexerStrong \;r \; s = \blexer\; r\;s$
\end{center}
and
\begin{center}
	$\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$
\end{center}
hold, but formalising
them is still work in progress.
We give reasons why the correctness and cubic size bound proofs
can be achieved,
by exploring the connection between the internal
data structure of our $\blexerStrong$ and
Animirov's partial derivatives.\\
Secondly, we extend our $\blexersimp$
to support bounded repetitions ($r^{\{n\}}$).
We update our formalisation of 
the correctness and finiteness properties to
include this new construct. With bounded repetitions
we are able to out-compete other verified lexers such as
Verbatim++ on regular expressions which involve a lot of
repetitions. We also present the idempotency property proof
of $\bsimp$, which leverages the idempotency proof of $\rsimp$.
This reinforces our claim that the fixpoint construction
originally required by Sulzmann and Lu can be removed in $\blexersimp$.
\\
Last but not least, we present our efforts and challenges we met
in further improving the algorithm by data
structures such as zippers.



%----------------------------------------------------------------------------------------
%	SECTION strongsimp
%----------------------------------------------------------------------------------------
\section{A Stronger Version of Simplification}
%TODO: search for isabelle proofs of algorithms that check equivalence 
In our bitcoded lexing algorithm, (sub)terms represent (sub)matches.
For example, the regular expression 
\[
	aa \cdot a^*+ a \cdot a^* + aa\cdot a^*
\]
contains three terms, 
expressing three possibilities it will match future input.
The first and the third terms are identical, which means we can eliminate
the latter as we know it will not be picked up by $\bmkeps$.
In $\bsimps$, the $\distinctBy$ function takes care of this.
The criteria $\distinctBy$ uses for removing a duplicate
$a_2$ in the list
\begin{center}
	$rs_a@[a_1]@rs_b@[a_2]@rs_c$
\end{center}
is that 
\begin{center}
	$\rerase{a_1} = \rerase{a_2}$.
\end{center}
It can be characterised as the $LD$ 
rewrite rule in \ref{rrewriteRules}.\\
The problem , however, is that identical components
in two slightly different regular expressions cannot be removed:
\begin{figure}[H]
\[
	(a+b+d) \cdot r_1 + (a+c+e) \cdot r_1 \stackrel{?}{\rightsquigarrow} (a+b+d) \cdot r_1 + (c+e) \cdot r_1
\]
\caption{Desired simplification, but not done in $\blexersimp$}\label{partialDedup}
\end{figure}
\noindent
A simplification like this actually
cannot be omitted,
as without it the size could blow up even with our $\simp$ function:
\begin{figure}[H]
\centering
\begin{tikzpicture}
\begin{axis}[
    %xlabel={$n$},
    myplotstyle,
    xlabel={input length},
    ylabel={size},
    ]
\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
\end{axis}
\end{tikzpicture}
\caption{Runtime of $\blexersimp$ for matching 
	$\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ 
	with strings 
	of the form $\protect\underbrace{aa..a}_{n}$.}\label{blexerExp}
\end{figure}
\noindent
We would like to apply the rewriting at some stage
\begin{figure}[H]
\[
	(a+b+d) \cdot r_1  \longrightarrow a \cdot r_1 + b \cdot r_1 + d \cdot r_1
\]
\caption{Desired simplification, but not done in $\blexersimp$}\label{desiredSimp}
\end{figure}
\noindent
in our $\simp$ function,
so that it makes the simplification in \ref{partialDedup} possible.
For example,
can a function like the below one be used?
%TODO: simp' that spills things

Unfortunately,
if we introduce them in our
setting we would lose the POSIX property of our calculated values. 
For example given the regular expression 
\begin{center}
	$(a + ab)(bc + c)$
\end{center}
and the string
\begin{center}
	$ab$,
\end{center}
then our algorithm generates the following
correct POSIX value
\begin{center}
	$\Seq \; (\Right \; ab) \; (\Right \; c)$.
\end{center}
Essentially it matches the string with the longer Right-alternative
in the first sequence (and
then the 'rest' with the character regular expression $c$ from the second sequence).
If we add the simplification above, then we obtain the following value
\begin{center}
	$\Left \; (\Seq \; a \; (\Left \; bc))$
\end{center}
where the $\Left$-alternatives get priority. 
However this violates the POSIX rules.
The reason for getting this undesired value
is that the new rule splits this regular expression up into 
\begin{center}
	$a\cdot(b c + c) + ab \cdot (bc + c)$,
\end{center}
which becomes a regular expression with a 
totally different structure--the original
was a sequence, and now it becomes an alternative.
With an alternative the maximum munch rule no longer works.\\
A method to reconcile this is to do the 
transformation in \ref{desiredSimp} ``non-invasively'',
meaning that we traverse the list of regular expressions
\begin{center}
	$rs_a@[a]@rs_c$
\end{center}
in the alternative
\begin{center}
	$\sum ( rs_a@[a]@rs_c)$
\end{center}
using  a function similar to $\distinctBy$,
but this time 
we allow a more general list rewrite:
\begin{mathpar}\label{cubicRule}
		\inferrule{\vspace{0mm} }{rs_a@[a]@rs_c 
			\stackrel{s}{\rightsquigarrow }
		rs_a@[\textit{prune} \; a \; rs_a]@rs_c }
\end{mathpar}
%L \; a_1' = L \; a_1 \setminus (\cup_{a \in rs_a} L \; a)
where $\textit{prune} \;a \; acc$ traverses $a$
without altering the structure of $a$, removing components in $a$
that have appeared in the accumulator $acc$.
For example
\begin{center}
	$\textit{prune} \;\;\; (r_a+r_f+r_g+r_h)r_d \;\; \; [(r_a+r_b+r_c)r_d, (r_e+r_f)r_d] $
\end{center}
should be equal to 
\begin{center}
	$(r_g+r_h)r_d$
\end{center}
because $r_gr_d$ and 
$r_hr_d$ are the only terms
that have not appeared in the accumulator list 
\begin{center}
$[(r_a+r_b+r_c)r_d, (r_e+r_f)r_d]$.
\end{center}
We implemented 
function $\textit{prune}$ in Scala,
and incorporated into our lexer,
by replacing the $\simp$ function 
with a stronger version called $\bsimpStrong$
that prunes regular expressions.
We call this lexer $\blexerStrong$.
$\blexerStrong$ is able to drastically reduce the
internal data structure size which could
trigger exponential behaviours in
$\blexersimp$.
\begin{figure}[H]
\centering
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
\begin{tikzpicture}
\begin{axis}[
    %xlabel={$n$},
    myplotstyle,
    xlabel={input length},
    ylabel={size},
    ]
\addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data};
\end{axis}
\end{tikzpicture}
  &
\begin{tikzpicture}
\begin{axis}[
    %xlabel={$n$},
    myplotstyle,
    xlabel={input length},
    ylabel={size},
    ]
\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
\end{axis}
\end{tikzpicture}\\
\multicolumn{2}{l}{Graphs: Runtime for matching $((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings 
	   of the form $\underbrace{aa..a}_{n}$.}
\end{tabular}    
\caption{aaaaaStarStar} \label{fig:aaaaaStarStar}
\end{figure}
\begin{figure}[H]
\begin{lstlisting}
    def atMostEmpty(r: Rexp) : Boolean = r match {
      case ZERO => true
      case ONE => true
      case STAR(r) => atMostEmpty(r)
      case SEQ(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2)
      case ALTS(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2)
      case CHAR(_) => false
    }


    def isOne(r: Rexp) : Boolean = r match {
      case ONE => true
      case SEQ(r1, r2) => isOne(r1) && isOne(r2)
      case ALTS(r1, r2) => (isOne(r1) || isOne(r2)) && (atMostEmpty(r1) && atMostEmpty(r2))//rs.forall(atMostEmpty) && rs.exists(isOne)
      case STAR(r0) => atMostEmpty(r0)
      case CHAR(c) => false
      case ZERO => false
    }

    //r = r' ~ tail' : If tail' matches tail => returns r'
    def removeSeqTail(r: Rexp, tail: Rexp) : Rexp = r match {
      case SEQ(r1, r2) => 
        if(r2 == tail) 
          r1
        else
          ZERO
      case r => ZERO
    }

    def prune(r: ARexp, acc: Set[Rexp]) : ARexp = r match{
      case AALTS(bs, rs) => rs.map(r => prune(r, acc)).filter(_ != ZERO) match
      {
        //all components have been removed, meaning this is effectively a duplicate
        //flats will take care of removing this AZERO 
        case Nil => AZERO
        case r::Nil => fuse(bs, r)
        case rs1 => AALTS(bs, rs1)
      }
      case ASEQ(bs, r1, r2) => 
        //remove the r2 in (ra + rb)r2 to identify the duplicate contents of r1
        prune(r1, acc.map(r => removeSeqTail(r, erase(r2)))) match {
          //after pruning, returns 0
          case AZERO => AZERO
          //after pruning, got r1'.r2, where r1' is equal to 1
          case r1p if(isOne(erase(r1p))) => fuse(bs ++ mkepsBC(r1p), r2)
          //assemble the pruned head r1p with r2
          case r1p => ASEQ(bs, r1p, r2)
        }
        //this does the duplicate component removal task
      case r => if(acc(erase(r))) AZERO else r
    }
\end{lstlisting}
\caption{pruning function together with its helper functions}
\end{figure}
\noindent
The benefits of using 
$\textit{prune}$ such as refining the finiteness bound
to a cubic bound has not been formalised yet.
Therefore we choose to use Scala code rather than an Isabelle-style formal 
definition like we did for $\simp$, as the definitions might change
to suit proof needs.
In the rest of the chapter we will use this convention consistently.
\begin{figure}[H]
\begin{lstlisting}
  def distinctWith(rs: List[ARexp], 
    pruneFunction: (ARexp, Set[Rexp]) => ARexp, 
    acc: Set[Rexp] = Set()) : List[ARexp] = 
      rs match{
        case Nil => Nil
        case r :: rs => 
          if(acc(erase(r)))
            distinctWith(rs, pruneFunction, acc)
          else {
            val pruned_r = pruneFunction(r, acc)
            pruned_r :: 
            distinctWith(rs, 
              pruneFunction, 
              turnIntoTerms(erase(pruned_r)) ++: acc
              )
          }
      }
\end{lstlisting}
\caption{A Stronger Version of $\textit{distinctBy}$}
\end{figure}
\noindent
The function $\textit{prune}$ is used in $\distinctWith$.
$\distinctWith$ is a stronger version of $\distinctBy$
which not only removes duplicates as $\distinctBy$ would
do, but also uses the $\textit{pruneFunction}$
argument to prune away verbose components in a regular expression.\\
\begin{figure}[H]
\begin{lstlisting}
   //a stronger version of simp
    def bsimpStrong(r: ARexp): ARexp =
    {
      r match {
        case ASEQ(bs1, r1, r2) => (bsimpStrong(r1), bsimpStrong(r2)) match {
          //normal clauses same as simp
        case (AZERO, _) => AZERO
        case (_, AZERO) => AZERO
        case (AONE(bs2), r2s) => fuse(bs1 ++ bs2, r2s)
        //bs2 can be discarded
        case (r1s, AONE(bs2)) => fuse(bs1, r1s) //assert bs2 == Nil
        case (r1s, r2s) => ASEQ(bs1, r1s, r2s)
        }
        case AALTS(bs1, rs) => {
          //distinctBy(flat_res, erase)
          distinctWith(flats(rs.map(bsimpStrong(_))), prune) match {
            case Nil => AZERO
            case s :: Nil => fuse(bs1, s)
            case rs => AALTS(bs1, rs)
          }
        }
        //stars that can be treated as 1
        case ASTAR(bs, r0) if(atMostEmpty(erase(r0))) => AONE(bs)
        case r => r
      }
    }
\end{lstlisting}
\caption{The function $\bsimpStrong$ and $\bdersStrongs$}
\end{figure}
\noindent
$\distinctWith$, is in turn used in $\bsimpStrong$:
\begin{figure}[H]
\begin{lstlisting}
      //Conjecture: [| bdersStrong(s, r) |] = O([| r |]^3)
      def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match {
        case Nil => r
        case c::s => bdersStrong(s, bsimpStrong(bder(c, r)))
      }
\end{lstlisting}
\caption{The function $\bsimpStrong$ and $\bdersStrongs$}
\end{figure}
\noindent
We conjecture that the above Scala function $\bdersStrongs$,
written $\bdersStrong{\_}{\_}$ as an infix notation, 
satisfies the following property:
\begin{conjecture}
	$\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$
\end{conjecture}
The stronger version of $\blexersimp$'s
code in Scala looks like: 
\begin{figure}[H]
\begin{lstlisting}
      def strongBlexer(r: Rexp, s: String) : Option[Val] = {
        Try(Some(decode(r, strong_blex_simp(internalise(r), s.toList)))).getOrElse(None)
      }
      def strong_blex_simp(r: ARexp, s: List[Char]) : Bits = s match {
        case Nil => {
          if (bnullable(r)) {
            mkepsBC(r)
          }
          else
            throw new Exception("Not matched")
        }
        case c::cs => {
          strong_blex_simp(strongBsimp(bder(c, r)), cs)
        }
      }
\end{lstlisting}
\end{figure}
\noindent
We would like to preserve the correctness like the one 
we had for $\blexersimp$:
\begin{conjecture}\label{cubicConjecture}
	$\blexerStrong \;r \; s = \blexer\; r\;s$
\end{conjecture}
\noindent
We introduce the new rule \ref{cubicRule}
and augment our proofs in chapter \ref{Bitcoded2}.
The idea is to maintain the properties like
$r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$ etc.

In the next section,
we will describe why we 
believe a cubic bound can be achieved.
We give an introduction to the
partial derivatives,
which was invented by Antimirov \cite{Antimirov95},
and then link it with the result of the function
$\bdersStrongs$.
 
\section{Antimirov's partial derivatives}
This suggests a link towrads "partial derivatives"
 introduced  The idea behind Antimirov's partial derivatives
is to do derivatives in a similar way as suggested by Brzozowski, 
but maintain a set of regular expressions instead of a single one:

%TODO: antimirov proposition 3.1, needs completion
 \begin{center}  
 \begin{tabular}{ccc}
 $\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\
$\partial_x(\ONE)$ & $=$ & $\phi$
\end{tabular}
\end{center}

Rather than joining the calculated derivatives $\partial_x a$ and $\partial_x b$ together
using the alternatives constructor, Antimirov cleverly chose to put them into
a set instead.  This breaks the terms in a derivative regular expression up, 
allowing us to understand what are the "atomic" components of it.
For example, To compute what regular expression $x^*(xx + y)^*$'s 
derivative against $x$ is made of, one can do a partial derivative
of it and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$
from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$.
To get all the "atomic" components of a regular expression's possible derivatives,
there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes
whatever character is available at the head of the string inside the language of a
regular expression, and gives back the character and the derivative regular expression
as a pair (which he called "monomial"):
 \begin{center}  
 \begin{tabular}{ccc}
 $\lf(\ONE)$ & $=$ & $\phi$\\
$\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\
 $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
 $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\
\end{tabular}
\end{center}
%TODO: completion

There is a slight difference in the last three clauses compared
with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular 
expression $r$ with every element inside $\textit{rset}$ to create a set of 
sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates 
on a set of monomials (which Antimirov called "linear form") and a regular 
expression, and returns a linear form:
 \begin{center}  
 \begin{tabular}{ccc}
 $l \bigodot (\ZERO)$ & $=$ & $\phi$\\
 $l \bigodot (\ONE)$ & $=$ & $l$\\
 $\phi \bigodot t$ & $=$ & $\phi$\\
 $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\
 $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\
  $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\
 $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
 $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\
\end{tabular}
\end{center}
%TODO: completion

 Some degree of simplification is applied when doing $\bigodot$, for example,
 $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$,
 and  $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and
  $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$,
  and so on.
  
  With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regular expression $r$ with 
  an iterative procedure:
   \begin{center}  
 \begin{tabular}{llll}
$\textit{while}$ & $(\Delta_i \neq \phi)$  &                &          \\
 		       &  $\Delta_{i+1}$           &        $ =$ & $\lf(\Delta_i) - \PD_i$ \\
		       &  $\PD_{i+1}$              &         $ =$ & $\Delta_{i+1} \cup \PD_i$ \\
$\partial_{UNIV}(r)$ & $=$ & $\PD$ &		     
\end{tabular}
\end{center}


 $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,




 \section{The NTIMES Constructor, and the Size Bound and Correctness Proof for it}
The NTIMES construct has the following closed form:
\begin{verbatim}
"rders_simp (RNTIMES r0 (Suc n)) (c#s) = 
rsimp ( RALTS ( (map (optermsimp r0 ) (nupdates s r0 [Some ([c], n)]) ) ))"
\end{verbatim}
%----------------------------------------------------------------------------------------
%	SECTION 1
%----------------------------------------------------------------------------------------

\section{Adding Support for the Negation Construct, and its Correctness Proof}
We now add support for the negation regular expression:
\[			r ::=   \ZERO \mid  \ONE
			 \mid  c  
			 \mid  r_1 \cdot r_2
			 \mid  r_1 + r_2   
			 \mid r^*    
			 \mid \sim r
\]
The $\textit{nullable}$ function's clause for it would be 
\[
\textit{nullable}(~r) = \neg \nullable(r)
\]
The derivative would be
\[
~r \backslash c = ~ (r \backslash c)
\]
 
The most tricky part of lexing for the $~r$ regular expression
 is creating a value for it.
 For other regular expressions, the value aligns with the 
 structure of the regular expression:
 \[
 \vdash \Seq(\Char(a), \Char(b)) : a \cdot b
 \]
But for the $~r$ regular expression, $s$ is a member of it if and only if
$s$ does not belong to $L(r)$. 
That means when there
is a match for the not regular expression, it is not possible to generate how the string $s$ matched
with $r$. 
What we can do is preserve the information of how $s$ was not matched by $r$,
and there are a number of options to do this.

We could give a partial value when there is a partial match for the regular expression inside
the $\mathbf{not}$ construct.
For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,
A value for it could be 
 \[
 \vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)
 \]
 The above example demonstrates what value to construct 
 when the string $s$ is at most a real prefix
 of the strings in $L(r)$. When $s$ instead is not a prefix of any strings
 in $L(r)$, it becomes unclear what to return as a value inside the $\textit{Not}$
 constructor.
 
 Another option would be to either store the string $s$ that resulted in 
 a mis-match for $r$ or a dummy value as a placeholder:
 \[
 \vdash \textit{Not}(abcd) : ~( r_1 )
 \]
or
 \[
 \vdash \textit{Not}(\textit{Dummy}) : ~( r_1 )
 \] 
 We choose to implement this as it is most straightforward:
 \[
 \mkeps(~(r))  = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})
 \]
 
%----------------------------------------------------------------------------------------
%	SECTION 2
%----------------------------------------------------------------------------------------

\section{Bounded Repetitions}