% Chapter Template+ −
+ −
%We also present the idempotency property proof+ −
%of $\bsimp$, which leverages the idempotency proof of $\rsimp$.+ −
%This reinforces our claim that the fixpoint construction+ −
%originally required by Sulzmann and Lu can be removed in $\blexersimp$.+ −
%Last but not least, we present our efforts and challenges we met+ −
%in further improving the algorithm by data+ −
%structures such as zippers.+ −
%----------------------------------------------------------------------------------------+ −
% SECTION strongsimp+ −
%----------------------------------------------------------------------------------------+ −
%TODO: search for isabelle proofs of algorithms that check equivalence + −
+ −
\chapter{A Better Size Bound for Derivatives} % Main chapter title+ −
+ −
\label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound+ −
%in Chapter 4 to a polynomial one, and demonstrate how one can extend the+ −
%algorithm to include constructs such as bounded repetitions and negations.+ −
\lstset{style=myScalastyle}+ −
+ −
+ −
This chapter is a ``work-in-progress''+ −
chapter which records+ −
extensions to our $\blexersimp$.+ −
We make a conjecture that the finite+ −
size bound from the previous chapter can be + −
improved to a cubic bound.+ −
We implemented our conjecture in Scala.+ −
We have not formalised this part in Isabelle/HOL. + −
%we have not been able to finish due to time constraints of the PhD.+ −
Nevertheless, we outline the ideas we intend to use for the proof.+ −
+ −
\section{A Stronger Version of Simplification}+ −
+ −
Let us first present further improvements+ −
for our lexer algorithm $\blexersimp$.+ −
We devise a stronger simplification algorithm,+ −
called $\bsimpStrong$, which can prune away+ −
similar components in two regular expressions at the same + −
alternative level,+ −
even if these regular expressions are not exactly the same.+ −
We call the lexer that uses this stronger simplification function+ −
$\blexerStrong$.+ −
%Unfortunately we did not have time to + −
%work out the proofs, like in+ −
%the previous chapters.+ −
We conjecture that both+ −
\begin{center}+ −
$\blexerStrong \;r \; s = \blexer\; r\;s$+ −
\end{center}+ −
and + −
\begin{center}+ −
$\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$+ −
\end{center}+ −
hold.+ −
%but a formalisation+ −
%is still future work.+ −
We give an informal justification + −
why the correctness and cubic size bound proofs+ −
can be achieved+ −
by exploring the connection between the internal+ −
data structure of our $\blexerStrong$ and+ −
Animirov's partial derivatives.+ −
+ −
In our bitcoded lexing algorithm, (sub)terms represent (sub)matches.+ −
For example, the regular expression + −
\[+ −
aa \cdot a^*+ a \cdot a^* + aa\cdot a^*+ −
\]+ −
contains three terms, + −
expressing three possibilities for how it can match some more+ −
input of the form $a \ldots a$.+ −
The first and the third terms are identical, which means we can eliminate+ −
the latter as it will not contribute to a POSIX value.+ −
In $\bsimps$, the $\distinctBy$ function takes care of + −
such instances.+ −
The criteria $\distinctBy$ uses for removing a duplicate+ −
$a_2$ in the list+ −
\begin{center}+ −
$rs_a@[a_1]@rs_b@[a_2]@rs_c$+ −
\end{center}+ −
is that + −
the two erased regular expressions are equal+ −
\begin{center}+ −
$\rerase{a_1} = \rerase{a_2}$.+ −
\end{center}+ −
This is characterised as the $LD$ + −
rewrite rule in figure \ref{rrewriteRules}.+ −
The problem, however, is that identical components+ −
in two slightly different regular expressions cannot be removed+ −
by the $LD$ rule.+ −
Consider the stronger simplification+ −
\begin{equation}+ −
\label{eqn:partialDedup}+ −
(a+b+d) \cdot r_1 + (a+c+e) \cdot r_1 \stackrel{?}{\rightsquigarrow} (a+b+d) \cdot r_1 + (c+e) \cdot r_1+ −
\end{equation}+ −
where the $(\underline{a}+c+e)\cdot r_1$ is deleted in the right alternative+ −
$a+c+e$.+ −
This is permissible because we have $(a+\ldots)\cdot r_1$ in the left+ −
alternative.+ −
The difficulty is that such ``buried''+ −
alternatives-sequences are not easily recognised.+ −
But simplification like this actually+ −
cannot be omitted, if we want to have a better bound.+ −
For example, the size of derivatives can still+ −
blow up even with our $\textit{bsimp}$ + −
function: + −
consider again the example+ −
$\protect((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$,+ −
and set $n$ to a relatively small number like $n=5$, then we get the following+ −
exponential growth:+ −
\begin{figure}[H]+ −
\centering+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
%xlabel={$n$},+ −
myplotstyle,+ −
xlabel={input length},+ −
ylabel={size},+ −
]+ −
\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};+ −
\end{axis}+ −
\end{tikzpicture}+ −
\caption{Size of derivatives of $\blexersimp$ from chapter 5 for matching + −
$\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ + −
with strings + −
of the form $\protect\underbrace{aa..a}_{n}$.}\label{blexerExp}+ −
\end{figure}+ −
\noindent+ −
One possible approach would be to apply the rewriting+ −
rule+ −
\[+ −
(a+b+d) \cdot r_1 \longrightarrow a \cdot r_1 + b \cdot r_1 + d \cdot r_1+ −
\]+ −
\noindent+ −
which pushes the sequence into the alternatives+ −
in our $\simp$ function.+ −
This would then make the simplification shown in \eqref{eqn:partialDedup} possible.+ −
Translating this rule into our $\textit{bsimp}$ function would simply+ −
involve adding a new clause to the $\textit{bsimp}_{ASEQ}$ function:+ −
\begin{center}+ −
\begin{tabular}{@{}lcl@{}}+ −
$\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\+ −
&& $\ldots$\\+ −
&&$\quad\textit{case} \; (_{bs1}\sum as, a_2') \Rightarrow _{bs1}\sum (+ −
\map \; (_{[]}\textit{ASEQ} \; \_ \; a_2') \; as)$\\+ −
&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
Unfortunately,+ −
if we introduce this clause in our+ −
setting we would lose the POSIX property of our calculated values. + −
For example given the regular expression + −
\begin{center}+ −
$(a + ab)(bc + c)$+ −
\end{center}+ −
and the string $ab$,+ −
then our algorithm generates the following+ −
correct POSIX value+ −
\begin{center}+ −
$\Seq \; (\Right \; ab) \; (\Right \; c)$.+ −
\end{center}+ −
Essentially it matches the string with the longer Right-alternative+ −
in the first sequence (and+ −
then the 'rest' with the character regular expression $c$ from the second sequence).+ −
If we add the simplification above, however, + −
then we would obtain the following value+ −
\begin{center}+ −
$\Left \; (\Seq \; a \; (\Left \; bc))$+ −
\end{center}+ −
where the $\Left$-alternatives get priority. + −
This violates the POSIX rules.+ −
The reason for getting this undesired value+ −
is that the new rule splits this regular expression up into + −
a topmost alternative+ −
\begin{center}+ −
$a\cdot(b c + c) + ab \cdot (bc + c)$,+ −
\end{center}+ −
which is a regular expression with a + −
quite different meaning: the original regular expression+ −
is a sequence, but the simplified regular expression is an alternative.+ −
With an alternative the maximal munch rule no longer works.+ −
+ −
+ −
A method to reconcile this problem is to do the + −
transformation in \eqref{eqn:partialDedup} ``non-invasively'',+ −
meaning that we traverse the list of regular expressions+ −
%\begin{center}+ −
% $rs_a@[a]@rs_c$+ −
%\end{center}+ −
inside alternatives+ −
\begin{center}+ −
$\sum ( rs_a@[a]@rs_c)$+ −
\end{center}+ −
using a function similar to $\distinctBy$,+ −
but this time + −
we allow the following more general rewrite rule:+ −
\begin{equation}+ −
\label{eqn:cubicRule}+ −
%\mbox{+ −
%\begin{mathpar} + −
\inferrule * [Right = cubicRule]{\vspace{0mm} }{rs_a@[a]@rs_c + −
\stackrel{s}{\rightsquigarrow }+ −
rs_a@[\textit{prune} \; a \; rs_a]@rs_c }+ −
%\end{mathpar}+ −
%\caption{The rule capturing the pruning simplification needed to achieve+ −
%a cubic bound}+ −
\end{equation}+ −
\noindent+ −
%L \; a_1' = L \; a_1 \setminus (\cup_{a \in rs_a} L \; a)+ −
where $\textit{prune} \;a \; acc$ traverses $a$+ −
without altering the structure of $a$, but removing components in $a$+ −
that have appeared in the accumulator $acc$.+ −
For example+ −
\begin{center}+ −
$\textit{prune} \;\;\; (r_a+r_f+r_g+r_h)r_d \;\; \; [(r_a+r_b+r_c)r_d, (r_e+r_f)r_d] $+ −
\end{center}+ −
should be equal to + −
\begin{center}+ −
$(r_g+r_h)r_d$+ −
\end{center}+ −
because $r_gr_d$ and + −
$r_hr_d$ are the only terms+ −
that do not appeared in the accumulator list + −
\begin{center}+ −
$[(r_a+r_b+r_c)r_d, (r_e+r_f)r_d]$.+ −
\end{center}+ −
We implemented the+ −
function $\textit{prune}$ in Scala (see figure \ref{fig:pruneFunc})+ −
The function $\textit{prune}$ + −
is a stronger version of $\textit{distinctBy}$.+ −
It does not just walk through a list looking for exact duplicates,+ −
but prunes sub-expressions recursively.+ −
It manages proper contexts by the helper functions+ −
$\textit{removeSeqTail}$, $\textit{isOne}$ and $\textit{atMostEmpty}$.+ −
\begin{figure}%[H]+ −
\begin{lstlisting}[numbers=left]+ −
def prune(r: ARexp, acc: Set[Rexp]) : ARexp = r match{+ −
case AALTS(bs, rs) => rs.map(r => prune(r, acc)).filter(_ != AZERO) match+ −
{+ −
//all components have been removed, meaning this is effectively a duplicate+ −
//flats will take care of removing this AZERO + −
case Nil => AZERO+ −
case r::Nil => fuse(bs, r)+ −
case rs1 => AALTS(bs, rs1)+ −
}+ −
case ASEQ(bs, r1, r2) => + −
//remove the r2 in (ra + rb)r2 to identify the duplicate contents of r1+ −
prune(r1, acc.map(r => removeSeqTail(r, erase(r2)))) match {+ −
//after pruning, returns 0+ −
case AZERO => AZERO+ −
//after pruning, got r1'.r2, where r1' is equal to 1+ −
case r1p if(isOne(erase(r1p))) => fuse(bs ++ mkepsBC(r1p), r2)+ −
//assemble the pruned head r1p with r2+ −
case r1p => ASEQ(bs, r1p, r2)+ −
}+ −
//this does the duplicate component removal task+ −
case r => if(acc(erase(r))) AZERO else r+ −
}+ −
\end{lstlisting}+ −
\caption{The function $\textit{prune}$ is called recursively in the + −
alternative case (line 2) and in the sequence case (line 12).+ −
In the alternative case we keep all the accumulators the same, but+ −
in the sequence case we are making necessary changes to the accumulators + −
to allow correct de-duplication.}\label{fig:pruneFunc}+ −
\end{figure}+ −
\noindent+ −
+ −
\begin{figure}+ −
\begin{lstlisting}[numbers=left]+ −
def atMostEmpty(r: Rexp) : Boolean = r match {+ −
case ZERO => true+ −
case ONE => true+ −
case STAR(r) => atMostEmpty(r)+ −
case SEQ(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2)+ −
case ALTS(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2)+ −
case CHAR(_) => false+ −
}+ −
+ −
def isOne(r: Rexp) : Boolean = r match {+ −
case ONE => true+ −
case SEQ(r1, r2) => isOne(r1) && isOne(r2)+ −
case ALTS(r1, r2) => (isOne(r1) || isOne(r2)) && (atMostEmpty(r1) && atMostEmpty(r2))+ −
case STAR(r0) => atMostEmpty(r0)+ −
case CHAR(c) => false+ −
case ZERO => false+ −
}+ −
+ −
def removeSeqTail(r: Rexp, tail: Rexp) : Rexp = + −
if (r == tail)+ −
ONE+ −
else {+ −
r match {+ −
case SEQ(r1, r2) => + −
if(r2 == tail) r1 else ZERO+ −
case r => ZERO+ −
}+ −
}+ −
+ −
+ −
\end{lstlisting}+ −
\caption{The helper functions of $\textit{prune}$: + −
$\textit{atMostEmpty}$, $\textit{isOne}$ and $\textit{removeSeqTail}$. $\textit{atMostEmpty}+ −
$ is a function that takes a regular expression and returns true only in case that it+ −
contains nothing more than the empty string in its language. $\textit{isOne}$ tests whether+ −
a regular expression is equivalent to $\ONE$. $\textit{removeSeqTail}$ is a function that+ −
takes away the tail of a sequence regular expression.}+ −
\end{figure}+ −
\noindent+ −
Suppose we feed + −
\begin{center}+ −
$r= (\underline{\ONE}+(\underline{f}+b)\cdot g)\cdot (a\cdot(d\cdot e))$+ −
\end{center}+ −
and + −
\begin{center}+ −
$acc = \{a\cdot(d\cdot e),f\cdot (g \cdot (a \cdot (d \cdot e))) \}$+ −
\end{center}+ −
as the input into $\textit{prune}$.+ −
The end result will be+ −
\[+ −
b\cdot(g\cdot(a\cdot(d\cdot e)))+ −
\]+ −
where the underlined components in $r$ are eliminated.+ −
Looking more closely, at the topmost call + −
\[+ −
\textit{prune} \quad (\ONE++ −
(f+b)\cdot g)\cdot (a\cdot(d\cdot e)) \quad+ −
\{a\cdot(d\cdot e),f\cdot (g \cdot (a \cdot (d \cdot e))) \}+ −
\]+ −
The sequence clause will be called,+ −
where a sub-call+ −
\[+ −
\textit{prune} \;\; (\ONE+(f+b)\cdot g)\;\; \{\ONE, f\cdot g \}+ −
\]+ −
is made. The terms in the new accumulator $\{\ONE,\; f\cdot g \}$ come from+ −
the two calls to $\textit{removeSeqTail}$:+ −
\[+ −
\textit{removeSeqTail} \quad\;\; a \cdot(d\cdot e) \quad\;\; a \cdot(d\cdot e) + −
\]+ −
and+ −
\[+ −
\textit{removeSeqTail} \quad \;\; + −
f\cdot(g\cdot (a \cdot(d\cdot e)))\quad \;\; a \cdot(d\cdot e).+ −
\]+ −
+ −
The idea behind $\textit{removeSeqTail}$ is that+ −
when pruning recursively, we need to ``zoom in''+ −
to sub-expressions, and this ``zoom in'' needs to be performed+ −
on the+ −
accumulators as well, otherwise the deletion will not work.+ −
The sub-call + −
$\textit{prune} \;\; (\ONE+(f+b)\cdot g)\;\; \{\ONE, f\cdot g \}$+ −
is simpler, which will trigger the alternative clause, causing+ −
a pruning on each element in $(\ONE+(f+b)\cdot g)$,+ −
leaving us with $b\cdot g$ only.+ −
+ −
Our new lexer with stronger simplification+ −
uses $\textit{prune}$ by making it + −
the core component of the deduplicating function+ −
called $\textit{distinctWith}$.+ −
$\textit{DistinctWith}$ ensures that all verbose+ −
parts of a regular expression are pruned away.+ −
+ −
\begin{figure}[H]+ −
\begin{lstlisting}[numbers=left]+ −
def turnIntoTerms(r: Rexp): List[Rexp] = r match {+ −
case SEQ(r1, r2) => + −
turnIntoTerms(r1).flatMap(r11 => furtherSEQ(r11, r2))+ −
case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2)+ −
case ZERO => Nil+ −
case _ => r :: Nil+ −
}+ −
+ −
def distinctWith(rs: List[ARexp], + −
pruneFunction: (ARexp, Set[Rexp]) => ARexp, + −
acc: Set[Rexp] = Set()) : List[ARexp] = + −
rs match{+ −
case Nil => Nil+ −
case r :: rs => + −
if(acc(erase(r)))+ −
distinctWith(rs, pruneFunction, acc)+ −
else {+ −
val pruned_r = pruneFunction(r, acc)+ −
pruned_r :: + −
distinctWith(rs, + −
pruneFunction, + −
turnIntoTerms(erase(pruned_r)) ++: acc+ −
)+ −
}+ −
}+ −
+ −
\end{lstlisting}+ −
\caption{A Stronger Version of $\textit{distinctBy}$. This function allows ``partial de-duplication''+ −
of a regular expression. When part of a regular expression has appeared before in the accumulator, we + −
can remove that verbose component.}+ −
\end{figure}+ −
\noindent+ −
Once a regular expression has been pruned,+ −
all its components will be added to the accumulator+ −
to remove any future regular expressions' duplicate components.+ −
+ −
The function $\textit{bsimpStrong}$+ −
is very much the same as $\textit{bsimp}$, just with+ −
$\textit{distinctBy}$ replaced + −
by $\textit{distinctWith}$.+ −
\begin{figure}[H]+ −
\begin{lstlisting}+ −
+ −
def bsimpStrong(r: ARexp): ARexp =+ −
{+ −
r match {+ −
case ASEQ(bs1, r1, r2) => (bsimpStrong(r1), bsimpStrong(r2)) match {+ −
case (AZERO, _) => AZERO+ −
case (_, AZERO) => AZERO+ −
case (AONE(bs2), r2s) => fuse(bs1 ++ bs2, r2s)+ −
case (r1s, AONE(bs2)) => fuse(bs1, r1s) //assert bs2 == Nil+ −
case (r1s, r2s) => ASEQ(bs1, r1s, r2s)+ −
}+ −
case AALTS(bs1, rs) => {+ −
distinctWith(flats(rs.map(bsimpStrong(_))), prune) match {+ −
case Nil => AZERO+ −
case s :: Nil => fuse(bs1, s)+ −
case rs => AALTS(bs1, rs)+ −
}+ −
}+ −
case ASTAR(bs, r0) if(atMostEmpty(erase(r0))) => AONE(bs)+ −
case r => r+ −
}+ −
}+ −
def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match {+ −
case Nil => r+ −
case c::s => bdersStrong(s, bsimpStrong(bder(c, r)))+ −
}+ −
+ −
\end{lstlisting}+ −
\caption{The function+ −
$\textit{bsimpStrong}$: a stronger version of $\textit{bsimp}$}+ −
\end{figure}+ −
\noindent+ −
The benefits of using + −
$\textit{prune}$ refining the finiteness bound+ −
to a cubic bound has not been formalised yet.+ −
Therefore we choose to use Scala code rather than an Isabelle-style formal + −
definition like we did for $\textit{bsimp}$, as the definitions might change+ −
to suit our proof needs.+ −
%In the rest of the chapter we will use this convention consistently.+ −
+ −
%The function $\textit{prune}$ is used in $\distinctWith$.+ −
%$\distinctWith$ is a stronger version of $\distinctBy$+ −
%which not only removes duplicates as $\distinctBy$ would+ −
%do, but also uses the $\textit{pruneFunction}$+ −
%argument to prune away verbose components in a regular expression.\\+ −
%\begin{figure}[H]+ −
%\begin{lstlisting}+ −
% //a stronger version of simp+ −
% def bsimpStrong(r: ARexp): ARexp =+ −
% {+ −
% r match {+ −
% case ASEQ(bs1, r1, r2) => (bsimpStrong(r1), bsimpStrong(r2)) match {+ −
% //normal clauses same as simp+ −
% case (AZERO, _) => AZERO+ −
% case (_, AZERO) => AZERO+ −
% case (AONE(bs2), r2s) => fuse(bs1 ++ bs2, r2s)+ −
% //bs2 can be discarded+ −
% case (r1s, AONE(bs2)) => fuse(bs1, r1s) //assert bs2 == Nil+ −
% case (r1s, r2s) => ASEQ(bs1, r1s, r2s)+ −
% }+ −
% case AALTS(bs1, rs) => {+ −
% //distinctBy(flat_res, erase)+ −
% distinctWith(flats(rs.map(bsimpStrong(_))), prune) match {+ −
% case Nil => AZERO+ −
% case s :: Nil => fuse(bs1, s)+ −
% case rs => AALTS(bs1, rs)+ −
% }+ −
% }+ −
% //stars that can be treated as 1+ −
% case ASTAR(bs, r0) if(atMostEmpty(erase(r0))) => AONE(bs)+ −
% case r => r+ −
% }+ −
% }+ −
% def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match {+ −
% case Nil => r+ −
% case c::s => bdersStrong(s, bsimpStrong(bder(c, r)))+ −
% }+ −
%\end{lstlisting}+ −
%\caption{The function $\bsimpStrong$ and $\bdersStrongs$}+ −
%\end{figure}+ −
%\noindent+ −
%$\distinctWith$, is in turn used in $\bsimpStrong$:+ −
%\begin{figure}[H]+ −
%\begin{lstlisting}+ −
% //Conjecture: [| bdersStrong(s, r) |] = O([| r |]^3)+ −
% def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match {+ −
% case Nil => r+ −
% case c::s => bdersStrong(s, bsimpStrong(bder(c, r)))+ −
% }+ −
%\end{lstlisting}+ −
%\caption{The function $\bsimpStrong$ and $\bdersStrongs$}+ −
%\end{figure}+ −
%\noindent+ −
We conjecture that the above Scala function $\bdersStrongs$,+ −
written $\bdersStrong{\_}{\_}$ as an infix notation, + −
satisfies the following property:+ −
\begin{conjecture}+ −
$\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$+ −
\end{conjecture}+ −
\noindent+ −
The stronger version of $\blexersimp$'s+ −
code in Scala looks like: + −
\begin{figure}[H]+ −
\begin{lstlisting}+ −
def strongBlexer(r: Rexp, s: String) : Option[Val] = {+ −
Try(Some(decode(r, strong_blex_simp(internalise(r), s.toList)))).getOrElse(None)+ −
}+ −
def strong_blex_simp(r: ARexp, s: List[Char]) : Bits = s match {+ −
case Nil => {+ −
if (bnullable(r)) {+ −
mkepsBC(r)+ −
}+ −
else+ −
throw new Exception("Not matched")+ −
}+ −
case c::cs => {+ −
strong_blex_simp(strongBsimp(bder(c, r)), cs)+ −
}+ −
}+ −
\end{lstlisting}+ −
\end{figure}+ −
\noindent+ −
We call this lexer $\blexerStrong$.+ −
This version is able to reduce the+ −
size of the derivatives which + −
otherwise + −
triggered exponential behaviour in+ −
$\blexersimp$.+ −
Consider again the runtime for matching + −
$\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings + −
of the form $\protect\underbrace{aa..a}_{n}$.+ −
They produce the folloiwng graphs ($\blexerStrong$ on the left-hand-side and+ −
$\blexersimp$ on the right-hand-side).+ −
\begin{figure}[H]+ −
\centering+ −
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
%xlabel={$n$},+ −
myplotstyle,+ −
xlabel={input length},+ −
ylabel={size},+ −
width = 7cm,+ −
height = 5cm,+ −
]+ −
\addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data};+ −
\end{axis}+ −
\end{tikzpicture}+ −
&+ −
\begin{tikzpicture}+ −
\begin{axis}[+ −
%xlabel={$n$},+ −
myplotstyle,+ −
xlabel={input length},+ −
ylabel={size},+ −
width = 7cm,+ −
height = 5cm,+ −
]+ −
\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};+ −
\end{axis}+ −
\end{tikzpicture}\\+ −
\multicolumn{2}{l}{}+ −
\end{tabular} + −
\caption{}\label{fig:aaaaaStarStar}+ −
\end{figure}+ −
\noindent+ −
We hope the correctness is preserved.+ −
%We would like to preserve the correctness like the one + −
%we had for $\blexersimp$:+ −
The proof idea is to preserve the key lemma in chapter \ref{Bitcoded2}+ −
such as in equation \eqref{eqn:cubicRule}.+ −
\begin{conjecture}\label{cubicConjecture}+ −
$\blexerStrong \;r \; s = \blexer\; r\;s$+ −
\end{conjecture}+ −
\noindent+ −
The idea is to maintain key lemmas in+ −
chapter \ref{Bitcoded2} like+ −
$r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$+ −
with the new rewriting rule + −
shown in figure \eqref{eqn:cubicRule} .+ −
+ −
In the next sub-section,+ −
we will describe why we + −
believe a cubic size bound can be achieved with+ −
the stronger simplification.+ −
For this we give a short introduction to the+ −
partial derivatives,+ −
which were invented by Antimirov \cite{Antimirov95},+ −
and then link them with the result of the function+ −
$\bdersStrongs$.+ −
+ −
\subsection{Antimirov's partial derivatives}+ −
Partial derivatives were first introduced by + −
Antimirov \cite{Antimirov95}.+ −
They are very similar+ −
to Brzozowski derivatives,+ −
but split children of alternative regular expressions into + −
multiple independent terms. This means the output of+ −
partial derivatives is a + −
set of regular expressions, defined as follows+ −
\begin{center} + −
\begin{tabular}{lcl@{\hspace{-5mm}}l}+ −
$\partial_x \; (r_1 \cdot r_2)$ & + −
$\dn$ & $(\partial_x \; r_1) \cdot r_2 \cup+ −
\partial_x \; r_2 \;$ & $ \textit{if} \; \; \nullable\; r_1$\\+ −
& & $(\partial_x \; r_1)\cdot r_2 \quad\quad$ & $ + −
\textit{otherwise}$\\ + −
$\partial_x \; r^*$ & $\dn$ & $(\partial_x \; r) \cdot r^*$\\+ −
$\partial_x \; c $ & $\dn$ & $\textit{if} \; x = c \; + −
\textit{then} \;+ −
\{ \ONE\} \;\;\textit{else} \; \varnothing$\\+ −
$\partial_x(r_1+r_2)$ & $=$ & $\partial_x(r_1) \cup \partial_x(r_2)$\\+ −
$\partial_x(\ONE)$ & $=$ & $\varnothing$\\+ −
$\partial_x(\ZERO)$ & $\dn$ & $\varnothing$\\+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
The $\cdot$ in the example + −
$(\partial_x \; r_1) \cdot r_2 $ + −
is a shorthand notation for the cartesian product + −
$(\partial_x \; r_1) \times \{ r_2\}$.+ −
%Each element in the set generated by a partial derivative+ −
%corresponds to a (potentially partial) match+ −
%TODO: define derivatives w.r.t string s+ −
Rather than joining the calculated derivatives $\partial_x r_1$ and $\partial_x r_2$ together+ −
using the $\sum$ constructor, Antimirov put them into+ −
a set. This means many subterms will be de-duplicated+ −
because they are sets.+ −
For example, to compute what the derivative of the regular expression + −
$x^*(xx + y)^*$+ −
w.r.t. $x$ is, one can compute a partial derivative+ −
and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$+ −
from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$.+ −
+ −
The partial derivative w.r.t. a string is defined recursively:+ −
\[+ −
\partial_{c::cs} r \dn \bigcup_{r'\in (\partial_c r)}+ −
\partial_{cs} r'+ −
\]+ −
Suppose an alphabet $\Sigma$, we use $\Sigma^*$ for the set of all possible strings+ −
from the alphabet. + −
The set of all possible partial derivatives is then defined+ −
as the union of derivatives w.r.t all the strings: + −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{PDER}_{\Sigma^*} \; r $ & $\dn $ & $\bigcup_{w \in \Sigma^*}\partial_w \; r$+ −
\end{tabular}+ −
\end{center}+ −
\noindent+ −
Consider now again our pathological case where we apply the more+ −
aggressive + −
simplification+ −
\begin{center}+ −
$((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$+ −
\end{center}+ −
let use abbreviate theis regular expression with $r$,+ −
then we have that+ −
\begin{center}+ −
$\textit{PDER}_{\Sigma^*} \; r = + −
\bigcup_{i=1}^{n}\bigcup_{j=0}^{i-1} \{ + −
(\underbrace{a \ldots a}_{\text{j a's}}\cdot+ −
(\underbrace{a \ldots a}_{\text{i a's}})^*)\cdot r \}$, + −
\end{center}+ −
The union on the right-hand-side has $n * (n + 1) / 2$ terms.+ −
This leads us to believe that the maximum number of terms needed+ −
in our derivative is also only $n * (n + 1) / 2$.+ −
Therefore+ −
we conjecture that $\bsimpStrong$ is also able to achieve this+ −
upper limit in general.+ −
\begin{conjecture}\label{bsimpStrongInclusionPder}+ −
Using a suitable transformation $f$, we have that+ −
\begin{center}+ −
$\forall s.\; f \; (r \bdersStrong \; s) \subseteq+ −
\textit{PDER}_{\Sigma^*} \; r$+ −
\end{center}+ −
holds.+ −
\end{conjecture}+ −
\noindent+ −
The reason is that our \eqref{eqn:cubicRule} will keep only one copy of each term,+ −
where the function $\textit{prune}$ takes care of maintaining+ −
a set like structure similar to partial derivatives.+ −
%We might need to adjust $\textit{prune}$+ −
%slightly to make sure all duplicate terms are eliminated,+ −
%which should be doable.+ −
+ −
Antimirov had proven that the sum of all the partial derivative + −
terms' sizes is bounded by the cubic of the size of that regular+ −
expression:+ −
\begin{property}\label{pderBound}+ −
$\llbracket \textit{PDER}_{\Sigma^*} \; r \rrbracket \leq O(\llbracket r \rrbracket^3)$+ −
\end{property}+ −
\noindent+ −
This property was formalised by Wu et al. \cite{Wu2014}, and the + −
details can be found in the Archive of Formal Froofs\footnote{https://www.isa-afp.org/entries/Myhill-Nerode.html}.+ −
Once conjecture \ref{bsimpStrongInclusionPder} is proven, then property \ref{pderBound}+ −
would provide us with a cubic bound for our $\blexerStrong$ algorithm: + −
\begin{conjecture}\label{strongCubic}+ −
$\llbracket r \bdersStrong\; s \rrbracket \leq \llbracket r \rrbracket^3$+ −
\end{conjecture}+ −
\noindent+ −
We leave this as future work.+ −
+ −
+ −
%To get all the "atomic" components of a regular expression's possible derivatives,+ −
%there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes+ −
%whatever character is available at the head of the string inside the language of a+ −
%regular expression, and gives back the character and the derivative regular expression+ −
%as a pair (which he called "monomial"):+ −
% \begin{center} + −
% \begin{tabular}{ccc}+ −
% $\lf(\ONE)$ & $=$ & $\phi$\\+ −
%$\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\+ −
% $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\+ −
% $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\+ −
%\end{tabular}+ −
%\end{center}+ −
%%TODO: completion+ −
%+ −
%There is a slight difference in the last three clauses compared+ −
%with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular + −
%expression $r$ with every element inside $\textit{rset}$ to create a set of + −
%sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates + −
%on a set of monomials (which Antimirov called "linear form") and a regular + −
%expression, and returns a linear form:+ −
% \begin{center} + −
% \begin{tabular}{ccc}+ −
% $l \bigodot (\ZERO)$ & $=$ & $\phi$\\+ −
% $l \bigodot (\ONE)$ & $=$ & $l$\\+ −
% $\phi \bigodot t$ & $=$ & $\phi$\\+ −
% $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\+ −
% $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\+ −
% $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\+ −
% $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\+ −
% $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\+ −
%\end{tabular}+ −
%\end{center}+ −
%%TODO: completion+ −
%+ −
% Some degree of simplification is applied when doing $\bigodot$, for example,+ −
% $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$,+ −
% and $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and+ −
% $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$,+ −
% and so on.+ −
% + −
% With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regular expression $r$ with + −
% an iterative procedure:+ −
% \begin{center} + −
% \begin{tabular}{llll}+ −
%$\textit{while}$ & $(\Delta_i \neq \phi)$ & & \\+ −
% & $\Delta_{i+1}$ & $ =$ & $\lf(\Delta_i) - \PD_i$ \\+ −
% & $\PD_{i+1}$ & $ =$ & $\Delta_{i+1} \cup \PD_i$ \\+ −
%$\partial_{UNIV}(r)$ & $=$ & $\PD$ & + −
%\end{tabular}+ −
%\end{center}+ −
%+ −
%+ −
% $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,+ −
+ −
+ −
+ −
+ −
%----------------------------------------------------------------------------------------+ −
% SECTION 2+ −
%----------------------------------------------------------------------------------------+ −
+ −
+ −
%The closed form for them looks like:+ −
%%\begin{center}+ −
%% \begin{tabular}{llrclll}+ −
%% $r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & \\+ −
%% $\textit{rsimp}$ & $($ & $+ −
%% \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\+ −
%% & & & & $\textit{nupdates} \;$ & + −
%% $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\+ −
%% & & & & $)$ &\\+ −
%% & & $)$ & & &\\+ −
%% & $)$ & & & &\\+ −
%% \end{tabular}+ −
%%\end{center}+ −
%\begin{center}+ −
% \begin{tabular}{llrcllrllll}+ −
% $r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & &&&&\\+ −
% &&&&$\textit{rsimp}$ & $($ & $+ −
% \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\+ −
% &&&& & & & & $\;\; \textit{nupdates} \;$ & + −
% $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\+ −
% &&&& & & & & $)$ &\\+ −
% &&&& & & $)$ & & &\\+ −
% &&&& & $)$ & & & &\\+ −
% \end{tabular}+ −
%\end{center}+ −
%The $\textit{optermsimp}$ function with the argument $r$ + −
%chooses from two options: $\ZERO$ or + −
%We define for the $r^{\{n\}}$ constructor something similar to $\starupdate$+ −
%and $\starupdates$:+ −
%\begin{center}+ −
% \begin{tabular}{lcl}+ −
% $\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\+ −
% $\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\+ −
% & & $\textit{if} \; + −
% (\rnullable \; (\rders \; r \; s))$ \\+ −
% & & $\textit{then} \;\; (s @ [c]) :: [c] :: (+ −
% \starupdate \; c \; r \; Ss)$ \\+ −
% & & $\textit{else} \;\; (s @ [c]) :: (+ −
% \starupdate \; c \; r \; Ss)$+ −
% \end{tabular}+ −
%\end{center}+ −
%\noindent+ −
%As a generalisation from characters to strings,+ −
%$\starupdates$ takes a string instead of a character+ −
%as the first input argument, and is otherwise the same+ −
%as $\starupdate$.+ −
%\begin{center}+ −
% \begin{tabular}{lcl}+ −
% $\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\+ −
% $\starupdates \; (c :: cs) \; r \; Ss$ & $=$ & $\starupdates \; cs \; r \; (+ −
% \starupdate \; c \; r \; Ss)$+ −
% \end{tabular}+ −
%\end{center}+ −
%\noindent+ −
+ −
+ −
+ −
%\section{Zippers}+ −
%Zipper is a data structure designed to operate on + −
%and navigate between local parts of a tree.+ −
%It was first formally described by Huet \cite{HuetZipper}.+ −
%Typical applications of zippers involve text editor buffers+ −
%and proof system databases.+ −
%In our setting, the idea is to compactify the representation+ −
%of derivatives with zippers, thereby making our algorithm faster.+ −
%Some initial results + −
%We first give a brief introduction to what zippers are,+ −
%and other works+ −
%that apply zippers to derivatives+ −
%When dealing with large trees, it would be a waste to + −
%traverse the entire tree if+ −
%the operation only+ −
%involves a small fraction of it.+ −
%The idea is to put the focus on that subtree, turning other parts+ −
%of the tree into a context+ −
%+ −
%+ −
%One observation about our derivative-based lexing algorithm is that+ −
%the derivative operation sometimes traverses the entire regular expression+ −
%unnecessarily:+ −
+ −
+ −
%----------------------------------------------------------------------------------------+ −
% SECTION 1+ −
%----------------------------------------------------------------------------------------+ −
+ −
%\section{Adding Support for the Negation Construct, and its Correctness Proof}+ −
%We now add support for the negation regular expression:+ −
%\[ r ::= \ZERO \mid \ONE+ −
% \mid c + −
% \mid r_1 \cdot r_2+ −
% \mid r_1 + r_2 + −
% \mid r^* + −
% \mid \sim r+ −
%\]+ −
%The $\textit{nullable}$ function's clause for it would be + −
%\[+ −
%\textit{nullable}(~r) = \neg \nullable(r)+ −
%\]+ −
%The derivative would be+ −
%\[+ −
%~r \backslash c = ~ (r \backslash c)+ −
%\]+ −
% + −
%The most tricky part of lexing for the $~r$ regular expression+ −
% is creating a value for it.+ −
% For other regular expressions, the value aligns with the + −
% structure of the regular expression:+ −
% \[+ −
% \vdash \Seq(\Char(a), \Char(b)) : a \cdot b+ −
% \]+ −
%But for the $~r$ regular expression, $s$ is a member of it if and only if+ −
%$s$ does not belong to $L(r)$. + −
%That means when there+ −
%is a match for the not regular expression, it is not possible to generate how the string $s$ matched+ −
%with $r$. + −
%What we can do is preserve the information of how $s$ was not matched by $r$,+ −
%and there are a number of options to do this.+ −
%+ −
%We could give a partial value when there is a partial match for the regular expression inside+ −
%the $\mathbf{not}$ construct.+ −
%For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,+ −
%A value for it could be + −
% \[+ −
% \vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)+ −
% \]+ −
% The above example demonstrates what value to construct + −
% when the string $s$ is at most a real prefix+ −
% of the strings in $L(r)$. When $s$ instead is not a prefix of any strings+ −
% in $L(r)$, it becomes unclear what to return as a value inside the $\textit{Not}$+ −
% constructor.+ −
% + −
% Another option would be to either store the string $s$ that resulted in + −
% a mis-match for $r$ or a dummy value as a placeholder:+ −
% \[+ −
% \vdash \textit{Not}(abcd) : ~( r_1 )+ −
% \]+ −
%or+ −
% \[+ −
% \vdash \textit{Not}(\textit{Dummy}) : ~( r_1 )+ −
% \] + −
% We choose to implement this as it is most straightforward:+ −
% \[+ −
% \mkeps(~(r)) = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})+ −
% \]+ −
% + −
%+ −
%\begin{center}+ −
% \begin{tabular}{lcl}+ −
% $\ntset \; r \; (n+1) \; c::cs $ & $\dn$ & $\nupdates \;+ −
% cs \; r \; [\Some \; ([c], n)]$\\+ −
% $\ntset \; r\; 0 \; \_$ & $\dn$ & $\None$\\+ −
% $\ntset \; r \; \_ \; [] $ & $ \dn$ & $[]$\\+ −
% \end{tabular}+ −
%\end{center}+ −
+ −