% Chapter Template%We also present the idempotency property proof%of $\bsimp$, which leverages the idempotency proof of $\rsimp$.%This reinforces our claim that the fixpoint construction%originally required by Sulzmann and Lu can be removed in $\blexersimp$.%Last but not least, we present our efforts and challenges we met%in further improving the algorithm by data%structures such as zippers.%----------------------------------------------------------------------------------------% SECTION strongsimp%----------------------------------------------------------------------------------------%TODO: search for isabelle proofs of algorithms that check equivalence \chapter{A Better Size Bound for Derivatives} % Main chapter title\label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound%in Chapter 4 to a polynomial one, and demonstrate how one can extend the%algorithm to include constructs such as bounded repetitions and negations.\lstset{style=myScalastyle}This chapter is a ``work-in-progress''chapter which recordsextensions to our $\blexersimp$.We make a conjecture that the finitesize bound from the previous chapter can be improved to a cubic bound.We implemented our conjecture in Scala.We intend to formalise this part in Isabelle/HOL at alater stage.%we have not been able to finish due to time constraints of the PhD.Nevertheless, we outline the ideas we intend to use for the proof.\section{A Stronger Version of Simplification}We present further improvementsfor our lexer algorithm $\blexersimp$.We devise a stronger simplification algorithm,called $\bsimpStrong$, which can prune awaysimilar components in two regular expressions at the same alternative level,even if these regular expressions are not exactly the same.We call the lexer that uses this stronger simplification function$\blexerStrong$.%Unfortunately we did not have time to %work out the proofs, like in%the previous chapters.We conjecture that both\begin{center} $\blexerStrong \;r \; s = \blexer\; r\;s$\end{center}and \begin{center} $\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$\end{center}hold.%but a formalisation%is still future work.We give an informal justification why the correctness and cubic size bound proofscan be achievedby exploring the connection between the internaldata structure of our $\blexerStrong$ andAnimirov's partial derivatives.In our bitcoded lexing algorithm, (sub)terms represent (sub)matches.For example, the regular expression \[ aa \cdot a^*+ a \cdot a^* + aa\cdot a^*\]contains three terms, expressing three possibilities for how it can match some input.The first and the third terms are identical, which means we can eliminatethe latter as it will not contribute to a POSIX value.In $\bsimps$, the $\distinctBy$ function takes care of such instances.The criteria $\distinctBy$ uses for removing a duplicate$a_2$ in the list\begin{center} $rs_a@[a_1]@rs_b@[a_2]@rs_c$\end{center}is that the two erased regular expressions are equal\begin{center} $\rerase{a_1} = \rerase{a_2}$.\end{center}This is characterised as the $LD$ rewrite rule in figure \ref{rrewriteRules}.The problem, however, is that identical componentsin two slightly different regular expressions cannot be removedby the $LD$ rule.Consider the stronger simplification\begin{equation} \label{eqn:partialDedup} (a+b+d) \cdot r_1 + (a+c+e) \cdot r_1 \stackrel{?}{\rightsquigarrow} (a+b+d) \cdot r_1 + (c+e) \cdot r_1\end{equation}where the $(\underline{a}+c+e)\cdot r_1$ is deleted in the right alternative$a+c+e$.This is permissible because we have $(a+\ldots)\cdot r_1$ in the leftalternative.The difficulty is that such ``buried''alternatives-sequences are not easily recognised.But simplification like this actuallycannot be omitted, if we want to have a better bound.For example, the size of derivatives can stillblow up even with our $\textit{bsimp}$ function: consider again the example$\protect((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$,and set $n$ to a relatively small number like $n=5$, then we get the followingexponential growth:\begin{figure}[H]\centering\begin{tikzpicture}\begin{axis}[ %xlabel={$n$}, myplotstyle, xlabel={input length}, ylabel={size}, ]\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};\end{axis}\end{tikzpicture}\caption{Size of derivatives of $\blexersimp$ from chapter 5 for matching $\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings of the form $\protect\underbrace{aa..a}_{n}$.}\label{blexerExp}\end{figure}\noindentOne possible approach would be to apply the rewritingrule\[ (a+b+d) \cdot r_1 \longrightarrow a \cdot r_1 + b \cdot r_1 + d \cdot r_1\]\noindentwhich pushes the sequence into the alternativesin our $\simp$ function.This would then make the simplification shown in \eqref{eqn:partialDedup} possible.Translating this rule into our $\textit{bsimp}$ function would simplyinvolve adding a new clause to the $\textit{bsimp}_{ASEQ}$ function:\begin{center} \begin{tabular}{@{}lcl@{}} $\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\ && $\ldots$\\ &&$\quad\textit{case} \; (_{bs1}\sum as, a_2') \Rightarrow _{bs1}\sum ( \map \; (_{[]}\textit{ASEQ} \; \_ \; a_2') \; as)$\\ &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ \end{tabular}\end{center}\noindentUnfortunately,if we introduce this clause in oursetting we would lose the POSIX property of our calculated values. For example given the regular expression \begin{center} $(a + ab)(bc + c)$\end{center}and the string $ab$,then our algorithm generates the followingcorrect POSIX value\begin{center} $\Seq \; (\Right \; ab) \; (\Right \; c)$.\end{center}Essentially it matches the string with the longer Right-alternativein the first sequence (andthen the 'rest' with the character regular expression $c$ from the second sequence).If we add the simplification above, however, then we would obtain the following value\begin{center} $\Left \; (\Seq \; a \; (\Left \; bc))$\end{center}where the $\Left$-alternatives get priority. This violates the POSIX rules.The reason for getting this undesired valueis that the new rule splits this regular expression up into a topmost alternative\begin{center} $a\cdot(b c + c) + ab \cdot (bc + c)$,\end{center}which is a regular expression with a quite different meaning: the original regular expressionis a sequence, but the simplified regular expression is an alternative.With an alternative the maximal munch rule no longer works.A method to reconcile this problem is to do the transformation in \eqref{eqn:partialDedup} ``non-invasively'',meaning that we traverse the list of regular expressions%\begin{center}% $rs_a@[a]@rs_c$%\end{center}inside alternatives\begin{center} $\sum ( rs_a@[a]@rs_c)$\end{center}using a function similar to $\distinctBy$,but this time we allow the following more general rewrite rule:\begin{equation}\label{eqn:cubicRule}%\mbox{%\begin{mathpar} \inferrule * [Right = cubicRule]{\vspace{0mm} }{rs_a@[a]@rs_c \stackrel{s}{\rightsquigarrow } rs_a@[\textit{prune} \; a \; rs_a]@rs_c }%\end{mathpar}%\caption{The rule capturing the pruning simplification needed to achieve%a cubic bound}\end{equation}\noindent%L \; a_1' = L \; a_1 \setminus (\cup_{a \in rs_a} L \; a)where $\textit{prune} \;a \; acc$ traverses $a$without altering the structure of $a$, but removing components in $a$that have appeared in the accumulator $acc$.For example\begin{center} $\textit{prune} \;\;\; (r_a+r_f+r_g+r_h)r_d \;\; \; [(r_a+r_b+r_c)r_d, (r_e+r_f)r_d] $\end{center}should be equal to \begin{center} $(r_g+r_h)r_d$\end{center}because $r_gr_d$ and $r_hr_d$ are the only termsthat do not appeared in the accumulator list \begin{center}$[(r_a+r_b+r_c)r_d, (r_e+r_f)r_d]$.\end{center}We implemented thefunction $\textit{prune}$ in Scala (see figure \ref{fig:pruneFunc})The function $\textit{prune}$ is a stronger version of $\textit{distinctBy}$.It does not just walk through a list looking for exact duplicates,but prunes sub-expressions recursively.It manages proper contexts by the helper functions$\textit{removeSeqTail}$, $\textit{isOne}$ and $\textit{atMostEmpty}$.\begin{figure}%[H]\begin{lstlisting}[numbers=left] def prune(r: ARexp, acc: Set[Rexp]) : ARexp = r match{ case AALTS(bs, rs) => rs.map(r => prune(r, acc)).filter(_ != AZERO) match { //all components have been removed, meaning this is effectively a duplicate //flats will take care of removing this AZERO case Nil => AZERO case r::Nil => fuse(bs, r) case rs1 => AALTS(bs, rs1) } case ASEQ(bs, r1, r2) => //remove the r2 in (ra + rb)r2 to identify the duplicate contents of r1 prune(r1, acc.map(r => removeSeqTail(r, erase(r2)))) match { //after pruning, returns 0 case AZERO => AZERO //after pruning, got r1'.r2, where r1' is equal to 1 case r1p if(isOne(erase(r1p))) => fuse(bs ++ mkepsBC(r1p), r2) //assemble the pruned head r1p with r2 case r1p => ASEQ(bs, r1p, r2) } //this does the duplicate component removal task case r => if(acc(erase(r))) AZERO else r }\end{lstlisting}\caption{The function $\textit{prune}$ is called recursively in the alternative case (line 2) and in the sequence case (line 12).In the alternative case we keep all the accumulators the same, butin the sequence case we are making necessary changes to the accumulators to allow correct de-duplication.}\label{fig:pruneFunc}\end{figure}\noindent\begin{figure}\begin{lstlisting}[numbers=left] def atMostEmpty(r: Rexp) : Boolean = r match { case ZERO => true case ONE => true case STAR(r) => atMostEmpty(r) case SEQ(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2) case ALTS(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2) case CHAR(_) => false } def isOne(r: Rexp) : Boolean = r match { case ONE => true case SEQ(r1, r2) => isOne(r1) && isOne(r2) case ALTS(r1, r2) => (isOne(r1) || isOne(r2)) && (atMostEmpty(r1) && atMostEmpty(r2)) case STAR(r0) => atMostEmpty(r0) case CHAR(c) => false case ZERO => false } def removeSeqTail(r: Rexp, tail: Rexp) : Rexp = if (r == tail) ONE else { r match { case SEQ(r1, r2) => if(r2 == tail) r1 else ZERO case r => ZERO } }\end{lstlisting}\caption{The helper functions of $\textit{prune}$: $\textit{atMostEmpty}$, $\textit{isOne}$ and $\textit{removeSeqTail}$}\end{figure}\noindentSuppose we feed \begin{center} $r= (\underline{\ONE}+(\underline{f}+b)\cdot g)\cdot (a\cdot(d\cdot e))$\end{center}and \begin{center} $acc = \{a\cdot(d\cdot e),f\cdot (g \cdot (a \cdot (d \cdot e))) \}$\end{center}as the input into $\textit{prune}$.The end result will be\[ b\cdot(g\cdot(a\cdot(d\cdot e)))\]where the underlined components in $r$ are eliminated.Looking more closely, at the topmost call \[ \textit{prune} \quad (\ONE+ (f+b)\cdot g)\cdot (a\cdot(d\cdot e)) \quad \{a\cdot(d\cdot e),f\cdot (g \cdot (a \cdot (d \cdot e))) \}\]The sequence clause will be called,where a sub-call\[ \textit{prune} \;\; (\ONE+(f+b)\cdot g)\;\; \{\ONE, f\cdot g \}\]is made. The terms in the new accumulator $\{\ONE,\; f\cdot g \}$ come fromthe two calls to $\textit{removeSeqTail}$:\[ \textit{removeSeqTail} \quad\;\; a \cdot(d\cdot e) \quad\;\; a \cdot(d\cdot e) \]and\[ \textit{removeSeqTail} \quad \;\; f\cdot(g\cdot (a \cdot(d\cdot e)))\quad \;\; a \cdot(d\cdot e).\]The idea behind $\textit{removeSeqTail}$ is thatwhen pruning recursively, we need to ``zoom in''to sub-expressions, and this ``zoom in'' needs to be performedon theaccumulators as well, otherwise the deletion will not work.The sub-call $\textit{prune} \;\; (\ONE+(f+b)\cdot g)\;\; \{\ONE, f\cdot g \}$is simpler, which will trigger the alternative clause, causinga pruning on each element in $(\ONE+(f+b)\cdot g)$,leaving us with $b\cdot g$ only.Our new lexer with stronger simplificationuses $\textit{prune}$ by making it the core component of the deduplicating functioncalled $\textit{distinctWith}$.$\textit{DistinctWith}$ ensures that all verboseparts of a regular expression are pruned away.\begin{figure}[H]\begin{lstlisting}[numbers=left] def turnIntoTerms(r: Rexp): List[Rexp] = r match { case SEQ(r1, r2) => turnIntoTerms(r1).flatMap(r11 => furtherSEQ(r11, r2)) case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2) case ZERO => Nil case _ => r :: Nil } def distinctWith(rs: List[ARexp], pruneFunction: (ARexp, Set[Rexp]) => ARexp, acc: Set[Rexp] = Set()) : List[ARexp] = rs match{ case Nil => Nil case r :: rs => if(acc(erase(r))) distinctWith(rs, pruneFunction, acc) else { val pruned_r = pruneFunction(r, acc) pruned_r :: distinctWith(rs, pruneFunction, turnIntoTerms(erase(pruned_r)) ++: acc ) } }\end{lstlisting}\caption{A Stronger Version of $\textit{distinctBy}$ XXX}\end{figure}\noindentOnce a regular expression has been pruned,all its components will be added to the accumulatorto remove any future regular expressions' duplicate components.The function $\textit{bsimpStrong}$is very much the same as $\textit{bsimp}$, just with$\textit{distinctBy}$ replaced by $\textit{distinctWith}$.\begin{figure}[H]\begin{lstlisting} def bsimpStrong(r: ARexp): ARexp = { r match { case ASEQ(bs1, r1, r2) => (bsimpStrong(r1), bsimpStrong(r2)) match { case (AZERO, _) => AZERO case (_, AZERO) => AZERO case (AONE(bs2), r2s) => fuse(bs1 ++ bs2, r2s) case (r1s, AONE(bs2)) => fuse(bs1, r1s) //assert bs2 == Nil case (r1s, r2s) => ASEQ(bs1, r1s, r2s) } case AALTS(bs1, rs) => { distinctWith(flats(rs.map(bsimpStrong(_))), prune) match { case Nil => AZERO case s :: Nil => fuse(bs1, s) case rs => AALTS(bs1, rs) } } case ASTAR(bs, r0) if(atMostEmpty(erase(r0))) => AONE(bs) case r => r } } def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match { case Nil => r case c::s => bdersStrong(s, bsimpStrong(bder(c, r))) }\end{lstlisting}\caption{The function$\textit{bsimpStrong}$: a stronger version of $\textit{bsimp}$}\end{figure}\noindentThe benefits of using $\textit{prune}$ refining the finiteness boundto a cubic bound has not been formalised yet.Therefore we choose to use Scala code rather than an Isabelle-style formal definition like we did for $\simp$, as the definitions might changeto suit our proof needs.In the rest of the chapter we will use this convention consistently.%The function $\textit{prune}$ is used in $\distinctWith$.%$\distinctWith$ is a stronger version of $\distinctBy$%which not only removes duplicates as $\distinctBy$ would%do, but also uses the $\textit{pruneFunction}$%argument to prune away verbose components in a regular expression.\\%\begin{figure}[H]%\begin{lstlisting}% //a stronger version of simp% def bsimpStrong(r: ARexp): ARexp =% {% r match {% case ASEQ(bs1, r1, r2) => (bsimpStrong(r1), bsimpStrong(r2)) match {% //normal clauses same as simp% case (AZERO, _) => AZERO% case (_, AZERO) => AZERO% case (AONE(bs2), r2s) => fuse(bs1 ++ bs2, r2s)% //bs2 can be discarded% case (r1s, AONE(bs2)) => fuse(bs1, r1s) //assert bs2 == Nil% case (r1s, r2s) => ASEQ(bs1, r1s, r2s)% }% case AALTS(bs1, rs) => {% //distinctBy(flat_res, erase)% distinctWith(flats(rs.map(bsimpStrong(_))), prune) match {% case Nil => AZERO% case s :: Nil => fuse(bs1, s)% case rs => AALTS(bs1, rs)% }% }% //stars that can be treated as 1% case ASTAR(bs, r0) if(atMostEmpty(erase(r0))) => AONE(bs)% case r => r% }% }% def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match {% case Nil => r% case c::s => bdersStrong(s, bsimpStrong(bder(c, r)))% }%\end{lstlisting}%\caption{The function $\bsimpStrong$ and $\bdersStrongs$}%\end{figure}%\noindent%$\distinctWith$, is in turn used in $\bsimpStrong$:%\begin{figure}[H]%\begin{lstlisting}% //Conjecture: [| bdersStrong(s, r) |] = O([| r |]^3)% def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match {% case Nil => r% case c::s => bdersStrong(s, bsimpStrong(bder(c, r)))% }%\end{lstlisting}%\caption{The function $\bsimpStrong$ and $\bdersStrongs$}%\end{figure}%\noindentWe conjecture that the above Scala function $\bdersStrongs$,written $\bdersStrong{\_}{\_}$ as an infix notation, satisfies the following property:\begin{conjecture} $\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$\end{conjecture}\noindentThe stronger version of $\blexersimp$'scode in Scala looks like: \begin{figure}[H]\begin{lstlisting} def strongBlexer(r: Rexp, s: String) : Option[Val] = { Try(Some(decode(r, strong_blex_simp(internalise(r), s.toList)))).getOrElse(None) } def strong_blex_simp(r: ARexp, s: List[Char]) : Bits = s match { case Nil => { if (bnullable(r)) { mkepsBC(r) } else throw new Exception("Not matched") } case c::cs => { strong_blex_simp(strongBsimp(bder(c, r)), cs) } }\end{lstlisting}\end{figure}\noindentWe call this lexer $\blexerStrong$.This version is able to reduce thesize of the derivatives which otherwise triggered exponential behaviour in$\blexersimp$.Consider again the runtime for matching $\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings of the form $\protect\underbrace{aa..a}_{n}$.They produce the folloiwng graphs ($\blexerStrong$ on the left-hand-side and$\blexersimp$ on the right-hand-side).\begin{figure}[H]\centering\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}\begin{tikzpicture}\begin{axis}[ %xlabel={$n$}, myplotstyle, xlabel={input length}, ylabel={size}, width = 7cm, height = 5cm, ]\addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data};\end{axis}\end{tikzpicture} &\begin{tikzpicture}\begin{axis}[ %xlabel={$n$}, myplotstyle, xlabel={input length}, ylabel={size}, width = 7cm, height = 5cm, ]\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};\end{axis}\end{tikzpicture}\\\multicolumn{2}{l}{}\end{tabular} \caption{}\label{fig:aaaaaStarStar}\end{figure}\noindentWe hope the correctness is preserved.%We would like to preserve the correctness like the one %we had for $\blexersimp$:The proof idea is to preserve the key lemma in chapter \ref{Bitcoded2}such as in equation \eqref{eqn:cubicRule}.\begin{conjecture}\label{cubicConjecture} $\blexerStrong \;r \; s = \blexer\; r\;s$\end{conjecture}\noindentThe idea is to maintain key lemmas inchapter \ref{Bitcoded2} like$r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$with the new rewriting rule shown in figure \eqref{eqn:cubicRule} .In the next sub-section,we will describe why we believe a cubic size bound can be achieved withthe stronger simplification.For this we give a short introduction to thepartial derivatives,which were invented by Antimirov \cite{Antimirov95},and then link them with the result of the function$\bdersStrongs$.\subsection{Antimirov's partial derivatives}Partial derivatives were first introduced by Antimirov \cite{Antimirov95}.They are very similarto Brzozowski derivatives,but split children of alternative regular expressions into multiple independent terms. This means the output ofpartial derivatives is a set of regular expressions, defined as follows\begin{center} \begin{tabular}{lcl@{\hspace{-5mm}}l} $\partial_x \; (r_1 \cdot r_2)$ & $\dn$ & $(\partial_x \; r_1) \cdot r_2 \cup \partial_x \; r_2 \;$ & $ \textit{if} \; \; \nullable\; r_1$\\ & & $(\partial_x \; r_1)\cdot r_2 \quad\quad$ & $ \textit{otherwise}$\\ $\partial_x \; r^*$ & $\dn$ & $(\partial_x \; r) \cdot r^*$\\ $\partial_x \; c $ & $\dn$ & $\textit{if} \; x = c \; \textit{then} \; \{ \ONE\} \;\;\textit{else} \; \varnothing$\\ $\partial_x(r_1+r_2)$ & $=$ & $\partial_x(r_1) \cup \partial_x(r_2)$\\ $\partial_x(\ONE)$ & $=$ & $\varnothing$\\ $\partial_x(\ZERO)$ & $\dn$ & $\varnothing$\\ \end{tabular}\end{center}\noindentThe $\cdot$ in the example $(\partial_x \; r_1) \cdot r_2 $ is a shorthand notation for the cartesian product $(\partial_x \; r_1) \times \{ r_2\}$.%Each element in the set generated by a partial derivative%corresponds to a (potentially partial) match%TODO: define derivatives w.r.t string sRather than joining the calculated derivatives $\partial_x r_1$ and $\partial_x r_2$ togetherusing the $\sum$ constructor, Antimirov put them intoa set. This means many subterms will be de-duplicatedbecause they are sets.For example, to compute what the derivative of the regular expression $x^*(xx + y)^*$w.r.t. $x$ is, one can compute a partial derivativeand get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$.The partial derivative w.r.t. a string is defined recursively:\[ \partial_{c::cs} r \dn \bigcup_{r'\in (\partial_c r)} \partial_{cs} r'\]Suppose an alphabet $\Sigma$, we use $\Sigma^*$ for the set of all possible stringsfrom the alphabet. The set of all possible partial derivatives is then definedas the union of derivatives w.r.t all the strings: \begin{center} \begin{tabular}{lcl} $\textit{PDER}_{\Sigma^*} \; r $ & $\dn $ & $\bigcup_{w \in \Sigma^*}\partial_w \; r$ \end{tabular}\end{center}\noindentConsider now again our pathological case where we apply the moreaggressive simplification\begin{center} $((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$\end{center}let use abbreviate theis regular expression with $r$,then we have that\begin{center}$\textit{PDER}_{\Sigma^*} \; r = \bigcup_{i=1}^{n}\bigcup_{j=0}^{i-1} \{ (\underbrace{a \ldots a}_{\text{j a's}}\cdot(\underbrace{a \ldots a}_{\text{i a's}})^*)\cdot r \}$, \end{center}The union on the right-hand-side has $n * (n + 1) / 2$ terms.This leads us to believe that the maximum number of terms neededin our derivative is also only $n * (n + 1) / 2$.Thereforewe conjecture that $\bsimpStrong$ is also able to achieve thisupper limit in general\begin{conjecture}\label{bsimpStrongInclusionPder} Using a suitable transformation $f$, we have that \begin{center} $\forall s.\; f \; (r \bdersStrong \; s) \subseteq \textit{PDER}_{\Sigma^*} \; r$ \end{center} holds.\end{conjecture}\noindentThe reason is that our \eqref{eqn:cubicRule} will keep only one copy of each term,where the function $\textit{prune}$ takes care of maintaininga set like structure similar to partial derivatives.%We might need to adjust $\textit{prune}$%slightly to make sure all duplicate terms are eliminated,%which should be doable.Antimirov had proven that the sum of all the partial derivative terms' sizes is bounded by the cubic of the size of that regularexpression:\begin{property}\label{pderBound} $\llbracket \textit{PDER}_{\Sigma^*} \; r \rrbracket \leq O(\llbracket r \rrbracket^3)$\end{property}\noindentThis property was formalised by Wu et al. \cite{Wu2014}, and the details can be found in the Archive of Formal Froofs\footnote{https://www.isa-afp.org/entries/Myhill-Nerode.html}.Once conjecture \ref{bsimpStrongInclusionPder} is proven, then property \ref{pderBound}would provide us with a cubic bound for our $\blexerStrong$ algorithm: \begin{conjecture}\label{strongCubic} $\llbracket r \bdersStrong\; s \rrbracket \leq \llbracket r \rrbracket^3$\end{conjecture}\noindentWe leave this as future work.%To get all the "atomic" components of a regular expression's possible derivatives,%there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes%whatever character is available at the head of the string inside the language of a%regular expression, and gives back the character and the derivative regular expression%as a pair (which he called "monomial"):% \begin{center} % \begin{tabular}{ccc}% $\lf(\ONE)$ & $=$ & $\phi$\\%$\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\% $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\% $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\%\end{tabular}%\end{center}%%TODO: completion%%There is a slight difference in the last three clauses compared%with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular %expression $r$ with every element inside $\textit{rset}$ to create a set of %sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates %on a set of monomials (which Antimirov called "linear form") and a regular %expression, and returns a linear form:% \begin{center} % \begin{tabular}{ccc}% $l \bigodot (\ZERO)$ & $=$ & $\phi$\\% $l \bigodot (\ONE)$ & $=$ & $l$\\% $\phi \bigodot t$ & $=$ & $\phi$\\% $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\% $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\% $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\% $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\% $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\%\end{tabular}%\end{center}%%TODO: completion%% Some degree of simplification is applied when doing $\bigodot$, for example,% $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$,% and $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and% $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$,% and so on.% % With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regular expression $r$ with % an iterative procedure:% \begin{center} % \begin{tabular}{llll}%$\textit{while}$ & $(\Delta_i \neq \phi)$ & & \\% & $\Delta_{i+1}$ & $ =$ & $\lf(\Delta_i) - \PD_i$ \\% & $\PD_{i+1}$ & $ =$ & $\Delta_{i+1} \cup \PD_i$ \\%$\partial_{UNIV}(r)$ & $=$ & $\PD$ & %\end{tabular}%\end{center}%%% $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,%----------------------------------------------------------------------------------------% SECTION 2%----------------------------------------------------------------------------------------%The closed form for them looks like:%%\begin{center}%% \begin{tabular}{llrclll}%% $r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & \\%% $\textit{rsimp}$ & $($ & $%% \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\%% & & & & $\textit{nupdates} \;$ & %% $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\%% & & & & $)$ &\\%% & & $)$ & & &\\%% & $)$ & & & &\\%% \end{tabular}%%\end{center}%\begin{center}% \begin{tabular}{llrcllrllll}% $r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & &&&&\\% &&&&$\textit{rsimp}$ & $($ & $% \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\% &&&& & & & & $\;\; \textit{nupdates} \;$ & % $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\% &&&& & & & & $)$ &\\% &&&& & & $)$ & & &\\% &&&& & $)$ & & & &\\% \end{tabular}%\end{center}%The $\textit{optermsimp}$ function with the argument $r$ %chooses from two options: $\ZERO$ or %We define for the $r^{\{n\}}$ constructor something similar to $\starupdate$%and $\starupdates$:%\begin{center}% \begin{tabular}{lcl}% $\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\% $\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\% & & $\textit{if} \; % (\rnullable \; (\rders \; r \; s))$ \\% & & $\textit{then} \;\; (s @ [c]) :: [c] :: (% \starupdate \; c \; r \; Ss)$ \\% & & $\textit{else} \;\; (s @ [c]) :: (% \starupdate \; c \; r \; Ss)$% \end{tabular}%\end{center}%\noindent%As a generalisation from characters to strings,%$\starupdates$ takes a string instead of a character%as the first input argument, and is otherwise the same%as $\starupdate$.%\begin{center}% \begin{tabular}{lcl}% $\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\% $\starupdates \; (c :: cs) \; r \; Ss$ & $=$ & $\starupdates \; cs \; r \; (% \starupdate \; c \; r \; Ss)$% \end{tabular}%\end{center}%\noindent%\section{Zippers}%Zipper is a data structure designed to operate on %and navigate between local parts of a tree.%It was first formally described by Huet \cite{HuetZipper}.%Typical applications of zippers involve text editor buffers%and proof system databases.%In our setting, the idea is to compactify the representation%of derivatives with zippers, thereby making our algorithm faster.%Some initial results %We first give a brief introduction to what zippers are,%and other works%that apply zippers to derivatives%When dealing with large trees, it would be a waste to %traverse the entire tree if%the operation only%involves a small fraction of it.%The idea is to put the focus on that subtree, turning other parts%of the tree into a context%%%One observation about our derivative-based lexing algorithm is that%the derivative operation sometimes traverses the entire regular expression%unnecessarily:%----------------------------------------------------------------------------------------% SECTION 1%----------------------------------------------------------------------------------------%\section{Adding Support for the Negation Construct, and its Correctness Proof}%We now add support for the negation regular expression:%\[ r ::= \ZERO \mid \ONE% \mid c % \mid r_1 \cdot r_2% \mid r_1 + r_2 % \mid r^* % \mid \sim r%\]%The $\textit{nullable}$ function's clause for it would be %\[%\textit{nullable}(~r) = \neg \nullable(r)%\]%The derivative would be%\[%~r \backslash c = ~ (r \backslash c)%\]% %The most tricky part of lexing for the $~r$ regular expression% is creating a value for it.% For other regular expressions, the value aligns with the % structure of the regular expression:% \[% \vdash \Seq(\Char(a), \Char(b)) : a \cdot b% \]%But for the $~r$ regular expression, $s$ is a member of it if and only if%$s$ does not belong to $L(r)$. %That means when there%is a match for the not regular expression, it is not possible to generate how the string $s$ matched%with $r$. %What we can do is preserve the information of how $s$ was not matched by $r$,%and there are a number of options to do this.%%We could give a partial value when there is a partial match for the regular expression inside%the $\mathbf{not}$ construct.%For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,%A value for it could be % \[% \vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)% \]% The above example demonstrates what value to construct % when the string $s$ is at most a real prefix% of the strings in $L(r)$. When $s$ instead is not a prefix of any strings% in $L(r)$, it becomes unclear what to return as a value inside the $\textit{Not}$% constructor.% % Another option would be to either store the string $s$ that resulted in % a mis-match for $r$ or a dummy value as a placeholder:% \[% \vdash \textit{Not}(abcd) : ~( r_1 )% \]%or% \[% \vdash \textit{Not}(\textit{Dummy}) : ~( r_1 )% \] % We choose to implement this as it is most straightforward:% \[% \mkeps(~(r)) = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})% \]% %%\begin{center}% \begin{tabular}{lcl}% $\ntset \; r \; (n+1) \; c::cs $ & $\dn$ & $\nupdates \;% cs \; r \; [\Some \; ([c], n)]$\\% $\ntset \; r\; 0 \; \_$ & $\dn$ & $\None$\\% $\ntset \; r \; \_ \; [] $ & $ \dn$ & $[]$\\% \end{tabular}%\end{center}