# HG changeset patch # User Chengsong # Date 1669479490 0 # Node ID 7af4e2420a8c3ab5c376b2444330485d29d148be # Parent 94db2636a296a5caad28c160eb12a2a9159ba60d ready to submit~~ diff -r 94db2636a296 -r 7af4e2420a8c ChengsongTanPhdThesis/Chapters/Cubic.tex --- a/ChengsongTanPhdThesis/Chapters/Cubic.tex Tue Nov 22 12:50:04 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Cubic.tex Sat Nov 26 16:18:10 2022 +0000 @@ -16,7 +16,7 @@ Nevertheless, we outline the ideas we intend to use for the proof. We present further improvements -made to our lexer algorithm $\blexersimp$. +for our lexer algorithm $\blexersimp$. We devise a stronger simplification algorithm, called $\bsimpStrong$, which can prune away similar components in two regular expressions at the same @@ -24,6 +24,9 @@ even if these regular expressions are not exactly the same. We call the lexer that uses this stronger simplification function $\blexerStrong$. +Unfortunately we did not have time to +work out the proofs, like in +the previous chapters. We conjecture that both \begin{center} $\blexerStrong \;r \; s = \blexer\; r\;s$ @@ -32,10 +35,11 @@ \begin{center} $\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$ \end{center} -hold, but formalising -them is still work in progress. -We give reasons why the correctness and cubic size bound proofs -can be achieved, +hold, but a formalisation +is still future work. +We give an informal justification +why the correctness and cubic size bound proofs +can be achieved by exploring the connection between the internal data structure of our $\blexerStrong$ and Animirov's partial derivatives.\\ @@ -61,10 +65,11 @@ aa \cdot a^*+ a \cdot a^* + aa\cdot a^* \] contains three terms, -expressing three possibilities it will match future input. +expressing three possibilities for how it can match some input. The first and the third terms are identical, which means we can eliminate -the latter as we know it will not be picked up by $\bmkeps$. -In $\bsimps$, the $\distinctBy$ function takes care of this. +the latter as it will not contribute to a POSIX value. +In $\bsimps$, the $\distinctBy$ function takes care of +such instances. The criteria $\distinctBy$ uses for removing a duplicate $a_2$ in the list \begin{center} @@ -74,24 +79,29 @@ \begin{center} $\rerase{a_1} = \rerase{a_2}$. \end{center} -It can be characterised as the $LD$ -rewrite rule in \ref{rrewriteRules}.\\ +It is characterised as the $LD$ +rewrite rule in figure \ref{rrewriteRules}.\\ The problem , however, is that identical components -in two slightly different regular expressions cannot be removed: -\begin{figure}[H] -\[ +in two slightly different regular expressions cannot be removed. +Consider the simplification +\begin{equation} + \label{eqn:partialDedup} (a+b+d) \cdot r_1 + (a+c+e) \cdot r_1 \stackrel{?}{\rightsquigarrow} (a+b+d) \cdot r_1 + (c+e) \cdot r_1 -\] -\caption{Desired simplification, but not done in $\blexersimp$}\label{partialDedup} -\end{figure} -\noindent -A simplification like this actually +\end{equation} +where the $(a+\ldots)\cdot r_1$ is deleted in the right alternative. +This is permissible because we have $(a+\ldots)\cdot r_1$ in the left +alternative. +The difficulty is that such ``buried'' +alternatives-sequences are not easily recognised. +But simplification like this actually cannot be omitted, -as without it the size could blow up even with our $\textit{bsimp}$ -function: for the chapter \ref{Finite} example +as without it the size of derivatives can still +blow up even with our $\textit{bsimp}$ +function: +consider again the example $\protect((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$, -by just setting n to a small number, -we get exponential growth that does not stop before it becomes huge: +and set $n$ to a relatively small number, +we get exponential growth: \begin{figure}[H] \centering \begin{tikzpicture} @@ -104,22 +114,20 @@ \addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data}; \end{axis} \end{tikzpicture} -\caption{Runtime of $\blexersimp$ for matching +\caption{Size of derivatives of $\blexersimp$ for matching $\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings of the form $\protect\underbrace{aa..a}_{n}$.}\label{blexerExp} \end{figure} \noindent -We would like to apply the rewriting at some stage -\begin{figure}[H] +One possible approach would be to apply the rewriting +rule \[ (a+b+d) \cdot r_1 \longrightarrow a \cdot r_1 + b \cdot r_1 + d \cdot r_1 \] -\caption{Desired simplification, but not done in $\blexersimp$}\label{desiredSimp} -\end{figure} \noindent in our $\simp$ function, -so that it makes the simplification in \ref{partialDedup} possible. +so that it makes the simplification in \eqref{eqn:partialDedup} possible. Translating the rule into our $\textit{bsimp}$ function simply involves adding a new clause to the $\textit{bsimp}_{ASEQ}$ function: \begin{center} @@ -131,9 +139,7 @@ &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\ \end{tabular} \end{center} - - - +\noindent Unfortunately, if we introduce them in our setting we would lose the POSIX property of our calculated values. @@ -141,10 +147,7 @@ \begin{center} $(a + ab)(bc + c)$ \end{center} -and the string -\begin{center} - $ab$, -\end{center} +and the string $ab$, then our algorithm generates the following correct POSIX value \begin{center} @@ -165,11 +168,11 @@ $a\cdot(b c + c) + ab \cdot (bc + c)$, \end{center} which becomes a regular expression with a -totally different structure--the original +quite different structure--the original was a sequence, and now it becomes an alternative. -With an alternative the maximum munch rule no longer works.\\ +With an alternative the maximal munch rule no longer works.\\ A method to reconcile this is to do the -transformation in \ref{desiredSimp} ``non-invasively'', +transformation in \eqref{eqn:partialDedup} ``non-invasively'', meaning that we traverse the list of regular expressions \begin{center} $rs_a@[a]@rs_c$ @@ -181,11 +184,16 @@ using a function similar to $\distinctBy$, but this time we allow a more general list rewrite: -\begin{mathpar}\label{cubicRule} +\begin{figure}[H] +\begin{mathpar} \inferrule * [Right = cubicRule]{\vspace{0mm} }{rs_a@[a]@rs_c \stackrel{s}{\rightsquigarrow } rs_a@[\textit{prune} \; a \; rs_a]@rs_c } \end{mathpar} +\caption{The rule capturing the pruning simplification needed to achieve +a cubic bound} +\label{fig:cubicRule} +\end{figure} %L \; a_1' = L \; a_1 \setminus (\cup_{a \in rs_a} L \; a) where $\textit{prune} \;a \; acc$ traverses $a$ without altering the structure of $a$, removing components in $a$ @@ -205,45 +213,11 @@ $[(r_a+r_b+r_c)r_d, (r_e+r_f)r_d]$. \end{center} We implemented -function $\textit{prune}$ in Scala, -and incorporated into our lexer, -by replacing the $\simp$ function -with a stronger version called $\bsimpStrong$ -that prunes regular expressions. +function $\textit{prune}$ in Scala: \begin{figure}[H] - \begin{lstlisting} - def atMostEmpty(r: Rexp) : Boolean = r match { - case ZERO => true - case ONE => true - case STAR(r) => atMostEmpty(r) - case SEQ(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2) - case ALTS(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2) - case CHAR(_) => false - } - - - def isOne(r: Rexp) : Boolean = r match { - case ONE => true - case SEQ(r1, r2) => isOne(r1) && isOne(r2) - case ALTS(r1, r2) => (isOne(r1) || isOne(r2)) && (atMostEmpty(r1) && atMostEmpty(r2))//rs.forall(atMostEmpty) && rs.exists(isOne) - case STAR(r0) => atMostEmpty(r0) - case CHAR(c) => false - case ZERO => false - } - - //r = r' ~ tail' : If tail' matches tail => returns r' - def removeSeqTail(r: Rexp, tail: Rexp) : Rexp = r match { - case SEQ(r1, r2) => - if(r2 == tail) - r1 - else - ZERO - case r => ZERO - } - def prune(r: ARexp, acc: Set[Rexp]) : ARexp = r match{ - case AALTS(bs, rs) => rs.map(r => prune(r, acc)).filter(_ != ZERO) match + case AALTS(bs, rs) => rs.map(r => prune(r, acc)).filter(_ != AZERO) match { //all components have been removed, meaning this is effectively a duplicate //flats will take care of removing this AZERO @@ -265,94 +239,250 @@ case r => if(acc(erase(r))) AZERO else r } \end{lstlisting} -\caption{pruning function together with its helper functions} +\caption{The function $\textit{prune}$ } \end{figure} \noindent -The benefits of using -$\textit{prune}$ such as refining the finiteness bound -to a cubic bound has not been formalised yet. -Therefore we choose to use Scala code rather than an Isabelle-style formal -definition like we did for $\simp$, as the definitions might change -to suit proof needs. -In the rest of the chapter we will use this convention consistently. +The function $\textit{prune}$ +is a stronger version of $\textit{distinctBy}$. +It does not just walk through a list looking for exact duplicates, +but prunes sub-expressions recursively. +It manages proper contexts by the helper functions +$\textit{removeSeqTail}$, $\textit{isOne}$ and $\textit{atMostEmpty}$. \begin{figure}[H] \begin{lstlisting} - def distinctWith(rs: List[ARexp], - pruneFunction: (ARexp, Set[Rexp]) => ARexp, - acc: Set[Rexp] = Set()) : List[ARexp] = - rs match{ - case Nil => Nil - case r :: rs => - if(acc(erase(r))) - distinctWith(rs, pruneFunction, acc) - else { - val pruned_r = pruneFunction(r, acc) - pruned_r :: - distinctWith(rs, - pruneFunction, - turnIntoTerms(erase(pruned_r)) ++: acc - ) - } + def atMostEmpty(r: Rexp) : Boolean = r match { + case ZERO => true + case ONE => true + case STAR(r) => atMostEmpty(r) + case SEQ(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2) + case ALTS(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2) + case CHAR(_) => false + } + + def isOne(r: Rexp) : Boolean = r match { + case ONE => true + case SEQ(r1, r2) => isOne(r1) && isOne(r2) + case ALTS(r1, r2) => (isOne(r1) || isOne(r2)) && (atMostEmpty(r1) && atMostEmpty(r2)) + case STAR(r0) => atMostEmpty(r0) + case CHAR(c) => false + case ZERO => false + } + + def removeSeqTail(r: Rexp, tail: Rexp) : Rexp = + if (r == tail) + ONE + else { + r match { + case SEQ(r1, r2) => + if(r2 == tail) + r1 + else + ZERO + case r => ZERO + } } + + +\end{lstlisting} +\caption{The helper functions of $\textit{prune}$} +\end{figure} +\noindent +Suppose we feed +\begin{center} + $r= (\underline{\ONE}+(\underline{f}+b)\cdot g)\cdot (a\cdot(d\cdot e))$ +\end{center} +and +\begin{center} + $acc = \{a\cdot(d\cdot e),f\cdot (g \cdot (a \cdot (d \cdot e))) \}$ +\end{center} +as the input for $\textit{prune}$. +The end result will be +\[ + b\cdot(g\cdot(a\cdot(d\cdot e))) +\] +where the underlined components in $r$ are eliminated. +Looking more closely, at the topmost call +\[ + \textit{prune} \quad (\ONE+ + (f+b)\cdot g)\cdot (a\cdot(d\cdot e)) \quad + \{a\cdot(d\cdot e),f\cdot (g \cdot (a \cdot (d \cdot e))) \} +\] +The sequence clause will be called, +where a sub-call +\[ + \textit{prune} \;\; (\ONE+(f+b)\cdot g)\;\; \{\ONE, f\cdot g \} +\] +is made. The terms in the new accumulator $\{\ONE,\; f\cdot g \}$ come from +the two calls to $\textit{removeSeqTail}$: +\[ + \textit{removeSeqTail} \quad\;\; a \cdot(d\cdot e) \quad\;\; a \cdot(d\cdot e) +\] +and +\[ + \textit{removeSeqTail} \quad \;\; + f\cdot(g\cdot (a \cdot(d\cdot e)))\quad \;\; a \cdot(d\cdot e). +\] +The idea behind $\textit{removeSeqTail}$ is that +when pruning recursively, we need to ``zoom in'' +to sub-expressions, and this ``zoom in'' needs to be performed +on the +accumulators as well, otherwise we will be comparing +apples with oranges. +The sub-call +$\textit{prune} \;\; (\ONE+(f+b)\cdot g)\;\; \{\ONE, f\cdot g \}$ +is simpler, which will trigger the alternative clause, causing +a pruning on each element in $(\ONE+(f+b)\cdot g)$, +leaving us $b\cdot g$ only. + +Our new lexer with stronger simplification +uses $\textit{prune}$ by making it +the core component of the deduplicating function +called $\textit{distinctWith}$. +$\textit{DistinctWith}$ ensures that all verbose +parts of a regular expression are pruned away. + +\begin{figure}[H] +\begin{lstlisting} + def turnIntoTerms(r: Rexp): List[Rexp] = r match { + case SEQ(r1, r2) => + turnIntoTerms(r1).flatMap(r11 => furtherSEQ(r11, r2)) + case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2) + case ZERO => Nil + case _ => r :: Nil + } + + def distinctWith(rs: List[ARexp], + pruneFunction: (ARexp, Set[Rexp]) => ARexp, + acc: Set[Rexp] = Set()) : List[ARexp] = + rs match{ + case Nil => Nil + case r :: rs => + if(acc(erase(r))) + distinctWith(rs, pruneFunction, acc) + else { + val pruned_r = pruneFunction(r, acc) + pruned_r :: + distinctWith(rs, + pruneFunction, + turnIntoTerms(erase(pruned_r)) ++: acc + ) + } + } + \end{lstlisting} \caption{A Stronger Version of $\textit{distinctBy}$} \end{figure} \noindent -The function $\textit{prune}$ is used in $\distinctWith$. -$\distinctWith$ is a stronger version of $\distinctBy$ -which not only removes duplicates as $\distinctBy$ would -do, but also uses the $\textit{pruneFunction}$ -argument to prune away verbose components in a regular expression.\\ +Once a regular expression has been pruned, +all its components will be added to the accumulator +to remove any future regular expressions' duplicate components. + +The function $\textit{bsimpStrong}$ +is very much the same as $\textit{bsimp}$, just with +$\textit{distinctBy}$ replaced +by $\textit{distinctWith}$. \begin{figure}[H] \begin{lstlisting} - //a stronger version of simp + def bsimpStrong(r: ARexp): ARexp = { r match { case ASEQ(bs1, r1, r2) => (bsimpStrong(r1), bsimpStrong(r2)) match { - //normal clauses same as simp case (AZERO, _) => AZERO case (_, AZERO) => AZERO case (AONE(bs2), r2s) => fuse(bs1 ++ bs2, r2s) - //bs2 can be discarded case (r1s, AONE(bs2)) => fuse(bs1, r1s) //assert bs2 == Nil case (r1s, r2s) => ASEQ(bs1, r1s, r2s) } case AALTS(bs1, rs) => { - //distinctBy(flat_res, erase) distinctWith(flats(rs.map(bsimpStrong(_))), prune) match { case Nil => AZERO case s :: Nil => fuse(bs1, s) case rs => AALTS(bs1, rs) } } - //stars that can be treated as 1 case ASTAR(bs, r0) if(atMostEmpty(erase(r0))) => AONE(bs) case r => r } } -\end{lstlisting} -\caption{The function $\bsimpStrong$ and $\bdersStrongs$} -\end{figure} -\noindent -$\distinctWith$, is in turn used in $\bsimpStrong$: -\begin{figure}[H] -\begin{lstlisting} - //Conjecture: [| bdersStrong(s, r) |] = O([| r |]^3) - def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match { + def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match { case Nil => r case c::s => bdersStrong(s, bsimpStrong(bder(c, r))) } + \end{lstlisting} -\caption{The function $\bsimpStrong$ and $\bdersStrongs$} +\caption{The function +$\textit{bsimpStrong}$: a stronger version of $\textit{bsimp}$} \end{figure} \noindent +The benefits of using +$\textit{prune}$ refining the finiteness bound +to a cubic bound has not been formalised yet. +Therefore we choose to use Scala code rather than an Isabelle-style formal +definition like we did for $\simp$, as the definitions might change +to suit our proof needs. +In the rest of the chapter we will use this convention consistently. + +%The function $\textit{prune}$ is used in $\distinctWith$. +%$\distinctWith$ is a stronger version of $\distinctBy$ +%which not only removes duplicates as $\distinctBy$ would +%do, but also uses the $\textit{pruneFunction}$ +%argument to prune away verbose components in a regular expression.\\ +%\begin{figure}[H] +%\begin{lstlisting} +% //a stronger version of simp +% def bsimpStrong(r: ARexp): ARexp = +% { +% r match { +% case ASEQ(bs1, r1, r2) => (bsimpStrong(r1), bsimpStrong(r2)) match { +% //normal clauses same as simp +% case (AZERO, _) => AZERO +% case (_, AZERO) => AZERO +% case (AONE(bs2), r2s) => fuse(bs1 ++ bs2, r2s) +% //bs2 can be discarded +% case (r1s, AONE(bs2)) => fuse(bs1, r1s) //assert bs2 == Nil +% case (r1s, r2s) => ASEQ(bs1, r1s, r2s) +% } +% case AALTS(bs1, rs) => { +% //distinctBy(flat_res, erase) +% distinctWith(flats(rs.map(bsimpStrong(_))), prune) match { +% case Nil => AZERO +% case s :: Nil => fuse(bs1, s) +% case rs => AALTS(bs1, rs) +% } +% } +% //stars that can be treated as 1 +% case ASTAR(bs, r0) if(atMostEmpty(erase(r0))) => AONE(bs) +% case r => r +% } +% } +% def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match { +% case Nil => r +% case c::s => bdersStrong(s, bsimpStrong(bder(c, r))) +% } +%\end{lstlisting} +%\caption{The function $\bsimpStrong$ and $\bdersStrongs$} +%\end{figure} +%\noindent +%$\distinctWith$, is in turn used in $\bsimpStrong$: +%\begin{figure}[H] +%\begin{lstlisting} +% //Conjecture: [| bdersStrong(s, r) |] = O([| r |]^3) +% def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match { +% case Nil => r +% case c::s => bdersStrong(s, bsimpStrong(bder(c, r))) +% } +%\end{lstlisting} +%\caption{The function $\bsimpStrong$ and $\bdersStrongs$} +%\end{figure} +%\noindent We conjecture that the above Scala function $\bdersStrongs$, written $\bdersStrong{\_}{\_}$ as an infix notation, satisfies the following property: \begin{conjecture} $\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$ \end{conjecture} +\noindent The stronger version of $\blexersimp$'s code in Scala looks like: \begin{figure}[H] @@ -376,8 +506,9 @@ \end{figure} \noindent We call this lexer $\blexerStrong$. -$\blexerStrong$ is able to drastically reduce the -internal data structure size which could +This version is able to drastically reduce the +internal data structure size which +otherwise could trigger exponential behaviours in $\blexersimp$. \begin{figure}[H] @@ -424,77 +555,88 @@ The idea is to maintain key lemmas in chapter \ref{Bitcoded2} like $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$ -with the new rewriting rule \ref{cubicRule} . +with the new rewriting rule +shown in figure \ref{fig:cubicRule} . In the next sub-section, we will describe why we -believe a cubic bound can be achieved. -We give an introduction to the +believe a cubic size bound can be achieved with +the stronger simplification. +For this we give a short introduction to the partial derivatives, -which was invented by Antimirov \cite{Antimirov95}, -and then link it with the result of the function +which were invented by Antimirov \cite{Antimirov95}, +and then link them with the result of the function $\bdersStrongs$. \subsection{Antimirov's partial derivatives} Partial derivatives were first introduced by Antimirov \cite{Antimirov95}. -It does derivatives in a similar way as suggested by Brzozowski, +Partial derivatives are very similar +to Brzozowski derivatives, but splits children of alternative regular expressions into -multiple independent terms, causing the output to become a +multiple independent terms. This means the output of +partial derivatives become a set of regular expressions: \begin{center} \begin{tabular}{lcl} - $\partial_x \; (a \cdot b)$ & - $\dn$ & $\partial_x \; a\cdot b \cup - \partial_x \; b \; \textit{if} \; \; \nullable\; a$\\ - & & $\partial_x \; a\cdot b \quad\quad + $\partial_x \; (r_1 \cdot r_2)$ & + $\dn$ & $(\partial_x \; r_1) \cdot r_2 \cup + \partial_x \; r_2 \; \textit{if} \; \; \nullable\; r_1$\\ + & & $(\partial_x \; r_1)\cdot r_2 \quad\quad \textit{otherwise}$\\ - $\partial_x \; r^*$ & $\dn$ & $\partial_x \; r \cdot r^*$\\ + $\partial_x \; r^*$ & $\dn$ & $(\partial_x \; r) \cdot r^*$\\ $\partial_x \; c $ & $\dn$ & $\textit{if} \; x = c \; \textit{then} \; \{ \ONE\} \;\;\textit{else} \; \varnothing$\\ - $\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\ + $\partial_x(r_1+r_2)$ & $=$ & $\partial_x(r_1) \cup \partial_x(r_2)$\\ $\partial_x(\ONE)$ & $=$ & $\varnothing$\\ $\partial_x(\ZERO)$ & $\dn$ & $\varnothing$\\ \end{tabular} \end{center} \noindent The $\cdot$ between for example -$\partial_x \; a\cdot b $ +$(\partial_x \; r_1) \cdot r_2 $ is a shorthand notation for the cartesian product -$\partial_x \; a \times \{ b\}$. +$(\partial_x \; r_1) \times \{ r_2\}$. %Each element in the set generated by a partial derivative %corresponds to a (potentially partial) match %TODO: define derivatives w.r.t string s -Rather than joining the calculated derivatives $\partial_x a$ and $\partial_x b$ together +Rather than joining the calculated derivatives $\partial_x r_1$ and $\partial_x r_2$ together using the $\sum$ constructor, Antimirov put them into -a set. This causes maximum de-duplication to happen, -allowing us to understand what are the "atomic" components of it. -For example, To compute what regular expression $x^*(xx + y)^*$'s -derivative against $x$ is made of, one can do a partial derivative -of it and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$ +a set. This means many subterms will be de-duplicated +because they are sets, +For example, to compute what regular expression $x^*(xx + y)^*$'s +derivative w.r.t. $x$ is, one can compute a partial derivative +and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$ from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$. +The partial derivative w.r.t. a string is defined recursively: +\[ + \partial_{c::cs} r \dn \bigcup_{r'\in (\partial_c r)} + \partial_{cs} r' +\] +Given an alphabet $\Sigma$, we denote the set of all possible strings +from this alphabet as $\Sigma^*$. The set of all possible partial derivatives is defined as the union of derivatives w.r.t all the strings in the universe: \begin{center} \begin{tabular}{lcl} - $\textit{PDER}_{UNIV} \; r $ & $\dn $ & $\bigcup_{w \in A^*}\partial_w \; r$ + $\textit{PDER}_{\Sigma^*} \; r $ & $\dn $ & $\bigcup_{w \in \Sigma^*}\partial_w \; r$ \end{tabular} \end{center} \noindent - -Back to our +Consider now again our pathological case where the derivatives +grow with a rather aggressive simplification \begin{center} $((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$ \end{center} -example, if we denote this regular expression as $A$, +example, if we denote this regular expression as $r$, we have that \begin{center} -$\textit{PDER}_{UNIV} \; A = +$\textit{PDER}_{\Sigma^*} \; r = \bigcup_{i=1}^{n}\bigcup_{j=0}^{i-1} \{ (\underbrace{a \ldots a}_{\text{j a's}}\cdot -(\underbrace{a \ldots a}_{\text{i a's}})^*)\cdot A \}$, +(\underbrace{a \ldots a}_{\text{i a's}})^*)\cdot r \}$, \end{center} with exactly $n * (n + 1) / 2$ terms. This is in line with our speculation that only $n*(n+1)/2$ terms are @@ -504,14 +646,14 @@ Using a suitable transformation $f$, we have \begin{center} $\forall s.\; f \; (r \bdersStrong \; s) \subseteq - \textit{PDER}_{UNIV} \; r$ + \textit{PDER}_{\Sigma^*} \; r$ \end{center} \end{conjecture} \noindent -because our \ref{cubicRule} will keep only one copy of each term, +because our \ref{fig:cubicRule} will keep only one copy of each term, where the function $\textit{prune}$ takes care of maintaining a set like structure similar to partial derivatives. -It is anticipated we might need to adjust $\textit{prune}$ +We might need to adjust $\textit{prune}$ slightly to make sure all duplicate terms are eliminated, which should be doable. @@ -519,15 +661,17 @@ terms' sizes is bounded by the cubic of the size of that regular expression: \begin{property}\label{pderBound} - $\llbracket \textit{PDER}_{UNIV} \; r \rrbracket \leq O((\llbracket r \rrbracket)^3)$ + $\llbracket \textit{PDER}_{\Sigma^*} \; r \rrbracket \leq O(\llbracket r \rrbracket^3)$ \end{property} -This property was formalised by Urban, and the details are in the PDERIVS.thy file -in our repository. +This property was formalised by Wu et al. \cite{Wu2014}, and the +details can be found in the archive of formal proofs. \footnote{https://www.isa-afp.org/entries/Myhill-Nerode.html} Once conjecture \ref{bsimpStrongInclusionPder} is proven, then property \ref{pderBound} -would yield us a cubic bound for our $\blexerStrong$ algorithm: +would yield us also a cubic bound for our $\blexerStrong$ algorithm: \begin{conjecture}\label{strongCubic} $\llbracket r \bdersStrong\; s \rrbracket \leq \llbracket r \rrbracket^3$ \end{conjecture} +\noindent +We leave this as future work. %To get all the "atomic" components of a regular expression's possible derivatives, @@ -748,3 +892,4 @@ % $\ntset \; r \; \_ \; [] $ & $ \dn$ & $[]$\\ % \end{tabular} %\end{center} + diff -r 94db2636a296 -r 7af4e2420a8c ChengsongTanPhdThesis/Chapters/Inj.tex --- a/ChengsongTanPhdThesis/Chapters/Inj.tex Tue Nov 22 12:50:04 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Inj.tex Sat Nov 26 16:18:10 2022 +0000 @@ -698,7 +698,7 @@ to indicate that a value $v$ could be generated from a lexing algorithm with input $r$. They call it the value inhabitation relation, defined by the rules. -\begin{figure}[H]\label{fig:inhab} +\begin{figure}[H] \begin{mathpar} \inferrule{\mbox{}}{\vdash \Char \; c : \mathbf{c}} \hspace{2em} @@ -712,7 +712,7 @@ \inferrule{\forall v \in vs. \vdash v:r \land |v| \neq []}{\vdash \Stars \; vs : r^*} \end{mathpar} -\caption{The inhabitation relation for values and regular expressions} +\caption{The inhabitation relation for values and regular expressions}\label{fig:inhab} \end{figure} \noindent The condition $|v| \neq []$ in the premise of star's rule diff -r 94db2636a296 -r 7af4e2420a8c ChengsongTanPhdThesis/Chapters/Introduction.tex --- a/ChengsongTanPhdThesis/Chapters/Introduction.tex Tue Nov 22 12:50:04 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/Introduction.tex Sat Nov 26 16:18:10 2022 +0000 @@ -703,7 +703,7 @@ Therefore, we take correctness and runtime claims made about these solutions with a grain of salt. -In the work reported in \cite{CSL2022} and here, +In the work reported in \cite{FoSSaCS2023} and here, we add better support using derivatives for bounded regular expressions $r^{\{n\}}$. The results @@ -909,7 +909,7 @@ used by Linux and OS X distributions) contain bugs or do not meet the specification they claim to adhere to. Kuklewicz maintains a unit test repository which lists some -problems with existing regular expression engines \ref{KuklewiczHaskell}. +problems with existing regular expression engines \cite{KuklewiczHaskell}. In some cases, they either fail to generate a result when there exists a match, or give results that are inconsistent with the POSIX standard. diff -r 94db2636a296 -r 7af4e2420a8c ChengsongTanPhdThesis/Chapters/RelatedWork.tex --- a/ChengsongTanPhdThesis/Chapters/RelatedWork.tex Tue Nov 22 12:50:04 2022 +0000 +++ b/ChengsongTanPhdThesis/Chapters/RelatedWork.tex Sat Nov 26 16:18:10 2022 +0000 @@ -26,6 +26,7 @@ and we do not make use of an equivalence checker that exploits the ACI-equivalent terms. + Central to this thesis is the work by Sulzmann and Lu \cite{Sulzmann2014}. They first introduced the elegant and simple idea of injection-based lexing and bit-coded lexing. diff -r 94db2636a296 -r 7af4e2420a8c ChengsongTanPhdThesis/example.bib --- a/ChengsongTanPhdThesis/example.bib Tue Nov 22 12:50:04 2022 +0000 +++ b/ChengsongTanPhdThesis/example.bib Sat Nov 26 16:18:10 2022 +0000 @@ -48,6 +48,14 @@ publisher={Princeton, NJ} } +@inproceedings{Paulson2015, + author = {L.~C.~Paulson}, + booktitle = {Proc.~of the 25th International Conference on Automated Deduction (CADE)}, + pages = {231--245}, + series = {LNAI}, + title = {{A} {F}ormalisation of {F}inite {A}utomata {U}sing {H}ereditarily {F}inite {S}ets}, + volume = {9195}, + year = {2015}} @article{Murugesan2014, @@ -92,7 +100,7 @@ @inproceedings{Berglund18, abstract = {Whereas Perl-compatible regular expression matchers typically exhibit some variation of leftmost-greedy semantics, those conforming to the posix standard are prescribed leftmost-longest semantics. However, the posix standard leaves some room for interpretation, and Fowler and Kuklewicz have done experimental work to confirm differences between various posix matchers. The Boost library has an interesting take on the posix standard, where it maximises the leftmost match not with respect to subexpressions of the regular expression pattern, but rather, with respect to capturing groups. In our work, we provide the first formalisation of Boost semantics, and we analyse the complexity of regular expression matching when using Boost semantics.}, address = {Cham}, - author = {M.~Berglund and W.~Bester and van der M.~Brink}, + author = {M.~Berglund and W.~Bester and B.~Van Der Merwe}, booktitle = {Theoretical Aspects of Computing -- ICTAC 2018}, editor = {Fischer, Bernd and Uustalu, Tarmo}, isbn = {978-3-030-02508-3}, @@ -156,7 +164,7 @@ @article{BERGLUND2022, abstract = {Most modern regular expression matching libraries (one of the rare exceptions being Google's RE2) allow backreferences, operations which bind a substring to a variable, allowing it to be matched again verbatim. However, both real-world implementations and definitions in the literature use different syntactic restrictions and have differences in the semantics of the matching of backreferences. Our aim is to compare these various flavors by considering the classes of formal languages that each can describe, establishing, as a result, a hierarchy of language classes. Beyond the hierarchy itself, some complexity results are given, and as part of the effort on comparing language classes new pumping lemmas are established, old classes are extended to new ones, and several incidental results on the nature of these language classes are given.}, - author = {M.~Berglund and van der M.~Brink }, + author = {M.~Berglund and B.~Van Der Merwe}, doi = {https://doi.org/10.1016/j.tcs.2022.10.041}, issn = {0304-3975}, journal = {Theoretical Computer Science}, @@ -343,20 +351,11 @@ -@inproceedings{Paulson2015,, -author = {L.~Paulson}, -year = {2015}, -month = {05}, -pages = {}, -title = {A Formalisation of Finite Automata Using Hereditarily Finite Sets}, -isbn = {978-3-319-21400-9}, -doi = {10.1007/978-3-319-21401-6_15} -} @inproceedings{Zippy2020, abstract = {In this paper, we present an efficient, functional, and formally verified parsing algorithm for LL(1) context-free expressions based on the concept of derivatives of formal languages. Parsing with derivatives is an elegant parsing technique, which, in the general case, suffers from cubic worst-case time complexity and slow performance in practice. We specialise the parsing with derivatives algorithm to LL(1) context-free expressions, where alternatives can be chosen given a single token of lookahead. We formalise the notion of LL(1) expressions and show how to efficiently check the LL(1) property. Next, we present a novel linear-time parsing with derivatives algorithm for LL(1) expressions operating on a zipper-inspired data structure. We prove the algorithm correct in Coq and present an implementation as a part of Scallion, a parser combinators framework in Scala with enumeration and pretty printing capabilities.}, address = {New York, NY, USA}, - author = {R.~Edelmannand H.~Jad and K.~Viktor}, + author = {R.~Edelmann and H.~Jad and K.~Viktor}, booktitle = {Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation}, doi = {10.1145/3385412.3385992}, isbn = {9781450376136}, @@ -426,7 +425,7 @@ } @INPROCEEDINGS{Sidhu2001, - author={Sidhu, R. and Prasanna, V.K.}, + author={R.~Sidhu and V.K.~Prasanna}, booktitle={The 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'01)}, title={Fast Regular Expression Matching Using FPGAs}, year={2001}, @@ -504,7 +503,7 @@ bdsk-url-1 = {https://doi.org/10.1145/1544012.1544037}} @book{Sakarovitch2009, - author = {Sakarovitch, Jacques}, + author = {J.~Sakarovitch}, doi = {10.1017/CBO9781139195218}, editor = {Thomas, ReubenTranslator}, place = {Cambridge}, @@ -513,13 +512,13 @@ year = {2009}, bdsk-url-1 = {https://doi.org/10.1017/CBO9781139195218}} -@unpublished{CSL2022, - author = {Chengsong Tan and Christian Urban}, +@unpublished{FoSSaCS2023, + author = {C.~Tan and C.~Urban}, note = {submitted}, title = {POSIX Lexing with Bitcoded Derivatives}} @inproceedings{Verbatim, - author = {Egolf, Derek and Lasser, Sam and Fisher, Kathleen}, + author = {D.~Egolf and S.~Lasser and K.~Fisher}, booktitle = {2021 IEEE Security and Privacy Workshops (SPW)}, doi = {10.1109/SPW53761.2021.00022}, pages = {92-100}, @@ -551,7 +550,7 @@ @inproceedings{Verbatimpp, abstract = {Lexers and parsers are attractive targets for attackers because they often sit at the boundary between a software system's internals and the outside world. Formally verified lexers can reduce the attack surface of these systems, thus making them more secure. One recent step in this direction is the development of Verbatim, a verified lexer based on the concept of Brzozowski derivatives. Two limitations restrict the tool's usefulness. First, its running time is quadratic in the length of its input string. Second, the lexer produces tokens with a simple "tag and string" representation, which limits the tool's ability to integrate with parsers that operate on more expressive token representations. In this work, we present a suite of extensions to Verbatim that overcomes these limitations while preserving the tool's original correctness guarantees. The lexer achieves effectively linear performance on a JSON benchmark through a combination of optimizations that, to our knowledge, has not been previously verified. The enhanced version of Verbatim also enables users to augment their lexical specifications with custom semantic actions, and it uses these actions to produce semantically rich tokens---i.e., tokens that carry values with arbitrary, user-defined types. All extensions were implemented and verified with the Coq Proof Assistant.}, address = {New York, NY, USA}, - author = {Egolf, Derek and Lasser, Sam and Fisher, Kathleen}, + author = {D.~Egolf and S.~Lasser and K.~Fisher}, booktitle = {Proceedings of the 11th ACM SIGPLAN International Conference on Certified Programs and Proofs}, doi = {10.1145/3497775.3503694}, isbn = {9781450391825}, @@ -567,7 +566,7 @@ bdsk-url-1 = {https://doi.org/10.1145/3497775.3503694}} @article{Turo_ov__2020, - author = {Lenka Turo{\v{n}}ov{\'{a}} and Luk{\'{a}}{\v{s}} Hol{\'{\i}}k and Ond{\v{r}}ej Leng{\'{a}}l and Olli Saarikivi and Margus Veanes and Tom{\'{a}}{\v{s}} Vojnar}, + author = {L.~Turo{\v{n}}ov{\'{a}} and L.~Hol{\'{\i}}k and O.~Leng{\'{a}}l and O.~Saarikivi and M.~Veanes and T.~Vojnar}, date-added = {2022-05-23 18:43:04 +0100}, date-modified = {2022-05-23 18:43:04 +0100}, doi = {10.1145/3428286}, @@ -584,14 +583,14 @@ bdsk-url-2 = {https://doi.org/10.1145/3428286}} @article{Rathnayake2014StaticAF, - author = {Asiri Rathnayake and Hayo Thielecke}, + author = {A.~Rathnayake and H.~Thielecke}, journal = {ArXiv}, title = {Static Analysis for Regular Expression Exponential Runtime via Substructural Logics}, volume = {abs/1405.7058}, year = {2014}} @inproceedings{Weideman2017Static, - author = {Nicolaas Weideman}, + author = {N.~Weideman}, title = {Static analysis of regular expressions}, year = {2017}} @@ -599,7 +598,7 @@ abstract = {We describe the formalization of a regular expression (RE) parsing algorithm that produces a bit representation of its parse tree in the dependently typed language Agda. The algorithm computes bit-codes using Brzozowski derivatives and we prove that produced codes are equivalent to parse trees ensuring soundness and completeness w.r.t an inductive RE semantics. We include the certified algorithm in a tool developed by us, named verigrep, for regular expression based search in the style of the well known GNU grep. Practical experiments conducted with this tool are reported.}, address = {New York, NY, USA}, articleno = {4}, - author = {Ribeiro, Rodrigo and Bois, Andr\'{e} Du}, + author = {R.~Ribeiro and A.~D.~Bois}, booktitle = {Proceedings of the 21st Brazilian Symposium on Programming Languages}, date-modified = {2022-03-16 16:38:47 +0000}, doi = {10.1145/3125374.3125381}, @@ -658,7 +657,7 @@ bdsk-file-1 = {YnBsaXN0MDDSAQIDBFxyZWxhdGl2ZVBhdGhZYWxpYXNEYXRhXxA1Li4vLi4vLi4vTXkgTWFjIChNYWNCb29rLVBybykvRGVza3RvcC9mcml0ei1wYXBlci5wZGZPEQF+AAAAAAF+AAIAAAxNYWNpbnRvc2ggSEQAAAAAAAAAAAAAAAAAAAAAAAAAQkQAAf////8PZnJpdHotcGFwZXIucGRmAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA/////wAAAAAAAAAAAAAAAAADAAMAAAogY3UAAAAAAAAAAAAAAAAAB0Rlc2t0b3AAAAIAQi86VXNlcnM6Y3N0YW46RHJvcGJveDpNeSBNYWMgKE1hY0Jvb2stUHJvKTpEZXNrdG9wOmZyaXR6LXBhcGVyLnBkZgAOACAADwBmAHIAaQB0AHoALQBwAGEAcABlAHIALgBwAGQAZgAPABoADABNAGEAYwBpAG4AdABvAHMAaAAgAEgARAASAEBVc2Vycy9jc3Rhbi9Ecm9wYm94L015IE1hYyAoTWFjQm9vay1Qcm8pL0Rlc2t0b3AvZnJpdHotcGFwZXIucGRmABMAAS8AABUAAgAM//8AAAAIAA0AGgAkAFwAAAAAAAACAQAAAAAAAAAFAAAAAAAAAAAAAAAAAAAB3g==}} @software{regexploit2021, - author = {Ben Caller, Luca Carettoni}, + author = {B.~Caller, L.~Carettoni}, date-added = {2020-11-24 00:00:00 +0000}, date-modified = {2021-05-07 00:00:00 +0000}, keywords = {ReDos static analyser}, @@ -669,7 +668,7 @@ bdsk-url-1 = {https://github.com/doyensec/regexploit}} @software{pcre, - author = {Philip Hazel}, + author = {P.~Hazel}, date-added = {1997-01-01 00:00:00 +0000}, date-modified = {2021-06-14 00:00:00 +0000}, keywords = {Perl Compatible Regular Expressions}, @@ -686,7 +685,7 @@ year = {2022}} @misc{KuklewiczHaskell, - author = {Kuklewicz}, + author = {C.~Kuklewicz}, keywords = {Buggy C POSIX Lexing Libraries}, title = {Regex Posix}, url = {https://wiki.haskell.org/Regex_Posix}, @@ -694,7 +693,7 @@ bdsk-url-1 = {https://wiki.haskell.org/Regex_Posix}} @misc{regex101, - author = {Firas Dib}, + author = {D.~Firas}, keywords = {regex tester debugger}, title = {regex101}, url = {https://regex101.com/}, @@ -702,7 +701,7 @@ bdsk-url-1 = {https://regex101.com/}} @misc{atomEditor, - author = {Dunno}, + author = {N.~Sobo}, keywords = {text editor}, title = {Atom Editor}, url = {https://atom.io}, @@ -710,7 +709,7 @@ bdsk-url-1 = {https://atom.io}} @techreport{grathwohl2014crash, - author = {Grathwohl, Niels Bj{\o}rn Bugge and Henglein, Fritz and Rasmussen, Ulrik Terp}, + author = {N.~B.~B.~Grathwohl and F.~Henglein and U.~T.~Rasmussen}, institution = {Technical report, University of Copenhagen}, title = {A Crash-Course in Regular Expression Parsing and Regular Expressions as Types}, year = {2014}} @@ -718,7 +717,7 @@ @inproceedings{xml2015, abstract = {Regular expressions are omnipresent in database applications. They form the structural core of schema languages for XML, they are a fundamental ingredient for navigational queries in graph databases, and are being considered in languages for upcoming technologies such as schema- and transformation languages for tabular data on the Web. In this paper we study the usage and effectiveness of the counting operator (or: limited repetition) in regular expressions. The counting operator is a popular extension which is part of the POSIX standard and therefore also present in regular expressions in grep, Java, Python, Perl, and Ruby. In a database context, expressions with counting appear in XML Schema and languages for querying graphs such as SPARQL 1.1 and Cypher.We first present a practical study that suggests that counters are extensively used in practice. We then investigate evaluation methods for such expressions and develop a new algorithm for efficient incremental evaluation. Finally, we conduct an extensive benchmark study that shows that exploiting counting operators can lead to speed-ups of several orders of magnitude in a wide range of settings: normal and incremental evaluation on synthetic and real expressions.}, address = {New York, NY, USA}, - author = {Bj\"{o}rklund, Henrik and Martens, Wim and Timm, Thomas}, + author = {H~.Bj\"{o}rklund and W.~Martens and T.~Timm}, booktitle = {Proceedings of the 24th ACM International on Conference on Information and Knowledge Management}, doi = {10.1145/2806416.2806434}, isbn = {9781450337946}, @@ -876,14 +875,6 @@ volume = 52, year = 2009} -@inproceedings{Paulson2015, - author = {L.~C.~Paulson}, - booktitle = {Proc.~of the 25th International Conference on Automated Deduction (CADE)}, - pages = {231--245}, - series = {LNAI}, - title = {{A} {F}ormalisation of {F}inite {A}utomata {U}sing {H}ereditarily {F}inite {S}ets}, - volume = {9195}, - year = {2015}} @article{Wu2014, author = {C.~Wu and X.~Zhang and C.~Urban}, diff -r 94db2636a296 -r 7af4e2420a8c ChengsongTanPhdThesis/main.tex --- a/ChengsongTanPhdThesis/main.tex Tue Nov 22 12:50:04 2022 +0000 +++ b/ChengsongTanPhdThesis/main.tex Sat Nov 26 16:18:10 2022 +0000 @@ -317,7 +317,27 @@ \begin{acknowledgements} \addchaptertocentry{\acknowledgementname} % Add the acknowledgements to the table of contents -The acknowledgments and the people to thank go here, don't forget to include your project advisor\ldots +I would like to express my deepest thanks to my supervisor Doctor Christian Urban, +who have been always extremely supportive thoughout my PhD, in all sorts of ways. +Supervisionwise, Christian always +thinks in terms of the best interests for the student, to which I am eternally grateful for. +I would also like to thank Doctor Ning Zhang, who have always been very gentle and caring to me, +quick to lend a +helping hand at difficult times. +I want to thank Doctor Kathrin Stark, my SIGPLAN mentor, for offering brilliant advice +at the late stage of my PhD. My transition from a PhD student to a postdoc researcher +could not have been so smooth without Kathrin's mentoring. +I want to thank Jeanna Wheeler, my UMO mentor, for helping me regulate my mental health +and productivity, by being always encouraging +and compassionate in her sessions. + +I want to thank my father Haiyan Tan and my mother Yunan Cheng, +for their unconditional love, and who I have not seen +face to face for three years. +I really miss you. +I want to thank my friends Yuying Chen, Kai Zeng, Rui Luo, Jingyi Liu, Qingtian Ye, and many others, +who have always been very patient and compassionate, giving clever advice when I turned to +them for help. \end{acknowledgements} %---------------------------------------------------------------------------------------- @@ -447,7 +467,6 @@ \include{Chapters/Bitcoded2} \include{Chapters/Finite} \include{Chapters/Cubic} -\include{Chapters/Further} \include{Chapters/RelatedWork} \include{Chapters/Future} @@ -455,12 +474,12 @@ % THESIS CONTENT - APPENDICES %---------------------------------------------------------------------------------------- -\appendix % Cue to tell LaTeX that the following "chapters" are Appendices +%\appendix % Cue to tell LaTeX that the following "chapters" are Appendices % Include the appendices of the thesis as separate files from the Appendices folder % Uncomment the lines as you write the Appendices -\include{Appendices/AppendixA} +%\include{Appendices/AppendixA} %\include{Appendices/AppendixB} %\include{Appendices/AppendixC} diff -r 94db2636a296 -r 7af4e2420a8c thys2/blexer2.sc --- a/thys2/blexer2.sc Tue Nov 22 12:50:04 2022 +0000 +++ b/thys2/blexer2.sc Sat Nov 26 16:18:10 2022 +0000 @@ -661,17 +661,22 @@ } //r = r' ~ tail' : If tail' matches tail => returns r' - def removeSeqTail(r: Rexp, tail: Rexp) : Rexp = r match { - case SEQ(r1, r2) => - if(r2 == tail) - r1 - else - ZERO - case r => ZERO - } + def removeSeqTail(r: Rexp, tail: Rexp) : Rexp = + if (r == tail) + ONE + else { + r match { + case SEQ(r1, r2) => + if(r2 == tail) + r1 + else + ZERO + case r => ZERO + } + } def prune(r: ARexp, acc: Set[Rexp]) : ARexp = r match{ - case AALTS(bs, rs) => rs.map(r => prune(r, acc)).filter(_ != ZERO) match + case AALTS(bs, rs) => rs.map(r => prune(r, acc)).filter(_ != AZERO) match { //all components have been removed, meaning this is effectively a duplicate //flats will take care of removing this AZERO @@ -911,13 +916,14 @@ } def turnIntoTerms(r: Rexp): List[Rexp] = r match { - case SEQ(r1, r2) => if(isOne1(r1)) - turnIntoTerms(r2) - else - turnIntoTerms(r1).flatMap(r11 => furtherSEQ(r11, r2)) - case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2) - case ZERO => Nil - case _ => r :: Nil + case SEQ(r1, r2) => + // if(isOne1(r1)) + // turnIntoTerms(r2) + // else + turnIntoTerms(r1).flatMap(r11 => furtherSEQ(r11, r2)) + case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2) + case ZERO => Nil + case _ => r :: Nil } def furtherSEQ(r11: Rexp, r2: Rexp) : List[Rexp] = { @@ -1544,25 +1550,29 @@ //new ce3 : SEQ(CHAR(b),ALTS(ALTS(ALTS(ONE,CHAR(a)),SEQ(CHAR(c),ONE)),SEQ(STAR(ZERO),SEQ(ONE,CHAR(b))))) //circular double-nullable STAR(SEQ((STAR("b") | "a" | "b"), ("b" | ONE))) def counterexample_check() { - val r = STAR(SEQ(ALTS(STAR(CHAR('b')),ALTS(CHAR('b'),CHAR('a'))), - ALTS(ALTS(CHAR('b'),CHAR('c')),STAR(ZERO))))//SEQ(SEQ(STAR(ALTS(CHAR('a'),CHAR('b'))), + val r : Rexp = SEQ(ALTS(ONE, + SEQ(ALTS(CHAR('f'), CHAR('b')), CHAR('g'))), SEQ(CHAR('a'), SEQ(CHAR('d'), CHAR('e'))))//(1+(f +b)·g)·(a·(d·e))//STAR(SEQ(ALTS(STAR(CHAR('b')),ALTS(CHAR('b'),CHAR('a'))), + val acc : Set[Rexp] = Set(SEQ(CHAR('a'), SEQ(CHAR('d'), CHAR('e'))), SEQ(SEQ(CHAR('f'), CHAR('g')), SEQ(CHAR('a'), SEQ(CHAR('d'), CHAR('e')))) ) + val a = internalise(r) + aprint(prune(a, acc)); + //ALTS(ALTS(CHAR('b'),CHAR('c')),STAR(ZERO))))//SEQ(SEQ(STAR(ALTS(CHAR('a'),CHAR('b'))), //ALTS(ALTS(CHAR('c'),CHAR('b')),STAR(ONE))),STAR(CHAR('b'))) - val s = "ab" - val bdStrong5 = bdersStrong6(s.toList, internalise(r)) - val bdStrong5Set = breakIntoTerms(erase(bdStrong5)) - val pdersSet = pderUNIV(r)//.map(r => erase(bsimp(internalise(r))))//.flatMap(r => turnIntoTerms(r))//.map(oneSimp).flatMap(r => breakIntoTerms(r)) - val apdersSet = pdersSet.map(internalise) - println("original regex ") - rprint(r) - println("after strong bsimp") - aprint(bdStrong5) - println("turned into a set %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ") - rsprint(bdStrong5Set) - println("after pderUNIV") - rsprint(pdersSet.toList) - println("pderUNIV distinctBy6") - //asprint(distinctBy6(apdersSet.toList)) - rsprint((pdersSet.toList).flatMap(turnIntoTerms)) + // val s = "ab" + // val bdStrong5 = bdersStrong6(s.toList, internalise(r)) + // val bdStrong5Set = breakIntoTerms(erase(bdStrong5)) + // val pdersSet = pderUNIV(r)//.map(r => erase(bsimp(internalise(r))))//.flatMap(r => turnIntoTerms(r))//.map(oneSimp).flatMap(r => breakIntoTerms(r)) + // val apdersSet = pdersSet.map(internalise) + // println("original regex ") + // rprint(r) + // println("after strong bsimp") + // aprint(bdStrong5) + // println("turned into a set %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ") + // rsprint(bdStrong5Set) + // println("after pderUNIV") + // rsprint(pdersSet.toList) + // println("pderUNIV distinctBy6") + // //asprint(distinctBy6(apdersSet.toList)) + // rsprint((pdersSet.toList).flatMap(turnIntoTerms)) // rsprint(turnIntoTerms(pdersSet.toList(3))) // println("NO 3 not into terms") // rprint((pdersSet.toList())) @@ -1586,7 +1596,7 @@ } - lf_display() + // lf_display() def breakable(r: Rexp) : Boolean = r match { case SEQ(ALTS(_, _), _) => true @@ -1657,3 +1667,5 @@ +println("hi!!!!!") +counterexample_check() \ No newline at end of file