ChengsongTanPhdThesis/Chapters/Cubic.tex
changeset 630 d50a309a0645
parent 628 7af4e2420a8c
child 638 dd9dde2d902b
equal deleted inserted replaced
629:96e009a446d5 630:d50a309a0645
     1 % Chapter Template
     1 % Chapter Template
     2 
     2 
     3 \chapter{A Better Bound and Other Extensions} % Main chapter title
     3 %We also present the idempotency property proof
       
     4 %of $\bsimp$, which leverages the idempotency proof of $\rsimp$.
       
     5 %This reinforces our claim that the fixpoint construction
       
     6 %originally required by Sulzmann and Lu can be removed in $\blexersimp$.
       
     7 %Last but not least, we present our efforts and challenges we met
       
     8 %in further improving the algorithm by data
       
     9 %structures such as zippers.
       
    10 %----------------------------------------------------------------------------------------
       
    11 %	SECTION strongsimp
       
    12 %----------------------------------------------------------------------------------------
       
    13 %TODO: search for isabelle proofs of algorithms that check equivalence 
       
    14 
       
    15 \chapter{A Better Size Bound for Derivatives} % Main chapter title
     4 
    16 
     5 \label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
    17 \label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
     6 %in Chapter 4 to a polynomial one, and demonstrate how one can extend the
    18 %in Chapter 4 to a polynomial one, and demonstrate how one can extend the
     7 %algorithm to include constructs such as bounded repetitions and negations.
    19 %algorithm to include constructs such as bounded repetitions and negations.
     8 \lstset{style=myScalastyle}
    20 \lstset{style=myScalastyle}
     9 
    21 
    10 
    22 
    11 This chapter is a ``work-in-progress''
    23 This chapter is a ``work-in-progress''
    12 chapter which records
    24 chapter which records
    13 extensions to our $\blexersimp$.
    25 extensions to our $\blexersimp$.
    14 We intend to formalise this part, which
    26 We make a conjecture that the finite
    15 we have not been able to finish due to time constraints of the PhD.
    27 size bound from the previous chapter can be 
       
    28 improved to a cubic bound.
       
    29 We implemented our conjecture in Scala.
       
    30 We intend to formalise this part in Isabelle/HOL at a
       
    31 later stage.
       
    32 %we have not been able to finish due to time constraints of the PhD.
    16 Nevertheless, we outline the ideas we intend to use for the proof.
    33 Nevertheless, we outline the ideas we intend to use for the proof.
       
    34 
       
    35 \section{A Stronger Version of Simplification}
    17 
    36 
    18 We present further improvements
    37 We present further improvements
    19 for our lexer algorithm $\blexersimp$.
    38 for our lexer algorithm $\blexersimp$.
    20 We devise a stronger simplification algorithm,
    39 We devise a stronger simplification algorithm,
    21 called $\bsimpStrong$, which can prune away
    40 called $\bsimpStrong$, which can prune away
    22 similar components in two regular expressions at the same 
    41 similar components in two regular expressions at the same 
    23 alternative level,
    42 alternative level,
    24 even if these regular expressions are not exactly the same.
    43 even if these regular expressions are not exactly the same.
    25 We call the lexer that uses this stronger simplification function
    44 We call the lexer that uses this stronger simplification function
    26 $\blexerStrong$.
    45 $\blexerStrong$.
    27 Unfortunately we did not have time to 
    46 %Unfortunately we did not have time to 
    28 work out the proofs, like in
    47 %work out the proofs, like in
    29 the previous chapters.
    48 %the previous chapters.
    30 We conjecture that both
    49 We conjecture that both
    31 \begin{center}
    50 \begin{center}
    32 	$\blexerStrong \;r \; s = \blexer\; r\;s$
    51 	$\blexerStrong \;r \; s = \blexer\; r\;s$
    33 \end{center}
    52 \end{center}
    34 and
    53 and 
    35 \begin{center}
    54 \begin{center}
    36 	$\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$
    55 	$\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$
    37 \end{center}
    56 \end{center}
    38 hold, but a formalisation
    57 hold.
    39 is still future work.
    58 %but a formalisation
       
    59 %is still future work.
    40 We give an informal justification 
    60 We give an informal justification 
    41 why the correctness and cubic size bound proofs
    61 why the correctness and cubic size bound proofs
    42 can be achieved
    62 can be achieved
    43 by exploring the connection between the internal
    63 by exploring the connection between the internal
    44 data structure of our $\blexerStrong$ and
    64 data structure of our $\blexerStrong$ and
    45 Animirov's partial derivatives.\\
    65 Animirov's partial derivatives.
    46 %We also present the idempotency property proof
    66 
    47 %of $\bsimp$, which leverages the idempotency proof of $\rsimp$.
       
    48 %This reinforces our claim that the fixpoint construction
       
    49 %originally required by Sulzmann and Lu can be removed in $\blexersimp$.
       
    50 
       
    51 %Last but not least, we present our efforts and challenges we met
       
    52 %in further improving the algorithm by data
       
    53 %structures such as zippers.
       
    54 
       
    55 
       
    56 
       
    57 %----------------------------------------------------------------------------------------
       
    58 %	SECTION strongsimp
       
    59 %----------------------------------------------------------------------------------------
       
    60 \section{A Stronger Version of Simplification}
       
    61 %TODO: search for isabelle proofs of algorithms that check equivalence 
       
    62 In our bitcoded lexing algorithm, (sub)terms represent (sub)matches.
    67 In our bitcoded lexing algorithm, (sub)terms represent (sub)matches.
    63 For example, the regular expression 
    68 For example, the regular expression 
    64 \[
    69 \[
    65 	aa \cdot a^*+ a \cdot a^* + aa\cdot a^*
    70 	aa \cdot a^*+ a \cdot a^* + aa\cdot a^*
    66 \]
    71 \]
    74 $a_2$ in the list
    79 $a_2$ in the list
    75 \begin{center}
    80 \begin{center}
    76 	$rs_a@[a_1]@rs_b@[a_2]@rs_c$
    81 	$rs_a@[a_1]@rs_b@[a_2]@rs_c$
    77 \end{center}
    82 \end{center}
    78 is that 
    83 is that 
       
    84 the two erased regular expressions are equal
    79 \begin{center}
    85 \begin{center}
    80 	$\rerase{a_1} = \rerase{a_2}$.
    86 	$\rerase{a_1} = \rerase{a_2}$.
    81 \end{center}
    87 \end{center}
    82 It is characterised as the $LD$ 
    88 This is characterised as the $LD$ 
    83 rewrite rule in figure \ref{rrewriteRules}.\\
    89 rewrite rule in figure \ref{rrewriteRules}.
    84 The problem , however, is that identical components
    90 The problem, however, is that identical components
    85 in two slightly different regular expressions cannot be removed.
    91 in two slightly different regular expressions cannot be removed
    86 Consider the simplification
    92 by the $LD$ rule.
       
    93 Consider the stronger simplification
    87 \begin{equation}
    94 \begin{equation}
    88 	\label{eqn:partialDedup}
    95 	\label{eqn:partialDedup}
    89 	(a+b+d) \cdot r_1 + (a+c+e) \cdot r_1 \stackrel{?}{\rightsquigarrow} (a+b+d) \cdot r_1 + (c+e) \cdot r_1
    96 	(a+b+d) \cdot r_1 + (a+c+e) \cdot r_1 \stackrel{?}{\rightsquigarrow} (a+b+d) \cdot r_1 + (c+e) \cdot r_1
    90 \end{equation}
    97 \end{equation}
    91 where the $(a+\ldots)\cdot r_1$ is deleted in the right alternative.
    98 where the $(\underline{a}+c+e)\cdot r_1$ is deleted in the right alternative
       
    99 $a+c+e$.
    92 This is permissible because we have $(a+\ldots)\cdot r_1$ in the left
   100 This is permissible because we have $(a+\ldots)\cdot r_1$ in the left
    93 alternative.
   101 alternative.
    94 The difficulty is that such  ``buried''
   102 The difficulty is that such  ``buried''
    95 alternatives-sequences are not easily recognised.
   103 alternatives-sequences are not easily recognised.
    96 But simplification like this actually
   104 But simplification like this actually
    97 cannot be omitted,
   105 cannot be omitted, if we want to have a better bound.
    98 as without it the size of derivatives can still
   106 For example, the size of derivatives can still
    99 blow up even with our $\textit{bsimp}$ 
   107 blow up even with our $\textit{bsimp}$ 
   100 function: 
   108 function: 
   101 consider again the example
   109 consider again the example
   102 $\protect((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$,
   110 $\protect((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$,
   103 and set $n$ to a relatively small number,
   111 and set $n$ to a relatively small number like $n=5$, then we get the following
   104 we get exponential growth:
   112 exponential growth:
   105 \begin{figure}[H]
   113 \begin{figure}[H]
   106 \centering
   114 \centering
   107 \begin{tikzpicture}
   115 \begin{tikzpicture}
   108 \begin{axis}[
   116 \begin{axis}[
   109     %xlabel={$n$},
   117     %xlabel={$n$},
   112     ylabel={size},
   120     ylabel={size},
   113     ]
   121     ]
   114 \addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
   122 \addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
   115 \end{axis}
   123 \end{axis}
   116 \end{tikzpicture}
   124 \end{tikzpicture}
   117 \caption{Size of derivatives of $\blexersimp$ for matching 
   125 \caption{Size of derivatives of $\blexersimp$ from chapter 5 for matching 
   118 	$\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ 
   126 	$\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ 
   119 	with strings 
   127 	with strings 
   120 	of the form $\protect\underbrace{aa..a}_{n}$.}\label{blexerExp}
   128 	of the form $\protect\underbrace{aa..a}_{n}$.}\label{blexerExp}
   121 \end{figure}
   129 \end{figure}
   122 \noindent
   130 \noindent
   124 rule
   132 rule
   125 \[
   133 \[
   126 	(a+b+d) \cdot r_1  \longrightarrow a \cdot r_1 + b \cdot r_1 + d \cdot r_1
   134 	(a+b+d) \cdot r_1  \longrightarrow a \cdot r_1 + b \cdot r_1 + d \cdot r_1
   127 \]
   135 \]
   128 \noindent
   136 \noindent
   129 in our $\simp$ function,
   137 which pushes the sequence into the alternatives
   130 so that it makes the simplification in \eqref{eqn:partialDedup} possible.
   138 in our $\simp$ function.
   131 Translating the rule into our $\textit{bsimp}$ function simply
   139 This would then make the simplification shown in \eqref{eqn:partialDedup} possible.
   132 involves adding a new clause to the $\textit{bsimp}_{ASEQ}$ function:
   140 Translating this rule into our $\textit{bsimp}$ function would simply
       
   141 involve adding a new clause to the $\textit{bsimp}_{ASEQ}$ function:
   133 \begin{center}
   142 \begin{center}
   134 	\begin{tabular}{@{}lcl@{}}
   143 	\begin{tabular}{@{}lcl@{}}
   135 		$\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\
   144 		$\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\
   136 						       && $\ldots$\\
   145 						       && $\ldots$\\
   137    &&$\quad\textit{case} \; (_{bs1}\sum as, a_2') \Rightarrow _{bs1}\sum (
   146    &&$\quad\textit{case} \; (_{bs1}\sum as, a_2') \Rightarrow _{bs1}\sum (
   139    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
   148    &&$\quad\textit{case} \; (a_1', a_2') \Rightarrow   _{bs}a_1' \cdot a_2'$ \\
   140 	\end{tabular}
   149 	\end{tabular}
   141 \end{center}
   150 \end{center}
   142 \noindent
   151 \noindent
   143 Unfortunately,
   152 Unfortunately,
   144 if we introduce them in our
   153 if we introduce this clause in our
   145 setting we would lose the POSIX property of our calculated values. 
   154 setting we would lose the POSIX property of our calculated values. 
   146 For example given the regular expression 
   155 For example given the regular expression 
   147 \begin{center}
   156 \begin{center}
   148 	$(a + ab)(bc + c)$
   157 	$(a + ab)(bc + c)$
   149 \end{center}
   158 \end{center}
   154 	$\Seq \; (\Right \; ab) \; (\Right \; c)$.
   163 	$\Seq \; (\Right \; ab) \; (\Right \; c)$.
   155 \end{center}
   164 \end{center}
   156 Essentially it matches the string with the longer Right-alternative
   165 Essentially it matches the string with the longer Right-alternative
   157 in the first sequence (and
   166 in the first sequence (and
   158 then the 'rest' with the character regular expression $c$ from the second sequence).
   167 then the 'rest' with the character regular expression $c$ from the second sequence).
   159 If we add the simplification above, then we obtain the following value
   168 If we add the simplification above, however, 
       
   169 then we would obtain the following value
   160 \begin{center}
   170 \begin{center}
   161 	$\Left \; (\Seq \; a \; (\Left \; bc))$
   171 	$\Left \; (\Seq \; a \; (\Left \; bc))$
   162 \end{center}
   172 \end{center}
   163 where the $\Left$-alternatives get priority. 
   173 where the $\Left$-alternatives get priority. 
   164 However this violates the POSIX rules.
   174 This violates the POSIX rules.
   165 The reason for getting this undesired value
   175 The reason for getting this undesired value
   166 is that the new rule splits this regular expression up into 
   176 is that the new rule splits this regular expression up into 
       
   177 a topmost alternative
   167 \begin{center}
   178 \begin{center}
   168 	$a\cdot(b c + c) + ab \cdot (bc + c)$,
   179 	$a\cdot(b c + c) + ab \cdot (bc + c)$,
   169 \end{center}
   180 \end{center}
   170 which becomes a regular expression with a 
   181 which is a regular expression with a 
   171 quite different structure--the original
   182 quite different meaning: the original regular expression
   172 was a sequence, and now it becomes an alternative.
   183 is a sequence, but the simplified regular expression is an alternative.
   173 With an alternative the maximal munch rule no longer works.\\
   184 With an alternative the maximal munch rule no longer works.
   174 A method to reconcile this is to do the 
   185 
       
   186 
       
   187 A method to reconcile this problem is to do the 
   175 transformation in \eqref{eqn:partialDedup} ``non-invasively'',
   188 transformation in \eqref{eqn:partialDedup} ``non-invasively'',
   176 meaning that we traverse the list of regular expressions
   189 meaning that we traverse the list of regular expressions
   177 \begin{center}
   190 %\begin{center}
   178 	$rs_a@[a]@rs_c$
   191 %	$rs_a@[a]@rs_c$
   179 \end{center}
   192 %\end{center}
   180 in the alternative
   193 inside alternatives
   181 \begin{center}
   194 \begin{center}
   182 	$\sum ( rs_a@[a]@rs_c)$
   195 	$\sum ( rs_a@[a]@rs_c)$
   183 \end{center}
   196 \end{center}
   184 using  a function similar to $\distinctBy$,
   197 using  a function similar to $\distinctBy$,
   185 but this time 
   198 but this time 
   186 we allow a more general list rewrite:
   199 we allow the following more general rewrite rule:
   187 \begin{figure}[H]
   200 \begin{equation}
   188 \begin{mathpar}	
   201 \label{eqn:cubicRule}
       
   202 %\mbox{
       
   203 %\begin{mathpar}	
   189 	\inferrule * [Right = cubicRule]{\vspace{0mm} }{rs_a@[a]@rs_c 
   204 	\inferrule * [Right = cubicRule]{\vspace{0mm} }{rs_a@[a]@rs_c 
   190 			\stackrel{s}{\rightsquigarrow }
   205 			\stackrel{s}{\rightsquigarrow }
   191 		rs_a@[\textit{prune} \; a \; rs_a]@rs_c }
   206 		rs_a@[\textit{prune} \; a \; rs_a]@rs_c }
   192 \end{mathpar}
   207 %\end{mathpar}
   193 \caption{The rule capturing the pruning simplification needed to achieve
   208 %\caption{The rule capturing the pruning simplification needed to achieve
   194 a cubic bound}
   209 %a cubic bound}
   195 \label{fig:cubicRule}
   210 \end{equation}
   196 \end{figure}
   211 \noindent
   197 %L \; a_1' = L \; a_1 \setminus (\cup_{a \in rs_a} L \; a)
   212 %L \; a_1' = L \; a_1 \setminus (\cup_{a \in rs_a} L \; a)
   198 where $\textit{prune} \;a \; acc$ traverses $a$
   213 where $\textit{prune} \;a \; acc$ traverses $a$
   199 without altering the structure of $a$, removing components in $a$
   214 without altering the structure of $a$, but removing components in $a$
   200 that have appeared in the accumulator $acc$.
   215 that have appeared in the accumulator $acc$.
   201 For example
   216 For example
   202 \begin{center}
   217 \begin{center}
   203 	$\textit{prune} \;\;\; (r_a+r_f+r_g+r_h)r_d \;\; \; [(r_a+r_b+r_c)r_d, (r_e+r_f)r_d] $
   218 	$\textit{prune} \;\;\; (r_a+r_f+r_g+r_h)r_d \;\; \; [(r_a+r_b+r_c)r_d, (r_e+r_f)r_d] $
   204 \end{center}
   219 \end{center}
   206 \begin{center}
   221 \begin{center}
   207 	$(r_g+r_h)r_d$
   222 	$(r_g+r_h)r_d$
   208 \end{center}
   223 \end{center}
   209 because $r_gr_d$ and 
   224 because $r_gr_d$ and 
   210 $r_hr_d$ are the only terms
   225 $r_hr_d$ are the only terms
   211 that have not appeared in the accumulator list 
   226 that do not appeared in the accumulator list 
   212 \begin{center}
   227 \begin{center}
   213 $[(r_a+r_b+r_c)r_d, (r_e+r_f)r_d]$.
   228 $[(r_a+r_b+r_c)r_d, (r_e+r_f)r_d]$.
   214 \end{center}
   229 \end{center}
   215 We implemented 
   230 We implemented the
   216 function $\textit{prune}$ in Scala:
   231 function $\textit{prune}$ in Scala (see figure \ref{fig:pruneFunc})
   217 \begin{figure}[H]
   232 The function $\textit{prune}$ 
   218 \begin{lstlisting}
   233 is a stronger version of $\textit{distinctBy}$.
       
   234 It does not just walk through a list looking for exact duplicates,
       
   235 but prunes sub-expressions recursively.
       
   236 It manages proper contexts by the helper functions
       
   237 $\textit{removeSeqTail}$, $\textit{isOne}$ and $\textit{atMostEmpty}$.
       
   238 \begin{figure}%[H]
       
   239 \begin{lstlisting}[numbers=left]
   219     def prune(r: ARexp, acc: Set[Rexp]) : ARexp = r match{
   240     def prune(r: ARexp, acc: Set[Rexp]) : ARexp = r match{
   220       case AALTS(bs, rs) => rs.map(r => prune(r, acc)).filter(_ != AZERO) match
   241       case AALTS(bs, rs) => rs.map(r => prune(r, acc)).filter(_ != AZERO) match
   221       {
   242       {
   222         //all components have been removed, meaning this is effectively a duplicate
   243         //all components have been removed, meaning this is effectively a duplicate
   223         //flats will take care of removing this AZERO 
   244         //flats will take care of removing this AZERO 
   237         }
   258         }
   238         //this does the duplicate component removal task
   259         //this does the duplicate component removal task
   239       case r => if(acc(erase(r))) AZERO else r
   260       case r => if(acc(erase(r))) AZERO else r
   240     }
   261     }
   241 \end{lstlisting}
   262 \end{lstlisting}
   242 \caption{The function $\textit{prune}$ }
   263 \caption{The function $\textit{prune}$ is called recursively in the 
       
   264 alternative case (line 2) and in the sequence case (line 12).
       
   265 In the alternative case we keep all the accumulators the same, but
       
   266 in the sequence case we are making necessary changes to the accumulators 
       
   267 to allow correct de-duplication.}\label{fig:pruneFunc}
   243 \end{figure}
   268 \end{figure}
   244 \noindent
   269 \noindent
   245 The function $\textit{prune}$ 
   270 
   246 is a stronger version of $\textit{distinctBy}$.
   271 \begin{figure}
   247 It does not just walk through a list looking for exact duplicates,
   272 \begin{lstlisting}[numbers=left]
   248 but prunes sub-expressions recursively.
       
   249 It manages proper contexts by the helper functions
       
   250 $\textit{removeSeqTail}$, $\textit{isOne}$ and $\textit{atMostEmpty}$.
       
   251 \begin{figure}[H]
       
   252 \begin{lstlisting}
       
   253     def atMostEmpty(r: Rexp) : Boolean = r match {
   273     def atMostEmpty(r: Rexp) : Boolean = r match {
   254       case ZERO => true
   274       case ZERO => true
   255       case ONE => true
   275       case ONE => true
   256       case STAR(r) => atMostEmpty(r)
   276       case STAR(r) => atMostEmpty(r)
   257       case SEQ(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2)
   277       case SEQ(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2)
   272       if (r == tail)
   292       if (r == tail)
   273         ONE
   293         ONE
   274       else {
   294       else {
   275         r match {
   295         r match {
   276           case SEQ(r1, r2) => 
   296           case SEQ(r1, r2) => 
   277             if(r2 == tail) 
   297             if(r2 == tail) r1 else ZERO
   278               r1
       
   279             else
       
   280               ZERO
       
   281           case r => ZERO
   298           case r => ZERO
   282         }
   299         }
   283       }
   300       }
   284 
   301 
   285 
   302 
   286 \end{lstlisting}
   303 \end{lstlisting}
   287 \caption{The helper functions of $\textit{prune}$}
   304 \caption{The helper functions of $\textit{prune}$ XXX.}
   288 \end{figure}
   305 \end{figure}
   289 \noindent
   306 \noindent
   290 Suppose we feed 
   307 Suppose we feed 
   291 \begin{center}
   308 \begin{center}
   292 	$r= (\underline{\ONE}+(\underline{f}+b)\cdot g)\cdot (a\cdot(d\cdot e))$
   309 	$r= (\underline{\ONE}+(\underline{f}+b)\cdot g)\cdot (a\cdot(d\cdot e))$
   293 \end{center}
   310 \end{center}
   294 and 
   311 and 
   295 \begin{center}
   312 \begin{center}
   296 	$acc = \{a\cdot(d\cdot e),f\cdot (g \cdot (a \cdot (d \cdot e))) \}$
   313 	$acc = \{a\cdot(d\cdot e),f\cdot (g \cdot (a \cdot (d \cdot e))) \}$
   297 \end{center}
   314 \end{center}
   298 as the input for $\textit{prune}$.
   315 as the input into $\textit{prune}$.
   299 The end result will be
   316 The end result will be
   300 \[
   317 \[
   301 	b\cdot(g\cdot(a\cdot(d\cdot e)))
   318 	b\cdot(g\cdot(a\cdot(d\cdot e)))
   302 \]
   319 \]
   303 where the underlined components in $r$ are eliminated.
   320 where the underlined components in $r$ are eliminated.
   320 and
   337 and
   321 \[
   338 \[
   322 	\textit{removeSeqTail} \quad \;\; 
   339 	\textit{removeSeqTail} \quad \;\; 
   323 	f\cdot(g\cdot (a \cdot(d\cdot e)))\quad  \;\; a \cdot(d\cdot e).
   340 	f\cdot(g\cdot (a \cdot(d\cdot e)))\quad  \;\; a \cdot(d\cdot e).
   324 \]
   341 \]
       
   342 
   325 The idea behind $\textit{removeSeqTail}$ is that
   343 The idea behind $\textit{removeSeqTail}$ is that
   326 when pruning recursively, we need to ``zoom in''
   344 when pruning recursively, we need to ``zoom in''
   327 to sub-expressions, and this ``zoom in'' needs to be performed
   345 to sub-expressions, and this ``zoom in'' needs to be performed
   328 on the
   346 on the
   329 accumulators as well, otherwise we will be comparing
   347 accumulators as well, otherwise the deletion will not work.
   330 apples with oranges.
       
   331 The sub-call 
   348 The sub-call 
   332 $\textit{prune} \;\; (\ONE+(f+b)\cdot g)\;\; \{\ONE, f\cdot g \}$
   349 $\textit{prune} \;\; (\ONE+(f+b)\cdot g)\;\; \{\ONE, f\cdot g \}$
   333 is simpler, which will trigger the alternative clause, causing
   350 is simpler, which will trigger the alternative clause, causing
   334 a pruning on each element in $(\ONE+(f+b)\cdot g)$,
   351 a pruning on each element in $(\ONE+(f+b)\cdot g)$,
   335 leaving us $b\cdot g$ only.
   352 leaving us with $b\cdot g$ only.
   336 
   353 
   337 Our new lexer with stronger simplification
   354 Our new lexer with stronger simplification
   338 uses $\textit{prune}$ by making it 
   355 uses $\textit{prune}$ by making it 
   339 the core component of the deduplicating function
   356 the core component of the deduplicating function
   340 called $\textit{distinctWith}$.
   357 called $\textit{distinctWith}$.
   341 $\textit{DistinctWith}$ ensures that all verbose
   358 $\textit{DistinctWith}$ ensures that all verbose
   342 parts of a regular expression are pruned away.
   359 parts of a regular expression are pruned away.
   343 
   360 
   344 \begin{figure}[H]
   361 \begin{figure}[H]
   345 \begin{lstlisting}
   362 \begin{lstlisting}[numbers=left]
   346     def turnIntoTerms(r: Rexp): List[Rexp] = r match {
   363     def turnIntoTerms(r: Rexp): List[Rexp] = r match {
   347       case SEQ(r1, r2)  => 
   364       case SEQ(r1, r2)  => 
   348         turnIntoTerms(r1).flatMap(r11 => furtherSEQ(r11, r2))
   365         turnIntoTerms(r1).flatMap(r11 => furtherSEQ(r11, r2))
   349           case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2)
   366           case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2)
   350           case ZERO => Nil
   367           case ZERO => Nil
   368                 )
   385                 )
   369             }
   386             }
   370         }
   387         }
   371 
   388 
   372 \end{lstlisting}
   389 \end{lstlisting}
   373 \caption{A Stronger Version of $\textit{distinctBy}$}
   390 \caption{A Stronger Version of $\textit{distinctBy}$ XXX}
   374 \end{figure}
   391 \end{figure}
   375 \noindent
   392 \noindent
   376 Once a regular expression has been pruned,
   393 Once a regular expression has been pruned,
   377 all its components will be added to the accumulator
   394 all its components will be added to the accumulator
   378 to remove any future regular expressions' duplicate components.
   395 to remove any future regular expressions' duplicate components.
   504       }
   521       }
   505 \end{lstlisting}
   522 \end{lstlisting}
   506 \end{figure}
   523 \end{figure}
   507 \noindent
   524 \noindent
   508 We call this lexer $\blexerStrong$.
   525 We call this lexer $\blexerStrong$.
   509 This version is able to drastically reduce the
   526 This version is able to reduce the
   510 internal data structure size which 
   527 size of the derivatives which 
   511 otherwise could
   528 otherwise 
   512 trigger exponential behaviours in
   529 triggered exponential behaviour in
   513 $\blexersimp$.
   530 $\blexersimp$.
       
   531 Consider again the runtime for matching 
       
   532 $\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings 
       
   533 of the form $\protect\underbrace{aa..a}_{n}$.
       
   534 They produce the folloiwng graphs ($\blexerStrong$ on the left-hand-side and
       
   535 $\blexersimp$ on the right-hand-side).
   514 \begin{figure}[H]
   536 \begin{figure}[H]
   515 \centering
   537 \centering
   516 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
   538 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
   517 \begin{tikzpicture}
   539 \begin{tikzpicture}
   518 \begin{axis}[
   540 \begin{axis}[
   539 \addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
   561 \addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
   540 \end{axis}
   562 \end{axis}
   541 \end{tikzpicture}\\
   563 \end{tikzpicture}\\
   542 \multicolumn{2}{l}{}
   564 \multicolumn{2}{l}{}
   543 \end{tabular}    
   565 \end{tabular}    
   544 \caption{Runtime for matching 
   566 \caption{}\label{fig:aaaaaStarStar}
   545 	$\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings 
       
   546 	   of the form $\protect\underbrace{aa..a}_{n}$.}\label{fig:aaaaaStarStar}
       
   547 \end{figure}
   567 \end{figure}
   548 \noindent
   568 \noindent
   549 We would like to preserve the correctness like the one 
   569 We hope the correctness is preserved.
   550 we had for $\blexersimp$:
   570 %We would like to preserve the correctness like the one 
       
   571 %we had for $\blexersimp$:
       
   572 The proof idea is to preserve the key lemma in chapter \ref{Bitcoded2}
       
   573 such as in equation \eqref{eqn:cubicRule}.
   551 \begin{conjecture}\label{cubicConjecture}
   574 \begin{conjecture}\label{cubicConjecture}
   552 	$\blexerStrong \;r \; s = \blexer\; r\;s$
   575 	$\blexerStrong \;r \; s = \blexer\; r\;s$
   553 \end{conjecture}
   576 \end{conjecture}
   554 \noindent
   577 \noindent
   555 The idea is to maintain key lemmas in
   578 The idea is to maintain key lemmas in
   556 chapter \ref{Bitcoded2} like
   579 chapter \ref{Bitcoded2} like
   557 $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$
   580 $r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$
   558 with the new rewriting rule 
   581 with the new rewriting rule 
   559 shown in figure \ref{fig:cubicRule} .
   582 shown in figure \eqref{eqn:cubicRule} .
   560 
   583 
   561 In the next sub-section,
   584 In the next sub-section,
   562 we will describe why we 
   585 we will describe why we 
   563 believe a cubic size bound can be achieved with
   586 believe a cubic size bound can be achieved with
   564 the stronger simplification.
   587 the stronger simplification.
   569 $\bdersStrongs$.
   592 $\bdersStrongs$.
   570  
   593  
   571 \subsection{Antimirov's partial derivatives}
   594 \subsection{Antimirov's partial derivatives}
   572 Partial derivatives were first introduced by 
   595 Partial derivatives were first introduced by 
   573 Antimirov \cite{Antimirov95}.
   596 Antimirov \cite{Antimirov95}.
   574 Partial derivatives are very similar
   597 They are very similar
   575 to Brzozowski derivatives,
   598 to Brzozowski derivatives,
   576 but splits children of alternative regular expressions into 
   599 but split children of alternative regular expressions into 
   577 multiple independent terms. This means the output of
   600 multiple independent terms. This means the output of
   578 partial derivatives become a 
   601 partial derivatives is a 
   579 set of regular expressions:
   602 set of regular expressions, defined as follows
   580 \begin{center}  
   603 \begin{center}  
   581  	\begin{tabular}{lcl}
   604 	\begin{tabular}{lcl@{\hspace{-5mm}}l}
   582 		$\partial_x \; (r_1 \cdot r_2)$ & 
   605 		$\partial_x \; (r_1 \cdot r_2)$ & 
   583 		$\dn$ & $(\partial_x \; r_1) \cdot r_2 \cup
   606 		$\dn$ & $(\partial_x \; r_1) \cdot r_2 \cup
   584 		\partial_x \; r_2 \; \textit{if} \; \; \nullable\; r_1$\\
   607 		\partial_x \; r_2 \;$ & $ \textit{if} \; \; \nullable\; r_1$\\
   585 		      & & $(\partial_x \; r_1)\cdot r_2 \quad\quad 
   608 				      & & $(\partial_x \; r_1)\cdot r_2 \quad\quad$ & $ 
   586 		      \textit{otherwise}$\\ 
   609 		      \textit{otherwise}$\\ 
   587 		$\partial_x \; r^*$ & $\dn$ & $(\partial_x \; r) \cdot r^*$\\
   610 		$\partial_x \; r^*$ & $\dn$ & $(\partial_x \; r) \cdot r^*$\\
   588 		$\partial_x \; c $ & $\dn$ & $\textit{if} \; x = c \; 
   611 		$\partial_x \; c $ & $\dn$ & $\textit{if} \; x = c \; 
   589 		\textit{then} \;
   612 		\textit{then} \;
   590 		\{ \ONE\} \;\;\textit{else} \; \varnothing$\\
   613 		\{ \ONE\} \;\;\textit{else} \; \varnothing$\\
   592 		$\partial_x(\ONE)$ & $=$ & $\varnothing$\\
   615 		$\partial_x(\ONE)$ & $=$ & $\varnothing$\\
   593 		$\partial_x(\ZERO)$ & $\dn$ & $\varnothing$\\
   616 		$\partial_x(\ZERO)$ & $\dn$ & $\varnothing$\\
   594 	\end{tabular}
   617 	\end{tabular}
   595 \end{center}
   618 \end{center}
   596 \noindent
   619 \noindent
   597 The $\cdot$ between for example 
   620 The $\cdot$ in the example 
   598 $(\partial_x \; r_1) \cdot r_2 $ 
   621 $(\partial_x \; r_1) \cdot r_2 $ 
   599 is a shorthand notation for the cartesian product 
   622 is a shorthand notation for the cartesian product 
   600 $(\partial_x \; r_1) \times \{ r_2\}$.
   623 $(\partial_x \; r_1) \times \{ r_2\}$.
   601 %Each element in the set generated by a partial derivative
   624 %Each element in the set generated by a partial derivative
   602 %corresponds to a (potentially partial) match
   625 %corresponds to a (potentially partial) match
   603 %TODO: define derivatives w.r.t string s
   626 %TODO: define derivatives w.r.t string s
   604 Rather than joining the calculated derivatives $\partial_x r_1$ and $\partial_x r_2$ together
   627 Rather than joining the calculated derivatives $\partial_x r_1$ and $\partial_x r_2$ together
   605 using the $\sum$ constructor, Antimirov put them into
   628 using the $\sum$ constructor, Antimirov put them into
   606 a set.  This means many subterms will be de-duplicated
   629 a set.  This means many subterms will be de-duplicated
   607 because they are sets, 
   630 because they are sets.
   608 For example, to compute what regular expression $x^*(xx + y)^*$'s 
   631 For example, to compute what the derivative of the regular expression 
   609 derivative w.r.t. $x$ is, one can compute a partial derivative
   632 $x^*(xx + y)^*$
       
   633 w.r.t. $x$ is, one can compute a partial derivative
   610 and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$
   634 and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$
   611 from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$.
   635 from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$.
   612 
   636 
   613 The partial derivative w.r.t. a string is defined recursively:
   637 The partial derivative w.r.t. a string is defined recursively:
   614 \[
   638 \[
   615 	\partial_{c::cs} r \dn \bigcup_{r'\in (\partial_c r)}
   639 	\partial_{c::cs} r \dn \bigcup_{r'\in (\partial_c r)}
   616 	\partial_{cs} r'
   640 	\partial_{cs} r'
   617 \]
   641 \]
   618 Given an alphabet $\Sigma$, we denote the set of all possible strings
   642 Suppose an alphabet $\Sigma$, we use $\Sigma^*$ for the set of all possible strings
   619 from this alphabet as $\Sigma^*$. 
   643 from the alphabet. 
   620 The set of all possible partial derivatives is defined
   644 The set of all possible partial derivatives is then defined
   621 as the union of derivatives w.r.t all the strings in the universe:
   645 as the union of derivatives w.r.t all the strings: 
   622 \begin{center}
   646 \begin{center}
   623 	\begin{tabular}{lcl}
   647 	\begin{tabular}{lcl}
   624 		$\textit{PDER}_{\Sigma^*} \; r $ & $\dn $ & $\bigcup_{w \in \Sigma^*}\partial_w \; r$
   648 		$\textit{PDER}_{\Sigma^*} \; r $ & $\dn $ & $\bigcup_{w \in \Sigma^*}\partial_w \; r$
   625 	\end{tabular}
   649 	\end{tabular}
   626 \end{center}
   650 \end{center}
   627 \noindent
   651 \noindent
   628 Consider now again our pathological case where the derivatives
   652 Consider now again our pathological case where we apply the more
   629 grow with a rather aggressive simplification
   653 aggressive 
       
   654 simplification
   630 \begin{center}
   655 \begin{center}
   631 	$((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$
   656 	$((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$
   632 \end{center}
   657 \end{center}
   633 example, if we denote this regular expression as $r$,
   658 let use abbreviate theis regular expression with $r$,
   634 we have that
   659 then we have that
   635 \begin{center}
   660 \begin{center}
   636 $\textit{PDER}_{\Sigma^*} \; r = 
   661 $\textit{PDER}_{\Sigma^*} \; r = 
   637 \bigcup_{i=1}^{n}\bigcup_{j=0}^{i-1} \{ 
   662 \bigcup_{i=1}^{n}\bigcup_{j=0}^{i-1} \{ 
   638 	(\underbrace{a \ldots a}_{\text{j a's}}\cdot
   663 	(\underbrace{a \ldots a}_{\text{j a's}}\cdot
   639 (\underbrace{a \ldots a}_{\text{i a's}})^*)\cdot r \}$, 
   664 (\underbrace{a \ldots a}_{\text{i a's}})^*)\cdot r \}$, 
   640 \end{center}
   665 \end{center}
   641 with exactly $n * (n + 1) / 2$ terms.
   666 The union on the right-hand-side has $n * (n + 1) / 2$ terms.
   642 This is in line with our speculation that only $n*(n+1)/2$ terms are
   667 This leads us to believe that the maximum number of terms needed
   643 needed. We conjecture that $\bsimpStrong$ is also able to achieve this
   668 in our derivative is also only $n * (n + 1) / 2$.
       
   669 Therefore
       
   670 we conjecture that $\bsimpStrong$ is also able to achieve this
   644 upper limit in general
   671 upper limit in general
   645 \begin{conjecture}\label{bsimpStrongInclusionPder}
   672 \begin{conjecture}\label{bsimpStrongInclusionPder}
   646 	Using a suitable transformation $f$, we have 
   673 	Using a suitable transformation $f$, we have that
   647 	\begin{center}
   674 	\begin{center}
   648 		$\forall s.\; f \; (r \bdersStrong  \; s) \subseteq
   675 		$\forall s.\; f \; (r \bdersStrong  \; s) \subseteq
   649 		 \textit{PDER}_{\Sigma^*} \; r$
   676 		 \textit{PDER}_{\Sigma^*} \; r$
   650 	\end{center}
   677 	\end{center}
       
   678 	holds.
   651 \end{conjecture}
   679 \end{conjecture}
   652 \noindent
   680 \noindent
   653 because our \ref{fig:cubicRule} will keep only one copy of each term,
   681 The reason is that our \eqref{eqn:cubicRule} will keep only one copy of each term,
   654 where the function $\textit{prune}$ takes care of maintaining
   682 where the function $\textit{prune}$ takes care of maintaining
   655 a set like structure similar to partial derivatives.
   683 a set like structure similar to partial derivatives.
   656 We might need to adjust $\textit{prune}$
   684 %We might need to adjust $\textit{prune}$
   657 slightly to make sure all duplicate terms are eliminated,
   685 %slightly to make sure all duplicate terms are eliminated,
   658 which should be doable.
   686 %which should be doable.
   659 
   687 
   660 Antimirov had proven that the sum of all the partial derivative 
   688 Antimirov had proven that the sum of all the partial derivative 
   661 terms' sizes is bounded by the cubic of the size of that regular
   689 terms' sizes is bounded by the cubic of the size of that regular
   662 expression:
   690 expression:
   663 \begin{property}\label{pderBound}
   691 \begin{property}\label{pderBound}
   664 	$\llbracket \textit{PDER}_{\Sigma^*} \; r \rrbracket \leq O(\llbracket r \rrbracket^3)$
   692 	$\llbracket \textit{PDER}_{\Sigma^*} \; r \rrbracket \leq O(\llbracket r \rrbracket^3)$
   665 \end{property}
   693 \end{property}
       
   694 \noindent
   666 This property was formalised by Wu et al. \cite{Wu2014}, and the 
   695 This property was formalised by Wu et al. \cite{Wu2014}, and the 
   667 details can be found in the archive of formal proofs. \footnote{https://www.isa-afp.org/entries/Myhill-Nerode.html}
   696 details can be found in the Archive of Formal Froofs\footnote{https://www.isa-afp.org/entries/Myhill-Nerode.html}.
   668 Once conjecture \ref{bsimpStrongInclusionPder} is proven, then property \ref{pderBound}
   697 Once conjecture \ref{bsimpStrongInclusionPder} is proven, then property \ref{pderBound}
   669 would yield us also a cubic bound for our $\blexerStrong$ algorithm: 
   698 would provide us with a cubic bound for our $\blexerStrong$ algorithm: 
   670 \begin{conjecture}\label{strongCubic}
   699 \begin{conjecture}\label{strongCubic}
   671 	$\llbracket r \bdersStrong\; s \rrbracket \leq \llbracket r \rrbracket^3$
   700 	$\llbracket r \bdersStrong\; s \rrbracket \leq \llbracket r \rrbracket^3$
   672 \end{conjecture}
   701 \end{conjecture}
   673 \noindent
   702 \noindent
   674 We leave this as future work.
   703 We leave this as future work.