ChengsongTanPhdThesis/Chapters/Cubic.tex
changeset 591 b2d0de6aee18
parent 590 988e92a70704
child 592 7f4c353c0f6b
equal deleted inserted replaced
590:988e92a70704 591:b2d0de6aee18
    57 %	SECTION strongsimp
    57 %	SECTION strongsimp
    58 %----------------------------------------------------------------------------------------
    58 %----------------------------------------------------------------------------------------
    59 \section{A Stronger Version of Simplification}
    59 \section{A Stronger Version of Simplification}
    60 %TODO: search for isabelle proofs of algorithms that check equivalence 
    60 %TODO: search for isabelle proofs of algorithms that check equivalence 
    61 In our bitcoded lexing algorithm, (sub)terms represent (sub)matches.
    61 In our bitcoded lexing algorithm, (sub)terms represent (sub)matches.
    62 For example, the regular expression $aa \cdot a^*+ a \cdot a^*$ contains two terms, 
    62 For example, the regular expression 
    63 which expresses two possibilities it will match future input.
    63 \[
    64 It will either match 2 $a$'s, then 0 or more $a$'s, in other words, at least 2 more $a$'s
    64 	aa \cdot a^*+ a \cdot a^* + aa\cdot a^*
    65 \begin{figure}\label{string_3parts1}
    65 \]
    66 \begin{center}
    66 contains three terms, 
    67 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
    67 expressing three possibilities it will match future input.
    68     \node [rectangle split, rectangle split horizontal, rectangle split parts=3]
    68 The first and the third terms are identical, which means we can eliminate
    69         {Consumed Input
    69 the latter as we know it will not be picked up by $\bmkeps$.
    70         \nodepart{two} Expects $aa$
    70 In $\bsimps$, the $\distinctBy$ function takes care of this.
    71          \nodepart{three} Then expects $a^*$};
    71 The criteria $\distinctBy$ uses for removing a duplicate
    72 
    72 $a_2$ in the list
       
    73 \begin{center}
       
    74 	$rs_a@[a_1]@rs_b@[a_2]@rs_c$
       
    75 \end{center}
       
    76 is that 
       
    77 \begin{center}
       
    78 	$\rerase{a_1} = \rerase{a_2}$.
       
    79 \end{center}
       
    80 It can be characterised as the $LD$ 
       
    81 rewrite rule in \ref{rrewriteRules}.\\
       
    82 The problem , however, is that identical components
       
    83 in two slightly different regular expressions cannot be removed:
       
    84 \begin{figure}[H]
       
    85 \[
       
    86 	(a+b+d) \cdot r_1 + (a+c+e) \cdot r_1 \stackrel{?}{\rightsquigarrow} (a+b+d) \cdot r_1 + (c+e) \cdot r_1
       
    87 \]
       
    88 \caption{Desired simplification, but not done in $\blexersimp$}\label{partialDedup}
       
    89 \end{figure}
       
    90 \noindent
       
    91 A simplification like this actually
       
    92 cannot be omitted,
       
    93 as without it the size could blow up even with our $\simp$ function:
       
    94 \begin{figure}[H]
       
    95 \centering
       
    96 \begin{tikzpicture}
       
    97 \begin{axis}[
       
    98     %xlabel={$n$},
       
    99     myplotstyle,
       
   100     xlabel={input length},
       
   101     ylabel={size},
       
   102     ]
       
   103 \addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
       
   104 \end{axis}
    73 \end{tikzpicture}
   105 \end{tikzpicture}
    74 \end{center}
   106 \caption{Runtime of $\blexersimp$ for matching 
    75 \end{figure}
   107 	$\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ 
    76 \noindent
   108 	with strings 
    77 Or it will match at least 1 more $a$:
   109 	of the form $\protect\underbrace{aa..a}_{n}$.}\label{blexerExp}
    78 \begin{figure}\label{string_3parts2}
   110 \end{figure}
    79 \begin{center}
   111 \noindent
    80 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
   112 We would like to apply the rewriting at some stage
    81     \node [rectangle split, rectangle split horizontal, rectangle split parts=3]
   113 \begin{figure}[H]
    82         {Consumed
       
    83         \nodepart{two} Expects $a$
       
    84          \nodepart{three} Then expects $a^*$};
       
    85 
       
    86 \end{tikzpicture}
       
    87 \end{center}
       
    88 \end{figure}
       
    89 If these two possibilities are identical, we can eliminate
       
    90 the second one as we know the second one corresponds to 
       
    91 a match that is not $\POSIX$.
       
    92 
       
    93 
       
    94 If two identical regexes 
       
    95 happen to be grouped into different sequences, one cannot use
       
    96 the $a + a \rightsquigarrow a$ rule to eliminate them, even if they 
       
    97 are "parallel" to each other:
       
    98 \[
   114 \[
    99 (a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2
   115 	(a+b+d) \cdot r_1  \longrightarrow a \cdot r_1 + b \cdot r_1 + d \cdot r_1
   100 \]
   116 \]
   101 and that's because they are followed by possibly 
   117 \caption{Desired simplification, but not done in $\blexersimp$}\label{desiredSimp}
   102 different "suffixes" $r_1$ and $r_2$, and if 
   118 \end{figure}
   103 they do turn out to be different then doing 
   119 \noindent
   104 \[
   120 in our $\simp$ function,
   105 (a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2
   121 so that it makes the simplification in \ref{partialDedup} possible.
   106 \]
   122 For example,
   107 might cause a possible match where the string is in $L(a \cdot r_2)$ to be lost.
   123 can a function like the below one be used?
   108 
   124 %TODO: simp' that spills things
   109 Here is an example for this.
   125 
   110 Assume we have the derivative regex 
   126 Unfortunately,
   111 \[( r_1  +  r_2  +  r_3)\cdot r+ 
   127 if we introduce them in our
   112 ( r_1  +  r_5  +  r_6)\cdot( r_1  +  r_2  +  r_3)^*
   128 setting we would lose the POSIX property of our calculated values. 
   113 \]
   129 For example given the regular expression 
   114 
   130 \begin{center}
   115 occurring in an intermediate step in our bit-coded lexing algorithm.
   131 	$(a + ab)(bc + c)$
   116 
   132 \end{center}
   117 In this expression, there will be 6 "parallel" terms if we break up regexes 
   133 and the string
   118 of shape $(a+b)\cdot c$ using the distributivity law (into $ac + bc$).
   134 \begin{center}
   119 \begin{align}
   135 	$ab$,
   120 (\nonumber \\
   136 \end{center}
   121  r_1  + &    \label{term:1} \\
   137 then our algorithm generates the following
   122  r_2  + &  \label{term:2} \\
   138 correct POSIX value
   123  r_3 &  \label{term:3} \\
   139 \begin{center}
   124 &	)\cdot r\; + \; (\nonumber \\
   140 	$\Seq \; (\Right \; ab) \; (\Right \; c)$.
   125  r_1  + & \label{term:4} \\
   141 \end{center}
   126  r_5  + & \label{term:5} \\
   142 Essentially it matches the string with the longer Right-alternative
   127  r_6 \label{term:6}\\
   143 in the first sequence (and
   128 &      )\cdot r\nonumber
   144 then the 'rest' with the character regular expression $c$ from the second sequence).
   129 \end{align}
   145 If we add the simplification above, then we obtain the following value
   130 
   146 \begin{center}
   131 They have been labelled, and each label denotes  
   147 	$\Left \; (\Seq \; a \; (\Left \; bc))$
   132 one term, for example, \ref{term:1} denotes
   148 \end{center}
   133 \[
   149 where the $\Left$-alternatives get priority. 
   134  r_1 \cdot r
   150 However this violates the POSIX rules.
   135 \]
   151 The reason for getting this undesired value
   136 \noindent
   152 is that the new rule splits this regular expression up into 
   137 and \ref{term:3} denotes
   153 \begin{center}
   138 \[
   154 	$a\cdot(b c + c) + ab \cdot (bc + c)$,
   139  r_3\cdot r.
   155 \end{center}
   140 \]
   156 which becomes a regular expression with a 
   141 From a lexer's point of view, \ref{term:4} will never 
   157 totally different structure--the original
   142 be picked up in later phases of matching because there
   158 was a sequence, and now it becomes an alternative.
   143 is a term \ref{term:1} giving identical matching information.
   159 With an alternative the maximum munch rule no longer works.\\
   144 The first term \ref{term:1} will match a string in $L(r_1 \cdot  r)$, 
   160 A method to reconcile this is to do the 
   145 and so on for term \ref{term:2}, \ref{term:3}, etc.
   161 transformation in \ref{desiredSimp} ``non-invasively'',
   146 
   162 meaning that we traverse the list of regular expressions
   147 \mybox{previous input $\ldots$}\mybox{$aaa$ }\mybox{rest of input $\ldots$}
   163 \begin{center}
   148 \begin{center}\label{string_2parts}
   164 	$rs_a@[a]@rs_c$
   149 
   165 \end{center}
   150 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
   166 in the alternative
   151     \node [rectangle split, rectangle split horizontal, rectangle split parts=2]
   167 \begin{center}
   152         {$r_{x1}$
   168 	$\sum ( rs_a@[a]@rs_c)$
   153          \nodepart{two} $r_1\cdot r$ };
   169 \end{center}
   154 %\caption{term 1 \ref{term:1}'s matching configuration}
   170 using  a function similar to $\distinctBy$,
   155 \end{tikzpicture}
   171 but this time 
   156 
   172 we allow a more general list rewrite:
   157 \end{center}
   173 	\begin{mathpar}
   158 For term 1 \ref{term:1}, whatever was before the current
   174 		\inferrule{\vspace{0mm} }{rs_a@[a]@rs_c 
   159 input position was fully matched and the regex corresponding
   175 			\stackrel{s}{\rightsquigarrow }
   160 to it reduced to $\ONE$, 
   176 		rs_a@[\textit{prune} \; a \; rs_a]@rs_c }
   161 and in the next input token, it will start on $r_1\cdot r$.
   177 	\end{mathpar}
   162 The resulting value will be something of a similar configuration:
   178 %L \; a_1' = L \; a_1 \setminus (\cup_{a \in rs_a} L \; a)
   163 \begin{center}\label{value_2parts}
   179 where $\textit{prune} \;a \; acc$ traverses $a$
   164 %\caption{term 1 \ref{term:1}'s matching configuration}
   180 without altering the structure of $a$, removing components in $a$
   165 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
   181 that have appeared in the accumulator $acc$.
   166     \node [rectangle split, rectangle split horizontal, rectangle split parts=2]
   182 For example
   167         {$v_{x1}$
   183 \begin{center}
   168          \nodepart{two} $v_{r_1\cdot r}$ };
   184 	$\textit{prune} \;\;\; (r_a+r_f+r_g+r_h)r_d \;\; \; [(r_a+r_b+r_c)r_d, (r_e+r_f)r_d] $
   169 \end{tikzpicture}
   185 \end{center}
   170 \end{center}
   186 should be equal to 
   171 For term 2 \ref{term:2} we have a similar value partition:
   187 \begin{center}
   172 \begin{center}\label{value_2parts2}
   188 	$(r_g+r_h)r_d$
   173 %\caption{term 1 \ref{term:1}'s matching configuration}
   189 \end{center}
   174 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
   190 because $r_gr_d$ and 
   175     \node [rectangle split, rectangle split horizontal, rectangle split parts=2]
   191 $r_hr_d$ are the only terms
   176         {$v_{x2}$
   192 that have not appeared in the accumulator list 
   177          \nodepart{two} $v_{r_2\cdot r}$ };
   193 \begin{center}
   178 \end{tikzpicture}
   194 $[(r_a+r_b+r_c)r_d, (r_e+r_f)r_d]$.
   179 \end{center}
   195 \end{center}
   180 \noindent
   196 We implemented 
   181 and so on.
   197 function $\textit{prune}$ in Scala,
   182 We note that for term 4 \ref{term:4} its result value 
   198 and incorporated into our lexer,
   183 after any position beyond  the current one will always
   199 by replacing the $\simp$ function 
   184 be the same:
   200 with a stronger version called $\bsimpStrong$
   185 \begin{center}\label{value_2parts4}
   201 that prunes regular expressions.
   186 %\caption{term 1 \ref{term:1}'s matching configuration}
   202 We call this lexer $\blexerStrong$.
   187 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
   203 $\blexerStrong$ is able to drastically reduce the
   188     \node [rectangle split, rectangle split horizontal, rectangle split parts=2]
   204 internal data structure size which could
   189         {$v_{x4}$
   205 trigger exponential behaviours in
   190          \nodepart{two} $v_{r_1\cdot r}$ };
   206 $\blexersimp$.
   191 \end{tikzpicture}
   207 \begin{figure}[H]
   192 \end{center}
       
   193 And $\POSIX$ rules says that we can eliminate at least one of them.
       
   194 Our algorithm always puts the regex with the longest initial sub-match at the left of the 
       
   195 alternatives, so we safely throw away \ref{term:4}.
       
   196 The fact that term 1 and 4 are not immediately in the same alternative
       
   197 expression does not prevent them from being two duplicate matches at
       
   198 the current point of view.
       
   199 To implement this idea into an algorithm, we define a "pruning function"
       
   200 $\textit{prune}$, that takes away parts of all terms in $r$ that belongs to
       
   201 $\textit{acc}$, which acts as an element
       
   202 is a stronger version of $\textit{distinctBy}$.
       
   203 Here is a concise version of it (in the style of Scala):
       
   204 \begin{figure}[H]
       
   205 \begin{lstlisting}
       
   206 def distinctByPlus(rs: List[ARexp], acc: Set[Rexp] = Set()) : 
       
   207 List[ARexp] =  rs match {
       
   208   case Nil => Nil
       
   209   case r :: rs =>
       
   210     if(acc.contains(erase(r)))
       
   211       distinctByPlus(rs, acc)
       
   212     else 
       
   213       prune(r, acc) :: distinctByPlus(rs, prune(r, acc) +: acc)
       
   214 }
       
   215 
       
   216 \end{lstlisting}
       
   217 \caption{the distinctByPlus function}
       
   218 \end{figure}
       
   219 But for the function $\textit{prune}$ there is a difficulty:
       
   220 how could one define the $L$ function in a "computable" way,
       
   221 so that they generate a (lazy infinite) set of strings in $L(r)$.
       
   222 We also need a function that tests whether $L(r_1) \subseteq L(r_2)$
       
   223 is true.
       
   224 For the moment we cut corners and do not define these two functions
       
   225 as they are not used by Antimirov (and will probably not contribute
       
   226 to a bound better than Antimirov's cubic bound).
       
   227 Rather, we do not try to eliminate in every instance of regular expressions
       
   228 that have "language duplicates", but only eliminate the "exact duplicates".
       
   229 For this we need to break up terms $(a+b)\cdot c$ into two
       
   230 terms $a\cdot c$ and $b\cdot c$ before we add them to the accumulator:
       
   231 \begin{figure}[H]
       
   232 \begin{lstlisting}
       
   233 def distinctWith(rs: List[ARexp], 
       
   234                 pruneFunction: (ARexp, Set[Rexp]) => ARexp, 
       
   235                 acc: Set[Rexp] = Set()) : List[ARexp] = 
       
   236   rs match{
       
   237     case Nil => Nil
       
   238     case r :: rs => 
       
   239       if(acc(erase(r)))
       
   240         distinctWith(rs, pruneFunction, acc)
       
   241       else {
       
   242         val pruned_r = pruneFunction(r, acc)
       
   243         pruned_r :: 
       
   244         distinctWith(rs, 
       
   245           pruneFunction, 
       
   246           turnIntoTerms(erase(pruned_r)) ++: acc
       
   247         )
       
   248       }
       
   249   }
       
   250 \end{lstlisting}
       
   251 \caption{A Stronger Version of $\textit{distinctBy}$}
       
   252 \end{figure}
       
   253 \noindent
       
   254 This process is done by the function $\textit{turnIntoTerms}$:
       
   255 \begin{figure}
       
   256 \begin{verbatim}
       
   257 def turnIntoTerms(r: Rexp): List[Rexp] = r match {
       
   258   case SEQ(r1, r2)  => if(isOne(r1)) 
       
   259                           turnIntoTerms(r2) 
       
   260                        else 
       
   261                           turnIntoTerms(r1).map(r11 => SEQ(r11, r2))
       
   262   case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2)
       
   263   case ZERO => Nil
       
   264   case _ => r :: Nil
       
   265 }
       
   266 \end{verbatim}
       
   267 \caption{function that breaks up regular expression into "atomic" terms}
       
   268 \end{figure}
       
   269 
       
   270 \noindent
       
   271 The pruning function can be defined recursively:
       
   272 \begin{figure}
       
   273 \begin{verbatim}
       
   274 def prune7(r: ARexp, acc: Set[Rexp]) : ARexp = r match{
       
   275   case AALTS(bs, rs) => rs.map(r => prune7(r, acc)).filter(_ != ZERO) match
       
   276   {
       
   277     case Nil => AZERO
       
   278     case r::Nil => fuse(bs, r)
       
   279     case rs1 => AALTS(bs, rs1)
       
   280   }
       
   281   case ASEQ(bs, r1, r2) => prune7(r1, acc.map(r => removeSeqTail(r, erase(r2)))) match {
       
   282     case AZERO => AZERO
       
   283     case r1p if(isOne(erase(r1p))) => fuse(bs ++ mkepsBC(r1p), r2)
       
   284     case r1p => ASEQ(bs, r1p, r2)
       
   285   }
       
   286   case r => if(acc(erase(r))) AZERO else r
       
   287 }
       
   288 \end{verbatim}
       
   289 \caption{pruning function}
       
   290 \end{figure}
       
   291 
       
   292 
       
   293 
       
   294 \begin{figure}
       
   295 \centering
   208 \centering
   296 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
   209 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
   297 \begin{tikzpicture}
   210 \begin{tikzpicture}
   298 \begin{axis}[
   211 \begin{axis}[
   299     %xlabel={$n$},
   212     %xlabel={$n$},
   314     ]
   227     ]
   315 \addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
   228 \addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
   316 \end{axis}
   229 \end{axis}
   317 \end{tikzpicture}\\
   230 \end{tikzpicture}\\
   318 \multicolumn{2}{l}{Graphs: Runtime for matching $((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings 
   231 \multicolumn{2}{l}{Graphs: Runtime for matching $((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings 
   319            of the form $\underbrace{aa..a}_{n}$.}
   232 	   of the form $\underbrace{aa..a}_{n}$.}
   320 \end{tabular}    
   233 \end{tabular}    
   321 \caption{aaaaaStarStar} \label{fig:aaaaaStarStar}
   234 \caption{aaaaaStarStar} \label{fig:aaaaaStarStar}
   322 \end{figure}
   235 \end{figure}
       
   236 \begin{figure}[H]
       
   237 \begin{lstlisting}
       
   238     def atMostEmpty(r: Rexp) : Boolean = r match {
       
   239       case ZERO => true
       
   240       case ONE => true
       
   241       case STAR(r) => atMostEmpty(r)
       
   242       case SEQ(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2)
       
   243       case ALTS(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2)
       
   244       case CHAR(_) => false
       
   245     }
       
   246 
       
   247 
       
   248     def isOne(r: Rexp) : Boolean = r match {
       
   249       case ONE => true
       
   250       case SEQ(r1, r2) => isOne(r1) && isOne(r2)
       
   251       case ALTS(r1, r2) => (isOne(r1) || isOne(r2)) && (atMostEmpty(r1) && atMostEmpty(r2))//rs.forall(atMostEmpty) && rs.exists(isOne)
       
   252       case STAR(r0) => atMostEmpty(r0)
       
   253       case CHAR(c) => false
       
   254       case ZERO => false
       
   255     }
       
   256 
       
   257     //r = r' ~ tail' : If tail' matches tail => returns r'
       
   258     def removeSeqTail(r: Rexp, tail: Rexp) : Rexp = r match {
       
   259       case SEQ(r1, r2) => 
       
   260         if(r2 == tail) 
       
   261           r1
       
   262         else
       
   263           ZERO
       
   264       case r => ZERO
       
   265     }
       
   266 
       
   267     def prune(r: ARexp, acc: Set[Rexp]) : ARexp = r match{
       
   268       case AALTS(bs, rs) => rs.map(r => prune(r, acc)).filter(_ != ZERO) match
       
   269       {
       
   270         //all components have been removed, meaning this is effectively a duplicate
       
   271         //flats will take care of removing this AZERO 
       
   272         case Nil => AZERO
       
   273         case r::Nil => fuse(bs, r)
       
   274         case rs1 => AALTS(bs, rs1)
       
   275       }
       
   276       case ASEQ(bs, r1, r2) => 
       
   277         //remove the r2 in (ra + rb)r2 to identify the duplicate contents of r1
       
   278         prune(r1, acc.map(r => removeSeqTail(r, erase(r2)))) match {
       
   279           //after pruning, returns 0
       
   280           case AZERO => AZERO
       
   281           //after pruning, got r1'.r2, where r1' is equal to 1
       
   282           case r1p if(isOne(erase(r1p))) => fuse(bs ++ mkepsBC(r1p), r2)
       
   283           //assemble the pruned head r1p with r2
       
   284           case r1p => ASEQ(bs, r1p, r2)
       
   285         }
       
   286         //this does the duplicate component removal task
       
   287       case r => if(acc(erase(r))) AZERO else r
       
   288     }
       
   289 \end{lstlisting}
       
   290 \caption{pruning function together with its helper functions}
       
   291 \end{figure}
       
   292 \noindent
       
   293 The benefits of using 
       
   294 $\textit{prune}$ such as refining the finiteness bound
       
   295 to a cubic bound has not been formalised yet.
       
   296 Therefore we choose to use Scala code rather than an Isabelle-style formal 
       
   297 definition like we did for $\simp$, as the definitions might change
       
   298 to suit proof needs.
       
   299 In the rest of the chapter we will use this convention consistently.
       
   300 \begin{figure}[H]
       
   301 \begin{lstlisting}
       
   302   def distinctWith(rs: List[ARexp], 
       
   303     pruneFunction: (ARexp, Set[Rexp]) => ARexp, 
       
   304     acc: Set[Rexp] = Set()) : List[ARexp] = 
       
   305       rs match{
       
   306         case Nil => Nil
       
   307         case r :: rs => 
       
   308           if(acc(erase(r)))
       
   309             distinctWith(rs, pruneFunction, acc)
       
   310           else {
       
   311             val pruned_r = pruneFunction(r, acc)
       
   312             pruned_r :: 
       
   313             distinctWith(rs, 
       
   314               pruneFunction, 
       
   315               turnIntoTerms(erase(pruned_r)) ++: acc
       
   316               )
       
   317           }
       
   318       }
       
   319 \end{lstlisting}
       
   320 \caption{A Stronger Version of $\textit{distinctBy}$}
       
   321 \end{figure}
       
   322 \noindent
       
   323 The function $\textit{prune}$ is used in $\distinctWith$.
       
   324 $\distinctWith$ is a stronger version of $\distinctBy$
       
   325 which not only removes duplicates as $\distinctBy$ would
       
   326 do, but also uses the $\textit{pruneFunction}$
       
   327 argument to prune away verbose components in a regular expression.\\
       
   328 \begin{figure}[H]
       
   329 \begin{lstlisting}
       
   330    //a stronger version of simp
       
   331     def bsimpStrong(r: ARexp): ARexp =
       
   332     {
       
   333       r match {
       
   334         case ASEQ(bs1, r1, r2) => (bsimpStrong(r1), bsimpStrong(r2)) match {
       
   335           //normal clauses same as simp
       
   336         case (AZERO, _) => AZERO
       
   337         case (_, AZERO) => AZERO
       
   338         case (AONE(bs2), r2s) => fuse(bs1 ++ bs2, r2s)
       
   339         //bs2 can be discarded
       
   340         case (r1s, AONE(bs2)) => fuse(bs1, r1s) //assert bs2 == Nil
       
   341         case (r1s, r2s) => ASEQ(bs1, r1s, r2s)
       
   342         }
       
   343         case AALTS(bs1, rs) => {
       
   344           //distinctBy(flat_res, erase)
       
   345           distinctWith(flats(rs.map(bsimpStrong(_))), prune) match {
       
   346             case Nil => AZERO
       
   347             case s :: Nil => fuse(bs1, s)
       
   348             case rs => AALTS(bs1, rs)
       
   349           }
       
   350         }
       
   351         //stars that can be treated as 1
       
   352         case ASTAR(bs, r0) if(atMostEmpty(erase(r0))) => AONE(bs)
       
   353         case r => r
       
   354       }
       
   355     }
       
   356 \end{lstlisting}
       
   357 \caption{The function $\bsimpStrong$ and $\bdersStrongs$}
       
   358 \end{figure}
       
   359 \noindent
       
   360 $\distinctWith$, is in turn used in $\bsimpStrong$:
       
   361 \begin{figure}[H]
       
   362 \begin{lstlisting}
       
   363       //Conjecture: [| bdersStrong(s, r) |] = O([| r |]^3)
       
   364       def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match {
       
   365         case Nil => r
       
   366         case c::s => bdersStrong(s, bsimpStrong(bder(c, r)))
       
   367       }
       
   368 \end{lstlisting}
       
   369 \caption{The function $\bsimpStrong$ and $\bdersStrongs$}
       
   370 \end{figure}
       
   371 \noindent
       
   372 We conjecture that the above Scala function $\bdersStrongs$,
       
   373 written $\bdersStrong{\_}{\_}$ as an infix notation, 
       
   374 satisfies the following property:
       
   375 \begin{conjecture}
       
   376 	$\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$
       
   377 \end{conjecture}
       
   378 The stronger version of $\blexersimp$'s
       
   379 code in Scala looks like: 
       
   380 \begin{figure}[H]
       
   381 \begin{lstlisting}
       
   382       def strongBlexer(r: Rexp, s: String) : Option[Val] = {
       
   383         Try(Some(decode(r, strong_blex_simp(internalise(r), s.toList)))).getOrElse(None)
       
   384       }
       
   385       def strong_blex_simp(r: ARexp, s: List[Char]) : Bits = s match {
       
   386         case Nil => {
       
   387           if (bnullable(r)) {
       
   388             mkepsBC(r)
       
   389           }
       
   390           else
       
   391             throw new Exception("Not matched")
       
   392         }
       
   393         case c::cs => {
       
   394           strong_blex_simp(strongBsimp(bder(c, r)), cs)
       
   395         }
       
   396       }
       
   397 \end{lstlisting}
       
   398 \end{figure}
       
   399 \noindent
       
   400 We would like to preserve the correctness like the one 
       
   401 we had for $\blexersimp$:
       
   402 \begin{conjecture}\label{cubicConjecture}
       
   403 	$\blexerStrong \;r \; s = \blexer\; r\;s$
       
   404 \end{conjecture}
       
   405 
       
   406 To implement this idea into an algorithm, we define a "pruning function"
       
   407 $\textit{prune}$, that takes away parts of all terms in $r$ that belongs to
       
   408 $\textit{acc}$, which acts as an element
       
   409 is a stronger version of $\textit{distinctBy}$.
       
   410 Here is a concise version of it (in the style of Scala):
       
   411 
       
   412 
       
   413 
       
   414 
       
   415 \noindent
       
   416 The pruning function can be defined recursively:
       
   417 
       
   418 
       
   419 
       
   420 
       
   421 This suggests a link towrads "partial derivatives"
       
   422  introduced by Antimirov \cite{Antimirov95}.
       
   423  
       
   424  \section{Antimirov's partial derivatives}
       
   425  The idea behind Antimirov's partial derivatives
       
   426 is to do derivatives in a similar way as suggested by Brzozowski, 
       
   427 but maintain a set of regular expressions instead of a single one:
       
   428 
       
   429 %TODO: antimirov proposition 3.1, needs completion
       
   430  \begin{center}  
       
   431  \begin{tabular}{ccc}
       
   432  $\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\
       
   433 $\partial_x(\ONE)$ & $=$ & $\phi$
       
   434 \end{tabular}
       
   435 \end{center}
       
   436 
       
   437 Rather than joining the calculated derivatives $\partial_x a$ and $\partial_x b$ together
       
   438 using the alternatives constructor, Antimirov cleverly chose to put them into
       
   439 a set instead.  This breaks the terms in a derivative regular expression up, 
       
   440 allowing us to understand what are the "atomic" components of it.
       
   441 For example, To compute what regular expression $x^*(xx + y)^*$'s 
       
   442 derivative against $x$ is made of, one can do a partial derivative
       
   443 of it and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$
       
   444 from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$.
       
   445 To get all the "atomic" components of a regular expression's possible derivatives,
       
   446 there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes
       
   447 whatever character is available at the head of the string inside the language of a
       
   448 regular expression, and gives back the character and the derivative regular expression
       
   449 as a pair (which he called "monomial"):
       
   450  \begin{center}  
       
   451  \begin{tabular}{ccc}
       
   452  $\lf(\ONE)$ & $=$ & $\phi$\\
       
   453 $\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\
       
   454  $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
       
   455  $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\
       
   456 \end{tabular}
       
   457 \end{center}
       
   458 %TODO: completion
       
   459 
       
   460 There is a slight difference in the last three clauses compared
       
   461 with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular 
       
   462 expression $r$ with every element inside $\textit{rset}$ to create a set of 
       
   463 sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates 
       
   464 on a set of monomials (which Antimirov called "linear form") and a regular 
       
   465 expression, and returns a linear form:
       
   466  \begin{center}  
       
   467  \begin{tabular}{ccc}
       
   468  $l \bigodot (\ZERO)$ & $=$ & $\phi$\\
       
   469  $l \bigodot (\ONE)$ & $=$ & $l$\\
       
   470  $\phi \bigodot t$ & $=$ & $\phi$\\
       
   471  $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\
       
   472  $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\
       
   473   $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\
       
   474  $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
       
   475  $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\
       
   476 \end{tabular}
       
   477 \end{center}
       
   478 %TODO: completion
       
   479 
       
   480  Some degree of simplification is applied when doing $\bigodot$, for example,
       
   481  $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$,
       
   482  and  $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and
       
   483   $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$,
       
   484   and so on.
       
   485   
       
   486   With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regular expression $r$ with 
       
   487   an iterative procedure:
       
   488    \begin{center}  
       
   489  \begin{tabular}{llll}
       
   490 $\textit{while}$ & $(\Delta_i \neq \phi)$  &                &          \\
       
   491  		       &  $\Delta_{i+1}$           &        $ =$ & $\lf(\Delta_i) - \PD_i$ \\
       
   492 		       &  $\PD_{i+1}$              &         $ =$ & $\Delta_{i+1} \cup \PD_i$ \\
       
   493 $\partial_{UNIV}(r)$ & $=$ & $\PD$ &		     
       
   494 \end{tabular}
       
   495 \end{center}
       
   496 
       
   497 
       
   498  $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
       
   499 
       
   500 
   323 
   501 
   324 
   502 
   325 
   503 
   326 %----------------------------------------------------------------------------------------
   504 %----------------------------------------------------------------------------------------
   327 %	SECTION 1
   505 %	SECTION 1
   343 The derivative would be
   521 The derivative would be
   344 \[
   522 \[
   345 ~r \backslash c = ~ (r \backslash c)
   523 ~r \backslash c = ~ (r \backslash c)
   346 \]
   524 \]
   347  
   525  
   348 The most tricky part of lexing for the $~r$ regex
   526 The most tricky part of lexing for the $~r$ regular expression
   349  is creating a value for it.
   527  is creating a value for it.
   350  For other regular expressions, the value aligns with the 
   528  For other regular expressions, the value aligns with the 
   351  structure of the regex:
   529  structure of the regular expression:
   352  \[
   530  \[
   353  \vdash \Seq(\Char(a), \Char(b)) : a \cdot b
   531  \vdash \Seq(\Char(a), \Char(b)) : a \cdot b
   354  \]
   532  \]
   355 But for the $~r$ regex, $s$ is a member of it if and only if
   533 But for the $~r$ regular expression, $s$ is a member of it if and only if
   356 $s$ does not belong to $L(r)$. 
   534 $s$ does not belong to $L(r)$. 
   357 That means when there
   535 That means when there
   358 is a match for the not regex, it is not possible to generate how the string $s$ matched
   536 is a match for the not regular expression, it is not possible to generate how the string $s$ matched
   359 with $r$. 
   537 with $r$. 
   360 What we can do is preserve the information of how $s$ was not matched by $r$,
   538 What we can do is preserve the information of how $s$ was not matched by $r$,
   361 and there are a number of options to do this.
   539 and there are a number of options to do this.
   362 
   540 
   363 We could give a partial value when there is a partial match for the regex inside
   541 We could give a partial value when there is a partial match for the regular expression inside
   364 the $\mathbf{not}$ construct.
   542 the $\mathbf{not}$ construct.
   365 For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,
   543 For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,
   366 A value for it could be 
   544 A value for it could be 
   367  \[
   545  \[
   368  \vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)
   546  \vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)