ChengsongTanPhdThesis/Chapters/Cubic.tex
changeset 533 6acbc939af6a
parent 532 cc54ce075db5
child 535 ce91c29d2885
equal deleted inserted replaced
532:cc54ce075db5 533:6acbc939af6a
     9 %----------------------------------------------------------------------------------------
     9 %----------------------------------------------------------------------------------------
    10 %	SECTION strongsimp
    10 %	SECTION strongsimp
    11 %----------------------------------------------------------------------------------------
    11 %----------------------------------------------------------------------------------------
    12 \section{A Stronger Version of Simplification}
    12 \section{A Stronger Version of Simplification}
    13 %TODO: search for isabelle proofs of algorithms that check equivalence 
    13 %TODO: search for isabelle proofs of algorithms that check equivalence 
    14 Two alternative (distinct) sub-matches for the (sub-)string and (sub-)regex pair
    14 In our bit-coded lexing algorithm, with or without simplification, 
       
    15 two alternative (distinct) sub-matches for the (sub-)string and (sub-)regex pair
    15 are always expressed in the "derivative regular expression" as two
    16 are always expressed in the "derivative regular expression" as two
    16 disjoint alternative terms at the current (sub-)regex level. The fact that they
    17 disjoint alternative terms at the current (sub-)regex level. The fact that they
    17 are parallel tells us when we retrieve the information from this (sub-)regex 
    18 are parallel tells us when we retrieve the information from this (sub-)regex 
    18 there will always be a choice of which alternative term to take.
    19 there will always be a choice of which alternative term to take.
       
    20 As an example, the regular expression $aa \cdot a^*+ a \cdot a^*$ (omitting bit-codes)
       
    21 expresses two possibilities it will match future input.
       
    22 It will either match 2 $a$'s, then 0 or more $a$'s, in other words, at least 2 more $a$'s
       
    23 \begin{figure}\label{string_3parts1}
       
    24 \begin{center}
       
    25 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
    26     \node [rectangle split, rectangle split horizontal, rectangle split parts=3]
       
    27         {Consumed Input
       
    28         \nodepart{two} Expects $aa$
       
    29          \nodepart{three} Then expects $a^*$};
       
    30 
       
    31 \end{tikzpicture}
       
    32 \end{center}
       
    33 \end{figure}
       
    34 \noindent
       
    35 Or it will match at least 1 more $a$:
       
    36 \begin{figure}\label{string_3parts2}
       
    37 \begin{center}
       
    38 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
    39     \node [rectangle split, rectangle split horizontal, rectangle split parts=3]
       
    40         {Consumed
       
    41         \nodepart{two} Expects $a$
       
    42          \nodepart{three} Then expects $a^*$};
       
    43 
       
    44 \end{tikzpicture}
       
    45 \end{center}
       
    46 \end{figure}
       
    47 If these two possibilities are identical, we can eliminate
       
    48 the second one as we know the second one corresponds to 
       
    49 a match that is not $\POSIX$.
       
    50 
       
    51 
       
    52 If two identical regexes 
       
    53 happen to be grouped into different sequences, one cannot use
       
    54 the $a + a \rightsquigarrow a$ rule to eliminate them, even if they 
       
    55 are "parallel" to each other:
       
    56 \[
       
    57 (a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2
       
    58 \]
       
    59 and that's because they are followed by possibly 
       
    60 different "suffixes" $r_1$ and $r_2$, and if 
       
    61 they do turn out to be different then doing 
       
    62 \[
       
    63 (a+b) \cdot r_1 + (a+c) \cdot r_2 \rightsquigarrow (a+b) \cdot r_1 + (c) \cdot r_2
       
    64 \]
       
    65 might cause a possible match where the string is in $L(a \cdot r_2)$ to be lost.
    19 
    66 
    20 Here is an example for this.
    67 Here is an example for this.
    21 
       
    22 Assume we have the derivative regex 
    68 Assume we have the derivative regex 
    23 \[(a^* + (aa)^* + (aaa)^*)\cdot(a^* + (aa)^* + (aaa)^*)^* + 
    69 \[( r_1  +  r_2  +  r_3)\cdot r+ 
    24 (a^* + a\cdot(aa)^* + (aa)\cdot (aaa)^*)\cdot(a^* + (aa)^* + (aaa)^*)^*
    70 ( r_1  +  r_5  +  r_6)\cdot( r_1  +  r_2  +  r_3)^*
    25 \]
    71 \]
    26 
    72 
    27 occurring in an intermediate step in successive
    73 occurring in an intermediate step in our bit-coded lexing algorithm.
    28 derivatives of an input string $\underbrace{aa\ldots a}_{\text{n \; a's}}$.
    74 
    29 In this expression, there will be 6 "parallel" terms if we break up regexes 
    75 In this expression, there will be 6 "parallel" terms if we break up regexes 
    30 of shape $(a+b)\cdot c$ using the distributivity law (into $ac + bc$).
    76 of shape $(a+b)\cdot c$ using the distributivity law (into $ac + bc$).
    31 \begin{align}
    77 \begin{align}
    32 (\nonumber \\
    78 (\nonumber \\
    33 a^* + &    \label{term:1} \\
    79  r_1  + &    \label{term:1} \\
    34 (aa)^* + &  \label{term:2} \\
    80  r_2  + &  \label{term:2} \\
    35 (aaa)^* &  \label{term:3} \\
    81  r_3 &  \label{term:3} \\
    36 &	)\cdot(a^* + (aa)^* + (aaa)^*)^* \; + \; (\nonumber \\
    82 &	)\cdot r\; + \; (\nonumber \\
    37 a^* + & \label{term:4} \\
    83  r_1  + & \label{term:4} \\
    38 a \cdot (aa)^* + & \label{term:5} \\
    84  r_5  + & \label{term:5} \\
    39 aa \cdot (aaa)^* \label{term:6}\\
    85  r_6 \label{term:6}\\
    40 &      )\cdot(a^* + (aa)^* + (aaa)^*)^*\nonumber
    86 &      )\cdot r\nonumber
    41 \end{align}
    87 \end{align}
    42 
       
    43 
    88 
    44 They have been labelled, and each label denotes  
    89 They have been labelled, and each label denotes  
    45 one term, for example, \ref{term:1} denotes
    90 one term, for example, \ref{term:1} denotes
    46 \[
    91 \[
    47 a^*\cdot(a^* + (aa)^* + (aaa)^*)^* 
    92  r_1 \cdot r
    48 \]
    93 \]
    49 \noindent
    94 \noindent
    50 and \ref{term:3} denotes
    95 and \ref{term:3} denotes
    51 \[
    96 \[
    52 (aaa)^*\cdot(a^* + (aa)^* + (aaa)^*)^*.
    97  r_3\cdot r.
    53 \]
    98 \]
    54 These terms can be interpreted in terms of 
    99 From a lexer's point of view, \ref{term:4} will never 
    55 their current input position's most recent 
   100 be picked up in later phases of matching because there
    56 partial match.
   101 is a term \ref{term:1} giving identical matching information.
    57 Term \ref{term:1}, \ref{term:2}, and \ref{term:3}
   102 The first term \ref{term:1} will match a string in $L(r_1 \cdot  r)$, 
    58 denote the partial-match ending at the current position:
   103 and so on for term \ref{term:2}, \ref{term:3}, etc.
       
   104 
    59 \mybox{previous input $\ldots$}\mybox{$aaa$ }\mybox{rest of input $\ldots$}
   105 \mybox{previous input $\ldots$}\mybox{$aaa$ }\mybox{rest of input $\ldots$}
       
   106 \begin{center}\label{string_2parts}
       
   107 
       
   108 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
   109     \node [rectangle split, rectangle split horizontal, rectangle split parts=2]
       
   110         {$r_{x1}$
       
   111          \nodepart{two} $r_1\cdot r$ };
       
   112 %\caption{term 1 \ref{term:1}'s matching configuration}
       
   113 \end{tikzpicture}
       
   114 
       
   115 \end{center}
       
   116 For term 1 \ref{term:1}, whatever was before the current
       
   117 input position was fully matched and the regex corresponding
       
   118 to it reduced to $\ONE$, 
       
   119 and in the next input token, it will start on $r_1\cdot r$.
       
   120 The resulting value will be something of a similar configuration:
       
   121 \begin{center}\label{value_2parts}
       
   122 %\caption{term 1 \ref{term:1}'s matching configuration}
       
   123 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
   124     \node [rectangle split, rectangle split horizontal, rectangle split parts=2]
       
   125         {$v_{x1}$
       
   126          \nodepart{two} $v_{r_1\cdot r}$ };
       
   127 \end{tikzpicture}
       
   128 \end{center}
       
   129 For term 2 \ref{term:2} we have a similar value partition:
       
   130 \begin{center}\label{value_2parts2}
       
   131 %\caption{term 1 \ref{term:1}'s matching configuration}
       
   132 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
   133     \node [rectangle split, rectangle split horizontal, rectangle split parts=2]
       
   134         {$v_{x2}$
       
   135          \nodepart{two} $v_{r_2\cdot r}$ };
       
   136 \end{tikzpicture}
       
   137 \end{center}
       
   138 \noindent
       
   139 and so on.
       
   140 We note that for term 4 \ref{term:4} its result value 
       
   141 after any position beyond  the current one will always
       
   142 be the same:
       
   143 \begin{center}\label{value_2parts4}
       
   144 %\caption{term 1 \ref{term:1}'s matching configuration}
       
   145 \begin{tikzpicture}[every node/.append style={draw, rounded corners, inner sep=10pt}]
       
   146     \node [rectangle split, rectangle split horizontal, rectangle split parts=2]
       
   147         {$v_{x4}$
       
   148          \nodepart{two} $v_{r_1\cdot r}$ };
       
   149 \end{tikzpicture}
       
   150 \end{center}
       
   151 And $\POSIX$ rules says that we can eliminate at least one of them.
       
   152 Our algorithm always puts the regex with the longest initial sub-match at the left of the 
       
   153 alternatives, so we safely throw away \ref{term:4}.
       
   154 The fact that term 1 and 4 are not immediately in the same alternative
       
   155 expression does not prevent them from being two duplicate matches at
       
   156 the current point of view.
       
   157 To implement this idea into an algorithm, we define a "pruning function"
       
   158 $\textit{prune}$, that takes away parts of all terms in $r$ that belongs to
       
   159 $\textit{acc}$, which acts as an element
       
   160 is a stronger version of $\textit{distinctBy}$.
       
   161 Here is a concise version of it (in the style of Scala):
       
   162 \begin{verbatim}
       
   163 def distinctByPlus(rs: List[ARexp], acc: Set[Rexp] = Set()) : 
       
   164 List[ARexp] =  rs match {
       
   165   case Nil => Nil
       
   166   case r :: rs =>
       
   167     if(acc.contains(erase(r)))
       
   168       distinctByPlus(rs, acc)
       
   169     else 
       
   170       prune(r, acc) :: distinctByPlus(rs, prune(r, acc) +: acc)
       
   171 }
       
   172 
       
   173 \end{verbatim}
       
   174 But for the function $\textit{prune}$ there is a difficulty:
       
   175 how could one define the $L$ function in a "computable" way,
       
   176 so that they generate a (lazy infinite) set of strings in $L(r)$.
       
   177 We also need a function that tests whether $L(r_1) \subseteq L(r_2)$
       
   178 is true.
       
   179 For the moment we cut corners and do not define these two functions
       
   180 as they are not used by Antimirov (and will probably not contribute
       
   181 to a bound better than Antimirov's cubic bound).
       
   182 Rather, we do not try to eliminate in every instance of regular expressions
       
   183 that have "language duplicates", but only eliminate the "exact duplicates".
       
   184 For this we need to break up terms $(a+b)\cdot c$ into two
       
   185 terms $a\cdot c$ and $b\cdot c$ before we add them to the accumulator:
       
   186 \begin{verbatim}
       
   187 def distinctWith(rs: List[ARexp], 
       
   188                 pruneFunction: (ARexp, Set[Rexp]) => ARexp, 
       
   189                 acc: Set[Rexp] = Set()) : List[ARexp] = 
       
   190   rs match{
       
   191     case Nil => Nil
       
   192     case r :: rs => 
       
   193       if(acc(erase(r)))
       
   194         distinctWith(rs, pruneFunction, acc)
       
   195       else {
       
   196         val pruned_r = pruneFunction(r, acc)
       
   197         pruned_r :: 
       
   198         distinctWith(rs, 
       
   199           pruneFunction, 
       
   200           turnIntoTerms(erase(pruned_r)) ++: acc
       
   201         )
       
   202       }
       
   203   }
       
   204 \end{verbatim}
       
   205 \noindent
       
   206 This process is done by the function $\textit{turnIntoTerms}$:
       
   207 \begin{verbatim}
       
   208 def turnIntoTerms(r: Rexp): List[Rexp] = r match {
       
   209   case SEQ(r1, r2)  => if(isOne(r1)) 
       
   210                           turnIntoTerms(r2) 
       
   211                        else 
       
   212                           turnIntoTerms(r1).map(r11 => SEQ(r11, r2))
       
   213   case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2)
       
   214   case ZERO => Nil
       
   215   case _ => r :: Nil
       
   216 }
       
   217 \end{verbatim}
       
   218 \noindent
       
   219 The pruning function can be defined recursively:
       
   220 \begin{verbatim}
       
   221 def prune7(r: ARexp, acc: Set[Rexp]) : ARexp = r match{
       
   222   case AALTS(bs, rs) => rs.map(r => prune7(r, acc)).filter(_ != ZERO) match
       
   223   {
       
   224     case Nil => AZERO
       
   225     case r::Nil => fuse(bs, r)
       
   226     case rs1 => AALTS(bs, rs1)
       
   227   }
       
   228   case ASEQ(bs, r1, r2) => prune7(r1, acc.map(r => removeSeqTail(r, erase(r2)))) match {
       
   229     case AZERO => AZERO
       
   230     case r1p if(isOne(erase(r1p))) => fuse(bs ++ mkepsBC(r1p), r2)
       
   231     case r1p => ASEQ(bs, r1p, r2)
       
   232   }
       
   233   case r => if(acc(erase(r))) AZERO else r
       
   234 }
       
   235 \end{verbatim}
       
   236 
       
   237 \begin{figure}
       
   238 \centering
       
   239 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
       
   240 \begin{tikzpicture}
       
   241 \begin{axis}[
       
   242     xlabel={$n$},
       
   243     x label style={at={(1.05,-0.05)}},
       
   244     ylabel={size},
       
   245     enlargelimits=false,
       
   246     xtick={0,5,...,30},
       
   247     xmax=30,
       
   248     ymax=800,
       
   249     ytick={0,200,...,800},
       
   250     scaled ticks=false,
       
   251     axis lines=left,
       
   252     width=5cm,
       
   253     height=4cm, 
       
   254     legend entries={$bsimpStrong$ size growth},  
       
   255     legend pos=north west,
       
   256     legend cell align=left]
       
   257 \addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data};
       
   258 \end{axis}
       
   259 \end{tikzpicture}
       
   260   &
       
   261 \begin{tikzpicture}
       
   262 \begin{axis}[
       
   263     xlabel={$n$},
       
   264     x label style={at={(1.05,-0.05)}},
       
   265     ylabel={size},
       
   266     enlargelimits=false,
       
   267     xtick={0,5,...,30},
       
   268     xmax=30,
       
   269     ymax=42000,
       
   270     ytick={0,10000,...,40000},
       
   271     scaled ticks=true,
       
   272     axis lines=left,
       
   273     width=5cm,
       
   274     height=4cm, 
       
   275     legend entries={$bsimp$ size growth},  
       
   276     legend pos=north west,
       
   277     legend cell align=left]
       
   278 \addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
       
   279 \end{axis}
       
   280 \end{tikzpicture}\\
       
   281 \multicolumn{2}{c}{Graphs: Runtime for matching $((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings 
       
   282            of the form $\underbrace{aa..a}_{n}$.}
       
   283 \end{tabular}    
       
   284 \caption{aaaaaStarStar} \label{fig:aaaaaStarStar}
       
   285 \end{figure}
    60 
   286 
    61 
   287 
    62 
   288 
    63 %----------------------------------------------------------------------------------------
   289 %----------------------------------------------------------------------------------------
    64 %	SECTION 1
   290 %	SECTION 1
   111  constructor.
   337  constructor.
   112  
   338  
   113  Another option would be to either store the string $s$ that resulted in 
   339  Another option would be to either store the string $s$ that resulted in 
   114  a mis-match for $r$ or a dummy value as a placeholder:
   340  a mis-match for $r$ or a dummy value as a placeholder:
   115  \[
   341  \[
   116  \vdash \textit{Not}(abcd) : ~(a^*)
   342  \vdash \textit{Not}(abcd) : ~( r_1 )
   117  \]
   343  \]
   118 or
   344 or
   119  \[
   345  \[
   120  \vdash \textit{Not}(\textit{Dummy}) : ~(a^*)
   346  \vdash \textit{Not}(\textit{Dummy}) : ~( r_1 )
   121  \] 
   347  \] 
   122  We choose to implement this as it is most straightforward:
   348  We choose to implement this as it is most straightforward:
   123  \[
   349  \[
   124  \mkeps(~(r))  = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})
   350  \mkeps(~(r))  = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})
   125  \]
   351  \]