ChengsongTanPhdThesis/Chapters/Finite.tex
changeset 576 3e1b699696b6
parent 564 3cbcd7cda0a9
child 577 f47fc4840579
equal deleted inserted replaced
575:3178f0e948ac 576:3e1b699696b6
     5 \label{Finite} 
     5 \label{Finite} 
     6 %  In Chapter 4 \ref{Chapter4} we give the second guarantee
     6 %  In Chapter 4 \ref{Chapter4} we give the second guarantee
     7 %of our bitcoded algorithm, that is a finite bound on the size of any 
     7 %of our bitcoded algorithm, that is a finite bound on the size of any 
     8 %regex's derivatives. 
     8 %regex's derivatives. 
     9 
     9 
    10 In this chapter we give a guarantee in terms of time complexity:
    10 In this chapter we give a guarantee in terms of size: 
    11 given a regular expression $r$, for any string $s$ 
    11 given an annotated regular expression $a$, for any string $s$ 
    12 our algorithm's internal data structure is finitely bounded.
    12 our algorithm's internal data structure 
    13 Note that it is not immediately obvious that $\llbracket \bderssimp{r}{s} \rrbracket$ (the internal
    13 $\bderssimp{a}{s}$'s size
    14 data structure used in our $\blexer$)
    14 \begin{center}
    15 is bounded by a constant $N_r$, where $N$ only depends on the regular expression
    15 $\llbracket \bderssimp{a}{s} \rrbracket$
    16 $r$, not the string $s$.
    16 \end{center}
    17 When doing a time complexity analysis of any 
    17 \noindent
    18 lexer/parser based on Brzozowski derivatives, one needs to take into account that
    18 is finitely bounded
    19 not only the "derivative steps".
    19 by a constant $N_a$ that only depends on $a$.
    20 
    20 
    21 %TODO: get a grpah for internal data structure growing arbitrary large
    21 \section{Formalising About Size}
    22 
    22 In our lexer $\blexersimp$,
    23 
    23 The regular expression is repeatedly being taken derivative of
    24 To obtain such a proof, we need to 
    24 and then simplified.
       
    25 
       
    26 \begin{center}
       
    27 \begin{tikzpicture}[scale=2,node distance=1.4cm,
       
    28                     every node/.style={minimum size=9mm}]
       
    29 \node (r0)  {$r$};
       
    30 \node (r1) [rectangle, draw=black, thick, right=of r0]{$r_1$};
       
    31 \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$};
       
    32 \node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$r_{1s}$};
       
    33 \draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {simp};
       
    34 \node (r2) [rectangle, draw=black, thick,  right=of r1s]{$r_2$};
       
    35 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$};
       
    36 \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=7mm]{$r_{2s}$};
       
    37 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$simp$};
       
    38 \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=7mm]{$r_{ns}$};
       
    39 \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$};
       
    40 \node (v) [circle, draw = blue, thick, right=of rns,minimum size=7mm]{$v$};
       
    41 \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {collect+decode};
       
    42 \end{tikzpicture}
       
    43 \end{center}
       
    44 \noindent
       
    45 As illustrated in the picture, 
       
    46 each time the derivative is taken derivative of, it grows,
       
    47 and when it is being simplified, it shrinks.
       
    48 The blue ones are the regular expressions after simplification,
       
    49 which would be smaller than before.
       
    50 We give a mechanised proof that after simplification 
       
    51 the regular expression's size (the blue ones)
       
    52 $\bderssimp{a}{s}$ will never exceed a constant $N_a$ for
       
    53 a fixed $a$. 
       
    54 While it is not yet a direct formalisation of our lexer's complexity,
       
    55 it is a stepping stone towards a complexity proof because
       
    56 if the data structures out lexer has to traverse is large, the program
       
    57 will certainly be slow.
       
    58 
       
    59 The bound is not yet tight, and we seek to improve $N_a$ so that
       
    60 it is polynomial on $\llbracket a \rrbracket$.
       
    61 We believe a formalisation of the complexity-related properties
       
    62 of the algorithm is important in our context, because we want to address
       
    63 catastrophic backtracking, which is not a correctness issue but
       
    64 in essence a performance issue, a formal proof can give
       
    65 the strongest assurance that such issues cannot arise 
       
    66 regardless of what the input is.
       
    67 This level of certainty cannot come from a pencil and paper proof or
       
    68 eimpirical evidence.
       
    69 
       
    70 For example, Sulzmann and Lu's made claimed that their algorithm
       
    71 with bitcodes and simplifications can lex in linear time with respect
       
    72 to the input string. This assumes that each
       
    73 derivative operation takes constant time. 
       
    74 However, it turns out that on certain cases their lexer 
       
    75 will have an indefinite size explosion, causing the running time 
       
    76 of each derivative step to grow arbitrarily large.
       
    77 Here in our proof we state that such explosions cannot happen 
       
    78 with our simplification function.
       
    79 
       
    80 Here is a bird's eye view of how the proof of finiteness works: 
       
    81 \begin{center}
       
    82   \begin{tikzpicture}[scale=1,font=\bf,
       
    83                       node/.style={
       
    84                       rectangle,rounded corners=3mm,
       
    85                       ultra thick,draw=black!50,minimum height=18mm, 
       
    86                       minimum width=20mm,
       
    87                       top color=white,bottom color=black!20}]
       
    88 
       
    89 
       
    90 		      \node (0) at (-5,0) [node, text width=1.8cm, text centered] {$\llbracket \bderssimp{a}{s} \rrbracket$};
       
    91 		      \node (A) at (0,0) [node,text width=1.6cm,  text centered] {$\llbracket \rderssimp{r}{s} \rrbracket_r$};
       
    92 		      \node (B) at (3,0) [node,text width=3.0cm, anchor=west, minimum width = 40mm] {$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$};
       
    93   \node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$};
       
    94  
       
    95   \draw [->,line width=0.5mm] (0) -- node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$r = a \downarrow_r$} (A); 
       
    96   \draw [->,line width=0.5mm] (A) -- node [above,pos=0.35] {$\quad =\ldots=$} (B); 
       
    97   \draw [->,line width=0.5mm] (B) -- node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); 
       
    98   \end{tikzpicture}
       
    99 \end{center}
       
   100 \noindent
       
   101 We explain the above steps one by one:
    25 \begin{itemize}
   102 \begin{itemize}
    26 \item
   103 \item
    27 Define an new datatype for regular expressions that makes it easy
   104 We first use a new datatype, called $\textit{rrexp}$s, whose 
    28 to reason about the size of an annotated regular expression.
   105 inductive type definition and derivative and simplification operations are
       
   106 almost identical to those of the annotated regular expressions,
       
   107 except that no bitcodes are attached.
       
   108 This new datatype is more straightforward to tweak 
       
   109 compared with an annotated regular expression.
    29 \item
   110 \item
    30 A set of equalities for this new datatype that enables one to
   111 We have a set of equalities for this new datatype that enables one to
    31 rewrite $\bderssimp{r_1 \cdot r_2}{s}$ and $\bderssimp{r^*}{s}$ etc.
   112 rewrite $\rderssimp{r_1 \cdot r_2}{s}$ and $\rderssimp{r^*}{s}$ etc.
    32 by their children regexes $r_1$, $r_2$, and $r$.
   113 by a combinatioin of derivatives of their 
       
   114 children regular expressions ($r_1$, $r_2$, and $r$, respectively).
       
   115 These equalities are chained together to get into a shape
       
   116 that is very easy to estimate, which we call the \emph{Closed Forms}
       
   117 of the derivatives.
    33 \item
   118 \item
    34 Using those equalities to actually get those rewriting equations, which we call
   119 This closed form is controlled by terms that
    35 "closed forms".
   120 are easier to deal with, wich are in turn bounded loosely
    36 \item
   121 by a large yet constant number.
    37 Bound the closed forms, thereby bounding the original
       
    38 $\blexersimp$'s internal data structures.
       
    39 \end{itemize}
   122 \end{itemize}
       
   123 We give details of these steps in the next sections.
       
   124 The first step is to use something simpler than annotated regular expressions. 
    40 
   125 
    41 \section{the $\mathbf{r}$-rexp datatype and the size functions}
   126 \section{the $\mathbf{r}$-rexp datatype and the size functions}
    42 
   127 We want to prove the size property about annotated regular expressions.
    43 We have a size function for bitcoded regular expressions, written
   128 The size is 
    44 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
   129 written $\llbracket r\rrbracket$, whose intuitive definition is as below
    45 
       
    46 \begin{center}
   130 \begin{center}
    47 \begin{tabular}{ccc}
   131 \begin{tabular}{ccc}
    48 	$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
   132 	$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\
    49 	$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
   133 	$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\
    50 	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
   134 	$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\
    51 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
   135 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\
    52 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
   136 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as   + 1$\\
    53 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
   137 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$
    54 \end{tabular}
   138 \end{tabular}
    55 \end{center}
   139 \end{center}
    56 
   140 \noindent
    57 \noindent
   141 We first note that $\llbracket \_ \rrbracket$
       
   142 is unaware of bitcodes, since 
       
   143 it only counts the number of nodes
       
   144 if we regard $r$ as a tree.
       
   145 This suggests that if we define a new datatype that is similar to plain $\rexp$s,
       
   146 \[			\rrexp ::=   \RZERO \mid  \RONE
       
   147 			 \mid  \RCHAR{c}  
       
   148 			 \mid  \RSEQ{r_1}{r_2}
       
   149 			 \mid  \RALTS{rs}
       
   150 			 \mid \RSTAR{r}        
       
   151 \]
       
   152 and define 
       
   153 \begin{center}
       
   154 \begin{tabular}{lcl}
       
   155 $\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
       
   156 $\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
       
   157 	$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
       
   158 $\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
       
   159 $\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
       
   160 $\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$
       
   161 \end{tabular}
       
   162 \end{center}
       
   163 \noindent
       
   164 we could calculate the size of annotated regular expressions in terms of
       
   165 its un-annotated counterpart: 
       
   166 \begin{lemma}
       
   167 $\rsize{\rerase a} = \asize a$
       
   168 \end{lemma}
       
   169 \begin{proof}
       
   170 	By routine induction on the structure of $a$.
       
   171 \end{proof}
       
   172 \noindent
       
   173 We call them \emph{r}-regular expressions:
       
   174 we use
       
   175 $\rrexp$ as its type name, so as to make a distinction  
       
   176 with plain regular expressions.
       
   177 In $\rrexp$ we throw away the bitcodes on the annotated regular expressions, 
       
   178 but keep everything else intact.
       
   179 To transform an annotated regular expression into an r-regular expression,
       
   180 we use the function \emph{rerase}, written $\downarrow_r$. 
       
   181 The $r$ in the subscript of $\downarrow$ is to 
       
   182 differentiate with the $\downarrow$ for the $\erase$ operation.
       
   183 Before we introduce more functions related to r-regular expressions,
       
   184 we first give out the reason why we take all the trouble 
       
   185 defining a new datatype in the first place.
       
   186 \subsection{Motivation Behind a New Datatype}
       
   187 The main difference between a plain regular expression
       
   188 and an r-regular expression is that it can take
       
   189 non-binary arguments for its alternative constructor.
       
   190 This turned out to be necessary if we want our proofs to be
       
   191 simple.
       
   192 We initially started by using 
       
   193 plain regular expressions and tried to prove
       
   194 equalities like 
       
   195 \begin{center}
       
   196 	$\llbracket a \rrbracket = \llbracket a\downarrow \rrbracket_p$
       
   197 \end{center}
       
   198 and 
       
   199 \[
       
   200 	\llbracket a \backslash_{bsimps} s \rrbracket = 
       
   201 	\llbracket a \downarrow \backslash_s s
       
   202 \]
       
   203 One might be able to prove that 
       
   204 $\llbracket a \downarrow \rrbracket_p \leq \llbracket a \rrbracket$.
       
   205 $\rrexp$ give the exact correspondence between an annotated regular expression
       
   206 and its (r-)erased version:
       
   207 
       
   208 This does not hold for plain $\rexp$s. 
       
   209 
       
   210 Of course, the bits which encode the lexing information would grow linearly with respect 
       
   211 to the input, which should be taken into account when we wish to tackle the runtime comlexity.
       
   212 But at the current stage 
       
   213 we can safely ignore them.
    58 Similarly there is a size function for plain regular expressions:
   214 Similarly there is a size function for plain regular expressions:
    59 \begin{center}
   215 \begin{center}
    60 \begin{tabular}{ccc}
   216 \begin{tabular}{ccc}
    61 	$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
   217 	$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\
    62 	$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
   218 	$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\
   101 \noindent
   257 \noindent
   102 An alternative regular expression with an empty list of children
   258 An alternative regular expression with an empty list of children
   103  is turned into a $\ZERO$ during the
   259  is turned into a $\ZERO$ during the
   104 $\erase$ function, thereby changing the size and structure of the regex.
   260 $\erase$ function, thereby changing the size and structure of the regex.
   105 Therefore the equality in question does not hold.
   261 Therefore the equality in question does not hold.
       
   262 
   106 These will likely be fixable if we really want to use plain $\rexp$s for dealing
   263 These will likely be fixable if we really want to use plain $\rexp$s for dealing
   107 with size, but we choose a more straightforward (or stupid) method by 
   264 with size, but we choose a more straightforward (or stupid) method by 
   108 defining a new datatype that is similar to plain $\rexp$s but can take
       
   109 non-binary arguments for its alternative constructor,
       
   110  which we call $\rrexp$ to denote
       
   111 the difference between it and plain regular expressions. 
       
   112 \[			\rrexp ::=   \RZERO \mid  \RONE
       
   113 			 \mid  \RCHAR{c}  
       
   114 			 \mid  \RSEQ{r_1}{r_2}
       
   115 			 \mid  \RALTS{rs}
       
   116 			 \mid \RSTAR{r}        
       
   117 \]
       
   118 For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, 
       
   119 but keep everything else intact.
       
   120 It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
       
   121 We denote the operation of erasing the bits and turning an annotated regular expression 
       
   122 into an $\rrexp{}$ as $\rerase{}$.
       
   123 \begin{center}
       
   124 \begin{tabular}{lcl}
       
   125 $\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\
       
   126 $\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\
       
   127 	$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\
       
   128 $\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
       
   129 $\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\
       
   130 $\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$
       
   131 \end{tabular}
       
   132 \end{center}
       
   133 \noindent
       
   134 $\rrexp$ give the exact correspondence between an annotated regular expression
       
   135 and its (r-)erased version:
       
   136 \begin{lemma}
       
   137 $\rsize{\rerase a} = \asize a$
       
   138 \end{lemma}
       
   139 \noindent
       
   140 This does not hold for plain $\rexp$s. 
       
   141 
   265 
   142 Similarly we could define the derivative  and simplification on 
   266 Similarly we could define the derivative  and simplification on 
   143 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, 
   267 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, 
   144 except that now they can operate on alternatives taking multiple arguments.
   268 except that now they can operate on alternatives taking multiple arguments.
   145 
   269 
  1164 	 $\sflat{\sum []}$ & $ = $ & $ \sum []$\\
  1288 	 $\sflat{\sum []}$ & $ = $ & $ \sum []$\\
  1165 $\sflat r$ & $=$ & $ r$
  1289 $\sflat r$ & $=$ & $ r$
  1166 \end{tabular}
  1290 \end{tabular}
  1167 \end{center}
  1291 \end{center}
  1168 \noindent
  1292 \noindent
  1169 $\sflataux{\_}$ breaks up nested alternative regexes 
  1293 $\sflataux{\_}$ breaks up nested alternative regular expressions 
  1170 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
  1294 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
  1171 into a "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
  1295 into a "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
  1172 It will return the singleton list $[r]$ otherwise.
  1296 It will return the singleton list $[r]$ otherwise.
  1173 $\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
  1297 $\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
  1174 the output type a regular expression, not a list.
  1298 the output type a regular expression, not a list.
  1261 $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
  1385 $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
  1262 the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
  1386 the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
  1263 of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not 
  1387 of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not 
  1264 count the possible size explosions of $r \backslash c$ themselves.
  1388 count the possible size explosions of $r \backslash c$ themselves.
  1265 
  1389 
  1266 Thanks to $\rflts$ and $\rDistinct$, we are able to open up regexes like
  1390 Thanks to $\rflts$ and $\rDistinct$, we are able to open up regular expressions like
  1267 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
  1391 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + 
  1268 (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
  1392 (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
  1269 into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', 
  1393 into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', 
  1270 r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
  1394 r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
  1271 and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
  1395 and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
  1331 \end{center}
  1455 \end{center}
  1332 \noindent
  1456 \noindent
  1333 %MAYBE TODO: introduce createdByStar
  1457 %MAYBE TODO: introduce createdByStar
  1334 Again these definitions are tailor-made for dealing with alternatives that have
  1458 Again these definitions are tailor-made for dealing with alternatives that have
  1335 originated from a star's derivatives, so we do not attempt to open up all possible 
  1459 originated from a star's derivatives, so we do not attempt to open up all possible 
  1336 regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
  1460 regular expressions of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
  1337 elements.
  1461 elements.
  1338 We give a predicate for such "star-created" regular expressions:
  1462 We give a predicate for such "star-created" regular expressions:
  1339 \begin{center}
  1463 \begin{center}
  1340 \begin{tabular}{lcr}
  1464 \begin{tabular}{lcr}
  1341          &    &       $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
  1465          &    &       $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\
  1739 
  1863 
  1740 Test data in the graphs from randomly generated regular expressions
  1864 Test data in the graphs from randomly generated regular expressions
  1741 shows that the giant bounds are far from being hit.
  1865 shows that the giant bounds are far from being hit.
  1742 %a few sample regular experessions' derivatives
  1866 %a few sample regular experessions' derivatives
  1743 %size change
  1867 %size change
  1744 %TODO: giving regex1_size_change.data showing a few regexes' size changes 
  1868 %TODO: giving regex1_size_change.data showing a few regular expressions' size changes 
  1745 %w;r;t the input characters number, where the size is usually cubic in terms of original size
  1869 %w;r;t the input characters number, where the size is usually cubic in terms of original size
  1746 %a*, aa*, aaa*, .....
  1870 %a*, aa*, aaa*, .....
  1747 %randomly generated regexes
  1871 %randomly generated regular expressions
  1748 \begin{center}
  1872 \begin{center}
  1749 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
  1873 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
  1750 \begin{tikzpicture}
  1874 \begin{tikzpicture}
  1751 \begin{axis}[
  1875 \begin{axis}[
  1752     xlabel={number of $a$'s},
  1876     xlabel={number of $a$'s},
  1807     legend pos=north west,
  1931     legend pos=north west,
  1808     legend cell align=left]
  1932     legend cell align=left]
  1809 \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
  1933 \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data};
  1810 \end{axis}
  1934 \end{axis}
  1811 \end{tikzpicture}\\
  1935 \end{tikzpicture}\\
  1812 \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regexes $w.r.t.$ input string length.}
  1936 \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regular expressions $w.r.t.$ input string length.}
  1813 \end{tabular}    
  1937 \end{tabular}    
  1814 \end{center}  
  1938 \end{center}  
  1815 
  1939 
  1816 
  1940 
  1817 
  1941 
  1974 	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
  2098 	Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
  1975 	where $a$ is not nullable.\\
  2099 	where $a$ is not nullable.\\
  1976 	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
  2100 	$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
  1977 	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
  2101 	$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
  1978 \end{center}
  2102 \end{center}
  1979 We start with two syntactically different regexes,
  2103 We start with two syntactically different regular expressions,
  1980 and end up with the same derivative result.
  2104 and end up with the same derivative result.
  1981 This is not surprising as we have such 
  2105 This is not surprising as we have such 
  1982 equality as below in the style of Arden's lemma:\\
  2106 equality as below in the style of Arden's lemma:\\
  1983 \begin{center}
  2107 \begin{center}
  1984 	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
  2108 	$L(A^*B) = L(A\cdot A^* \cdot B + B)$
  2004   exponential $w.r.t$ the number of characters, 
  2128   exponential $w.r.t$ the number of characters, 
  2005 but will eventually level off when the string $s$ is long enough.
  2129 but will eventually level off when the string $s$ is long enough.
  2006 If they grow to a size exponential $w.r.t$ the original regex, our algorithm
  2130 If they grow to a size exponential $w.r.t$ the original regex, our algorithm
  2007 would still be slow.
  2131 would still be slow.
  2008 And unfortunately, we have concrete examples
  2132 And unfortunately, we have concrete examples
  2009 where such regexes grew exponentially large before levelling off:
  2133 where such regular expressions grew exponentially large before levelling off:
  2010 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
  2134 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + 
  2011 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
  2135 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum
  2012  size that is  exponential on the number $n$ 
  2136  size that is  exponential on the number $n$ 
  2013 under our current simplification rules:
  2137 under our current simplification rules:
  2014 %TODO: graph of a regex whose size increases exponentially.
  2138 %TODO: graph of a regex whose size increases exponentially.