ChengsongTanPhdThesis/Chapters/Chapter2.tex
changeset 505 5ce3bd8e5696
parent 503 35b80e37dfe7
child 506 69ad05398894
equal deleted inserted replaced
503:35b80e37dfe7 505:5ce3bd8e5696
    14 \section{Properties of $\backslash c$}
    14 \section{Properties of $\backslash c$}
    15 
    15 
    16 To have a clear idea of what we can and 
    16 To have a clear idea of what we can and 
    17 need to prove about the algorithms involving
    17 need to prove about the algorithms involving
    18 Brzozowski's derivatives, there are a few 
    18 Brzozowski's derivatives, there are a few 
    19 properties we need to be clear about 
    19 properties we need to be clear about .
    20 it.
       
    21 \subsection{function $\backslash c$ is not 1-to-1}
    20 \subsection{function $\backslash c$ is not 1-to-1}
    22 \begin{center}
    21 \begin{center}
    23 The derivative $w.r.t$ character $c$ is not one-to-one.
    22 The derivative $w.r.t$ character $c$ is not one-to-one.
    24 Formally,
    23 Formally,
    25 	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
    24 	$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
    51 	
    50 	
    52 %-----------------------------------
    51 %-----------------------------------
    53 %	SUBSECTION 1
    52 %	SUBSECTION 1
    54 %-----------------------------------
    53 %-----------------------------------
    55 \subsection{Subsection 1}
    54 \subsection{Subsection 1}
    56 
    55 To be completed.
    57 Nunc posuere quam at lectus tristique eu ultrices augue venenatis. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Aliquam erat volutpat. Vivamus sodales tortor eget quam adipiscing in vulputate ante ullamcorper. Sed eros ante, lacinia et sollicitudin et, aliquam sit amet augue. In hac habitasse platea dictumst.
       
    58 
       
    59 
       
    60 
    56 
    61 
    57 
    62 
    58 
    63 
    59 
    64 
    60 
    96 \end{center}
    92 \end{center}
    97 %TODO: FINISH definition of rerase
    93 %TODO: FINISH definition of rerase
    98 Similarly we could define the derivative  and simplification on 
    94 Similarly we could define the derivative  and simplification on 
    99 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, 
    95 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, 
   100 except that now they can operate on alternatives taking multiple arguments.
    96 except that now they can operate on alternatives taking multiple arguments.
   101 %TODO: definition of rder rsimp (maybe only the alternative clause)
    97 
       
    98 \begin{center}
       
    99 \begin{tabular}{lcr}
       
   100 $\RALTS{rs} \backslash c$ & $=$ &  $\RALTS{map\; (\_ \backslash c) \;rs}$\\
       
   101 (other clauses omitted)
       
   102 \end{tabular}
       
   103 \end{center}
       
   104 
       
   105 Now that $\rrexp$s do not have bitcodes on them, we can do the 
       
   106 duplicate removal without  an equivalence relation:
       
   107 \begin{center}
       
   108 \begin{tabular}{lcl}
       
   109 $\rdistinct{r :: rs}{rset}$ & $=$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\
       
   110            			    &        & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$
       
   111 \end{tabular}
       
   112 \end{center}
       
   113 %TODO: definition of rsimp (maybe only the alternative clause)
       
   114 
       
   115 
   102 The reason why these definitions  mirror precisely the corresponding operations
   116 The reason why these definitions  mirror precisely the corresponding operations
   103 on annotated regualar expressions is that we can calculate sizes more easily:
   117 on annotated regualar expressions is that we can calculate sizes more easily:
   104 
   118 
   105 \begin{lemma}
   119 \begin{lemma}
   106 $\rsize{\rerase a} = \asize a$
   120 $\rsize{\rerase a} = \asize a$
   115 \begin{lemma}
   129 \begin{lemma}
   116 $\asize{\bderssimp{r}{s}} =  \rsize{\rderssimp{\rerase{r}}{s}}$
   130 $\asize{\bderssimp{r}{s}} =  \rsize{\rderssimp{\rerase{r}}{s}}$
   117 \end{lemma}
   131 \end{lemma}
   118 Unless stated otherwise in this chapter all $\textit{rexp}$s without
   132 Unless stated otherwise in this chapter all $\textit{rexp}$s without
   119  bitcodes are seen as $\rrexp$s.
   133  bitcodes are seen as $\rrexp$s.
       
   134  We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably
       
   135  as the former suits people's intuitive way of stating a binary alternative
       
   136  regular expression.
   120 
   137 
   121  Suppose
   138  Suppose
   122 we have a size function for bitcoded regular expressions, written
   139 we have a size function for bitcoded regular expressions, written
   123 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
   140 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
   124 (we omit the precise definition; ditto for lists $\llbracket r\!s\rrbracket$). For this we show that for every $r$
   141 (we omit the precise definition; ditto for lists $\llbracket r\!s\rrbracket$). For this we show that for every $r$
   250  for a formal proof, we need something
   267  for a formal proof, we need something
   251 stronger than language equivalence.
   268 stronger than language equivalence.
   252 With the help of $\sflat{\_}$ we can state the equation in Indianpaper
   269 With the help of $\sflat{\_}$ we can state the equation in Indianpaper
   253 more rigorously:
   270 more rigorously:
   254 \begin{lemma}
   271 \begin{lemma}
   255 $\sflat{(r_1 \cdot r_2) \backslash s} = \AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
   272 $\sflat{(r_1 \cdot r_2) \backslash s} = \RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
   256 \end{lemma}
   273 \end{lemma}
   257 With this property we can prove the finiteness of the size of the regex $(r_1 \cdot r_2) \backslash s$,
   274 
   258 much like a recursive function's termination proof.
   275 The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string:
   259 The function $\vsuf{\_}{\_}$ is defined this way:
   276 
   260 %TODO: DEF of vsuf
       
   261 \begin{center}
   277 \begin{center}
   262 \begin{tabular}{lcl}
   278 \begin{tabular}{lcl}
   263 $\vsuf{[]}{\_} $ & $=$ &  $[]$\\
   279 $\vsuf{[]}{\_} $ & $=$ &  $[]$\\
   264 $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then}  (\vsuf{cs}{(rder c r1)}) @ [c :: cs]$\\
   280 $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\
   265                                      && $\textit{else}  (\vsuf{cs}{rder c r1})  $
   281                                      && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) })  $
   266 \end{tabular}
   282 \end{tabular}
   267 \end{center}                   
   283 \end{center}                   
   268 It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
   284 It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
   269 and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
   285 and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
   270 the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
   286 the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
   271 In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r$, but instead of producing 
   287 In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r_1 \cdot r_2$, but instead of producing 
   272 the entire result of  $(r_1 \cdot r_2) \backslash s$, 
   288 the entire result of  $(r_1 \cdot r_2) \backslash s$, 
   273 it only stores all the $r_2 \backslash s''$ terms.
   289 it only stores all the $s''$ such that $r_2 \backslash s''$  are occurring terms in $(r_1\cdot r_2)\backslash s$.
   274 
   290 
   275 With this we can also add simplifications to both sides and get
   291 With this we can also add simplifications to both sides and get
   276 \begin{lemma}
   292 \begin{lemma}
   277 $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
   293 $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
   278 \end{lemma}
   294 \end{lemma}
   279 Together with the idempotency property of $\rsimp$,
   295 Together with the idempotency property of $\rsimp$,
   280 %TODO: state the idempotency property of rsimp
   296 %TODO: state the idempotency property of rsimp
       
   297 \begin{lemma}
       
   298 $\rsimp(r) = \rsimp (\rsimp(r))$
       
   299 \end{lemma}
   281 it is possible to convert the above lemma to obtain a "closed form"
   300 it is possible to convert the above lemma to obtain a "closed form"
   282 for our lexer's intermediate result without bitcodes:
   301 for  derivatives nested with simplification:
   283 \begin{lemma}
   302 \begin{lemma}
   284 $\rderssimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
   303 $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
   285 \end{lemma}
   304 \end{lemma}
   286 And now the reason we have (2) in section 1 is clear:
   305 And now the reason we have (1) in section 1 is clear:
   287 $\rsize{\rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}}$
   306 $\rsize{\rsimp{\RALTS{ (r_1 \backslash s) \cdot r_2 :: (\map \;(r_2 \backslash \_) \; (\vsuf{s}{r_1}))}}}$, 
   288 is bounded by     $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
   307 is equal to $\rsize{\rsimp{\RALTS{ ((r_1 \backslash s) \cdot r_2 :: (\map \; (r_2 \backslash \_) \; (\textit{Suffix}(r1, s))))}}}$
   289     [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ ,
   308     as $\vsuf{s}{r_1}$ is one way of expressing the list $\textit{Suffix}( r_1, s)$.
   290     as $\vsuf{s}{r_1}$ is a subset of s $\textit{Suffix}( r_1, s)])$.
       
   291 
   309 
   292 %----------------------------------------------------------------------------------------
   310 %----------------------------------------------------------------------------------------
   293 %	SECTION 3
   311 %	SECTION 3
   294 %----------------------------------------------------------------------------------------
   312 %----------------------------------------------------------------------------------------
   295 
   313 
   296 \section{interaction between $\distinctWith$ and $\flts$}
   314 \section{interaction between $\distinctWith$ and $\flts$}
   297 Note that it is not immediately obvious that 
   315 Note that it is not immediately obvious that 
   298 $\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket   \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket  $
   316 \begin{center}
   299 
   317 $\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket   \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket  $.
   300 Sed ullamcorper quam eu nisl interdum at interdum enim egestas. Aliquam placerat justo sed lectus lobortis ut porta nisl porttitor. Vestibulum mi dolor, lacinia molestie gravida at, tempus vitae ligula. Donec eget quam sapien, in viverra eros. Donec pellentesque justo a massa fringilla non vestibulum metus vestibulum. Vestibulum in orci quis felis tempor lacinia. Vivamus ornare ultrices facilisis. Ut hendrerit volutpat vulputate. Morbi condimentum venenatis augue, id porta ipsum vulputate in. Curabitur luctus tempus justo. Vestibulum risus lectus, adipiscing nec condimentum quis, condimentum nec nisl. Aliquam dictum sagittis velit sed iaculis. Morbi tristique augue sit amet nulla pulvinar id facilisis ligula mollis. Nam elit libero, tincidunt ut aliquam at, molestie in quam. Aenean rhoncus vehicula hendrerit.
   318 \end{center}
       
   319 The intuition is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of 
       
   320 duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. 
       
   321 
       
   322 %----------------------------------------------------------------------------------------
       
   323 %	SECTION 4
       
   324 %----------------------------------------------------------------------------------------
       
   325 \section{a bound for the star regular expression}
       
   326 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using
       
   327 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then 
       
   328 the property of the $\distinct$ function.
       
   329 Now we try to get a bound on $r^* \backslash s$ as well.
       
   330 Again, we first look at how a star's derivatives evolve, if they grow maximally: 
       
   331 \begin{center}
       
   332 
       
   333 $r^* \quad \longrightarrow_{\backslash c}  \quad   (r\backslash c)  \cdot  r^* \quad \longrightarrow_{\backslash c'}  \quad
       
   334 r \backslash cc'  \cdot r^* + r \backslash c' \cdot r^*  \quad \longrightarrow_{\backslash c''} \quad 
       
   335 (r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)   \quad \longrightarrow_{\backslash c'''}
       
   336 \quad \ldots$
       
   337 
       
   338 \end{center}
       
   339 When we have a string $s = c :: c' :: c'' \ldots$  such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, 
       
   340 $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable,
       
   341 the number of terms in $r^* \backslash s$ will grow exponentially, causing the size
       
   342 of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not 
       
   343 count the possible size explosions of $r \backslash c$ themselves.
       
   344 
       
   345 Thanks to $\flts$ and $\distinctWith$, we are able to open up regexes like
       
   346 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ 
       
   347 into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$
       
   348 and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$).
       
   349 For this we define $\hflataux{\_}$ and $\hflat{\_}$, similar to $\sflataux{\_}$ and $\sflat{\_}$:
       
   350 %TODO: definitions of  and \hflataux \hflat
       
   351  \begin{center}  
       
   352  \begin{tabular}{ccc}
       
   353  $\hflataux{r_1 + r_2}$ & $=$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
       
   354 $\hflataux r$ & $=$ & $ [r]$
       
   355 \end{tabular}
       
   356 \end{center}
       
   357 
       
   358  \begin{center}  
       
   359  \begin{tabular}{ccc}
       
   360  $\hflat{r_1 + r_2}$ & $=$ & $\RALTS{\hflataux{r_1} @ \hflataux{r_2}}$\\
       
   361 $\hflat r$ & $=$ & $ r$
       
   362 \end{tabular}
       
   363 \end{center}
       
   364 Again these definitions are tailor-made for dealing with alternatives that have
       
   365 originated from a star's derivatives, so we don't attempt to open up all possible 
       
   366 regexes of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2
       
   367 elements.
       
   368 We give a predicate for such "star-created" regular expressions:
       
   369 \begin{center}
       
   370 \begin{tabular}{lcr}
       
   371          &    &       $\createdByStar{\RSEQ{ra}{\RSTAR{rb}}}$\\
       
   372  $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$
       
   373  \end{tabular}
       
   374  \end{center}
       
   375  
       
   376 One might wonder the actual bound rather than the loose bound we gave
       
   377 for the convenience of a easier proof.
       
   378 How much can the regex $r^* \backslash s$ grow? As hinted earlier,
       
   379  they can grow at a maximum speed
       
   380 of exponential $w.r.t$ the number of characters.
       
   381 But they will eventually level off when the string $s$ is long enough,
       
   382 as suggested by the finiteness bound proof.
       
   383 
       
   384 %TODO: give out a graph showing how the size of the regex grows and levels off at a size exponential to the original regex
       
   385 
       
   386