ChengsongTanPhdThesis/Chapters/Chapter2.tex
changeset 503 35b80e37dfe7
parent 500 4d9eecfc936a
child 505 5ce3bd8e5696
equal deleted inserted replaced
502:1ab693d6342f 503:35b80e37dfe7
    69 %-----------------------------------
    69 %-----------------------------------
    70 
    70 
    71 \section{Finiteness Property}
    71 \section{Finiteness Property}
    72 In this section let us sketch our argument for why the size of the simplified
    72 In this section let us sketch our argument for why the size of the simplified
    73 derivatives with the aggressive simplification function can be finitely bounded.
    73 derivatives with the aggressive simplification function can be finitely bounded.
    74 For convenience, we use a new datatype which we call $\textit{rrexp}$ to denote
    74 
       
    75 For convenience, we use a new datatype which we call $\rrexp$ to denote
    75 the difference between it and annotated regular expressions. 
    76 the difference between it and annotated regular expressions. 
    76 For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, but keeps
    77 \[			\rrexp ::=   \RZERO \mid  \RONE
    77 everything else intact.
    78 			 \mid  \RCHAR{c}  
    78 It is similar to annotated regular expression being $\erase$ed, but with all its structure being intact
    79 			 \mid  \RSEQ{r_1}{r_2}
       
    80 			 \mid  \RALTS{rs}
       
    81 			 \mid \RSTAR{r}        
       
    82 \]
       
    83 For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, 
       
    84 but keep everything else intact.
       
    85 It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
    79 (the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
    86 (the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
    80 $\AALTS$.
    87 $\ALTS$).
    81 We denote the operation of erasing the bits and turning an annotated regular expression 
    88 We denote the operation of erasing the bits and turning an annotated regular expression 
    82 into an $\rrexp$ as $\rerase$.
    89 into an $\rrexp{}$ as $\rerase{}$.
    83 %TODO: definition of rerase
    90 \begin{center}
    84 That we can bound the size of annotated regular expressions by 
    91 \begin{tabular}{lcr}
    85 $\rrexp$ is that the bitcodes grow linearly w.r.t the input string, and don't contribute to the structural size of 
    92 $\rerase{\AZERO}$ & $=$ & $\RZERO$\\
    86 an annotated regular expression:
    93 $\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
    87 $\rsize (\rerase a) = \asize a$
    94 $\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
       
    95 \end{tabular}
       
    96 \end{center}
       
    97 %TODO: FINISH definition of rerase
       
    98 Similarly we could define the derivative  and simplification on 
       
    99 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, 
       
   100 except that now they can operate on alternatives taking multiple arguments.
       
   101 %TODO: definition of rder rsimp (maybe only the alternative clause)
       
   102 The reason why these definitions  mirror precisely the corresponding operations
       
   103 on annotated regualar expressions is that we can calculate sizes more easily:
       
   104 
       
   105 \begin{lemma}
       
   106 $\rsize{\rerase a} = \asize a$
       
   107 \end{lemma}
       
   108 A similar equation holds for annotated regular expressions' simplification
       
   109 and the plain regular expressions' simplification:
       
   110 \begin{lemma}
       
   111 $\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
       
   112 \end{lemma}
       
   113 Putting these two together we get a property that allows us to estimate the 
       
   114 size of an annotated regular expression derivative using its un-annotated counterpart:
       
   115 \begin{lemma}
       
   116 $\asize{\bderssimp{r}{s}} =  \rsize{\rderssimp{\rerase{r}}{s}}$
       
   117 \end{lemma}
       
   118 Unless stated otherwise in this chapter all $\textit{rexp}$s without
       
   119  bitcodes are seen as $\rrexp$s.
    88 
   120 
    89  Suppose
   121  Suppose
    90 we have a size function for bitcoded regular expressions, written
   122 we have a size function for bitcoded regular expressions, written
    91 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
   123 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
    92 (we omit the precise definition; ditto for lists $\llbracket r\!s\rrbracket$). For this we show that for every $r$
   124 (we omit the precise definition; ditto for lists $\llbracket r\!s\rrbracket$). For this we show that for every $r$
    93 there exists a bound $N$
   125 there exists a bound $N$
    94 such that 
   126 such that 
    95 
   127 
    96 \begin{center}
   128 \begin{center}
    97 $\forall s. \; \llbracket{\bderssimp r s}\rrbracket \leq N$
   129 $\forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N$
    98 \end{center}
   130 \end{center}
    99 
   131 
   100 \noindent
   132 \noindent
   101 We prove this by induction on $r$. The base cases for $\AZERO$,
   133 We prove this by induction on $r$. The base cases for $\AZERO$,
   102 $\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
   134 $\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
   103 for sequences of the form $\ASEQ \textit{bs} r_1 r_2$. In this case our induction
   135 for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
   104 hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp(r_1, s) \rrbracket \leq N_1$ and
   136 hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
   105 $\exists N_2. \forall s. \; \llbracket \bderssimp(r_2, s) \rrbracket \leq N_2$. We can reason as follows
   137 $\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
   106 %
   138 %
   107 \begin{center}
   139 \begin{center}
   108 \begin{tabular}{lcll}
   140 \begin{tabular}{lcll}
   109 & & $ \llbracket \bderssimp( (\ASEQ\, \textit{bs} \, r_1 \,r_2),  s) \rrbracket$\\
   141 & & $ \llbracket   \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
   110 & $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;((ASEQ [] (\bderssimp r_1 s) r_2) ::
   142 & $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
   111     [\bderssimp r_2 s' \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
   143     [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
   112 & $\leq$ &
   144 & $\leq$ &
   113     $\llbracket\textit{\distinctWith}\,((\ASEQ [] (\bderssimp r_1 s) r_2$) ::
   145     $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
   114     [$\bderssimp r_2 s' \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
   146     [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
   115 & $\leq$ & $\llbracket\ASEQ [] (\bderssimp r_1 s) r_2\rrbracket +
   147 & $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
   116              \llbracket\textit{distinctWith}\,[\bderssimp r_2 s' \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
   148              \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
   117 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
   149 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
   118       \llbracket \distinctWith\,[ \bderssimp r_2 s' \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
   150       \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
   119 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
   151 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
   120 \end{tabular}
   152 \end{tabular}
   121 \end{center}
   153 \end{center}
   122 
   154 
   123 % tell Chengsong about Indian paper of closed forms of derivatives
   155 % tell Chengsong about Indian paper of closed forms of derivatives
   124 
   156 
   125 \noindent
   157 \noindent
   126 where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp r_1 s'$ is nullable ($s'$ being a suffix of $s$).
   158 where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
   127 The reason why we could write the derivatives of a sequence this way is described in section 2.
   159 The reason why we could write the derivatives of a sequence this way is described in section 2.
   128 The term (2) is used to control (1) since it we know that you can obtain an overall
   160 The term (2) is used to control (1). 
       
   161 That is because one can obtain an overall
   129 smaller regex list
   162 smaller regex list
   130 by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
   163 by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
   131 Section 3 is dedicated to its proof.
   164 Section 3 is dedicated to its proof.
   132 In (3) we know that  $\llbracket\ASEQ [] (\bderssimp r_1 s) r_2\rrbracket$ is 
   165 In (3) we know that  $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is 
   133 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
   166 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
   134 than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
   167 than $N_2$. The list length after  $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
   135 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
   168 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
   136 We reason similarly for  $\STAR$.\medskip
   169 We reason similarly for  $\STAR$.\medskip
   137 
   170 
   153 %----------------------------------------------------------------------------------------
   186 %----------------------------------------------------------------------------------------
   154 %	SECTION 2
   187 %	SECTION 2
   155 %----------------------------------------------------------------------------------------
   188 %----------------------------------------------------------------------------------------
   156 
   189 
   157 \section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
   190 \section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
   158 There is a nice property about the compound regular expression 
   191 To embark on getting the "closed forms" of regexes, we first
   159 $r_1 \cdot r_2$ in general,
   192 define a few auxiliary definitions to make expressing them smoothly.
   160 that the derivatives of it against a string $s$ can be described by
   193 
   161 the derivatives w.r.t $r_1$ and $r_2$ over substrings of $s$:
   194  \begin{center}  
   162 \begin{lemma}
   195  \begin{tabular}{ccc}
   163 $\textit{sflat}\_{aux} ( (r_1 \cdot r_2) \backslash s) = (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf s r_1))$
   196  $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
   164 \end{lemma}
   197 $\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
   165 
   198 $\sflataux r$ & $=$ & $ [r]$
       
   199 \end{tabular}
       
   200 \end{center}
       
   201 The intuition behind $\sflataux{\_}$ is to break up nested regexes 
       
   202 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
       
   203 into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
       
   204 It will return the singleton list $[r]$ otherwise.
       
   205 
       
   206 $\sflat{\_}$ works the same  as $\sflataux{\_}$, except that it keeps
       
   207 the output type a regular expression, not a list:
       
   208  \begin{center} 
       
   209  \begin{tabular}{ccc}
       
   210  $\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
       
   211 $\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\
       
   212 $\sflat r$ & $=$ & $ [r]$
       
   213 \end{tabular}
       
   214 \end{center}
       
   215 $\sflataux{\_}$  and $\sflat{\_}$ is only recursive in terms of the
       
   216  first element of the list of children of
       
   217 an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for  dealing with the regular expression
       
   218 $r_1 \cdot r_2 \backslash s$.
       
   219 
       
   220 With $\sflat{\_}$ and $\sflataux{\_}$ we make 
       
   221 precise what  "closed forms" we have for the sequence derivatives and their simplifications,
       
   222 in other words, how can we construct $(r_1 \cdot r_2) \backslash s$
       
   223 and $\bderssimp{(r_1\cdot r_2)}{s}$,
       
   224 if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)
       
   225 and  $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$  ranges over 
       
   226 the substring of $s$?
       
   227 First let's look at a series of derivatives steps on a sequence 
       
   228 regular expression,  (assuming) that each time the first
       
   229 component of the sequence is always nullable):
       
   230 \begin{center}
       
   231 
       
   232 $r_1 \cdot r_2 \quad \longrightarrow_{\backslash c}  \quad   r_1  \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
       
   233 $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c''   \longrightarrow_{\backslash c''} \quad
       
   234  \ldots$
       
   235 
       
   236 \end{center}
       
   237 %TODO: cite indian paper
       
   238 Indianpaper have  come up with a slightly more formal way of putting the above process:
       
   239 \begin{center}
       
   240 $r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) +
       
   241 \delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots
       
   242 + \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$
       
   243 \end{center}
       
   244 where  $\delta(b, r)$ will produce $r$
       
   245 if $b$ evaluates to true, 
       
   246 and $\ZERO$ otherwise.
       
   247 
       
   248  But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical
       
   249  equivalence. To make this intuition useful 
       
   250  for a formal proof, we need something
       
   251 stronger than language equivalence.
       
   252 With the help of $\sflat{\_}$ we can state the equation in Indianpaper
       
   253 more rigorously:
       
   254 \begin{lemma}
       
   255 $\sflat{(r_1 \cdot r_2) \backslash s} = \AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
       
   256 \end{lemma}
       
   257 With this property we can prove the finiteness of the size of the regex $(r_1 \cdot r_2) \backslash s$,
       
   258 much like a recursive function's termination proof.
       
   259 The function $\vsuf{\_}{\_}$ is defined this way:
       
   260 %TODO: DEF of vsuf
       
   261 \begin{center}
       
   262 \begin{tabular}{lcl}
       
   263 $\vsuf{[]}{\_} $ & $=$ &  $[]$\\
       
   264 $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then}  (\vsuf{cs}{(rder c r1)}) @ [c :: cs]$\\
       
   265                                      && $\textit{else}  (\vsuf{cs}{rder c r1})  $
       
   266 \end{tabular}
       
   267 \end{center}                   
       
   268 It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
       
   269 and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
       
   270 the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
       
   271 In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r$, but instead of producing 
       
   272 the entire result of  $(r_1 \cdot r_2) \backslash s$, 
       
   273 it only stores all the $r_2 \backslash s''$ terms.
       
   274 
       
   275 With this we can also add simplifications to both sides and get
       
   276 \begin{lemma}
       
   277 $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
       
   278 \end{lemma}
       
   279 Together with the idempotency property of $\rsimp$,
       
   280 %TODO: state the idempotency property of rsimp
       
   281 it is possible to convert the above lemma to obtain a "closed form"
       
   282 for our lexer's intermediate result without bitcodes:
       
   283 \begin{lemma}
       
   284 $\rderssimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
       
   285 \end{lemma}
       
   286 And now the reason we have (2) in section 1 is clear:
       
   287 $\rsize{\rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}}$
       
   288 is bounded by     $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
       
   289     [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ ,
       
   290     as $\vsuf{s}{r_1}$ is a subset of s $\textit{Suffix}( r_1, s)])$.
   166 
   291 
   167 %----------------------------------------------------------------------------------------
   292 %----------------------------------------------------------------------------------------
   168 %	SECTION 3
   293 %	SECTION 3
   169 %----------------------------------------------------------------------------------------
   294 %----------------------------------------------------------------------------------------
   170 
   295