69 %----------------------------------- |
69 %----------------------------------- |
70 |
70 |
71 \section{Finiteness Property} |
71 \section{Finiteness Property} |
72 In this section let us sketch our argument for why the size of the simplified |
72 In this section let us sketch our argument for why the size of the simplified |
73 derivatives with the aggressive simplification function can be finitely bounded. |
73 derivatives with the aggressive simplification function can be finitely bounded. |
74 For convenience, we use a new datatype which we call $\textit{rrexp}$ to denote |
74 |
|
75 For convenience, we use a new datatype which we call $\rrexp$ to denote |
75 the difference between it and annotated regular expressions. |
76 the difference between it and annotated regular expressions. |
76 For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, but keeps |
77 \[ \rrexp ::= \RZERO \mid \RONE |
77 everything else intact. |
78 \mid \RCHAR{c} |
78 It is similar to annotated regular expression being $\erase$ed, but with all its structure being intact |
79 \mid \RSEQ{r_1}{r_2} |
|
80 \mid \RALTS{rs} |
|
81 \mid \RSTAR{r} |
|
82 \] |
|
83 For $\rrexp$ we throw away the bitcodes on the annotated regular expressions, |
|
84 but keep everything else intact. |
|
85 It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved |
79 (the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton |
86 (the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton |
80 $\AALTS$. |
87 $\ALTS$). |
81 We denote the operation of erasing the bits and turning an annotated regular expression |
88 We denote the operation of erasing the bits and turning an annotated regular expression |
82 into an $\rrexp$ as $\rerase$. |
89 into an $\rrexp{}$ as $\rerase{}$. |
83 %TODO: definition of rerase |
90 \begin{center} |
84 That we can bound the size of annotated regular expressions by |
91 \begin{tabular}{lcr} |
85 $\rrexp$ is that the bitcodes grow linearly w.r.t the input string, and don't contribute to the structural size of |
92 $\rerase{\AZERO}$ & $=$ & $\RZERO$\\ |
86 an annotated regular expression: |
93 $\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ |
87 $\rsize (\rerase a) = \asize a$ |
94 $\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$ |
|
95 \end{tabular} |
|
96 \end{center} |
|
97 %TODO: FINISH definition of rerase |
|
98 Similarly we could define the derivative and simplification on |
|
99 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, |
|
100 except that now they can operate on alternatives taking multiple arguments. |
|
101 %TODO: definition of rder rsimp (maybe only the alternative clause) |
|
102 The reason why these definitions mirror precisely the corresponding operations |
|
103 on annotated regualar expressions is that we can calculate sizes more easily: |
|
104 |
|
105 \begin{lemma} |
|
106 $\rsize{\rerase a} = \asize a$ |
|
107 \end{lemma} |
|
108 A similar equation holds for annotated regular expressions' simplification |
|
109 and the plain regular expressions' simplification: |
|
110 \begin{lemma} |
|
111 $\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$ |
|
112 \end{lemma} |
|
113 Putting these two together we get a property that allows us to estimate the |
|
114 size of an annotated regular expression derivative using its un-annotated counterpart: |
|
115 \begin{lemma} |
|
116 $\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$ |
|
117 \end{lemma} |
|
118 Unless stated otherwise in this chapter all $\textit{rexp}$s without |
|
119 bitcodes are seen as $\rrexp$s. |
88 |
120 |
89 Suppose |
121 Suppose |
90 we have a size function for bitcoded regular expressions, written |
122 we have a size function for bitcoded regular expressions, written |
91 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree |
123 $\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree |
92 (we omit the precise definition; ditto for lists $\llbracket r\!s\rrbracket$). For this we show that for every $r$ |
124 (we omit the precise definition; ditto for lists $\llbracket r\!s\rrbracket$). For this we show that for every $r$ |
93 there exists a bound $N$ |
125 there exists a bound $N$ |
94 such that |
126 such that |
95 |
127 |
96 \begin{center} |
128 \begin{center} |
97 $\forall s. \; \llbracket{\bderssimp r s}\rrbracket \leq N$ |
129 $\forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N$ |
98 \end{center} |
130 \end{center} |
99 |
131 |
100 \noindent |
132 \noindent |
101 We prove this by induction on $r$. The base cases for $\AZERO$, |
133 We prove this by induction on $r$. The base cases for $\AZERO$, |
102 $\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is |
134 $\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is |
103 for sequences of the form $\ASEQ \textit{bs} r_1 r_2$. In this case our induction |
135 for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction |
104 hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp(r_1, s) \rrbracket \leq N_1$ and |
136 hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and |
105 $\exists N_2. \forall s. \; \llbracket \bderssimp(r_2, s) \rrbracket \leq N_2$. We can reason as follows |
137 $\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows |
106 % |
138 % |
107 \begin{center} |
139 \begin{center} |
108 \begin{tabular}{lcll} |
140 \begin{tabular}{lcll} |
109 & & $ \llbracket \bderssimp( (\ASEQ\, \textit{bs} \, r_1 \,r_2), s) \rrbracket$\\ |
141 & & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\ |
110 & $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;((ASEQ [] (\bderssimp r_1 s) r_2) :: |
142 & $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} :: |
111 [\bderssimp r_2 s' \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\ |
143 [\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\ |
112 & $\leq$ & |
144 & $\leq$ & |
113 $\llbracket\textit{\distinctWith}\,((\ASEQ [] (\bderssimp r_1 s) r_2$) :: |
145 $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) :: |
114 [$\bderssimp r_2 s' \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\ |
146 [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\ |
115 & $\leq$ & $\llbracket\ASEQ [] (\bderssimp r_1 s) r_2\rrbracket + |
147 & $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket + |
116 \llbracket\textit{distinctWith}\,[\bderssimp r_2 s' \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\ |
148 \llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\ |
117 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + |
149 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + |
118 \llbracket \distinctWith\,[ \bderssimp r_2 s' \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\ |
150 \llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\ |
119 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5) |
151 & $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5) |
120 \end{tabular} |
152 \end{tabular} |
121 \end{center} |
153 \end{center} |
122 |
154 |
123 % tell Chengsong about Indian paper of closed forms of derivatives |
155 % tell Chengsong about Indian paper of closed forms of derivatives |
124 |
156 |
125 \noindent |
157 \noindent |
126 where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp r_1 s'$ is nullable ($s'$ being a suffix of $s$). |
158 where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$). |
127 The reason why we could write the derivatives of a sequence this way is described in section 2. |
159 The reason why we could write the derivatives of a sequence this way is described in section 2. |
128 The term (2) is used to control (1) since it we know that you can obtain an overall |
160 The term (2) is used to control (1). |
|
161 That is because one can obtain an overall |
129 smaller regex list |
162 smaller regex list |
130 by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it. |
163 by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it. |
131 Section 3 is dedicated to its proof. |
164 Section 3 is dedicated to its proof. |
132 In (3) we know that $\llbracket\ASEQ [] (\bderssimp r_1 s) r_2\rrbracket$ is |
165 In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is |
133 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller |
166 bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller |
134 than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands |
167 than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands |
135 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). |
168 for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them). |
136 We reason similarly for $\STAR$.\medskip |
169 We reason similarly for $\STAR$.\medskip |
137 |
170 |
153 %---------------------------------------------------------------------------------------- |
186 %---------------------------------------------------------------------------------------- |
154 % SECTION 2 |
187 % SECTION 2 |
155 %---------------------------------------------------------------------------------------- |
188 %---------------------------------------------------------------------------------------- |
156 |
189 |
157 \section{"Closed Forms" of regular expressions' derivatives w.r.t strings} |
190 \section{"Closed Forms" of regular expressions' derivatives w.r.t strings} |
158 There is a nice property about the compound regular expression |
191 To embark on getting the "closed forms" of regexes, we first |
159 $r_1 \cdot r_2$ in general, |
192 define a few auxiliary definitions to make expressing them smoothly. |
160 that the derivatives of it against a string $s$ can be described by |
193 |
161 the derivatives w.r.t $r_1$ and $r_2$ over substrings of $s$: |
194 \begin{center} |
162 \begin{lemma} |
195 \begin{tabular}{ccc} |
163 $\textit{sflat}\_{aux} ( (r_1 \cdot r_2) \backslash s) = (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf s r_1))$ |
196 $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ |
164 \end{lemma} |
197 $\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\ |
165 |
198 $\sflataux r$ & $=$ & $ [r]$ |
|
199 \end{tabular} |
|
200 \end{center} |
|
201 The intuition behind $\sflataux{\_}$ is to break up nested regexes |
|
202 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape |
|
203 into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$. |
|
204 It will return the singleton list $[r]$ otherwise. |
|
205 |
|
206 $\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps |
|
207 the output type a regular expression, not a list: |
|
208 \begin{center} |
|
209 \begin{tabular}{ccc} |
|
210 $\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ |
|
211 $\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\ |
|
212 $\sflat r$ & $=$ & $ [r]$ |
|
213 \end{tabular} |
|
214 \end{center} |
|
215 $\sflataux{\_}$ and $\sflat{\_}$ is only recursive in terms of the |
|
216 first element of the list of children of |
|
217 an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for dealing with the regular expression |
|
218 $r_1 \cdot r_2 \backslash s$. |
|
219 |
|
220 With $\sflat{\_}$ and $\sflataux{\_}$ we make |
|
221 precise what "closed forms" we have for the sequence derivatives and their simplifications, |
|
222 in other words, how can we construct $(r_1 \cdot r_2) \backslash s$ |
|
223 and $\bderssimp{(r_1\cdot r_2)}{s}$, |
|
224 if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$) |
|
225 and $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$ ranges over |
|
226 the substring of $s$? |
|
227 First let's look at a series of derivatives steps on a sequence |
|
228 regular expression, (assuming) that each time the first |
|
229 component of the sequence is always nullable): |
|
230 \begin{center} |
|
231 |
|
232 $r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\ |
|
233 $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad |
|
234 \ldots$ |
|
235 |
|
236 \end{center} |
|
237 %TODO: cite indian paper |
|
238 Indianpaper have come up with a slightly more formal way of putting the above process: |
|
239 \begin{center} |
|
240 $r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) + |
|
241 \delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots |
|
242 + \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$ |
|
243 \end{center} |
|
244 where $\delta(b, r)$ will produce $r$ |
|
245 if $b$ evaluates to true, |
|
246 and $\ZERO$ otherwise. |
|
247 |
|
248 But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical |
|
249 equivalence. To make this intuition useful |
|
250 for a formal proof, we need something |
|
251 stronger than language equivalence. |
|
252 With the help of $\sflat{\_}$ we can state the equation in Indianpaper |
|
253 more rigorously: |
|
254 \begin{lemma} |
|
255 $\sflat{(r_1 \cdot r_2) \backslash s} = \AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$ |
|
256 \end{lemma} |
|
257 With this property we can prove the finiteness of the size of the regex $(r_1 \cdot r_2) \backslash s$, |
|
258 much like a recursive function's termination proof. |
|
259 The function $\vsuf{\_}{\_}$ is defined this way: |
|
260 %TODO: DEF of vsuf |
|
261 \begin{center} |
|
262 \begin{tabular}{lcl} |
|
263 $\vsuf{[]}{\_} $ & $=$ & $[]$\\ |
|
264 $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} (\vsuf{cs}{(rder c r1)}) @ [c :: cs]$\\ |
|
265 && $\textit{else} (\vsuf{cs}{rder c r1}) $ |
|
266 \end{tabular} |
|
267 \end{center} |
|
268 It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable, |
|
269 and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in |
|
270 the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$. |
|
271 In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r$, but instead of producing |
|
272 the entire result of $(r_1 \cdot r_2) \backslash s$, |
|
273 it only stores all the $r_2 \backslash s''$ terms. |
|
274 |
|
275 With this we can also add simplifications to both sides and get |
|
276 \begin{lemma} |
|
277 $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ |
|
278 \end{lemma} |
|
279 Together with the idempotency property of $\rsimp$, |
|
280 %TODO: state the idempotency property of rsimp |
|
281 it is possible to convert the above lemma to obtain a "closed form" |
|
282 for our lexer's intermediate result without bitcodes: |
|
283 \begin{lemma} |
|
284 $\rderssimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$ |
|
285 \end{lemma} |
|
286 And now the reason we have (2) in section 1 is clear: |
|
287 $\rsize{\rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}}$ |
|
288 is bounded by $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) :: |
|
289 [$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ , |
|
290 as $\vsuf{s}{r_1}$ is a subset of s $\textit{Suffix}( r_1, s)])$. |
166 |
291 |
167 %---------------------------------------------------------------------------------------- |
292 %---------------------------------------------------------------------------------------- |
168 % SECTION 3 |
293 % SECTION 3 |
169 %---------------------------------------------------------------------------------------- |
294 %---------------------------------------------------------------------------------------- |
170 |
295 |