46 \noindent |
46 \noindent |
47 In our lexer $\blexersimp$, |
47 In our lexer $\blexersimp$, |
48 The regular expression is repeatedly being taken derivative of |
48 The regular expression is repeatedly being taken derivative of |
49 and then simplified. |
49 and then simplified. |
50 \begin{figure}[H] |
50 \begin{figure}[H] |
51 \begin{tikzpicture}[scale=2, |
51 \begin{tikzpicture}[scale=2, |
52 every node/.style={minimum size=11mm}, |
52 every node/.style={minimum size=11mm}, |
53 ->,>=stealth',shorten >=1pt,auto,thick |
53 ->,>=stealth',shorten >=1pt,auto,thick |
54 ] |
54 ] |
55 \node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$}; |
55 \node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$}; |
56 \node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$}; |
56 \node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$}; |
57 \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$}; |
57 \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$}; |
58 |
58 |
59 \node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=6mm]{$a_{1s}$}; |
59 \node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=6mm]{$a_{1s}$}; |
60 \draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp$}; |
60 \draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp$}; |
61 |
61 |
62 \node (r2) [rectangle, draw=black, thick, right=of r1s, minimum size = 12mm]{$a_2$}; |
62 \node (r2) [rectangle, draw=black, thick, right=of r1s, minimum size = 12mm]{$a_2$}; |
63 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; |
63 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; |
64 |
64 |
65 \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=6mm]{$a_{2s}$}; |
65 \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=6mm]{$a_{2s}$}; |
66 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp$}; |
66 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp$}; |
67 |
67 |
68 \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=6mm]{$a_{ns}$}; |
68 \node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=6mm]{$a_{ns}$}; |
69 \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$}; |
69 \draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$}; |
70 |
70 |
71 \node (v) [circle, thick, draw, right=of rns, minimum size=6mm, right=1.7cm]{$v$}; |
71 \node (v) [circle, thick, draw, right=of rns, minimum size=6mm, right=1.7cm]{$v$}; |
72 \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {\bmkeps} node [below, midway] {\decode}; |
72 \draw[->, line width=0.2mm](rns)--(v) node[above, midway] {\bmkeps} node [below, midway] {\decode}; |
73 \end{tikzpicture} |
73 \end{tikzpicture} |
74 \caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks} |
74 \caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks} |
75 \end{figure} |
75 \end{figure} |
76 \noindent |
76 \noindent |
77 Each time |
77 Each time |
78 a derivative is taken, a regular expression might grow a bit, |
78 a derivative is taken, a regular expression might grow a bit, |
79 but simplification always takes care that |
79 but simplification always takes care that |
86 |
86 |
87 \noindent |
87 \noindent |
88 Sulzmann and Lu's assumed something similar about their algorithm, |
88 Sulzmann and Lu's assumed something similar about their algorithm, |
89 though in fact their algorithm's size might be better depicted by the following graph: |
89 though in fact their algorithm's size might be better depicted by the following graph: |
90 \begin{figure}[H] |
90 \begin{figure}[H] |
91 \begin{tikzpicture}[scale=2, |
91 \begin{tikzpicture}[scale=2, |
92 every node/.style={minimum size=11mm}, |
92 every node/.style={minimum size=11mm}, |
93 ->,>=stealth',shorten >=1pt,auto,thick |
93 ->,>=stealth',shorten >=1pt,auto,thick |
94 ] |
94 ] |
95 \node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$}; |
95 \node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$}; |
96 \node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$}; |
96 \node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$}; |
97 \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$}; |
97 \draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$}; |
98 |
98 |
99 \node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$a_{1s}$}; |
99 \node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$a_{1s}$}; |
100 \draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp'$}; |
100 \draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp'$}; |
101 |
101 |
102 \node (r2) [rectangle, draw=black, thick, right=of r1s, minimum size = 17mm]{$a_2$}; |
102 \node (r2) [rectangle, draw=black, thick, right=of r1s, minimum size = 17mm]{$a_2$}; |
103 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; |
103 \draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; |
104 |
104 |
105 \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=14mm]{$a_{2s}$}; |
105 \node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=14mm]{$a_{2s}$}; |
106 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp'$}; |
106 \draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp'$}; |
107 |
107 |
108 \node (r3) [rectangle, draw = black, thick, right= of r2s, minimum size = 22mm]{$a_3$}; |
108 \node (r3) [rectangle, draw = black, thick, right= of r2s, minimum size = 22mm]{$a_3$}; |
109 \draw[->,line width=0.2mm](r2s)--(r3) node[above,midway] {$\backslash c_3$}; |
109 \draw[->,line width=0.2mm](r2s)--(r3) node[above,midway] {$\backslash c_3$}; |
110 |
110 |
111 \node (rns) [right = of r3, draw=blue, minimum size = 20mm]{$a_{3s}$}; |
111 \node (rns) [right = of r3, draw=blue, minimum size = 20mm]{$a_{3s}$}; |
112 \draw[->,line width=0.2mm] (r3)--(rns) node [above, midway] {$\simp'$}; |
112 \draw[->,line width=0.2mm] (r3)--(rns) node [above, midway] {$\simp'$}; |
113 |
113 |
114 \node (rnn) [right = of rns, minimum size = 1mm]{}; |
114 \node (rnn) [right = of rns, minimum size = 1mm]{}; |
115 \draw[->, dashed] (rns)--(rnn) node [above, midway] {$\ldots$}; |
115 \draw[->, dashed] (rns)--(rnn) node [above, midway] {$\ldots$}; |
116 |
116 |
117 \end{tikzpicture} |
117 \end{tikzpicture} |
118 \caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks} |
118 \caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks} |
119 \end{figure} |
119 \end{figure} |
120 \noindent |
120 \noindent |
121 That is, on certain cases their lexer |
121 That is, on certain cases their lexer |
122 will have an indefinite size explosion, causing the running time |
122 will have an indefinite size explosion, causing the running time |
123 of each derivative step to grow arbitrarily large (for example |
123 of each derivative step to grow arbitrarily large (for example |
130 We give details of the proof in the next sections. |
130 We give details of the proof in the next sections. |
131 |
131 |
132 \subsection{Overview of the Proof} |
132 \subsection{Overview of the Proof} |
133 Here is a bird's eye view of how the proof of finiteness works, |
133 Here is a bird's eye view of how the proof of finiteness works, |
134 which involves three steps: |
134 which involves three steps: |
135 \begin{center} |
135 \begin{figure}[H] |
136 \begin{tikzpicture}[scale=1,font=\bf, |
136 \begin{tikzpicture}[scale=1,font=\bf, |
137 node/.style={ |
137 node/.style={ |
138 rectangle,rounded corners=3mm, |
138 rectangle,rounded corners=3mm, |
139 ultra thick,draw=black!50,minimum height=18mm, |
139 ultra thick,draw=black!50,minimum height=18mm, |
140 minimum width=20mm, |
140 minimum width=20mm, |
141 top color=white,bottom color=black!20}] |
141 top color=white,bottom color=black!20}] |
142 |
142 |
143 |
143 |
144 \node (0) at (-5,0) |
144 \node (0) at (-5,0) |
145 [node, text width=1.8cm, text centered] |
145 [node, text width=1.8cm, text centered] |
146 {$\llbracket \bderssimp{a}{s} \rrbracket$}; |
146 {$\llbracket \bderssimp{a}{s} \rrbracket$}; |
147 \node (A) at (0,0) |
147 \node (A) at (0,0) |
148 [node,text width=1.6cm, text centered] |
148 [node,text width=1.6cm, text centered] |
149 {$\llbracket \rderssimp{r}{s} \rrbracket_r$}; |
149 {$\llbracket \rderssimp{r}{s} \rrbracket_r$}; |
150 \node (B) at (3,0) |
150 \node (B) at (3,0) |
151 [node,text width=3.0cm, anchor=west, minimum width = 40mm] |
151 [node,text width=3.0cm, anchor=west, minimum width = 40mm] |
152 {$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$}; |
152 {$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$}; |
153 \node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$}; |
153 \node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$}; |
154 |
154 |
155 \draw [->,line width=0.5mm] (0) -- |
155 \draw [->,line width=0.5mm] (0) -- |
156 node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$(r = a \downarrow_r)$} (A); |
156 node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$(r = a \downarrow_r)$} (A); |
157 \draw [->,line width=0.5mm] (A) -- |
157 \draw [->,line width=0.5mm] (A) -- |
158 node [above,pos=0.35] {$\quad =\ldots=$} (B); |
158 node [above,pos=0.35] {$\quad =\ldots=$} (B); |
159 \draw [->,line width=0.5mm] (B) -- |
159 \draw [->,line width=0.5mm] (B) -- |
160 node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); |
160 node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); |
161 \end{tikzpicture} |
161 \end{tikzpicture} |
162 \end{center} |
162 %\caption{ |
|
163 \end{figure} |
163 \noindent |
164 \noindent |
164 We explain the steps one by one: |
165 We explain the steps one by one: |
165 \begin{itemize} |
166 \begin{itemize} |
166 \item |
167 \item |
167 We first introduce the operations such as |
168 We first introduce the operations such as |
168 derivatives, simplification, size calculation, etc. |
169 derivatives, simplification, size calculation, etc. |
169 associated with $\rrexp$s, which we have given |
170 associated with $\rrexp$s, which we have given |
170 a very brief introduction to in chapter \ref{Bitcoded2}. |
171 a very brief introduction to in chapter \ref{Bitcoded2}. |
|
172 The operations on $\rrexp$s are identical to those on |
|
173 annotated regular expressions except that they are unaware |
|
174 of bitcodes. This means that all proofs about size of $\rrexp$s will apply to |
|
175 annotated regular expressions. |
171 \item |
176 \item |
172 We have a set of equalities for this new datatype that enables one to |
177 We prove that $\rderssimp{r}{s} = \textit{ClosedForm}(r, s)$, |
173 rewrite $\rderssimp{r_1 \cdot r_2}{s}$ |
178 where $\textit{ClosedForm}(r, s)$ is entirely |
174 and $\rderssimp{r^*}{s}$ (the most involved |
179 written in the derivatives of their children regular |
175 inductive cases) |
180 expressions. |
176 by a combinatioin of derivatives of their |
181 We call the right-hand-side the \emph{Closed Form} |
177 children regular expressions ($r_1$, $r_2$, and $r$, respectively), |
182 of the derivative $\rderssimp{r}{s}$. |
178 which we call the \emph{Closed Forms} |
|
179 of the derivatives. |
|
180 \item |
183 \item |
181 The Closed Forms of the regular expressions |
184 We estimate $\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$. |
182 are controlled by terms that |
185 The key observation is that $\distinctBy$'s output is |
183 are easier to deal with. |
186 a list with a constant length bound. |
184 Using inductive hypothesis, these terms |
|
185 are in turn bounded loosely |
|
186 by a large yet constant number. |
|
187 \end{itemize} |
187 \end{itemize} |
188 We give details of these steps in the next sections. |
188 We give details of these steps in the next sections. |
189 The first step is to use |
189 The first step is to use |
190 $\textit{rrexp}$s, |
190 $\textit{rrexp}$s, |
191 something simpler than |
191 something simpler than |
192 annotated regular expressions. |
192 annotated regular expressions. |
193 |
193 |
194 \section{the $\textit{rrexp}$ Datatype and Its Size Functions} |
194 \section{The $\textit{Rrexp}$ Datatype and Its Lexing-Related Functions} |
195 |
195 We first recap the definition of the new datatype $\rrexp$, called |
196 We first recap a bit about the new datatype |
196 \emph{r-regular expressions}, |
197 we defined in \ref{rrexpDef}, |
197 which we first defined in \ref{rrexpDef}. |
198 called $\textit{rrexp}$s. |
198 R-regular expressions are similar to basic regular expressions. |
199 We chose $\rrexp$ over |
199 We call them \emph{r}-regular expressions |
200 the basic regular expressions |
200 to make a distinction |
201 because it is more straightforward to tweak |
201 with plain regular expressions. |
202 into the shape we want |
|
203 compared with an annotated regular expression. |
|
204 We want to prove the size property about annotated regular expressions. |
|
205 The size is |
|
206 written $\llbracket r\rrbracket$, whose intuitive definition is as below |
|
207 \noindent |
|
208 We first note that $\llbracket \_ \rrbracket$ |
|
209 is unaware of bitcodes, since |
|
210 it only counts the number of nodes |
|
211 if we regard $r$ as a tree. |
|
212 This suggests that if we define a new datatype that is similar to plain $\rexp$s, |
|
213 \[ \rrexp ::= \RZERO \mid \RONE |
202 \[ \rrexp ::= \RZERO \mid \RONE |
214 \mid \RCHAR{c} |
203 \mid \RCHAR{c} |
215 \mid \RSEQ{r_1}{r_2} |
204 \mid \RSEQ{r_1}{r_2} |
216 \mid \RALTS{rs} |
205 \mid \RALTS{rs} |
217 \mid \RSTAR{r} |
206 \mid \RSTAR{r} |
218 \] |
207 \] |
219 and define |
208 The size of an r-regular expression is |
220 \begin{center} |
209 written $\llbracket r\rrbracket_r$, |
221 \begin{tabular}{lcl} |
210 whose definition mirrors that of an annotated regular expression. |
222 $\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\ |
211 \begin{center} |
223 $\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\ |
212 \begin{tabular}{ccc} |
224 $\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\ |
213 $\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\ |
225 $\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ |
214 $\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\ |
226 $\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\ |
215 $\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\ |
227 $\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a}^*$ |
216 $\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\ |
228 \end{tabular} |
217 $\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as + 1$\\ |
229 \end{center} |
218 $\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$. |
230 \noindent |
219 \end{tabular} |
231 we could calculate the size of annotated regular expressions in terms of |
220 \end{center} |
232 its un-annotated counterpart: |
221 \noindent |
233 \begin{lemma} |
222 The $r$ in the subscript of $\llbracket \rrbracket_r$ is to |
234 $\rsize{\rerase a} = \asize a$ |
223 differentiate with the same operation for annotated regular expressions. |
235 \end{lemma} |
224 Adding $r$ as subscript will be used in |
236 \begin{proof} |
225 other operations as well. |
237 By routine induction on the structure of $a$. |
226 |
238 \end{proof} |
227 The transformation from an annotated regular expression |
239 \noindent |
228 to an r-regular expression is straightforward. |
240 We call them \emph{r}-regular expressions: |
229 \begin{center} |
241 we use |
230 \begin{tabular}{lcl} |
242 $\rrexp$ as its type name, so as to make a distinction |
231 $\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\ |
243 with plain regular expressions. |
232 $\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\ |
244 In $\rrexp$ we throw away the bitcodes on the annotated regular expressions, |
233 $\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\ |
245 but keep everything else intact. |
234 $\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ |
246 To transform an annotated regular expression into an r-regular expression, |
235 $\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\ |
247 we use the function \emph{rerase}, written $\downarrow_r$. |
236 $\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$ |
248 The $r$ in the subscript of $\downarrow$ is to |
237 \end{tabular} |
249 differentiate with the $\downarrow$ for the $\erase$ operation. |
238 \end{center} |
|
239 \noindent |
|
240 $\textit{Rerase}$ throws away the bitcodes on the annotated regular expressions, |
|
241 but keeps everything else intact. |
|
242 |
250 Before we introduce more functions related to r-regular expressions, |
243 Before we introduce more functions related to r-regular expressions, |
251 we first give out the reason why we take all the trouble |
244 we first give out the reason why we take all the trouble |
252 defining a new datatype in the first place. |
245 defining a new datatype in the first place. |
|
246 We could calculate the size of annotated regular expressions in terms of |
|
247 their un-annotated $\rrexp$ counterpart: |
|
248 \begin{lemma} |
|
249 $\rsize{\rerase a} = \asize a$ |
|
250 \end{lemma} |
|
251 \begin{proof} |
|
252 By routine structural induction on $a$. |
|
253 \end{proof} |
|
254 \noindent |
253 \subsection{Motivation Behind a New Datatype} |
255 \subsection{Motivation Behind a New Datatype} |
254 The main difference between a plain regular expression |
256 The main difference between a plain regular expression |
255 and an r-regular expression is that it can take |
257 and an r-regular expression is that it can take |
256 non-binary arguments for its alternative constructor. |
258 non-binary arguments for its alternative constructor. |
257 This turned out to be necessary if we want our proofs to be |
259 This turned out to be necessary if we want our proofs to be |
337 Similarly we could define the derivative and simplification on |
339 Similarly we could define the derivative and simplification on |
338 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, |
340 $\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1, |
339 except that now they can operate on alternatives taking multiple arguments. |
341 except that now they can operate on alternatives taking multiple arguments. |
340 |
342 |
341 \begin{center} |
343 \begin{center} |
342 \begin{tabular}{lcr} |
344 \begin{tabular}{lcr} |
343 $(\RALTS{rs})\; \backslash c$ & $\dn$ & $\RALTS{\map\; (\_ \backslash c) \;rs}$\\ |
345 $(\RALTS{rs})\; \backslash c$ & $\dn$ & $\RALTS{\map\; (\_ \backslash c) \;rs}$\\ |
344 (other clauses omitted) |
346 (other clauses omitted) |
345 With the new $\rrexp$ datatype in place, one can define its size function, |
347 With the new $\rrexp$ datatype in place, one can define its size function, |
346 which precisely mirrors that of the annotated regular expressions: |
348 which precisely mirrors that of the annotated regular expressions: |
347 \end{tabular} |
349 \end{tabular} |
348 \end{center} |
350 \end{center} |
349 \noindent |
351 \noindent |
350 \begin{center} |
352 \begin{center} |
351 \begin{tabular}{ccc} |
353 \begin{tabular}{ccc} |
352 $\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\ |
354 $\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\ |
353 $\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\ |
355 $\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\ |
354 $\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\ |
356 $\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\ |
355 $\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\ |
357 $\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\ |
356 $\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as + 1$\\ |
358 $\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as + 1$\\ |
357 $\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$ |
359 $\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$ |
358 \end{tabular} |
360 \end{tabular} |
359 \end{center} |
361 \end{center} |
360 \noindent |
362 \noindent |
361 |
363 |
362 \subsection{Lexing Related Functions for $\rrexp$} |
364 \subsection{Lexing Related Functions for $\rrexp$} |
363 Everything else for $\rrexp$ will be precisely the same for annotated expressions, |
365 Everything else for $\rrexp$ will be precisely the same for annotated expressions, |
364 except that they do not involve rectifying and augmenting bit-encoded tokenization information. |
366 except that they do not involve rectifying and augmenting bit-encoded tokenization information. |
365 As expected, most functions are simpler, such as the derivative: |
367 As expected, most functions are simpler, such as the derivative: |
366 \begin{center} |
368 \begin{center} |
367 \begin{tabular}{@{}lcl@{}} |
369 \begin{tabular}{@{}lcl@{}} |
368 $(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\ |
370 $(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\ |
369 $(\ONE)\,\backslash_r c$ & $\dn$ & |
371 $(\ONE)\,\backslash_r c$ & $\dn$ & |
370 $\textit{if}\;c=d\; \;\textit{then}\; |
372 $\textit{if}\;c=d\; \;\textit{then}\; |
371 \ONE\;\textit{else}\;\ZERO$\\ |
373 \ONE\;\textit{else}\;\ZERO$\\ |
372 $(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ & |
374 $(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ & |
373 $\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\ |
375 $\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\ |
374 $(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ & |
376 $(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ & |
375 $\textit{if}\;\textit{rnullable}\,r_1$\\ |
377 $\textit{if}\;\textit{rnullable}\,r_1$\\ |
376 & &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\ |
378 & &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\ |
377 & &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\ |
379 & &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\ |
378 & &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\ |
380 & &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\ |
379 $(r^*)\,\backslash_r c$ & $\dn$ & |
381 $(r^*)\,\backslash_r c$ & $\dn$ & |
380 $( r\,\backslash_r c)\cdot |
382 $( r\,\backslash_r c)\cdot |
381 (_{[]}r^*))$ |
383 (_{[]}r^*))$ |
382 \end{tabular} |
384 \end{tabular} |
383 \end{center} |
385 \end{center} |
384 \noindent |
386 \noindent |
385 The simplification function is simplified without annotation causing superficial differences. |
387 The simplification function is simplified without annotation causing superficial differences. |
386 Duplicate removal without an equivalence relation: |
388 Duplicate removal without an equivalence relation: |
387 \begin{center} |
389 \begin{center} |
388 \begin{tabular}{lcl} |
390 \begin{tabular}{lcl} |
389 $\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\ |
391 $\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\ |
390 $\rdistinct{r :: rs}{rset}$ & $\dn$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ |
392 $\rdistinct{r :: rs}{rset}$ & $\dn$ & $\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ |
391 & & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$ |
393 & & $\textit{else}\; r::\rdistinct{rs}{(rset \cup \{r\})}$ |
392 \end{tabular} |
394 \end{tabular} |
393 \end{center} |
395 \end{center} |
394 %TODO: definition of rsimp (maybe only the alternative clause) |
396 %TODO: definition of rsimp (maybe only the alternative clause) |
395 \noindent |
397 \noindent |
396 The prefix $r$ in front of $\rdistinct{}{}$ is used mainly to |
398 The prefix $r$ in front of $\rdistinct{}{}$ is used mainly to |
397 differentiate with $\textit{distinct}$, which is a built-in predicate |
399 differentiate with $\textit{distinct}$, which is a built-in predicate |
416 \noindent |
418 \noindent |
417 In the following content, we will focus on $\rrexp$'s size bound. |
419 In the following content, we will focus on $\rrexp$'s size bound. |
418 We will piece together this bound and show the same bound for annotated regular |
420 We will piece together this bound and show the same bound for annotated regular |
419 expressions in the end. |
421 expressions in the end. |
420 Unless stated otherwise in this chapter all $\textit{rexp}$s without |
422 Unless stated otherwise in this chapter all $\textit{rexp}$s without |
421 bitcodes are seen as $\rrexp$s. |
423 bitcodes are seen as $\rrexp$s. |
422 We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably |
424 We also use $r_1 + r_2$ and $\RALTS{[r_1, r_2]}$ interchageably |
423 as the former suits people's intuitive way of stating a binary alternative |
425 as the former suits people's intuitive way of stating a binary alternative |
424 regular expression. |
426 regular expression. |
425 |
427 |
426 |
428 |
427 |
429 |
428 %----------------------------------- |
430 %----------------------------------- |
429 % SECTION ? |
431 % SECTION ? |
430 %----------------------------------- |
432 %----------------------------------- |
431 \subsection{Finiteness Proof Using $\rrexp$s} |
433 \subsection{Finiteness Proof Using $\rrexp$s} |
432 Now that we have defined the $\rrexp$ datatype, and proven that its size changes |
434 Now that we have defined the $\rrexp$ datatype, and proven that its size changes |
433 w.r.t derivatives and simplifications mirrors precisely those of annotated regular expressions, |
435 w.r.t derivatives and simplifications mirrors precisely those of annotated regular expressions, |
434 we aim to bound the size of $r \backslash s$ for any $\rrexp$ $r$. |
436 we aim to bound the size of $r \backslash s$ for any $\rrexp$ $r$. |
435 Once we have a bound like: |
437 Once we have a bound like: |
436 \[ |
438 \[ |
437 \llbracket r \backslash_{rsimp} s \rrbracket_r \leq N_r |
439 \llbracket r \backslash_{rsimp} s \rrbracket_r \leq N_r |
438 \] |
440 \] |
439 \noindent |
441 \noindent |
440 we could easily extend that to |
442 we could easily extend that to |
441 \[ |
443 \[ |
442 \llbracket a \backslash_{bsimps} s \rrbracket \leq N_r. |
444 \llbracket a \backslash_{bsimps} s \rrbracket \leq N_r. |
443 \] |
445 \] |
444 |
446 |
445 \subsection{Roadmap to a Bound for $\textit{Rrexp}$} |
447 \subsection{Roadmap to a Bound for $\textit{Rrexp}$} |
446 |
448 |
447 The way we obtain the bound for $\rrexp$s is by two steps: |
449 The way we obtain the bound for $\rrexp$s is by two steps: |
448 \begin{itemize} |
450 \begin{itemize} |
449 \item |
451 \item |
450 First, we rewrite $r\backslash s$ into something else that is easier |
452 First, we rewrite $r\backslash s$ into something else that is easier |
996 to mean atomic simplification transitions |
998 to mean atomic simplification transitions |
997 of $\rrexp$s and $\rrexp$ lists, respectively. |
999 of $\rrexp$s and $\rrexp$ lists, respectively. |
998 |
1000 |
999 |
1001 |
1000 |
1002 |
1001 List of one-step rewrite rules for $\rrexp$ ($\hrewrite$): |
1003 List of one-step rewrite rules for $\rrexp$ ($\hrewrite$): |
1002 |
1004 |
1003 |
1005 |
1004 \begin{center} |
1006 \begin{center} |
1005 \begin{mathpar} |
1007 \begin{mathpar} |
1006 \inferrule[RSEQ0L]{}{\RZERO \cdot r_2 \hrewrite \RZERO\\} |
1008 \inferrule[RSEQ0L]{}{\RZERO \cdot r_2 \hrewrite \RZERO\\} |
1007 |
1009 |
1008 \inferrule[RSEQ0R]{}{r_1 \cdot \RZERO \hrewrite \RZERO\\} |
1010 \inferrule[RSEQ0R]{}{r_1 \cdot \RZERO \hrewrite \RZERO\\} |
1009 |
1011 |
1010 \inferrule[RSEQ1]{}{(\RONE \cdot r) \hrewrite r\\}\\ |
1012 \inferrule[RSEQ1]{}{(\RONE \cdot r) \hrewrite r\\}\\ |
1011 |
1013 |
1012 \inferrule[RSEQL]{ r_1 \hrewrite r_2}{r_1 \cdot r_3 \hrewrite r_2 \cdot r_3\\} |
1014 \inferrule[RSEQL]{ r_1 \hrewrite r_2}{r_1 \cdot r_3 \hrewrite r_2 \cdot r_3\\} |
1013 |
1015 |
1014 \inferrule[RSEQR]{ r_3 \hrewrite r_4}{r_1 \cdot r_3 \hrewrite r_1 \cdot r_4\\}\\ |
1016 \inferrule[RSEQR]{ r_3 \hrewrite r_4}{r_1 \cdot r_3 \hrewrite r_1 \cdot r_4\\}\\ |
1015 |
1017 |
1016 \inferrule[RALTSChild]{r \hrewrite r'}{\sum (rs_1 @ [r] @ rs_2) \hrewrite \sum (rs_1 @ [r'] @ rs_2)\\} |
1018 \inferrule[RALTSChild]{r \hrewrite r'}{\sum (rs_1 @ [r] @ rs_2) \hrewrite \sum (rs_1 @ [r'] @ rs_2)\\} |
1017 |
1019 |
1018 \inferrule[RALTS0]{}{\sum (rs_a @ [\RZERO] @ rs_b) \hrewrite \sum (rs_a @ rs_b)} |
1020 \inferrule[RALTS0]{}{\sum (rs_a @ [\RZERO] @ rs_b) \hrewrite \sum (rs_a @ rs_b)} |
1019 |
1021 |
1020 \inferrule[RALTSNested]{}{\sum (rs_a @ [\sum rs_1] @ rs_b) \hrewrite \sum (rs_a @ rs_1 @ rs_b)} |
1022 \inferrule[RALTSNested]{}{\sum (rs_a @ [\sum rs_1] @ rs_b) \hrewrite \sum (rs_a @ rs_1 @ rs_b)} |
1021 |
1023 |
1022 \inferrule[RALTSNil]{}{ \sum [] \hrewrite \RZERO\\} |
1024 \inferrule[RALTSNil]{}{ \sum [] \hrewrite \RZERO\\} |
1023 |
1025 |
1024 \inferrule[RALTSSingle]{}{ \sum [r] \hrewrite r\\} |
1026 \inferrule[RALTSSingle]{}{ \sum [r] \hrewrite r\\} |
1025 |
1027 |
1026 \inferrule[RALTSDelete]{\\ r_1 = r_2}{\sum rs_a @ [r_1] @ rs_b @ [r_2] @ rsc \hrewrite \sum rs_a @ [r_1] @ rs_b @ rs_c} |
1028 \inferrule[RALTSDelete]{\\ r_1 = r_2}{\sum rs_a @ [r_1] @ rs_b @ [r_2] @ rsc \hrewrite \sum rs_a @ [r_1] @ rs_b @ rs_c} |
1027 |
1029 |
1028 \end{mathpar} |
1030 \end{mathpar} |
1029 \end{center} |
1031 \end{center} |
1030 |
1032 |
1031 |
1033 |
1032 List of rewrite rules for a list of regular expressions, |
1034 List of rewrite rules for a list of regular expressions, |
1033 where each element can rewrite in many steps to the other (scf stands for |
1035 where each element can rewrite in many steps to the other (scf stands for |
1034 li\emph{s}t \emph{c}losed \emph{f}orm). This relation is similar to the |
1036 li\emph{s}t \emph{c}losed \emph{f}orm). This relation is similar to the |
1035 $\stackrel{s*}{\rightsquigarrow}$ for annotated regular expressions. |
1037 $\stackrel{s*}{\rightsquigarrow}$ for annotated regular expressions. |
1036 |
1038 |
1037 \begin{center} |
1039 \begin{center} |
1038 \begin{mathpar} |
1040 \begin{mathpar} |
1039 \inferrule{}{[] \scfrewrites [] } |
1041 \inferrule{}{[] \scfrewrites [] } |
1040 \inferrule{r \hrewrites r' \\ rs \scfrewrites rs'}{r :: rs \scfrewrites r' :: rs'} |
1042 \inferrule{r \hrewrites r' \\ rs \scfrewrites rs'}{r :: rs \scfrewrites r' :: rs'} |
1041 \end{mathpar} |
1043 \end{mathpar} |
1042 \end{center} |
1044 \end{center} |
1043 %frewrite |
1045 %frewrite |
1044 List of one-step rewrite rules for flattening |
1046 List of one-step rewrite rules for flattening |
1045 a list of regular expressions($\frewrite$): |
1047 a list of regular expressions($\frewrite$): |
1046 \begin{center} |
1048 \begin{center} |
1047 \begin{mathpar} |
1049 \begin{mathpar} |
1048 \inferrule{}{\RZERO :: rs \frewrite rs \\} |
1050 \inferrule{}{\RZERO :: rs \frewrite rs \\} |
1049 |
1051 |
1050 \inferrule{}{(\sum rs) :: rs_a \frewrite rs @ rs_a \\} |
1052 \inferrule{}{(\sum rs) :: rs_a \frewrite rs @ rs_a \\} |
1051 |
1053 |
1052 \inferrule{rs_1 \frewrite rs_2}{r :: rs_1 \frewrite r :: rs_2} |
1054 \inferrule{rs_1 \frewrite rs_2}{r :: rs_1 \frewrite r :: rs_2} |
1053 \end{mathpar} |
1055 \end{mathpar} |
1054 \end{center} |
1056 \end{center} |
1055 |
1057 |
1056 Lists of one-step rewrite rules for flattening and de-duplicating |
1058 Lists of one-step rewrite rules for flattening and de-duplicating |
1057 a list of regular expressions ($\grewrite$): |
1059 a list of regular expressions ($\grewrite$): |
1058 \begin{center} |
1060 \begin{center} |
1059 \begin{mathpar} |
1061 \begin{mathpar} |
1060 \inferrule{}{\RZERO :: rs \grewrite rs \\} |
1062 \inferrule{}{\RZERO :: rs \grewrite rs \\} |
1061 |
1063 |
1062 \inferrule{}{(\sum rs) :: rs_a \grewrite rs @ rs_a \\} |
1064 \inferrule{}{(\sum rs) :: rs_a \grewrite rs @ rs_a \\} |
1063 |
1065 |
1064 \inferrule{rs_1 \grewrite rs_2}{r :: rs_1 \grewrite r :: rs_2} |
1066 \inferrule{rs_1 \grewrite rs_2}{r :: rs_1 \grewrite r :: rs_2} |
1065 |
1067 |
1066 \inferrule[dB]{}{rs_a @ [a] @ rs_b @[a] @ rs_c \grewrite rs_a @ [a] @ rsb @ rsc} |
1068 \inferrule[dB]{}{rs_a @ [a] @ rs_b @[a] @ rs_c \grewrite rs_a @ [a] @ rsb @ rsc} |
1067 \end{mathpar} |
1069 \end{mathpar} |
1068 \end{center} |
1070 \end{center} |
1069 |
1071 |
1070 \noindent |
1072 \noindent |
1071 The reason why we take the trouble of defining |
1073 The reason why we take the trouble of defining |
1072 two separate list rewriting definitions $\frewrite$ and $\grewrite$ |
1074 two separate list rewriting definitions $\frewrite$ and $\grewrite$ |
1530 originated from a star's derivatives, so we do not attempt to open up all possible |
1532 originated from a star's derivatives, so we do not attempt to open up all possible |
1531 regular expressions of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2 |
1533 regular expressions of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2 |
1532 elements. |
1534 elements. |
1533 We give a predicate for such "star-created" regular expressions: |
1535 We give a predicate for such "star-created" regular expressions: |
1534 \begin{center} |
1536 \begin{center} |
1535 \begin{tabular}{lcr} |
1537 \begin{tabular}{lcr} |
1536 & & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\ |
1538 & & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\ |
1537 $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$ |
1539 $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$ |
1538 \end{tabular} |
1540 \end{tabular} |
1539 \end{center} |
1541 \end{center} |
1540 |
1542 |
1541 These definitions allows us the flexibility to talk about |
1543 These definitions allows us the flexibility to talk about |
1542 regular expressions in their most convenient format, |
1544 regular expressions in their most convenient format, |
1543 for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $ |
1545 for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $ |
1544 instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$. |
1546 instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$. |
1545 These definitions help express that certain classes of syntatically |
1547 These definitions help express that certain classes of syntatically |
1546 distinct regular expressions are actually the same under simplification. |
1548 distinct regular expressions are actually the same under simplification. |
1547 This is not entirely true for annotated regular expressions: |
1549 This is not entirely true for annotated regular expressions: |
1548 %TODO: bsimp bders \neq bderssimp |
1550 %TODO: bsimp bders \neq bderssimp |
1549 \begin{center} |
1551 \begin{center} |
1550 $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ |
1552 $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ |
1551 \end{center} |
1553 \end{center} |
1552 For bit-codes, the order in which simplification is applied |
1554 For bit-codes, the order in which simplification is applied |
1553 might cause a difference in the location they are placed. |
1555 might cause a difference in the location they are placed. |
1554 If we want something like |
1556 If we want something like |
1555 \begin{center} |
1557 \begin{center} |
1556 $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ |
1558 $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ |
1557 \end{center} |
1559 \end{center} |
1558 Some "canonicalization" procedure is required, |
1560 Some "canonicalization" procedure is required, |
1559 which either pushes all the common bitcodes to nodes |
1561 which either pushes all the common bitcodes to nodes |
1560 as senior as possible: |
1562 as senior as possible: |
1561 \begin{center} |
1563 \begin{center} |
1562 $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ |
1564 $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ |
1563 \end{center} |
1565 \end{center} |
1564 or does the reverse. However bitcodes are not of interest if we are talking about |
1566 or does the reverse. However bitcodes are not of interest if we are talking about |
1565 the $\llbracket r \rrbracket$ size of a regex. |
1567 the $\llbracket r \rrbracket$ size of a regex. |
1566 Therefore for the ease and simplicity of producing a |
1568 Therefore for the ease and simplicity of producing a |
1567 proof for a size bound, we are happy to restrict ourselves to |
1569 proof for a size bound, we are happy to restrict ourselves to |
1568 unannotated regular expressions, and obtain such equalities as |
1570 unannotated regular expressions, and obtain such equalities as |
1569 \begin{lemma} |
1571 \begin{lemma} |
1570 $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$ |
1572 $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$ |
1571 \end{lemma} |
1573 \end{lemma} |
1572 |
1574 |
1573 \begin{proof} |
1575 \begin{proof} |
1574 By using the rewriting relation $\rightsquigarrow$ |
1576 By using the rewriting relation $\rightsquigarrow$ |
1575 \end{proof} |
1577 \end{proof} |
1576 %TODO: rsimp sflat |
1578 %TODO: rsimp sflat |
1577 And from this we obtain a proof that a star's derivative will be the same |
1579 And from this we obtain a proof that a star's derivative will be the same |
1578 as if it had all its nested alternatives created during deriving being flattened out: |
1580 as if it had all its nested alternatives created during deriving being flattened out: |
1579 For example, |
1581 For example, |
1580 \begin{lemma} |
1582 \begin{lemma} |
1581 $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ |
1583 $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ |
1582 \end{lemma} |
1584 \end{lemma} |
1583 \begin{proof} |
1585 \begin{proof} |
1584 By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$. |
1586 By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$. |
1585 \end{proof} |
1587 \end{proof} |
1586 % The simplification of a flattened out regular expression, provided it comes |
1588 % The simplification of a flattened out regular expression, provided it comes |
1587 %from the derivative of a star, is the same as the one nested. |
1589 %from the derivative of a star, is the same as the one nested. |
1588 |
1590 |
1589 |
1591 |
1590 |
1592 |
1591 We first introduce an inductive property |
1593 We first introduce an inductive property |
1592 for $\starupdate$ and $\hflataux{\_}$, |
1594 for $\starupdate$ and $\hflataux{\_}$, |
1593 it says if we do derivatives of $r^*$ |
1595 it says if we do derivatives of $r^*$ |
1756 |
1758 |
1757 |
1759 |
1758 Now we are ready to control the sizes of |
1760 Now we are ready to control the sizes of |
1759 $r_1 \cdot r_2 \backslash s$, $r^* \backslash s$. |
1761 $r_1 \cdot r_2 \backslash s$, $r^* \backslash s$. |
1760 \begin{theorem} |
1762 \begin{theorem} |
1761 For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$ |
1763 For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$ |
1762 \end{theorem} |
1764 \end{theorem} |
1763 \noindent |
1765 \noindent |
1764 \begin{proof} |
1766 \begin{proof} |
1765 We prove this by induction on $r$. The base cases for $\RZERO$, |
1767 We prove this by induction on $r$. The base cases for $\RZERO$, |
1766 $\RONE $ and $\RCHAR{c}$ are straightforward. |
1768 $\RONE $ and $\RCHAR{c}$ are straightforward. |
1767 In the sequence $r_1 \cdot r_2$ case, |
1769 In the sequence $r_1 \cdot r_2$ case, |
1768 the inductive hypotheses state |
1770 the inductive hypotheses state |
1769 $\exists N_1. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N_1$ and |
1771 $\exists N_1. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N_1$ and |
1770 $\exists N_2. \forall s. \; \llbracket \rderssimp{r_2}{s} \rrbracket \leq N_2$. |
1772 $\exists N_2. \forall s. \; \llbracket \rderssimp{r_2}{s} \rrbracket \leq N_2$. |
1771 |
1773 |
1772 When the string $s$ is not empty, we can reason as follows |
1774 When the string $s$ is not empty, we can reason as follows |
1773 % |
1775 % |
1774 \begin{center} |
1776 \begin{center} |
1775 \begin{tabular}{lcll} |
1777 \begin{tabular}{lcll} |
1776 & & $ \llbracket \rderssimp{r_1\cdot r_2 }{s} \rrbracket_r $\\ |
1778 & & $ \llbracket \rderssimp{r_1\cdot r_2 }{s} \rrbracket_r $\\ |
1777 & $ = $ & $\llbracket \rsimp{(\sum(r_1 \backslash_{rsimp} s \cdot r_2 \; \; :: \; \; |
1779 & $ = $ & $\llbracket \rsimp{(\sum(r_1 \backslash_{rsimp} s \cdot r_2 \; \; :: \; \; |
1778 \map \; (r_2\backslash_{rsimp} \_)\; (\vsuf{s}{r})))} \rrbracket_r $ & (1) \\ |
1780 \map \; (r_2\backslash_{rsimp} \_)\; (\vsuf{s}{r})))} \rrbracket_r $ & (1) \\ |
1779 & $\leq$ & $\llbracket \rdistinct{(r_1 \backslash_{rsimp} s \cdot r_2 \; \; :: \; \; |
1781 & $\leq$ & $\llbracket \rdistinct{(r_1 \backslash_{rsimp} s \cdot r_2 \; \; :: \; \; |
1780 \map \; (r_2\backslash_{rsimp} \_)\; (\vsuf{s}{r}))}{\varnothing} \rrbracket_r + 1$ & (2) \\ |
1782 \map \; (r_2\backslash_{rsimp} \_)\; (\vsuf{s}{r}))}{\varnothing} \rrbracket_r + 1$ & (2) \\ |
1781 & $\leq$ & $2 + N_1 + \rsize{r_2} + (N_2 * (card\;(\sizeNregex \; N_2)))$ & (3)\\ |
1783 & $\leq$ & $2 + N_1 + \rsize{r_2} + (N_2 * (card\;(\sizeNregex \; N_2)))$ & (3)\\ |
1782 \end{tabular} |
1784 \end{tabular} |
1783 \end{center} |
1785 \end{center} |
1784 \noindent |
1786 \noindent |
1785 (1) is by the corollary \ref{seqEstimate1}. |
1787 (1) is by the corollary \ref{seqEstimate1}. |
1786 (2) is by \ref{altsSimpControl}. |
1788 (2) is by \ref{altsSimpControl}. |
1822 (5) is by the lemma \ref{starClosedForm}. |
1824 (5) is by the lemma \ref{starClosedForm}. |
1823 (6) is by \ref{altsSimpControl}. |
1825 (6) is by \ref{altsSimpControl}. |
1824 (7) is by \ref{finiteSizeNCorollary}. |
1826 (7) is by \ref{finiteSizeNCorollary}. |
1825 Combining with the case when $s = []$, one gets |
1827 Combining with the case when $s = []$, one gets |
1826 \begin{center} |
1828 \begin{center} |
1827 \begin{tabular}{lcll} |
1829 \begin{tabular}{lcll} |
1828 $\rsize{r^* \backslash_r s}$ & $\leq$ & $max \; n_r \; 1 + (\textit{card} (\sizeNregex \; (N + n_r))) |
1830 $\rsize{r^* \backslash_r s}$ & $\leq$ & $max \; n_r \; 1 + (\textit{card} (\sizeNregex \; (N + n_r))) |
1829 * (1 + (N + n_r)) $ & (8)\\ |
1831 * (1 + (N + n_r)) $ & (8)\\ |
1830 \end{tabular} |
1832 \end{tabular} |
1831 \end{center} |
1833 \end{center} |
1832 \noindent |
1834 \noindent |
1833 |
1835 |
1834 The alternative case is slightly less involved. |
1836 The alternative case is slightly less involved. |
1835 The inductive hypothesis |
1837 The inductive hypothesis |
1836 is equivalent to $\exists N. \forall r \in (\map \; (\_ \backslash_r s) \; rs). \rsize{r} \leq N$. |
1838 is equivalent to $\exists N. \forall r \in (\map \; (\_ \backslash_r s) \; rs). \rsize{r} \leq N$. |
1837 In the case when $s = c::cs$, we have |
1839 In the case when $s = c::cs$, we have |
1838 \begin{center} |
1840 \begin{center} |
1839 \begin{tabular}{lcll} |
1841 \begin{tabular}{lcll} |
1840 & & $ \llbracket \rderssimp{\sum rs }{c::cs} \rrbracket_r $\\ |
1842 & & $ \llbracket \rderssimp{\sum rs }{c::cs} \rrbracket_r $\\ |
1841 & $ = $ & $\llbracket \rsimp{(\sum (\map \; (\_ \backslash_{rsimp} s) \; rs) )} \rrbracket_r $ & (9) \\ |
1843 & $ = $ & $\llbracket \rsimp{(\sum (\map \; (\_ \backslash_{rsimp} s) \; rs) )} \rrbracket_r $ & (9) \\ |
1842 & $\leq$ & $\llbracket (\sum (\map \; (\_ \backslash_{rsimp} s) \; rs) ) \rrbracket_r $ & (10) \\ |
1844 & $\leq$ & $\llbracket (\sum (\map \; (\_ \backslash_{rsimp} s) \; rs) ) \rrbracket_r $ & (10) \\ |
1843 & $\leq$ & $1 + N * (length \; rs) $ & (11)\\ |
1845 & $\leq$ & $1 + N * (length \; rs) $ & (11)\\ |
1844 \end{tabular} |
1846 \end{tabular} |
1845 \end{center} |
1847 \end{center} |
1846 \noindent |
1848 \noindent |
1847 (9) is by \ref{altsClosedForm}, (10) by \ref{rsimpSize} and (11) by inductive hypothesis. |
1849 (9) is by \ref{altsClosedForm}, (10) by \ref{rsimpSize} and (11) by inductive hypothesis. |
1848 |
1850 |
1849 Combining with the case when $s = []$, one gets |
1851 Combining with the case when $s = []$, one gets |
1850 \begin{center} |
1852 \begin{center} |
1851 \begin{tabular}{lcll} |
1853 \begin{tabular}{lcll} |
1852 $\rsize{\sum rs \backslash_r s}$ & $\leq$ & $max \; \rsize{\sum rs} \; 1+N*(length \; rs)$ |
1854 $\rsize{\sum rs \backslash_r s}$ & $\leq$ & $max \; \rsize{\sum rs} \; 1+N*(length \; rs)$ |
1853 & (12)\\ |
1855 & (12)\\ |
1854 \end{tabular} |
1856 \end{tabular} |
1855 \end{center} |
1857 \end{center} |
1856 (4), (8), and (12) are all the inductive cases proven. |
1858 (4), (8), and (12) are all the inductive cases proven. |
1857 \end{proof} |
1859 \end{proof} |
1858 |
1860 |
1859 |
1861 |
1860 \begin{corollary} |
1862 \begin{corollary} |
1861 For any regex $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$ |
1863 For any regex $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$ |
1862 \end{corollary} |
1864 \end{corollary} |
1863 \begin{proof} |
1865 \begin{proof} |
1864 By \ref{sizeRelations}. |
1866 By \ref{sizeRelations}. |
1865 \end{proof} |
1867 \end{proof} |
1866 \noindent |
1868 \noindent |
1892 What guarantee does this bound give us? |
1894 What guarantee does this bound give us? |
1893 |
1895 |
1894 Whatever the regex is, it will not grow indefinitely. |
1896 Whatever the regex is, it will not grow indefinitely. |
1895 Take our previous example $(a + aa)^*$ as an example: |
1897 Take our previous example $(a + aa)^*$ as an example: |
1896 \begin{center} |
1898 \begin{center} |
1897 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
1899 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
1898 \begin{tikzpicture} |
1900 \begin{tikzpicture} |
1899 \begin{axis}[ |
1901 \begin{axis}[ |
1900 xlabel={number of $a$'s}, |
1902 xlabel={number of $a$'s}, |
1901 x label style={at={(1.05,-0.05)}}, |
1903 x label style={at={(1.05,-0.05)}}, |
1902 ylabel={regex size}, |
1904 ylabel={regex size}, |
1903 enlargelimits=false, |
1905 enlargelimits=false, |
1904 xtick={0,5,...,30}, |
1906 xtick={0,5,...,30}, |
1905 xmax=33, |
1907 xmax=33, |
1906 ymax= 40, |
1908 ymax= 40, |
1907 ytick={0,10,...,40}, |
1909 ytick={0,10,...,40}, |
1908 scaled ticks=false, |
1910 scaled ticks=false, |
1909 axis lines=left, |
1911 axis lines=left, |
1910 width=5cm, |
1912 width=5cm, |
1911 height=4cm, |
1913 height=4cm, |
1912 legend entries={$(a + aa)^*$}, |
1914 legend entries={$(a + aa)^*$}, |
1913 legend pos=north west, |
1915 legend pos=north west, |
1914 legend cell align=left] |
1916 legend cell align=left] |
1915 \addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data}; |
1917 \addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data}; |
1916 \end{axis} |
1918 \end{axis} |
1917 \end{tikzpicture} |
1919 \end{tikzpicture} |
1918 \end{tabular} |
1920 \end{tabular} |
1919 \end{center} |
1921 \end{center} |
1920 We are able to limit the size of the regex $(a + aa)^*$'s derivatives |
1922 We are able to limit the size of the regex $(a + aa)^*$'s derivatives |
1921 with our simplification |
1923 with our simplification |
1922 rules very effectively. |
1924 rules very effectively. |
1923 |
1925 |
1924 |
1926 |
1925 In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound |
1927 In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound |
1926 is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$. |
1928 is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$. |
1939 %TODO: giving regex1_size_change.data showing a few regular expressions' size changes |
1941 %TODO: giving regex1_size_change.data showing a few regular expressions' size changes |
1940 %w;r;t the input characters number, where the size is usually cubic in terms of original size |
1942 %w;r;t the input characters number, where the size is usually cubic in terms of original size |
1941 %a*, aa*, aaa*, ..... |
1943 %a*, aa*, aaa*, ..... |
1942 %randomly generated regular expressions |
1944 %randomly generated regular expressions |
1943 \begin{center} |
1945 \begin{center} |
1944 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
1946 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
1945 \begin{tikzpicture} |
1947 \begin{tikzpicture} |
1946 \begin{axis}[ |
1948 \begin{axis}[ |
1947 xlabel={number of $a$'s}, |
1949 xlabel={number of $a$'s}, |
1948 x label style={at={(1.05,-0.05)}}, |
1950 x label style={at={(1.05,-0.05)}}, |
1949 ylabel={regex size}, |
1951 ylabel={regex size}, |
1950 enlargelimits=false, |
1952 enlargelimits=false, |
1951 xtick={0,5,...,30}, |
1953 xtick={0,5,...,30}, |
1952 xmax=33, |
1954 xmax=33, |
1953 ymax=1000, |
1955 ymax=1000, |
1954 ytick={0,100,...,1000}, |
1956 ytick={0,100,...,1000}, |
1955 scaled ticks=false, |
1957 scaled ticks=false, |
1956 axis lines=left, |
1958 axis lines=left, |
1957 width=5cm, |
1959 width=5cm, |
1958 height=4cm, |
1960 height=4cm, |
1959 legend entries={regex1}, |
1961 legend entries={regex1}, |
1960 legend pos=north west, |
1962 legend pos=north west, |
1961 legend cell align=left] |
1963 legend cell align=left] |
1962 \addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data}; |
1964 \addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data}; |
1963 \end{axis} |
1965 \end{axis} |
1964 \end{tikzpicture} |
1966 \end{tikzpicture} |
1965 & |
1967 & |
1966 \begin{tikzpicture} |
1968 \begin{tikzpicture} |
1967 \begin{axis}[ |
1969 \begin{axis}[ |
1968 xlabel={$n$}, |
1970 xlabel={$n$}, |
1969 x label style={at={(1.05,-0.05)}}, |
1971 x label style={at={(1.05,-0.05)}}, |
1970 %ylabel={time in secs}, |
1972 %ylabel={time in secs}, |
1971 enlargelimits=false, |
1973 enlargelimits=false, |
1972 xtick={0,5,...,30}, |
1974 xtick={0,5,...,30}, |
1973 xmax=33, |
1975 xmax=33, |
1974 ymax=1000, |
1976 ymax=1000, |
1975 ytick={0,100,...,1000}, |
1977 ytick={0,100,...,1000}, |
1976 scaled ticks=false, |
1978 scaled ticks=false, |
1977 axis lines=left, |
1979 axis lines=left, |
1978 width=5cm, |
1980 width=5cm, |
1979 height=4cm, |
1981 height=4cm, |
1980 legend entries={regex2}, |
1982 legend entries={regex2}, |
1981 legend pos=north west, |
1983 legend pos=north west, |
1982 legend cell align=left] |
1984 legend cell align=left] |
1983 \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data}; |
1985 \addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data}; |
1984 \end{axis} |
1986 \end{axis} |
1985 \end{tikzpicture} |
1987 \end{tikzpicture} |
1986 & |
1988 & |
1987 \begin{tikzpicture} |
1989 \begin{tikzpicture} |
1988 \begin{axis}[ |
1990 \begin{axis}[ |
1989 xlabel={$n$}, |
1991 xlabel={$n$}, |
1990 x label style={at={(1.05,-0.05)}}, |
1992 x label style={at={(1.05,-0.05)}}, |
1991 %ylabel={time in secs}, |
1993 %ylabel={time in secs}, |
1992 enlargelimits=false, |
1994 enlargelimits=false, |
1993 xtick={0,5,...,30}, |
1995 xtick={0,5,...,30}, |
1994 xmax=33, |
1996 xmax=33, |
1995 ymax=1000, |
1997 ymax=1000, |
1996 ytick={0,100,...,1000}, |
1998 ytick={0,100,...,1000}, |
1997 scaled ticks=false, |
1999 scaled ticks=false, |
1998 axis lines=left, |
2000 axis lines=left, |
1999 width=5cm, |
2001 width=5cm, |
2000 height=4cm, |
2002 height=4cm, |
2001 legend entries={regex3}, |
2003 legend entries={regex3}, |
2002 legend pos=north west, |
2004 legend pos=north west, |
2003 legend cell align=left] |
2005 legend cell align=left] |
2004 \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data}; |
2006 \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data}; |
2005 \end{axis} |
2007 \end{axis} |
2006 \end{tikzpicture}\\ |
2008 \end{tikzpicture}\\ |
2007 \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regular expressions $w.r.t.$ input string length.} |
2009 \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regular expressions $w.r.t.$ input string length.} |
2008 \end{tabular} |
2010 \end{tabular} |
2009 \end{center} |
2011 \end{center} |
2010 \noindent |
2012 \noindent |
2011 Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the |
2013 Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the |
2012 original size. |
2014 original size. |
2013 We will discuss improvements to this bound in the next chapter. |
2015 We will discuss improvements to this bound in the next chapter. |
2107 backtracking is much less likely to occur in our $\blexersimp$ |
2109 backtracking is much less likely to occur in our $\blexersimp$ |
2108 algorithm. |
2110 algorithm. |
2109 We refine $\blexersimp$ with $\blexerStrong$ in the next chapter |
2111 We refine $\blexersimp$ with $\blexerStrong$ in the next chapter |
2110 so that the bound becomes polynomial. |
2112 so that the bound becomes polynomial. |
2111 \end{itemize} |
2113 \end{itemize} |
2112 |
2114 |
2113 %---------------------------------------------------------------------------------------- |
2115 %---------------------------------------------------------------------------------------- |
2114 % SECTION 4 |
2116 % SECTION 4 |
2115 %---------------------------------------------------------------------------------------- |
2117 %---------------------------------------------------------------------------------------- |
2116 |
2118 |
2117 |
2119 |
2118 |
2120 |
2119 |
2121 |
2120 |
2122 |
2121 |
2123 |
2122 |
2124 |
2123 |
2125 |
2124 One might wonder the actual bound rather than the loose bound we gave |
2126 One might wonder the actual bound rather than the loose bound we gave |
2125 for the convenience of an easier proof. |
2127 for the convenience of an easier proof. |
2126 How much can the regex $r^* \backslash s$ grow? |
2128 How much can the regex $r^* \backslash s$ grow? |
2127 As earlier graphs have shown, |
2129 As earlier graphs have shown, |
2128 %TODO: reference that graph where size grows quickly |
2130 %TODO: reference that graph where size grows quickly |
2129 they can grow at a maximum speed |
2131 they can grow at a maximum speed |
2130 exponential $w.r.t$ the number of characters, |
2132 exponential $w.r.t$ the number of characters, |
2131 but will eventually level off when the string $s$ is long enough. |
2133 but will eventually level off when the string $s$ is long enough. |
2132 If they grow to a size exponential $w.r.t$ the original regex, our algorithm |
2134 If they grow to a size exponential $w.r.t$ the original regex, our algorithm |
2133 would still be slow. |
2135 would still be slow. |
2134 And unfortunately, we have concrete examples |
2136 And unfortunately, we have concrete examples |
2135 where such regular expressions grew exponentially large before levelling off: |
2137 where such regular expressions grew exponentially large before levelling off: |
2136 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
2138 $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
2137 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum |
2139 (\underbrace{a \ldots a}_{\text{n a's}})^*$ will already have a maximum |
2138 size that is exponential on the number $n$ |
2140 size that is exponential on the number $n$ |
2139 under our current simplification rules: |
2141 under our current simplification rules: |
2140 %TODO: graph of a regex whose size increases exponentially. |
2142 %TODO: graph of a regex whose size increases exponentially. |
2141 \begin{center} |
2143 \begin{center} |
2142 \begin{tikzpicture} |
2144 \begin{tikzpicture} |
2143 \begin{axis}[ |
2145 \begin{axis}[ |
2144 height=0.5\textwidth, |
2146 height=0.5\textwidth, |
2145 width=\textwidth, |
2147 width=\textwidth, |
2146 xlabel=number of a's, |
2148 xlabel=number of a's, |
2147 xtick={0,...,9}, |
2149 xtick={0,...,9}, |
2148 ylabel=maximum size, |
2150 ylabel=maximum size, |
2149 ymode=log, |
2151 ymode=log, |
2150 log basis y={2} |
2152 log basis y={2} |
2151 ] |
2153 ] |
2152 \addplot[mark=*,blue] table {re-chengsong.data}; |
2154 \addplot[mark=*,blue] table {re-chengsong.data}; |
2153 \end{axis} |
2155 \end{axis} |
2154 \end{tikzpicture} |
2156 \end{tikzpicture} |
2155 \end{center} |
2157 \end{center} |
2156 |
2158 |
2157 For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ |
2159 For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ |
2158 to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
2160 to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
2159 (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion. |
2161 (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion. |
2163 different derivatives, despite those difference being minor. |
2165 different derivatives, despite those difference being minor. |
2164 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$ |
2166 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$ |
2165 will therefore contain the following terms (after flattening out all nested |
2167 will therefore contain the following terms (after flattening out all nested |
2166 alternatives): |
2168 alternatives): |
2167 \begin{center} |
2169 \begin{center} |
2168 $(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ |
2170 $(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ |
2169 $(1 \leq m' \leq m )$ |
2171 $(1 \leq m' \leq m )$ |
2170 \end{center} |
2172 \end{center} |
2171 These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix). |
2173 These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix). |
2172 With each new input character taking the derivative against the intermediate result, more and more such distinct |
2174 With each new input character taking the derivative against the intermediate result, more and more such distinct |
2173 terms will accumulate, |
2175 terms will accumulate, |
2174 until the length reaches $L.C.M.(1, \ldots, n)$. |
2176 until the length reaches $L.C.M.(1, \ldots, n)$. |
2175 $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms |
2177 $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms |
2176 $(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ |
2178 $(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ |
2177 |
2179 |
2178 $(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ |
2180 $(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ |
2179 where $m' \neq m''$ \\ |
2181 where $m' \neq m''$ \\ |
2180 as they are slightly different. |
2182 as they are slightly different. |
2181 This means that with our current simplification methods, |
2183 This means that with our current simplification methods, |
2182 we will not be able to control the derivative so that |
2184 we will not be able to control the derivative so that |
2183 $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$ |
2185 $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$ |
2184 as there are already exponentially many terms. |
2186 as there are already exponentially many terms. |
2185 These terms are similar in the sense that the head of those terms |
2187 These terms are similar in the sense that the head of those terms |
2186 are all consisted of sub-terms of the form: |
2188 are all consisted of sub-terms of the form: |
2187 $(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $. |
2189 $(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $. |
2188 For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most |
2190 For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most |
2189 $n * (n + 1) / 2$ such terms. |
2191 $n * (n + 1) / 2$ such terms. |
2190 For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives |
2192 For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives |
2191 can be described by 6 terms: |
2193 can be described by 6 terms: |
2192 $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, |
2194 $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, |
2193 $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$. |
2195 $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$. |
2194 The total number of different "head terms", $n * (n + 1) / 2$, |
2196 The total number of different "head terms", $n * (n + 1) / 2$, |
2195 is proportional to the number of characters in the regex |
2197 is proportional to the number of characters in the regex |
2196 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$. |
2198 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$. |
2197 This suggests a slightly different notion of size, which we call the |
2199 This suggests a slightly different notion of size, which we call the |
2198 alphabetic width: |
2200 alphabetic width: |
2199 %TODO: |
2201 %TODO: |
2200 (TODO: Alphabetic width def.) |
2202 (TODO: Alphabetic width def.) |
2201 |
2203 |
2202 |
2204 |
2203 Antimirov\parencite{Antimirov95} has proven that |
2205 Antimirov\parencite{Antimirov95} has proven that |
2204 $\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$. |
2206 $\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$. |
2205 where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms |
2207 where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms |
2206 created by doing derivatives of $r$ against all possible strings. |
2208 created by doing derivatives of $r$ against all possible strings. |
2207 If we can make sure that at any moment in our lexing algorithm our |
2209 If we can make sure that at any moment in our lexing algorithm our |