5 \label{Finite} |
5 \label{Finite} |
6 % In Chapter 4 \ref{Chapter4} we give the second guarantee |
6 % In Chapter 4 \ref{Chapter4} we give the second guarantee |
7 %of our bitcoded algorithm, that is a finite bound on the size of any |
7 %of our bitcoded algorithm, that is a finite bound on the size of any |
8 %regex's derivatives. |
8 %regex's derivatives. |
9 |
9 |
10 \begin{figure} |
10 |
11 \begin{center} |
11 In this chapter we give a guarantee in terms of size of derivatives: |
12 \begin{tabular}{ccc} |
|
13 $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ |
|
14 $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ |
|
15 $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ |
|
16 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ |
|
17 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ |
|
18 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$. |
|
19 \end{tabular} |
|
20 \end{center} |
|
21 \caption{The size function of bitcoded regular expressions}\label{brexpSize} |
|
22 \end{figure} |
|
23 In this chapter we give a guarantee in terms of size: |
|
24 given an annotated regular expression $a$, for any string $s$ |
12 given an annotated regular expression $a$, for any string $s$ |
25 our algorithm $\blexersimp$'s internal annotated regular expression |
13 our algorithm $\blexersimp$'s internal annotated regular expression |
26 size is finitely bounded |
14 size is finitely bounded |
27 by a constant $N_a$ that only depends on $a$: |
15 by a constant that only depends on $a$. |
|
16 Formally we show that there exists an $N_a$ such that |
28 \begin{center} |
17 \begin{center} |
29 $\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$ |
18 $\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$ |
30 \end{center} |
19 \end{center} |
31 \noindent |
20 \noindent |
32 where the size of an annotated regular expression is defined |
21 where the size ($\llbracket \_ \rrbracket$) of |
33 in terms of the number of nodes in its tree structure (see figure \ref{brexpSize}). |
22 an annotated regular expression is defined |
|
23 in terms of the number of nodes in its |
|
24 tree structure (its recursive definition given in the next page). |
34 We believe this size bound |
25 We believe this size bound |
35 is important in the context of POSIX lexing, because |
26 is important in the context of POSIX lexing, because |
36 \begin{itemize} |
27 \begin{itemize} |
37 \item |
28 \item |
38 It is a stepping stone towards an ``absence of catastrophic-backtracking'' |
29 It is a stepping stone towards the goal |
39 guarantee. |
30 of eliminating ``catastrophic backtracking''. |
40 If the internal data structures used by our algorithm cannot grow very large, |
31 If the internal data structures used by our algorithm |
41 then our algorithm (which traverses those structures) cannot be too slow. |
32 grows beyond a finite bound, then clearly |
|
33 the algorithm (which traverses these structures) will |
|
34 be slow. |
42 The next step would be to refine the bound $N_a$ so that it |
35 The next step would be to refine the bound $N_a$ so that it |
43 is polynomial on $\llbracket a\rrbracket$. |
36 is polynomial on $\llbracket a\rrbracket$. |
44 \item |
37 \item |
45 Having it formalised gives us a higher confidence that |
38 Having the finite bound formalised |
|
39 gives us a higher confidence that |
46 our simplification algorithm $\simp$ does not ``mis-behave'' |
40 our simplification algorithm $\simp$ does not ``mis-behave'' |
47 like $\simpsulz$ does. |
41 like $\simpsulz$ does. |
48 The bound is universal, which is an advantage over work which |
42 The bound is universal for a given regular expression, |
49 only gives empirical evidence on some test cases. |
43 which is an advantage over work which |
|
44 only gives empirical evidence on some test cases (see Verbatim++\cite{Verbatimpp}). |
50 \end{itemize} |
45 \end{itemize} |
|
46 In the next section we describe in more detail |
|
47 what the finite bound means in our algorithm |
|
48 and why the size of the internal data structures of |
|
49 a typical derivative-based lexer such as |
|
50 Sulzmann and Lu's need formal treatment. |
|
51 |
51 \section{Formalising About Size} |
52 \section{Formalising About Size} |
52 \noindent |
53 \noindent |
53 In our lexer ($\blexersimp$), |
54 In our lexer ($\blexersimp$), |
54 we take an annotated regular expression as input, |
55 we take an annotated regular expression as input, |
55 and repeately take derivative of and simplify it: |
56 and repeately take derivative of and simplify it. |
56 \begin{figure}[H] |
57 \begin{figure} |
|
58 \begin{center} |
|
59 \begin{tabular}{lcl} |
|
60 $\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ |
|
61 $\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ |
|
62 $\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ |
|
63 $\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ |
|
64 $\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ |
|
65 $\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$. |
|
66 \end{tabular} |
|
67 \end{center} |
|
68 \caption{The size function of bitcoded regular expressions}\label{brexpSize} |
|
69 \end{figure} |
|
70 |
|
71 \begin{figure} |
57 \begin{tikzpicture}[scale=2, |
72 \begin{tikzpicture}[scale=2, |
58 every node/.style={minimum size=11mm}, |
73 every node/.style={minimum size=11mm}, |
59 ->,>=stealth',shorten >=1pt,auto,thick |
74 ->,>=stealth',shorten >=1pt,auto,thick |
60 ] |
75 ] |
61 \node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$}; |
76 \node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$}; |
177 We explain the steps one by one: |
193 We explain the steps one by one: |
178 \begin{itemize} |
194 \begin{itemize} |
179 \item |
195 \item |
180 We first introduce the operations such as |
196 We first introduce the operations such as |
181 derivatives, simplification, size calculation, etc. |
197 derivatives, simplification, size calculation, etc. |
182 associated with $\rrexp$s, which we have given |
198 associated with $\rrexp$s, which we have introduced |
183 a very brief introduction to in chapter \ref{Bitcoded2}. |
199 in chapter \ref{Bitcoded2}. |
184 The operations on $\rrexp$s are identical to those on |
200 The operations on $\rrexp$s are identical to those on |
185 annotated regular expressions except that they are unaware |
201 annotated regular expressions except that they dispense with |
186 of bitcodes. This means that all proofs about size of $\rrexp$s will apply to |
202 bitcodes. This means that all proofs about size of $\rrexp$s will apply to |
187 annotated regular expressions. |
203 annotated regular expressions, because the size of a regular |
|
204 expression is independent of the bitcodes. |
188 \item |
205 \item |
189 We prove that $\rderssimp{r}{s} = \textit{ClosedForm}(r, s)$, |
206 We prove that $\rderssimp{r}{s} = \textit{ClosedForm}(r, s)$, |
190 where $\textit{ClosedForm}(r, s)$ is entirely |
207 where $\textit{ClosedForm}(r, s)$ is entirely |
191 written in the derivatives of their children regular |
208 given as the derivatives of their children regular |
192 expressions. |
209 expressions. |
193 We call the right-hand-side the \emph{Closed Form} |
210 We call the right-hand-side the \emph{Closed Form} |
194 of the derivative $\rderssimp{r}{s}$. |
211 of the derivative $\rderssimp{r}{s}$. |
195 \item |
212 \item |
196 We estimate $\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$. |
213 Formally we give an estimate of |
|
214 $\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$. |
197 The key observation is that $\distinctBy$'s output is |
215 The key observation is that $\distinctBy$'s output is |
198 a list with a constant length bound. |
216 a list with a constant length bound. |
199 \end{itemize} |
217 \end{itemize} |
200 We will expand on these steps in the next sections.\\ |
218 We will expand on these steps in the next sections.\\ |
201 |
219 |
202 \section{The $\textit{Rrexp}$ Datatype} |
220 \section{The $\textit{Rrexp}$ Datatype} |
203 The first step is to define |
221 The first step is to define |
204 $\textit{rrexp}$s. |
222 $\textit{rrexp}$s. |
205 They are without bitcodes, |
223 They are annotated regular expressions without bitcodes, |
206 allowing a much simpler size bound proof. |
224 allowing a much simpler size bound proof. |
207 Of course, the bits which encode the lexing information |
225 %Of course, the bits which encode the lexing information |
208 would grow linearly with respect |
226 %would grow linearly with respect |
209 to the input, which should be taken into account when we wish to tackle the runtime comlexity. |
227 %to the input, which should be taken into account when we wish to tackle the runtime comlexity. |
210 But for the sake of the structural size |
228 %But for the sake of the structural size |
211 we can safely ignore them.\\ |
229 %we can safely ignore them.\\ |
212 To recapitulate, the datatype |
230 The datatype |
213 definition of the $\rrexp$, called |
231 definition of the $\rrexp$, called |
214 \emph{r-regular expressions}, |
232 \emph{r-regular expressions}, |
215 was initially defined in \ref{rrexpDef}. |
233 was initially defined in \ref{rrexpDef}. |
216 The reason for the prefix $r$ is |
234 The reason for the prefix $r$ is |
217 to make a distinction |
235 to make a distinction |
252 $\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$ |
270 $\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$ |
253 \end{tabular} |
271 \end{tabular} |
254 \end{center} |
272 \end{center} |
255 |
273 |
256 \subsection{Why a New Datatype?} |
274 \subsection{Why a New Datatype?} |
257 The reason we take all the trouble |
275 The reason we |
258 defining a new datatype is that $\erase$ makes things harder. |
276 define a new datatype is that |
|
277 the $\erase$ function |
|
278 does not preserve the structure of annotated |
|
279 regular expressions. |
259 We initially started by using |
280 We initially started by using |
260 plain regular expressions and tried to prove |
281 plain regular expressions and tried to prove |
261 the lemma \ref{rsizeAsize}, |
282 lemma \ref{rsizeAsize}, |
262 however the $\erase$ function unavoidbly messes with the structure of the |
283 however the $\erase$ function messes with the structure of the |
263 annotated regular expression. |
284 annotated regular expression. |
264 The $+$ constructor |
285 The $+$ constructor |
265 of basic regular expressions is binary whereas $\sum$ |
286 of basic regular expressions is only binary, whereas $\sum$ |
266 takes a list, and one has to convert between them: |
287 takes a list. Therefore we need to convert between |
267 \begin{center} |
288 annotated and normal regular expressions as follows: |
268 \begin{tabular}{ccc} |
289 \begin{center} |
|
290 \begin{tabular}{lcl} |
269 $\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\ |
291 $\erase \; _{bs}\sum [] $ & $\dn$ & $\ZERO$\\ |
270 $\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\ |
292 $\erase \; _{bs}\sum [a]$ & $\dn$ & $a$\\ |
271 $\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$ |
293 $\erase \; _{bs}\sum a :: as$ & $\dn$ & $a + (\erase \; _{[]} \sum as)\quad \text{if $as$ length over 1}$ |
272 \end{tabular} |
294 \end{tabular} |
273 \end{center} |
295 \end{center} |
274 \noindent |
296 \noindent |
275 An alternative regular expression with an empty argument list |
297 As can be seen alternative regular expression with an empty argument list |
276 will be turned into a $\ZERO$. |
298 will be turned into a $\ZERO$. |
277 The singleton alternative $\sum [r]$ would have $r$ during the |
299 The singleton alternative $\sum [r]$ becomes $r$ during the |
278 $\erase$ function. |
300 $\erase$ function. |
279 The annotated regular expression $\sum[a, b, c]$ would turn into |
301 The annotated regular expression $\sum[a, b, c]$ would turn into |
280 $(a+(b+c))$. |
302 $(a+(b+c))$. |
281 All these operations change the size and structure of |
303 All these operations change the size and structure of |
282 an annotated regular expressions, adding unnecessary |
304 an annotated regular expressions, adding unnecessary |
283 complexities to the size bound proof.\\ |
305 complexities to the size bound proof.\\ |
284 For example, if we define the size of a basic plain regular expression |
306 For example, if we define the size of a basic plain regular expression |
285 in the usual way, |
307 in the usual way, |
286 \begin{center} |
308 \begin{center} |
287 \begin{tabular}{ccc} |
309 \begin{tabular}{lcl} |
288 $\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\ |
310 $\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\ |
289 $\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\ |
311 $\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\ |
290 $\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\ |
312 $\llbracket r_1 \cdot r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\ |
291 $\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\ |
313 $\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\ |
292 $\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\ |
314 $\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\ |
315 $\llbracket a \rrbracket \leq \llbracket a_\downarrow \rrbracket_p $ |
337 $\llbracket a \rrbracket \leq \llbracket a_\downarrow \rrbracket_p $ |
316 and then estimate $\llbracket a_\downarrow \rrbracket_p$, |
338 and then estimate $\llbracket a_\downarrow \rrbracket_p$, |
317 but we found our approach more straightforward.\\ |
339 but we found our approach more straightforward.\\ |
318 |
340 |
319 \subsection{Functions for R-regular Expressions} |
341 \subsection{Functions for R-regular Expressions} |
320 We shall define the r-regular expression version |
342 In this section we shall define the r-regular expression version |
321 of $\blexer$ and $\blexersimp$ related functions. |
343 of $\blexer$, and $\blexersimp$ related functions. |
322 We use $r$ as the prefix or subscript to differentiate |
344 We use $r$ as the prefix or subscript to differentiate |
323 with the bitcoded version. |
345 with the bitcoded version. |
324 For example,$\backslash_r$, $\rdistincts$, and $\rsimp$ |
346 %For example,$\backslash_r$, $\rdistincts$, and $\rsimp$ |
325 as opposed to $\backslash$, $\distinctBy$, and $\bsimp$. |
347 %as opposed to $\backslash$, $\distinctBy$, and $\bsimp$. |
326 As promised, they are much simpler than their bitcoded counterparts. |
348 %As promised, they are much simpler than their bitcoded counterparts. |
327 %The operations on r-regular expressions are |
349 %The operations on r-regular expressions are |
328 %almost identical to those of the annotated regular expressions, |
350 %almost identical to those of the annotated regular expressions, |
329 %except that no bitcodes are used. For example, |
351 %except that no bitcodes are used. For example, |
330 The derivative operation becomes simpler:\\ |
352 The derivative operation for an r-regular expression is\\ |
331 \begin{center} |
353 \begin{center} |
332 \begin{tabular}{@{}lcl@{}} |
354 \begin{tabular}{@{}lcl@{}} |
333 $(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\ |
355 $(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\ |
334 $(\ONE)\,\backslash_r c$ & $\dn$ & |
356 $(\ONE)\,\backslash_r c$ & $\dn$ & |
335 $\textit{if}\;c=d\; \;\textit{then}\; |
357 $\textit{if}\;c=d\; \;\textit{then}\; |
545 \end{center} |
569 \end{center} |
546 We call the right-hand-side the |
570 We call the right-hand-side the |
547 \emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$. |
571 \emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$. |
548 Second, we will bound the closed form of r-regular expressions |
572 Second, we will bound the closed form of r-regular expressions |
549 using some estimation techniques |
573 using some estimation techniques |
550 and then piece it together |
574 and then apply |
551 with lemma \ref{sizeRelations} to show the bitcoded regular expressions |
575 lemma \ref{sizeRelations} to show that the bitcoded regular expressions |
552 in our $\blexersimp$ are finitely bounded. |
576 in our $\blexersimp$ are finitely bounded. |
553 |
577 |
554 We will flesh out the first step of the proof we |
578 We will describe in detail the first step of the proof |
555 sketched just now in the next section. |
579 in the next section. |
556 |
580 |
557 \section{Closed Forms} |
581 \section{Closed Forms} |
558 In this section we introduce in detail |
582 In this section we introduce in detail |
559 how the closed forms are obtained for regular expressions' |
583 how we express the string derivatives |
560 derivatives. |
584 of regular expressions (i.e. $r \backslash_r s$ where $s$ is a string |
|
585 rather than a single character) in a different way than |
|
586 our previous definition. |
|
587 In previous chapters, the derivative of a |
|
588 regular expression $r$ w.r.t a string $s$ |
|
589 was recursively defined on the string: |
|
590 \begin{center} |
|
591 $r \backslash_s (c::s) \dn (r \backslash c) \backslash_s s$ |
|
592 \end{center} |
|
593 The problem is that |
|
594 this definition does not provide much information |
|
595 on what $r \backslash_s s$ looks like. |
|
596 If we are interested in the size of a derivative like |
|
597 $(r_1 \cdot r_2)\backslash s$, |
|
598 we have to somehow get a more concrete form to begin. |
|
599 We call such more concrete representations the ``closed forms'' of |
|
600 string derivatives as opposed to their original definitions. |
|
601 The terminology ``closed form'' is borrowed from mathematics, |
|
602 which usually describe expressions that are solely comprised of |
|
603 well-known and easy-to-compute operations such as |
|
604 additions, multiplications, exponential functions. |
|
605 |
561 We start by proving some basic identities |
606 We start by proving some basic identities |
562 involving the simplification functions for r-regular expressions. |
607 involving the simplification functions for r-regular expressions. |
563 After that we introduce the rewrite relations |
608 After that we introduce the rewrite relations |
564 $\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$ |
609 $\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$ |
565 $\rightsquigarrow_f$ and $\rightsquigarrow_g$. |
610 $\rightsquigarrow_f$ and $\rightsquigarrow_g$. |
566 These relations involves similar techniques in chapter \ref{Bitcoded2}. |
611 These relations involves similar techniques as in chapter \ref{Bitcoded2} |
|
612 for annotated regular expressions. |
567 Finally, we use these identities to establish the |
613 Finally, we use these identities to establish the |
568 closed forms of the alternative regular expression, |
614 closed forms of the alternative regular expression, |
569 the sequence regular expression, and the star regular expression. |
615 the sequence regular expression, and the star regular expression. |
570 %$r_1\cdot r_2$, $r^*$ and $\sum rs$. |
616 %$r_1\cdot r_2$, $r^*$ and $\sum rs$. |
571 |
617 |
572 |
618 |
573 |
619 |
574 \subsection{Some Basic Identities} |
620 \subsection{Some Basic Identities} |
575 |
621 |
576 We now introduce lemmas |
622 In what follows we will often convert between lists |
577 that are repeatedly used in later proofs. |
623 and sets. |
578 Note that for the $\textit{rdistinct}$ function there |
624 We use Isabelle's $set$ to refere to the |
579 will be a lot of conversion from lists to sets. |
|
580 We use $set$ to refere to the |
|
581 function that converts a list $rs$ to the set |
625 function that converts a list $rs$ to the set |
582 containing all the elements in $rs$. |
626 containing all the elements in $rs$. |
583 \subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication} |
627 \subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication} |
584 The $\textit{rdistinct}$ function, as its name suggests, will |
628 The $\textit{rdistinct}$ function, as its name suggests, will |
585 de-duplicate an r-regular expression list. |
629 de-duplicate an r-regular expression list. |
818 \end{proof} |
865 \end{proof} |
819 |
866 |
820 \subsubsection{Simplified $\textit{Rrexp}$s are Good} |
867 \subsubsection{Simplified $\textit{Rrexp}$s are Good} |
821 We formalise the notion of ``good" regular expressions, |
868 We formalise the notion of ``good" regular expressions, |
822 which means regular expressions that |
869 which means regular expressions that |
823 are not fully simplified. For alternative regular expressions that means they |
870 are fully simplified in terms of our $\textit{rsimp}$ function. |
824 do not contain any nested alternatives like |
871 For alternative regular expressions that means they |
825 \[ r_1 + (r_2 + r_3) \], un-removed $\RZERO$s like \[\RZERO + r\] |
872 do not contain any nested alternatives, un-eliminated $\RZERO$s |
826 or duplicate elements in a children regular expression list like \[ \sum [r, r, \ldots]\]: |
873 or duplicate elements (for example, |
|
874 $r_1 + (r_2 + r_3)$, $\RZERO + r$ and $ \sum [r, r, \ldots]$). |
|
875 The clauses for $\good$ are: |
827 \begin{center} |
876 \begin{center} |
828 \begin{tabular}{@{}lcl@{}} |
877 \begin{tabular}{@{}lcl@{}} |
829 $\good\; \RZERO$ & $\dn$ & $\textit{false}$\\ |
878 $\good\; \RZERO$ & $\dn$ & $\textit{false}$\\ |
830 $\good\; \RONE$ & $\dn$ & $\textit{true}$\\ |
879 $\good\; \RONE$ & $\dn$ & $\textit{true}$\\ |
831 $\good\; \RCHAR{c}$ & $\dn$ & $\btrue$\\ |
880 $\good\; \RCHAR{c}$ & $\dn$ & $\btrue$\\ |
832 $\good\; \RALTS{[]}$ & $\dn$ & $\bfalse$\\ |
881 $\good\; \RALTS{[]}$ & $\dn$ & $\bfalse$\\ |
833 $\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\ |
882 $\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\ |
834 $\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ & |
883 $\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ & |
835 $\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\ |
884 $\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\ |
836 & & $\textit{and}\; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \, \textit{and}\; \, \textit{nonAlt}\; r')$\\ |
885 & & $\land \; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \, \land \; \, \textit{nonAlt}\; r')$\\ |
837 $\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\ |
886 $\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\ |
838 $\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\ |
887 $\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\ |
839 $\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\ |
888 $\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\ |
840 $\good \; \RSEQ{r_1}{r_2}$ & $\dn$ & $\good \; r_1 \;\, \textit{and} \;\, \good \; r_2$\\ |
889 $\good \; \RSEQ{r_1}{r_2}$ & $\dn$ & $\good \; r_1 \;\, \textit{and} \;\, \good \; r_2$\\ |
841 $\good \; \RSTAR{r}$ & $\dn$ & $\btrue$\\ |
890 $\good \; \RSTAR{r}$ & $\dn$ & $\btrue$\\ |
842 \end{tabular} |
891 \end{tabular} |
843 \end{center} |
892 \end{center} |
844 \noindent |
893 \noindent |
845 The predicate $\textit{nonAlt}$ evaluates to true when the regular expression is not an |
894 We omit the recursive definition of the predicate $\textit{nonAlt}$, |
|
895 which evaluates to true when the regular expression is not an |
846 alternative, and false otherwise. |
896 alternative, and false otherwise. |
847 The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that |
897 The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that |
848 its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$, |
898 its non-empty argument list of expressions are all good themsleves, and $\textit{nonAlt}$, |
849 and unique: |
899 and unique: |
850 \begin{lemma}\label{rsimpaltsGood} |
900 \begin{lemma}\label{rsimpaltsGood} |
1234 $\map \; (\_ \backslash x) \; (\rflts \; rs) \frewrites |
1298 $\map \; (\_ \backslash x) \; (\rflts \; rs) \frewrites |
1235 \rflts \; (\map \; (\_ \backslash x) \; rs)$ |
1299 \rflts \; (\map \; (\_ \backslash x) \; rs)$ |
1236 \end{lemma} |
1300 \end{lemma} |
1237 \noindent |
1301 \noindent |
1238 and then the equivalence between two terms |
1302 and then the equivalence between two terms |
1239 that can reduce in many steps to each other. |
1303 that can reduce in many steps to each other: |
1240 \begin{lemma}\label{frewritesSimpeq} |
1304 \begin{lemma}\label{frewritesSimpeq} |
1241 If $rs_1 \frewrites rs_2 $, then $\sum (\rDistinct \; rs_1 \; \varnothing) \sequal |
1305 If $rs_1 \frewrites rs_2 $, then $\sum (\rDistinct \; rs_1 \; \varnothing) \sequal |
1242 \sum (\rDistinct \; rs_2 \; \varnothing)$. |
1306 \sum (\rDistinct \; rs_2 \; \varnothing)$. |
1243 \end{lemma} |
1307 \end{lemma} |
1244 \noindent |
1308 \noindent |
1245 |
1309 These two lemmas can both be proven using a straightforward induction (and |
|
1310 the proofs for them are therefore omitted). |
|
1311 |
|
1312 Now the above equalities can be derived like a breeze: |
1246 \begin{corollary} |
1313 \begin{corollary} |
1247 $\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal |
1314 $\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal |
1248 \sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing) |
1315 \sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing) |
1249 $ |
1316 $ |
1250 \end{corollary} |
1317 \end{corollary} |
1251 |
1318 \begin{proof} |
|
1319 By lemmas \ref{earlyLaterDerFrewrites} and \ref{frewritesSimpeq}. |
|
1320 \end{proof} |
1252 But this trick will not work for $\grewrites$. |
1321 But this trick will not work for $\grewrites$. |
1253 For example, a rewriting step in proving |
1322 For example, a rewriting step in proving |
1254 closed forms is: |
1323 closed forms is: |
1255 \begin{center} |
1324 \begin{center} |
1256 $\rsimp{(\rsimpalts \; (\map \; (\_ \backslash x) \; (\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs}))))}{\varnothing})))}$\\ |
1325 $\rsimp{(\rsimpalts \; (\map \; (\_ \backslash x) \; (\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs}))))}{\varnothing})))}$\\ |
1257 $=$ \\ |
1326 $=$ \\ |
1258 $\rsimp{(\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; (\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}))} $ |
1327 $\rsimp{(\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; (\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}))} $ |
1259 \noindent |
1328 \noindent |
1260 \end{center} |
1329 \end{center} |
1261 For this one would hope to have a rewriting relation between the two lists involved, |
1330 For this, one would hope to have a rewriting relation between the two lists involved, |
1262 similar to \ref{earlyLaterDerFrewrites}. However, it turns out that |
1331 similar to \ref{earlyLaterDerFrewrites}. However, it turns out that |
1263 \begin{center} |
1332 \begin{center} |
1264 $\map \; (\_ \backslash x) \; (\rDistinct \; rs \; rset) \grewrites \rDistinct \; (\map \; |
1333 $\map \; (\_ \backslash x) \; (\rDistinct \; rs \; rset) \grewrites \rDistinct \; (\map \; |
1265 (\_ \backslash x) \; rs) \; ( rset \backslash x)$ |
1334 (\_ \backslash x) \; rs) \; ( rset \backslash x)$ |
1266 \end{center} |
1335 \end{center} |
1267 \noindent |
1336 \noindent |
1268 does $\mathbf{not}$ hold in general. |
1337 does $\mathbf{not}$ hold in general. |
1269 For this rewriting step we will introduce some slightly more cumbersome |
1338 For this rewriting step we will introduce some slightly more cumbersome |
1270 proof technique in later sections. |
1339 proof technique later. |
1271 The point is that $\frewrite$ |
1340 The point is that $\frewrite$ |
1272 allows us to prove equivalence in a straightforward way that is |
1341 allows us to prove equivalence in a straightforward way that is |
1273 not possible for $\grewrite$. |
1342 not possible for $\grewrite$. |
1274 |
1343 |
1275 |
1344 |
1276 \subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$} |
1345 \subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$} |
1277 In this part, we present lemmas stating |
1346 In this part, we present lemmas stating |
1278 pairs of r-regular expressions or r-regular expression lists |
1347 pairs of r-regular expressions and r-regular expression lists |
1279 where one could rewrite from one in many steps to the other. |
1348 where one could rewrite from one in many steps to the other. |
1280 Most of the proofs to these lemmas are straightforward, using |
1349 Most of the proofs to these lemmas are straightforward, using |
1281 an induction on the inductive cases of the corresponding rewriting relations. |
1350 an induction on the corresponding rewriting relations. |
1282 These proofs will therefore be omitted when this is the case. |
1351 These proofs will therefore be omitted when this is the case. |
1283 We present in the below lemma a few pairs of terms that are rewritable via |
1352 We present in the following lemma a few pairs of terms that are rewritable via |
1284 $\grewrites$: |
1353 $\grewrites$: |
1285 \begin{lemma}\label{gstarRdistinctGeneral} |
1354 \begin{lemma}\label{gstarRdistinctGeneral} |
1286 \mbox{} |
1355 \mbox{} |
1287 \begin{itemize} |
1356 \begin{itemize} |
1288 \item |
1357 \item |
1412 Before we obtain them, some preliminary definitions |
1484 Before we obtain them, some preliminary definitions |
1413 are needed to make proof statements concise. |
1485 are needed to make proof statements concise. |
1414 |
1486 |
1415 |
1487 |
1416 \subsubsection{Closed Form for Sequence Regular Expressions} |
1488 \subsubsection{Closed Form for Sequence Regular Expressions} |
1417 First let's look at a series of derivatives steps on a sequence |
1489 For the sequence regular expression, |
1418 regular expression, assuming that each time the first |
1490 let's first look at a series of derivative steps on it |
1419 component of the sequence is always nullable): |
1491 (assuming that each time when a derivative is taken, |
1420 \begin{center} |
1492 the head of the sequence is always nullable): |
1421 |
1493 \begin{center} |
1422 $r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\ |
1494 \begin{tabular}{llll} |
1423 $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad |
1495 $r_1 \cdot r_2$ & |
1424 \ldots$ |
1496 $\longrightarrow_{\backslash c}$ & |
1425 |
1497 $r_1\backslash c \cdot r_2 + r_2 \backslash c$ & |
|
1498 $ \longrightarrow_{\backslash c'} $ \\ |
|
1499 \\ |
|
1500 $(r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc'$ & |
|
1501 $\longrightarrow_{\backslash c''} $ & |
|
1502 $((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') |
|
1503 + r_2 \backslash cc'c''$ & |
|
1504 $ \longrightarrow_{\backslash c''} \quad \ldots$\\ |
|
1505 \end{tabular} |
1426 \end{center} |
1506 \end{center} |
1427 Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as |
1507 Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expresssed as |
1428 a giant alternative taking a list of terms |
1508 a giant alternative taking a list of terms |
1429 $[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$, |
1509 $[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$, |
1430 where the head of the list is always the term |
1510 where the head of the list is always the term |
1431 representing a match involving only $r_1$, and the tail of the list consisting of |
1511 representing a match involving only $r_1$, and the tail of the list consisting of |
1432 terms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$. |
1512 terms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$. |
1433 This intuition is also echoed by IndianPaper, where they gave |
1513 This intuition is also echoed by Murugesan and Sundaram \cite{Murugesan2014}, |
|
1514 where they gave |
1434 a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$: |
1515 a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$: |
1435 \begin{center} |
1516 \begin{center} |
1436 \begin{tabular}{c} |
1517 \begin{tabular}{lc} |
1437 $(r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) \myequiv$\\ |
1518 $L \; [ (r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) ]$ & $ =$\\ |
1438 \rule{0pt}{3ex} $((r_1 \backslash_r c_1) \cdot r_2 + (\delta\; (\rnullable \; r_1) \; r_2 \backslash_r c_1)) \backslash_r (c_2 :: \ldots c_n) |
1519 \\ |
1439 \myequiv$\\ |
1520 \rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1) \cdot r_2 + |
1440 \rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + (\delta \; (\rnullable \; r_1) \; r_2 \backslash_r c_1c_2)) |
1521 (\delta\; (\nullable \; r_1) \; (r_2 \backslash_r c_1) )) \backslash_r |
1441 + (\delta \ (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2)) \backslash_r (c_3 \ldots c_n) |
1522 (c_2 :: \ldots c_n) ]$ & |
1442 $ |
1523 $=$\\ |
1443 \end{tabular} |
1524 \\ |
1444 \end{center} |
1525 \rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1c_2 \cdot r_2 + |
1445 \noindent |
1526 (\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2))) |
1446 The equality in above should be interpretated |
1527 $ & \\ |
1447 as language equivalence. |
1528 \\ |
1448 The $\delta$ function works similarly to that of |
1529 $\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) )) |
1449 a Kronecker delta function: |
1530 \backslash_r (c_3 \ldots c_n) ]$ & $\ldots$ \\ |
1450 \[ \delta \; b\; r\] |
1531 \end{tabular} |
1451 will produce $r$ |
1532 \end{center} |
1452 if $b$ evaluates to true, |
1533 \noindent |
1453 and $\RZERO$ otherwise. |
1534 The $\delta$ function |
1454 Note that their formulation |
1535 returns $r$ when the boolean condition |
1455 \[ |
1536 $b$ evaluates to true and |
1456 ((r_1 \backslash_r \, c_1c_2 \cdot r_2 + (\delta \; (\rnullable) \; r_1, r_2 \backslash_r c_1c_2) |
1537 $\ZERO$ otherwise: |
1457 + (\delta \; (\rnullable \; r_1 \backslash_r c)\; r_2 \backslash_r c_2) |
1538 \begin{center} |
1458 \] |
1539 \begin{tabular}{lcl} |
|
1540 $\delta \; b\; r$ & $\dn$ & $r \quad \textit{if} \; b \; is \;\textit{true}$\\ |
|
1541 & $\dn$ & $\ZERO \quad otherwise$ |
|
1542 \end{tabular} |
|
1543 \end{center} |
|
1544 \noindent |
|
1545 Note that the term |
|
1546 \begin{center} |
|
1547 \begin{tabular}{lc} |
|
1548 \rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + |
|
1549 (\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2))) |
|
1550 $ & \\ |
|
1551 \\ |
|
1552 $\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) )) |
|
1553 \backslash_r (c_3 \ldots c_n)$ &\\ |
|
1554 \end{tabular} |
|
1555 \end{center} |
|
1556 \noindent |
1459 does not faithfully |
1557 does not faithfully |
1460 represent what the intermediate derivatives would actually look like |
1558 represent what the intermediate derivatives would actually look like |
1461 when one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not |
1559 when one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not |
1462 nullable in the head of the sequence. |
1560 nullable in the head of the sequence. |
1463 For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable, |
1561 For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable, |
1464 the regular expression would not look like |
1562 the regular expression would not look like |
1465 \[ |
1563 \[ |
1466 (r_1 \backslash_r c_1c_2 + \RZERO ) + \RZERO, |
1564 r_1 \backslash_r c_1c_2 |
1467 \] |
1565 \] |
1468 but actually $r_1 \backslash_r c_1c_2$, the redundant $\RZERO$s will not be created in the |
1566 instead of |
|
1567 \[ |
|
1568 (r_1 \backslash_r c_1c_2 + \ZERO ) + \ZERO. |
|
1569 \] |
|
1570 The redundant $\ZERO$s will not be created in the |
1469 first place. |
1571 first place. |
1470 In a closed-form one would want to take into account this |
1572 In a closed-form one needs to take into account this (because |
1471 and generate the list of |
1573 closed forms require exact equality rather than language equivalence) |
1472 regular expressions $r_2 \backslash_r s''$ with |
1574 and only generate the |
1473 string pairs $(s', s'')$ where $s'@s'' = s$ and |
1575 $r_2 \backslash_r s''$ terms satisfying the property |
1474 $r_1 \backslash s'$ nullable. |
1576 \begin{center} |
1475 We denote the list consisting of such |
1577 $\exists s'. such \; that \; s'@s'' = s \;\; \land \;\; |
1476 strings $s''$ as $\vsuf{s}{r_1}$. |
1578 r_1 \backslash s' \; is \; nullable$. |
1477 |
1579 \end{center} |
1478 The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string: |
1580 Given the arguments $s$ and $r_1$, we denote the list of strings |
|
1581 $s''$ satisfying the above property as $\vsuf{s}{r_1}$. |
|
1582 The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string\footnote{ |
|
1583 Perhaps a better name of it would be ``NullablePrefixSuffix'' |
|
1584 to differentiate with the list of \emph{all} prefixes of $s$, but |
|
1585 that is a bit too long for a function name and we are yet to find |
|
1586 a more concise and easy-to-understand name.} |
1479 \begin{center} |
1587 \begin{center} |
1480 \begin{tabular}{lcl} |
1588 \begin{tabular}{lcl} |
1481 $\vsuf{[]}{\_} $ & $=$ & $[]$\\ |
1589 $\vsuf{[]}{\_} $ & $=$ & $[]$\\ |
1482 $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ |
1590 $\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} \; (\rnullable{r_1}) \; \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ |
1483 && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $ |
1591 && $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $ |
1484 \end{tabular} |
1592 \end{tabular} |
1485 \end{center} |
1593 \end{center} |
1486 \noindent |
1594 \noindent |
1487 The list is sorted in the order $r_2\backslash s''$ |
1595 The list starts with shorter suffixes |
1488 appears in $(r_1\cdot r_2)\backslash s$. |
1596 and ends with longer ones (in other words, the string elements $s''$ |
|
1597 in the list $\vsuf{s}{r_1}$ are sorted |
|
1598 in the same order as that of the terms $r_2\backslash s''$ |
|
1599 appearing in $(r_1\cdot r_2)\backslash s$). |
1489 In essence, $\vsuf{\_}{\_}$ is doing a |
1600 In essence, $\vsuf{\_}{\_}$ is doing a |
1490 "virtual derivative" of $r_1 \cdot r_2$, but instead of producing |
1601 "virtual derivative" of $r_1 \cdot r_2$, but instead of producing |
1491 the entire result $(r_1 \cdot r_2) \backslash s$, |
1602 the entire result $(r_1 \cdot r_2) \backslash s$, |
1492 it only stores all the strings $s''$ such that $r_2 \backslash s''$ |
1603 it only stores strings, |
1493 are occurring terms in $(r_1\cdot r_2)\backslash s$. |
1604 with each string $s''$ representing a term such that $r_2 \backslash s''$ |
1494 |
1605 is occurring in $(r_1\cdot r_2)\backslash s$. |
1495 To make the closed form representation |
1606 |
1496 more straightforward, |
1607 With $\textit{Suffix}$ we are ready to express the |
1497 the flattetning function $\sflat{\_}$ is used to enable the transformation from |
1608 sequence regular expression's closed form, |
|
1609 but before doing so |
|
1610 more devices are needed. |
|
1611 The first thing is the flattening function $\sflat{\_}$, |
|
1612 which takes an alternative regular expression and produces a flattened version |
|
1613 of that alternative regular expression. |
|
1614 It is needed to convert |
1498 a left-associative nested sequence of alternatives into |
1615 a left-associative nested sequence of alternatives into |
1499 a flattened list: |
1616 a flattened list: |
1500 \[ |
1617 \[ |
1501 \sum [r_1, r_2, r_3, \ldots] \stackrel{\sflat{\_}}{\rightarrow} |
1618 \sum(\ldots ((r_1 + r_2) + r_3) + \ldots) |
1502 (\ldots ((r_1 + r_2) + r_3) + \ldots) |
1619 \stackrel{\sflat{\_}}{\rightarrow} |
|
1620 \sum[r_1, r_2, r_3, \ldots] |
1503 \] |
1621 \] |
1504 \noindent |
1622 \noindent |
1505 The definitions $\sflat{\_}$, $\sflataux{\_}$ are given below. |
1623 The definitions of $\sflat{\_}$ and helper functions |
|
1624 $\sflataux{\_}$ and $\llparenthesis \_ \rrparenthesis''$ are given below. |
1506 \begin{center} |
1625 \begin{center} |
1507 \begin{tabular}{ccc} |
1626 \begin{tabular}{lcl} |
1508 $\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\ |
1627 $\sflataux{\sum r :: rs}$ & $\dn$ & $\sflataux{r} @ rs$\\ |
1509 $\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\ |
1628 $\sflataux{\sum []}$ & $ \dn $ & $ []$\\ |
1510 $\sflataux r$ & $=$ & $ [r]$ |
1629 $\sflataux r$ & $\dn$ & $ [r]$ |
1511 \end{tabular} |
1630 \end{tabular} |
1512 \end{center} |
1631 \end{center} |
1513 |
1632 |
1514 \begin{center} |
1633 \begin{center} |
1515 \begin{tabular}{ccc} |
1634 \begin{tabular}{lcl} |
1516 $\sflat{(\sum r :: rs)}$ & $=$ & $\sum (\sflataux{r} @ rs)$\\ |
1635 $\sflat{(\sum r :: rs)}$ & $\dn$ & $\sum (\sflataux{r} @ rs)$\\ |
1517 $\sflat{\sum []}$ & $ = $ & $ \sum []$\\ |
1636 $\sflat{\sum []}$ & $ \dn $ & $ \sum []$\\ |
1518 $\sflat r$ & $=$ & $ r$ |
1637 $\sflat r$ & $\dn$ & $ r$ |
|
1638 \end{tabular} |
|
1639 \end{center} |
|
1640 |
|
1641 \begin{center} |
|
1642 \begin{tabular}{lcl} |
|
1643 $\sflataux{[]}'$ & $ \dn $ & $ []$\\ |
|
1644 $\sflataux{ (r_1 + r_2) :: rs }'$ & $\dn$ & $r_1 :: r_2 :: rs$\\ |
|
1645 $\sflataux{r :: rs}$ & $\dn$ & $ r::rs$ |
1519 \end{tabular} |
1646 \end{tabular} |
1520 \end{center} |
1647 \end{center} |
1521 \noindent |
1648 \noindent |
1522 $\sflataux{\_}$ breaks up nested alternative regular expressions |
1649 $\sflataux{\_}$ breaks up nested alternative regular expressions |
1523 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape |
1650 of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape |
1525 It will return the singleton list $[r]$ otherwise. |
1652 It will return the singleton list $[r]$ otherwise. |
1526 $\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps |
1653 $\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps |
1527 the output type a regular expression, not a list. |
1654 the output type a regular expression, not a list. |
1528 $\sflataux{\_}$ and $\sflat{\_}$ are only recursive on the |
1655 $\sflataux{\_}$ and $\sflat{\_}$ are only recursive on the |
1529 first element of the list. |
1656 first element of the list. |
1530 |
1657 $\sflataux{\_}'$ takes a list of regular expressions as input, and outputs |
1531 With $\sflataux{}$ a preliminary to the closed form can be stated, |
1658 a list of regular expressions. |
1532 where the derivative of $r_1 \cdot r_2 \backslash s$ can be |
1659 The use of $\sflataux{\_}$ and $\sflataux{\_}'$ is clear once we have |
1533 flattened into a list whose head and tail meet the description |
1660 $\textit{createdBySequence}$ defined: |
1534 we gave earlier. |
1661 \begin{center} |
|
1662 \begin{mathpar} |
|
1663 \inferrule{\mbox{}}{\textit{createdBySequence}\; (r_1 \cdot r_2)} |
|
1664 |
|
1665 \inferrule{\textit{createdBySequence} \; r_1}{\textit{createdBySequence} \; |
|
1666 (r_1 + r_2)} |
|
1667 \end{mathpar} |
|
1668 \end{center} |
|
1669 \noindent |
|
1670 The predicate $\textit{createdBySequence}$ is used to describe the shape of |
|
1671 the derivative regular expressions $(r_1\cdot r_2) \backslash s$: |
|
1672 \begin{lemma}\label{recursivelyDerseq} |
|
1673 It is always the case that |
|
1674 \begin{center} |
|
1675 $\textit{createdBySequence} \; ( (r_1\cdot r_2) \backslash_r s) $ |
|
1676 \end{center} |
|
1677 holds. |
|
1678 \end{lemma} |
|
1679 \begin{proof} |
|
1680 By a reverse induction on the string $s$, where the inductive cases are $[]$ |
|
1681 and $xs @ [x]$. |
|
1682 \end{proof} |
|
1683 \noindent |
|
1684 If we have a regular expression $r$ whose shpae |
|
1685 fits into those described by $\textit{createdBySequence}$, |
|
1686 then we can convert between |
|
1687 $r \backslash_r c$ and $(\sflataux{r}) \backslash_r c$ with |
|
1688 $\sflataux{\_}'$: |
|
1689 \begin{lemma}\label{sfauIdemDer} |
|
1690 If $\textit{createdBySequence} \; r$, then |
|
1691 \begin{center} |
|
1692 $\sflataux{ r \backslash_r c} = |
|
1693 \llparenthesis (\map \; (\_ \backslash_r c) \; (\sflataux{r}) ) \rrparenthesis''$ |
|
1694 \end{center} |
|
1695 holds. |
|
1696 \end{lemma} |
|
1697 \begin{proof} |
|
1698 By a simple induction on the inductive cases of $\textit{createdBySequence}. |
|
1699 $ |
|
1700 \end{proof} |
|
1701 |
|
1702 Now we are ready to express |
|
1703 the shape of $r_1 \cdot r_2 \backslash s$ |
1535 \begin{lemma}\label{seqSfau0} |
1704 \begin{lemma}\label{seqSfau0} |
1536 $\sflataux{\rders{(r_1 \cdot r_2) \backslash s }} = (r_1 \backslash_r s) \cdot r_2 |
1705 $\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 |
1537 :: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ |
1706 :: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ |
1538 \end{lemma} |
1707 \end{lemma} |
1539 \begin{proof} |
1708 \begin{proof} |
1540 By an induction on the string $s$, where the inductive cases |
1709 By a reverse induction on the string $s$, where the inductive cases |
1541 are split as $[]$ and $xs @ [x]$. |
1710 are $[]$ and $xs @ [x]$. |
1542 Note the key identify holds: |
1711 For the inductive case, we know that $\textit{createdBySequence} \; ((r_1 \cdot r_2) |
|
1712 \backslash_r xs)$ holds from lemma \ref{recursivelyDerseq}, |
|
1713 which can be used to prove |
1543 \[ |
1714 \[ |
1544 \map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\; |
1715 \map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\; |
1545 \map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1})) |
1716 \map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1})) |
1546 \] |
1717 \] |
1547 = |
1718 = |
1548 \[ |
1719 \[ |
1549 \map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1}) |
1720 \map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1}) |
1550 \] |
1721 \] |
1551 This enables the inductive case to go through. |
1722 using lemma \ref{sfauIdemDer}. |
|
1723 This equality enables the inductive case to go through. |
1552 \end{proof} |
1724 \end{proof} |
1553 \noindent |
1725 \noindent |
1554 Note that this lemma does $\mathbf{not}$ depend on any |
1726 This lemma says that $(r_1\cdot r_2)\backslash s$ |
1555 specific definitions we used, |
1727 can be flattened into a list whose head and tail meet the description |
1556 allowing people investigating derivatives to get an alternative |
1728 we gave earlier. |
1557 view of what $r_1 \cdot r_2$ is. |
1729 %Note that this lemma does $\mathbf{not}$ depend on any |
1558 |
1730 %specific definitions we used, |
1559 Now we are able to use this for the intuition that |
1731 %allowing people investigating derivatives to get an alternative |
1560 the different ways in which regular expressions are |
1732 %view of what $r_1 \cdot r_2$ is. |
1561 nested do not matter under $\rsimp{}$: |
1733 |
1562 \begin{center} |
1734 We now use $\textit{createdBySequence}$ and |
1563 $\rsimp{r} \stackrel{?}{\sequal} \rsimp{r'}$ if $r = \sum [r_1, r_2, r_3, \ldots]$ |
1735 $\sflataux{\_}$ to describe an intuition |
1564 and $r' =(\ldots ((r_1 + r_2) + r_3) + \ldots)$ |
1736 behind the closed form of the sequence closed form. |
1565 \end{center} |
1737 If two regular expressions only differ in the way their |
1566 Simply wrap with $\sum$ constructor and add |
1738 alternatives are nested, then we should be able to get the same result |
1567 simplifications to both sides of \ref{seqSfau0} |
1739 once we apply simplification to both of them: |
1568 and one gets |
1740 \begin{lemma}\label{sflatRsimpeq} |
1569 \begin{corollary}\label{seqClosedFormGeneral} |
1741 If $r$ is created from a sequence through |
|
1742 a series of derivatives |
|
1743 (i.e. if $\textit{createdBySequence} \; r$ holds), |
|
1744 and that $\sflataux{r} = rs$, |
|
1745 then we have |
|
1746 that |
|
1747 \begin{center} |
|
1748 $\textit{rsimp} \; r = \textit{rsimp} \; (\sum \; rs)$ |
|
1749 \end{center} |
|
1750 holds. |
|
1751 \end{lemma} |
|
1752 \begin{proof} |
|
1753 By an induction on the inductive cases of $\textit{createdBySequence}$. |
|
1754 \end{proof} |
|
1755 |
|
1756 Now we are ready for the closed form |
|
1757 for the sequence regular expressions (without the inner applications |
|
1758 of simplifications): |
|
1759 \begin{lemma}\label{seqClosedFormGeneral} |
1570 $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} } |
1760 $\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} } |
1571 =\rsimp{(\sum ( (r_1 \backslash s) \cdot r_2 :: |
1761 =\rsimp{(\sum ( (r_1 \backslash s) \cdot r_2 :: |
1572 \map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$ |
1762 \map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$ |
1573 \end{corollary} |
1763 \end{lemma} |
|
1764 \begin{proof} |
|
1765 We know that |
|
1766 $\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 |
|
1767 :: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r1))$ |
|
1768 holds |
|
1769 by lemma \ref{seqSfau0}. |
|
1770 This allows the theorem to go through because of lemma \ref{sflatRsimpeq}. |
|
1771 \end{proof} |
1574 Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}), |
1772 Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}), |
1575 it is possible to convert the above lemma to obtain a "closed form" |
1773 it is possible to convert the above lemma to obtain the |
|
1774 proper closed form for $\backslash_{rsimps}$ rather than $\backslash_r$: |
1576 for derivatives nested with simplification: |
1775 for derivatives nested with simplification: |
1577 \begin{lemma}\label{seqClosedForm} |
1776 \begin{theorem}\label{seqClosedForm} |
1578 $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) |
1777 $\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) |
1579 :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$ |
1778 :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$ |
1580 \end{lemma} |
1779 \end{theorem} |
1581 \begin{proof} |
1780 \begin{proof} |
1582 By a case analysis of string $s$. |
1781 By a case analysis of the string $s$. |
1583 When $s$ is empty list, the rewrite is straightforward. |
1782 When $s$ is an empty list, the rewrite is straightforward. |
1584 When $s$ is a list, one could use the corollary \ref{seqSfau0}, |
1783 When $s$ is a non-empty list, the |
1585 and lemma \ref{Simpders} to rewrite the left-hand-side. |
1784 lemmas \ref{seqClosedFormGeneral} and \ref{Simpders} apply, |
1586 \end{proof} |
1785 making the proof go through. |
1587 As a corollary for this closed form, one can estimate the size |
1786 \end{proof} |
1588 of the sequence derivative $r_1 \cdot r_2 \backslash_r s$ using |
|
1589 an easier-to-handle expression: |
|
1590 \begin{corollary}\label{seqEstimate1} |
|
1591 \begin{center} |
|
1592 |
|
1593 $\llbracket \rderssimp{(r_1 \cdot r_2)}{s} \rrbracket_r = \llbracket \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) |
|
1594 :: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))} \rrbracket_r$ |
|
1595 |
|
1596 \end{center} |
|
1597 \end{corollary} |
|
1598 \noindent |
|
1599 \subsubsection{Closed Forms for Star Regular Expressions} |
1787 \subsubsection{Closed Forms for Star Regular Expressions} |
1600 We have shown how to control the size of the sequence regular expression $r_1\cdot r_2$ using |
1788 The closed form for the star regular expression involves similar tricks |
1601 the "closed form" of $(r_1 \cdot r_2) \backslash s$ and then |
1789 for the sequence regular expression. |
1602 the property of the $\distinct$ function. |
1790 The $\textit{Suffix}$ function is now replaced by something |
1603 Now we try to get a bound on $r^* \backslash s$ as well. |
1791 slightly more complex, because the growth pattern of star |
1604 Again, we first look at how a star's derivatives evolve, if they grow maximally: |
1792 regular expressions' derivatives is a bit different: |
1605 \begin{center} |
1793 \begin{center} |
1606 |
1794 \begin{tabular}{lclc} |
1607 $r^* \quad \longrightarrow_{\backslash c} \quad (r\backslash c) \cdot r^* \quad \longrightarrow_{\backslash c'} \quad |
1795 $r^* $ & $\longrightarrow_{\backslash c}$ & |
1608 r \backslash cc' \cdot r^* + r \backslash c' \cdot r^* \quad \longrightarrow_{\backslash c''} \quad |
1796 $(r\backslash c) \cdot r^*$ & $\longrightarrow_{\backslash c'}$\\ |
1609 (r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) \quad \longrightarrow_{\backslash c'''} |
1797 \\ |
1610 \quad \ldots$ |
1798 $r \backslash cc' \cdot r^* + r \backslash c' \cdot r^*$ & |
1611 |
1799 $\longrightarrow_{\backslash c''}$ & |
|
1800 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + |
|
1801 (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)$ & |
|
1802 $\longrightarrow_{\backslash c'''}$ \\ |
|
1803 \\ |
|
1804 $\ldots$\\ |
|
1805 \end{tabular} |
1612 \end{center} |
1806 \end{center} |
1613 When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, |
1807 When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, |
1614 $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable, |
1808 $r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable, |
1615 the number of terms in $r^* \backslash s$ will grow exponentially, causing the size |
1809 the number of terms in $r^* \backslash s$ will grow exponentially rather than linearly |
1616 of the derivatives $r^* \backslash s$ to grow exponentially, even if we do not |
1810 in the sequence case. |
1617 count the possible size explosions of $r \backslash c$ themselves. |
1811 The good news is that the function $\textit{rsimp}$ will again |
1618 |
1812 ignore the difference between differently nesting patterns of alternatives, |
1619 Thanks to $\rflts$ and $\rDistinct$, we are able to open up regular expressions like |
1813 and the exponentially growing star derivative like |
1620 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + |
1814 \begin{center} |
1621 (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ |
1815 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + |
1622 into $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', |
1816 (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ |
1623 r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$ |
1817 \end{center} |
1624 and then de-duplicate terms of the form $r\backslash s' \cdot r^*$ ($s'$ being a substring of $s$). |
1818 can be treated as |
|
1819 \begin{center} |
|
1820 $\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', |
|
1821 r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$ |
|
1822 \end{center} |
|
1823 which can be de-duplicated by $\rDistinct$ |
|
1824 and therefore bounded finitely. |
|
1825 |
|
1826 and then de-duplicate terms of the form ($s'$ being a substring of $s$). |
1625 This allows us to use a similar technique as $r_1 \cdot r_2$ case, |
1827 This allows us to use a similar technique as $r_1 \cdot r_2$ case, |
1626 where the crux is to get an equivalent form of |
1828 |
1627 $\rderssimp{r^*}{s}$ with shape $\rsimp{\sum rs}$. |
1829 Now the crux of this section is finding a suitable description |
1628 This requires generating |
1830 for $rs$ where |
1629 all possible sub-strings $s'$ of $s$ |
1831 \begin{center} |
1630 such that $r\backslash s' \cdot r^*$ will appear |
1832 $\rderssimp{r^*}{s} = \rsimp{\sum rs}$. |
1631 as a term in $(r^*) \backslash s$. |
1833 \end{center} |
1632 The first function we define is a single-step |
1834 holds. |
1633 updating function $\starupdate$, which takes three arguments as input: |
1835 In addition, the list $rs$ |
1634 the new character $c$ to take derivative with, |
1836 shall be in the form of |
1635 the regular expression |
1837 $\map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss$. |
1636 $r$ directly under the star $r^*$, and the |
1838 The $Ss$ is a list of strings, and for example in the sequence |
1637 list of strings $sSet$ for the derivative $r^* \backslash s$ |
1839 closed form it is specified as $\textit{Suffix} \; s \; r_1$. |
1638 up til this point |
1840 To get $Ss$ for the star regular expression, |
1639 such that $(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ |
1841 we need to introduce $\starupdate$ and $\starupdates$: |
1640 (the equality is not exact, more on this later). |
|
1641 \begin{center} |
1842 \begin{center} |
1642 \begin{tabular}{lcl} |
1843 \begin{tabular}{lcl} |
1643 $\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ |
1844 $\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ |
1644 $\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\ |
1845 $\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\ |
1645 & & $\textit{if} \; |
1846 & & $\textit{if} \; |
1681 $\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\ |
1898 $\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\ |
1682 $\hflat{r}$ & $\dn$ & $r$ |
1899 $\hflat{r}$ & $\dn$ & $r$ |
1683 \end{tabular} |
1900 \end{tabular} |
1684 \end{center} |
1901 \end{center} |
1685 \noindent |
1902 \noindent |
1686 %MAYBE TODO: introduce createdByStar |
1903 These definitions are tailor-made for dealing with alternatives that have |
1687 Again these definitions are tailor-made for dealing with alternatives that have |
1904 originated from a star's derivatives. |
1688 originated from a star's derivatives, so we do not attempt to open up all possible |
1905 A typical star derivative always has the structure of a balanced binary tree: |
1689 regular expressions of the form $\RALTS{rs}$, where $\textit{rs}$ might not contain precisely 2 |
1906 \begin{center} |
1690 elements. |
1907 $(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + |
1691 We give a predicate for such "star-created" regular expressions: |
1908 (r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ |
1692 \begin{center} |
1909 \end{center} |
1693 \begin{tabular}{lcr} |
1910 All of the nested structures of alternatives |
1694 & & $\createdByStar{(\RSEQ{ra}{\RSTAR{rb}}) }$\\ |
1911 generated from derivatives are binary, and therefore |
1695 $\createdByStar{r_1} \land \createdByStar{r_2} $ & $ \Longrightarrow$ & $\createdByStar{(r_1 + r_2)}$ |
1912 $\hflat{\_}$ and $\hflataux{\_}$ only deal with binary alternatives. |
1696 \end{tabular} |
1913 $\hflat{\_}$ ``untangles'' like the following: |
1697 \end{center} |
1914 \[ |
1698 |
1915 \sum ((r_1 + r_2) + (r_3 + r_4)) + \ldots \; |
1699 These definitions allows us the flexibility to talk about |
1916 \stackrel{\hflat{\_}}{\longrightarrow} \; |
1700 regular expressions in their most convenient format, |
1917 \RALTS{[r_1, r_2, \ldots, r_n]} |
1701 for example, flattened out $\RALTS{[r_1, r_2, \ldots, r_n]} $ |
1918 \] |
1702 instead of binary-nested: $((r_1 + r_2) + (r_3 + r_4)) + \ldots$. |
1919 Here is a lemma stating the recursive property of $\starupdate$ and $\starupdates$, |
1703 These definitions help express that certain classes of syntatically |
1920 with the helpers $\hflat{\_}$ and $\hflataux{\_}$\footnote{The function $\textit{concat}$ takes a list of lists |
1704 distinct regular expressions are actually the same under simplification. |
1921 and merges each of the element lists to form a flattened list.}: |
1705 This is not entirely true for annotated regular expressions: |
1922 \begin{lemma}\label{stupdateInduct1} |
1706 %TODO: bsimp bders \neq bderssimp |
1923 \mbox |
1707 \begin{center} |
1924 For a list of strings $Ss$, the following hold. |
1708 $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ |
1925 \begin{itemize} |
1709 \end{center} |
1926 \item |
1710 For bit-codes, the order in which simplification is applied |
1927 If we do a derivative on the terms |
1711 might cause a difference in the location they are placed. |
1928 $r\backslash_r s \cdot r^*$ (where $s$ is taken from the list $Ss$), |
1712 If we want something like |
1929 the result will be the same as if we apply $\starupdate$ to $Ss$. |
1713 \begin{center} |
1930 \begin{center} |
1714 $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ |
1931 \begin{tabular}{c} |
1715 \end{center} |
1932 $\textit{concat} \; (\map \; (\hflataux{\_} \circ ( (\_\backslash_r x) |
1716 Some "canonicalization" procedure is required, |
1933 \circ (\lambda s.\;\; (r \backslash_r s) \cdot r^*)))\; Ss )\; |
1717 which either pushes all the common bitcodes to nodes |
1934 $\\ |
1718 as senior as possible: |
1935 \\ |
1719 \begin{center} |
1936 $=$ \\ |
1720 $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ |
1937 \\ |
1721 \end{center} |
1938 $\map \; (\lambda s. (r \backslash_r s) \cdot (r^*)) \; |
1722 or does the reverse. However bitcodes are not of interest if we are talking about |
1939 (\starupdate \; x \; r \; Ss)$ |
1723 the $\llbracket r \rrbracket$ size of a regex. |
1940 \end{tabular} |
1724 Therefore for the ease and simplicity of producing a |
1941 \end{center} |
1725 proof for a size bound, we are happy to restrict ourselves to |
1942 \item |
1726 unannotated regular expressions, and obtain such equalities as |
1943 $\starupdates$ is ``composable'' w.r.t a derivative. |
1727 \begin{lemma} |
1944 It piggybacks the character $x$ to the tail of the string |
1728 $\rsimp{r_1 + r_2} = \rsimp{\RALTS{\hflataux{r_1} @ \hflataux{r_2}}}$ |
1945 $xs$. |
|
1946 \begin{center} |
|
1947 \begin{tabular}{c} |
|
1948 $\textit{concat} \; (\map \; \hflataux{\_} \; |
|
1949 (\map \; (\_\backslash_r x) \; |
|
1950 (\map \; (\lambda s.\;\; (r \backslash_r s) \cdot |
|
1951 (r^*) ) \; (\starupdates \; xs \; r \; Ss))))$\\ |
|
1952 \\ |
|
1953 $=$\\ |
|
1954 \\ |
|
1955 $\map \; (\lambda s.\;\; (r\backslash_r s) \cdot (r^*)) \; |
|
1956 (\starupdates \; (xs @ [x]) \; r \; Ss)$ |
|
1957 \end{tabular} |
|
1958 \end{center} |
|
1959 \end{itemize} |
|
1960 \end{lemma} |
|
1961 |
|
1962 \begin{proof} |
|
1963 Part 1 is by induction on $Ss$. |
|
1964 Part 2 is by induction on $xs$, where $Ss$ is left to take arbitrary values. |
|
1965 \end{proof} |
|
1966 |
|
1967 |
|
1968 Like $\textit{createdBySequence}$, we need |
|
1969 a predicate for ``star-created'' regular expressions: |
|
1970 \begin{center} |
|
1971 \begin{mathpar} |
|
1972 \inferrule{\mbox{}}{ \textit{createdByStar}\; \RSEQ{ra}{\RSTAR{rb}} } |
|
1973 |
|
1974 \inferrule{ \textit{createdByStar} \; r_1\; \land \; \textit{createdByStar} \; r_2 }{\textit{createdByStar} \; (r_1 + r_2) } |
|
1975 \end{mathpar} |
|
1976 \end{center} |
|
1977 \noindent |
|
1978 All regular expressions created by taking derivatives of |
|
1979 $r_1 \cdot (r_2)^*$ satisfy the $\textit{createdByStar}$ predicate: |
|
1980 \begin{lemma}\label{starDersCbs} |
|
1981 $\textit{createdByStar} \; ((r_1 \cdot r_2^*) \backslash_r s) $ holds. |
|
1982 \end{lemma} |
|
1983 \begin{proof} |
|
1984 By a reverse induction on $s$. |
|
1985 \end{proof} |
|
1986 If a regular expression conforms to the shape of a star's derivative, |
|
1987 then we can push an application of $\hflataux{\_}$ inside a derivative of it: |
|
1988 \begin{lemma}\label{hfauPushin} |
|
1989 If $\textit{createdByStar} \; r$ holds, then |
|
1990 \begin{center} |
|
1991 $\hflataux{r \backslash_r c} = \textit{concat} \; ( |
|
1992 \map \; \hflataux{\_} (\map \; (\_\backslash_r c) \;(\hflataux{r})))$ |
|
1993 \end{center} |
|
1994 holds. |
|
1995 \end{lemma} |
|
1996 \begin{proof} |
|
1997 By an induction on the inductivev cases of $\textit{createdByStar}$. |
|
1998 \end{proof} |
|
1999 %This is not entirely true for annotated regular expressions: |
|
2000 %%TODO: bsimp bders \neq bderssimp |
|
2001 %\begin{center} |
|
2002 % $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ |
|
2003 %\end{center} |
|
2004 %For bit-codes, the order in which simplification is applied |
|
2005 %might cause a difference in the location they are placed. |
|
2006 %If we want something like |
|
2007 %\begin{center} |
|
2008 % $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ |
|
2009 %\end{center} |
|
2010 %Some "canonicalization" procedure is required, |
|
2011 %which either pushes all the common bitcodes to nodes |
|
2012 %as senior as possible: |
|
2013 %\begin{center} |
|
2014 % $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ |
|
2015 %\end{center} |
|
2016 %or does the reverse. However bitcodes are not of interest if we are talking about |
|
2017 %the $\llbracket r \rrbracket$ size of a regex. |
|
2018 %Therefore for the ease and simplicity of producing a |
|
2019 %proof for a size bound, we are happy to restrict ourselves to |
|
2020 %unannotated regular expressions, and obtain such equalities as |
|
2021 %TODO: rsimp sflat |
|
2022 % The simplification of a flattened out regular expression, provided it comes |
|
2023 %from the derivative of a star, is the same as the one nested. |
|
2024 |
|
2025 |
|
2026 |
|
2027 Now we introduce an inductive property |
|
2028 for $\starupdate$ and $\hflataux{\_}$. |
|
2029 \begin{lemma}\label{starHfauInduct} |
|
2030 If we do derivatives of $r^*$ |
|
2031 with a string that starts with $c$, |
|
2032 then flatten it out, |
|
2033 we obtain a list |
|
2034 of the shape $\sum_{s' \in sS} (r\backslash_r s') \cdot r^*$, |
|
2035 where $sS = \starupdates \; s \; r \; [[c]]$. Namely, |
|
2036 \begin{center} |
|
2037 $\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} = |
|
2038 \map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; |
|
2039 (\starupdates \; s \; r_0 \; [[c]])$ |
|
2040 \end{center} |
|
2041 holds. |
|
2042 \end{lemma} |
|
2043 \begin{proof} |
|
2044 By an induction on $s$, the inductive cases |
|
2045 being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used. |
|
2046 \end{proof} |
|
2047 \noindent |
|
2048 |
|
2049 $\hflataux{\_}$ has a similar effect as $\textit{flatten}$: |
|
2050 \begin{lemma}\label{hflatauxGrewrites} |
|
2051 $a :: rs \grewrites \hflataux{a} @ rs$ |
|
2052 \end{lemma} |
|
2053 \begin{proof} |
|
2054 By induction on $a$. $rs$ is set to take arbitrary values. |
|
2055 \end{proof} |
|
2056 It is also not surprising that $\textit{rsimp}$ will cover |
|
2057 the effect of $\hflataux{\_}$: |
|
2058 \begin{lemma}\label{cbsHfauRsimpeq1} |
|
2059 $\rsimp{(r_1 + r_2)} = \rsimp{(\RALTS{\hflataux{r_1} @ \hflataux{r_2}})}$ |
1729 \end{lemma} |
2060 \end{lemma} |
1730 |
2061 |
1731 \begin{proof} |
2062 \begin{proof} |
1732 By using the rewriting relation $\rightsquigarrow$ |
2063 By using the rewriting relation $\rightsquigarrow$ |
1733 \end{proof} |
2064 \end{proof} |
1734 %TODO: rsimp sflat |
|
1735 And from this we obtain a proof that a star's derivative will be the same |
2065 And from this we obtain a proof that a star's derivative will be the same |
1736 as if it had all its nested alternatives created during deriving being flattened out: |
2066 as if it had all its nested alternatives created during deriving being flattened out: |
1737 For example, |
2067 For example, |
1738 \begin{lemma} |
2068 \begin{lemma}\label{hfauRsimpeq2} |
1739 $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ |
2069 $\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ |
1740 \end{lemma} |
2070 \end{lemma} |
1741 \begin{proof} |
2071 \begin{proof} |
1742 By structural induction on $r$, where the induction rules are these of $\createdByStar{_}$. |
2072 By structural induction on $r$, where the induction rules |
1743 \end{proof} |
2073 are these of $\createdByStar{\_}$. |
1744 % The simplification of a flattened out regular expression, provided it comes |
2074 Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case. |
1745 %from the derivative of a star, is the same as the one nested. |
2075 \end{proof} |
1746 |
2076 |
1747 |
2077 |
1748 |
2078 %Here is a corollary that states the lemma in |
1749 We first introduce an inductive property |
2079 %a more intuitive way: |
1750 for $\starupdate$ and $\hflataux{\_}$, |
2080 %\begin{corollary} |
1751 it says if we do derivatives of $r^*$ |
2081 % $\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot |
1752 with a string that starts with $c$, |
2082 % (r^*))\; (\starupdates \; c\; r\; [[c]])$ |
1753 then flatten it out, |
2083 %\end{corollary} |
1754 we obtain a list |
2084 %\noindent |
1755 of the shape $\sum_{s' \in sSet} (r\backslash_r s') \cdot r^*$, |
2085 %Note that this is also agnostic of the simplification |
1756 where $sSet = \starupdates \; s \; r \; [[c]]$. |
2086 %function we defined, and is therefore of more general interest. |
1757 \begin{lemma}\label{starHfauInduct} |
2087 |
1758 $\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} = |
|
1759 \map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; |
|
1760 (\starupdates \; s \; r_0 \; [[c]])$ |
|
1761 \end{lemma} |
|
1762 \begin{proof} |
|
1763 By an induction on $s$, the inductive cases |
|
1764 being $[]$ and $s@[c]$. |
|
1765 \end{proof} |
|
1766 \noindent |
|
1767 Here is a corollary that states the lemma in |
|
1768 a more intuitive way: |
|
1769 \begin{corollary} |
|
1770 $\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot |
|
1771 (r^*))\; (\starupdates \; c\; r\; [[c]])$ |
|
1772 \end{corollary} |
|
1773 \noindent |
|
1774 Note that this is also agnostic of the simplification |
|
1775 function we defined, and is therefore of more general interest. |
|
1776 |
|
1777 Now adding the $\rsimp{}$ bit for closed forms, |
|
1778 we have |
|
1779 \begin{lemma} |
|
1780 $a :: rs \grewrites \hflataux{a} @ rs$ |
|
1781 \end{lemma} |
|
1782 \noindent |
|
1783 giving us |
|
1784 \begin{lemma}\label{cbsHfauRsimpeq1} |
|
1785 $\rsimp{a+b} = \rsimp{(\sum \hflataux{a} @ \hflataux{b})}$. |
|
1786 \end{lemma} |
|
1787 \noindent |
|
1788 This yields |
|
1789 \begin{lemma}\label{hfauRsimpeq2} |
|
1790 $\rsimp{r} = \rsimp{(\sum \hflataux{r})}$ |
|
1791 \end{lemma} |
|
1792 \noindent |
|
1793 Together with the rewriting relation |
2088 Together with the rewriting relation |
1794 \begin{lemma}\label{starClosedForm6Hrewrites} |
2089 \begin{lemma}\label{starClosedForm6Hrewrites} |
1795 $\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss |
2090 We have the following set of rewriting relations or equalities: |
1796 \scfrewrites |
2091 \begin{itemize} |
1797 \map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$ |
2092 \item |
1798 \end{lemma} |
2093 $\textit{rsimp} \; (r^* \backslash_r (c::s)) |
1799 \noindent |
2094 \sequal |
1800 We obtain the closed form for star regular expression: |
2095 \sum \; ( ( \sum (\lambda s. (r\backslash_r s) \cdot r^*) \; ( |
1801 \begin{lemma}\label{starClosedForm} |
2096 \starupdates \; s \; r \; [ c::[]] ) ) )$ |
|
2097 \item |
|
2098 $r \backslash_{rsimp} (c::s) = \textit{rsimp} \; ( ( |
|
2099 \sum ( (\map \; (\lambda s_1. (r\backslash s_1) \; r^*) \; |
|
2100 (\starupdates \;s \; r \; [ c::[] ])))))$ |
|
2101 \item |
|
2102 $\sum ( (\map \; (\lambda s. (r\backslash s) \; r^*) \; Ss)) |
|
2103 \sequal |
|
2104 \sum ( (\map \; (\lambda s. \textit{rsimp} \; (r\backslash s) \; |
|
2105 r^*) \; Ss) )$ |
|
2106 \item |
|
2107 $\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss |
|
2108 \scfrewrites |
|
2109 \map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$ |
|
2110 \item |
|
2111 $( ( \sum ( ( \map \ (\lambda s. \;\; |
|
2112 (\textit{rsimp} \; (r \backslash_r s)) \cdot r^*) \; (\starupdates \; |
|
2113 s \; r \; [ c::[] ])))))$\\ |
|
2114 $\sequal$\\ |
|
2115 $( ( \sum ( ( \map \ (\lambda s. \;\; |
|
2116 ( r \backslash_{rsimp} s)) \cdot r^*) \; (\starupdates \; |
|
2117 s \; r \; [ c::[] ]))))$\\ |
|
2118 \end{itemize} |
|
2119 \end{lemma} |
|
2120 \begin{proof} |
|
2121 Part 1 leads to part 2. |
|
2122 The rest of them are routine. |
|
2123 \end{proof} |
|
2124 \noindent |
|
2125 Next the closed form for star regular expressions can be derived: |
|
2126 \begin{theorem}\label{starClosedForm} |
1802 $\rderssimp{r^*}{c::s} = |
2127 $\rderssimp{r^*}{c::s} = |
1803 \rsimp{ |
2128 \rsimp{ |
1804 (\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; |
2129 (\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; |
1805 (\starupdates \; s\; r \; [[c]]) |
2130 (\starupdates \; s\; r \; [[c]]) |
1806 ) |
2131 ) |
1807 ) |
2132 ) |
1808 } |
2133 } |
1809 $ |
2134 $ |
1810 \end{lemma} |
2135 \end{theorem} |
1811 \begin{proof} |
2136 \begin{proof} |
1812 By an induction on $s$. |
2137 By an induction on $s$. |
1813 The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, and \ref{hfauRsimpeq2} |
2138 The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, \ref{starClosedForm6Hrewrites} |
|
2139 and \ref{hfauRsimpeq2} |
1814 are used. |
2140 are used. |
|
2141 In \ref{starClosedForm6Hrewrites}, the equalities are |
|
2142 used to link the LHS and RHS. |
1815 \end{proof} |
2143 \end{proof} |
1816 |
2144 |
1817 |
2145 |
1818 |
2146 |
1819 |
2147 |
1879 %\end{center} |
2207 %\end{center} |
1880 \section{Bounding Closed Forms} |
2208 \section{Bounding Closed Forms} |
1881 |
2209 |
1882 In this section, we introduce how we formalised the bound |
2210 In this section, we introduce how we formalised the bound |
1883 on closed forms. |
2211 on closed forms. |
1884 We first prove that functions such as $\rflts$ |
2212 We first show that in general regular expressions up to a certain |
|
2213 size are finite. |
|
2214 Then we prove that functions such as $\rflts$ |
1885 will not cause the size of r-regular expressions to grow. |
2215 will not cause the size of r-regular expressions to grow. |
1886 Putting this together with a general bound |
2216 Putting this together with a general bound |
1887 on the finiteness of distinct regular expressions |
2217 on the finiteness of distinct regular expressions |
1888 smaller than a certain size, we obtain a bound on |
2218 up to a certain size, we obtain a bound on |
1889 the closed forms. |
2219 the closed forms. |
|
2220 |
|
2221 \subsection{Finiteness of Distinct Regular Expressions} |
|
2222 We define the set of regular expressions whose size are no more than |
|
2223 a certain size $N$ as $\textit{sizeNregex} \; N$: |
|
2224 \[ |
|
2225 \textit{sizeNregex} \; N \dn \{r\; \mid \; \llbracket r \rrbracket_r \leq N \} |
|
2226 \] |
|
2227 We have that $\textit{sizeNregex} \; N$ is always a finite set: |
|
2228 \begin{lemma}\label{finiteSizeN} |
|
2229 $\textit{finite} \; (\textit{sizeNregex} \; N)$ holds. |
|
2230 \end{lemma} |
|
2231 \begin{proof} |
|
2232 By splitting the set $\textit{sizeNregex} \; (N + 1)$ into |
|
2233 subsets by their categories: |
|
2234 $\{\ZERO, \ONE, c\}$, $\{* `` \textit{sizeNregex} \; N\}$, |
|
2235 and so on. Each of these subsets are finitely bounded. |
|
2236 \end{proof} |
|
2237 \noindent |
|
2238 From this we get a corollary that |
|
2239 if forall $r \in rs$, $\rsize{r} \leq N$, then the output of |
|
2240 $\rdistinct{rs}{\varnothing}$ is a list of regular |
|
2241 expressions of finite size depending on $N$ only. |
|
2242 \begin{corollary}\label{finiteSizeNCorollary} |
|
2243 $\rsize{\rdistinct{rs}{\varnothing}} \leq c_N * N$ holds, |
|
2244 where the constant $c_N$ is equal to $\textit{card} \; (\textit{sizeNregex} \; |
|
2245 N)$. |
|
2246 \end{corollary} |
|
2247 \begin{proof} |
|
2248 For all $r$ in |
|
2249 $\textit{set} \; (\rdistinct{rs}{\varnothing})$, |
|
2250 it is always the case that $\rsize{r} \leq N$. |
|
2251 In addition, the list length is bounded by |
|
2252 $c_N$, yielding the desired bound. |
|
2253 \end{proof} |
|
2254 \noindent |
|
2255 This fact will be handy in estimating the closed form sizes. |
|
2256 %We have proven that the size of the |
|
2257 %output of $\textit{rdistinct} \; rs' \; \varnothing$ |
|
2258 %is bounded by a constant $N * c_N$ depending only on $N$, |
|
2259 %provided that each of $rs'$'s element |
|
2260 %is bounded by $N$. |
1890 |
2261 |
1891 \subsection{$\textit{rsimp}$ Does Not Increment the Size} |
2262 \subsection{$\textit{rsimp}$ Does Not Increment the Size} |
1892 Although it seems evident, we need a series |
2263 Although it seems evident, we need a series |
1893 of non-trivial lemmas to establish that functions such as $\rflts$ |
2264 of non-trivial lemmas to establish that functions such as $\rflts$ |
1894 do not cause the regular expressions to grow. |
2265 do not cause the regular expressions to grow. |
2295 Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$ |
2627 Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$ |
2296 inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function |
2628 inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function |
2297 $f(x) = x * 2^x$. |
2629 $f(x) = x * 2^x$. |
2298 This means the bound we have will surge up at least |
2630 This means the bound we have will surge up at least |
2299 tower-exponentially with a linear increase of the depth. |
2631 tower-exponentially with a linear increase of the depth. |
2300 For a regex of depth $n$, the bound |
2632 |
2301 would be approximately $4^n$. |
2633 One might be quite skeptical about what this non-elementary |
2302 |
2634 bound can bring us. |
2303 Test data in the graphs from randomly generated regular expressions |
2635 It turns out that the giant bounds are far from being hit. |
2304 shows that the giant bounds are far from being hit. |
2636 Here we have some test data from randomly generated regular expressions: |
2305 %a few sample regular experessions' derivatives |
2637 \begin{figure}[H] |
2306 %size change |
2638 \begin{tabular}{@{}c@{\hspace{2mm}}c@{\hspace{0mm}}c@{}} |
2307 %TODO: giving regex1_size_change.data showing a few regular expressions' size changes |
|
2308 %w;r;t the input characters number, where the size is usually cubic in terms of original size |
|
2309 %a*, aa*, aaa*, ..... |
|
2310 %randomly generated regular expressions |
|
2311 \begin{figure}{H} |
|
2312 \begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
|
2313 \begin{tikzpicture} |
2639 \begin{tikzpicture} |
2314 \begin{axis}[ |
2640 \begin{axis}[ |
2315 xlabel={number of characters}, |
2641 xlabel={$n$}, |
2316 x label style={at={(1.05,-0.05)}}, |
2642 x label style={at={(1.05,-0.05)}}, |
2317 ylabel={regex size}, |
2643 ylabel={regex size}, |
2318 enlargelimits=false, |
2644 enlargelimits=false, |
2319 xtick={0,5,...,30}, |
2645 xtick={0,5,...,30}, |
2320 xmax=33, |
2646 xmax=33, |
2321 %ymax=1000, |
2647 %ymax=1000, |
2322 %ytick={0,100,...,1000}, |
2648 %ytick={0,100,...,1000}, |
2323 scaled ticks=false, |
2649 scaled ticks=false, |
2324 axis lines=left, |
2650 axis lines=left, |
2325 width=5cm, |
2651 width=4.75cm, |
2326 height=4cm, |
2652 height=3.8cm, |
2327 legend entries={regex1}, |
2653 legend entries={regex1}, |
2328 legend pos=north west, |
2654 legend pos=north east, |
2329 legend cell align=left] |
2655 legend cell align=left] |
2330 \addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data}; |
2656 \addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data}; |
2331 \end{axis} |
2657 \end{axis} |
2332 \end{tikzpicture} |
2658 \end{tikzpicture} |
2333 & |
2659 & |
2334 \begin{tikzpicture} |
2660 \begin{tikzpicture} |
2335 \begin{axis}[ |
2661 \begin{axis}[ |
2336 xlabel={$n$}, |
2662 xlabel={$n$}, |
2337 x label style={at={(1.05,-0.05)}}, |
2663 x label style={at={(1.05,-0.05)}}, |
2338 %ylabel={time in secs}, |
2664 %ylabel={time in secs}, |
2362 xmax=33, |
2688 xmax=33, |
2363 %ymax=1000, |
2689 %ymax=1000, |
2364 %ytick={0,100,...,1000}, |
2690 %ytick={0,100,...,1000}, |
2365 scaled ticks=false, |
2691 scaled ticks=false, |
2366 axis lines=left, |
2692 axis lines=left, |
2367 width=5cm, |
2693 width=4.75cm, |
2368 height=4cm, |
2694 height=3.8cm, |
2369 legend entries={regex3}, |
2695 legend entries={regex3}, |
2370 legend pos=north west, |
2696 legend pos=south east, |
2371 legend cell align=left] |
2697 legend cell align=left] |
2372 \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data}; |
2698 \addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data}; |
2373 \end{axis} |
2699 \end{axis} |
2374 \end{tikzpicture}\\ |
2700 \end{tikzpicture}\\ |
2375 \multicolumn{3}{c}{Graphs: size change of 3 randomly generated regular expressions $w.r.t.$ input string length.} |
2701 \multicolumn{3}{c}{} |
2376 \end{tabular} |
2702 \end{tabular} |
|
2703 \caption{Graphs: size change of 3 randomly generated |
|
2704 regular expressions $w.r.t.$ input string length. |
|
2705 The x axis represents the length of input.} |
2377 \end{figure} |
2706 \end{figure} |
2378 \noindent |
2707 \noindent |
2379 Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the |
2708 Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the |
2380 original size. |
2709 original size. |
2381 We will discuss improvements to this bound in the next chapter. |
2710 We will discuss improvements to this bound in the next chapter. |
2382 |
2711 |
2383 |
2712 |
2384 |
2713 |
2385 \section{Possible Further Improvements} |
2714 \subsection{Possible Further Improvements} |
2386 There are two problems with this finiteness result, though. |
2715 There are two problems with this finiteness result, though.\\ |
2387 \begin{itemize} |
2716 (i) |
2388 \item |
2717 First, it is not yet a direct formalisation of our lexer's complexity, |
2389 First, It is not yet a direct formalisation of our lexer's complexity, |
|
2390 as a complexity proof would require looking into |
2718 as a complexity proof would require looking into |
2391 the time it takes to execute {\bf all} the operations |
2719 the time it takes to execute {\bf all} the operations |
2392 involved in the lexer (simp, collect, decode), not just the derivative. |
2720 involved in the lexer (simp, collect, decode), not just the derivative.\\ |
2393 \item |
2721 (ii) |
2394 Second, the bound is not yet tight, and we seek to improve $N_a$ so that |
2722 Second, the bound is not yet tight, and we seek to improve $N_a$ so that |
2395 it is polynomial on $\llbracket a \rrbracket$. |
2723 it is polynomial on $\llbracket a \rrbracket$.\\ |
2396 \end{itemize} |
2724 Still, we believe this contribution is useful, |
2397 Still, we believe this contribution is fruitful, |
|
2398 because |
2725 because |
2399 \begin{itemize} |
2726 \begin{itemize} |
2400 \item |
2727 \item |
2401 |
2728 |
2402 The size proof can serve as a cornerstone for a complexity |
2729 The size proof can serve as a starting point for a complexity |
2403 formalisation. |
2730 formalisation. |
2404 Derivatives are the most important phases of our lexer algorithm. |
2731 Derivatives are the most important phases of our lexer algorithm. |
2405 Size properties about derivatives covers the majority of the algorithm |
2732 Size properties about derivatives covers the majority of the algorithm |
2406 and is therefore a good indication of complexity of the entire program. |
2733 and is therefore a good indication of complexity of the entire program. |
2407 \item |
2734 \item |
2454 \addplot[mark=*,blue] table {re-chengsong.data}; |
2784 \addplot[mark=*,blue] table {re-chengsong.data}; |
2455 \end{axis} |
2785 \end{axis} |
2456 \end{tikzpicture} |
2786 \end{tikzpicture} |
2457 \end{center} |
2787 \end{center} |
2458 |
2788 |
2459 For convenience we use $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ |
2789 For convenience we use $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ |
2460 to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
2790 to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
2461 (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion. |
2791 (\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion. |
2462 The exponential size is triggered by that the regex |
2792 The exponential size is triggered by that the regex |
2463 $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$ |
2793 $\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$ |
2464 inside the $(\ldots) ^*$ having exponentially many |
2794 inside the $(\ldots) ^*$ having exponentially many |
2465 different derivatives, despite those difference being minor. |
2795 different derivatives, despite those difference being minor. |
2466 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$ |
2796 $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$ |
2467 will therefore contain the following terms (after flattening out all nested |
2797 will therefore contain the following terms (after flattening out all nested |
2468 alternatives): |
2798 alternatives): |
2469 \begin{center} |
2799 \begin{center} |
2470 $(\oplus_{i = 1]{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* })\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ |
2800 $(\sum_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ |
|
2801 [1mm] |
2471 $(1 \leq m' \leq m )$ |
2802 $(1 \leq m' \leq m )$ |
2472 \end{center} |
2803 \end{center} |
2473 These terms are distinct for $m' \leq L.C.M.(1, \ldots, n)$ (will be explained in appendix). |
2804 There at at least exponentially |
|
2805 many such terms.\footnote{To be exact, these terms are |
|
2806 distinct for $m' \leq L.C.M.(1, \ldots, n)$, the details are omitted, |
|
2807 but the point is that the number is exponential. |
|
2808 } |
2474 With each new input character taking the derivative against the intermediate result, more and more such distinct |
2809 With each new input character taking the derivative against the intermediate result, more and more such distinct |
2475 terms will accumulate, |
2810 terms will accumulate. |
2476 until the length reaches $L.C.M.(1, \ldots, n)$. |
2811 The function $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms |
2477 $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms |
2812 \begin{center} |
2478 $(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ |
2813 $(\sum_{i = 1}^{n} |
2479 |
2814 (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot |
2480 $(\oplus_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ |
2815 (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot |
2481 where $m' \neq m''$ \\ |
2816 (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ |
|
2817 $(\sum_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot |
|
2818 (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot |
|
2819 (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ |
|
2820 \end{center} |
|
2821 \noindent |
|
2822 where $m' \neq m''$ |
2482 as they are slightly different. |
2823 as they are slightly different. |
2483 This means that with our current simplification methods, |
2824 This means that with our current simplification methods, |
2484 we will not be able to control the derivative so that |
2825 we will not be able to control the derivative so that |
2485 $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial %\leq O((\llbracket r\rrbacket)^c)$ |
2826 $\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial. %\leq O((\llbracket r\rrbacket)^c)$ |
2486 as there are already exponentially many terms. |
|
2487 These terms are similar in the sense that the head of those terms |
2827 These terms are similar in the sense that the head of those terms |
2488 are all consisted of sub-terms of the form: |
2828 are all consisted of sub-terms of the form: |
2489 $(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $. |
2829 $(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $. |
2490 For $\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most |
2830 For $\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most |
2491 $n * (n + 1) / 2$ such terms. |
2831 $n * (n + 1) / 2$ such terms. |
2492 For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives |
2832 For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives |
2493 can be described by 6 terms: |
2833 can be described by 6 terms: |
2494 $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, |
2834 $a^*$, $a\cdot (aa)^*$, $ (aa)^*$, |
2495 $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$. |
2835 $aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$. |
2496 The total number of different "head terms", $n * (n + 1) / 2$, |
2836 The total number of different "head terms", $n * (n + 1) / 2$, |
2497 is proportional to the number of characters in the regex |
2837 is proportional to the number of characters in the regex |
2498 $(\oplus_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$. |
2838 $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$. |
2499 This suggests a slightly different notion of size, which we call the |
2839 If we can improve our deduplication process so that it becomes smarter |
|
2840 and only keep track of these $n * (n+1) /2$ terms, then we can keep |
|
2841 the size growth polynomial again. |
|
2842 This example also suggests a slightly different notion of size, which we call the |
2500 alphabetic width: |
2843 alphabetic width: |
2501 %TODO: |
2844 \begin{center} |
2502 (TODO: Alphabetic width def.) |
2845 \begin{tabular}{lcl} |
|
2846 $\textit{awidth} \; \ZERO$ & $\dn$ & $0$\\ |
|
2847 $\textit{awidth} \; \ONE$ & $\dn$ & $0$\\ |
|
2848 $\textit{awidth} \; c$ & $\dn$ & $1$\\ |
|
2849 $\textit{awidth} \; r_1 + r_2$ & $\dn$ & $\textit{awidth} \; |
|
2850 r_1 + \textit{awidth} \; r_2$\\ |
|
2851 $\textit{awidth} \; r_1 \cdot r_2$ & $\dn$ & $\textit{awidth} \; |
|
2852 r_1 + \textit{awidth} \; r_2$\\ |
|
2853 $\textit{awidth} \; r^*$ & $\dn$ & $\textit{awidth} \; r$\\ |
|
2854 \end{tabular} |
|
2855 \end{center} |
|
2856 |
2503 |
2857 |
2504 |
2858 |
2505 Antimirov\parencite{Antimirov95} has proven that |
2859 Antimirov\parencite{Antimirov95} has proven that |
2506 $\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$. |
2860 $\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$, |
2507 where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms |
2861 where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms |
2508 created by doing derivatives of $r$ against all possible strings. |
2862 created by doing derivatives of $r$ against all possible strings. |
2509 If we can make sure that at any moment in our lexing algorithm our |
2863 If we can make sure that at any moment in our lexing algorithm our |
2510 intermediate result hold at most one copy of each of the |
2864 intermediate result hold at most one copy of each of the |
2511 subterms then we can get the same bound as Antimirov's. |
2865 subterms then we can get the same bound as Antimirov's. |