author | Chengsong |
Mon, 10 Jul 2023 19:29:22 +0100 | |
changeset 664 | ba44144875b1 |
parent 663 | 0d1e68268d0f |
child 668 | 3831621d7b14 |
permissions | -rwxr-xr-x |
532 | 1 |
% Chapter Template |
2 |
||
3 |
\chapter{Finiteness Bound} % Main chapter title |
|
4 |
||
5 |
\label{Finite} |
|
6 |
% In Chapter 4 \ref{Chapter4} we give the second guarantee |
|
7 |
%of our bitcoded algorithm, that is a finite bound on the size of any |
|
8 |
%regex's derivatives. |
|
660
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
9 |
%(this is cahpter 5 now) |
618 | 10 |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
11 |
In this chapter we give a bound in terms of the size of |
624 | 12 |
the calculated derivatives: |
590 | 13 |
given an annotated regular expression $a$, for any string $s$ |
624 | 14 |
our algorithm $\blexersimp$'s derivatives |
15 |
are finitely bounded |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
16 |
by a constant that only depends on $a$. |
661
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
17 |
Formally we show that there exists a constant integer $N_a$ such that |
576 | 18 |
\begin{center} |
593 | 19 |
$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$ |
576 | 20 |
\end{center} |
21 |
\noindent |
|
618 | 22 |
where the size ($\llbracket \_ \rrbracket$) of |
23 |
an annotated regular expression is defined |
|
24 |
in terms of the number of nodes in its |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
25 |
tree structure (its recursive definition is given in the next page). |
613 | 26 |
We believe this size bound |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
27 |
is important in the context of POSIX lexing because |
660
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
28 |
\marginpar{Addressing Gerog comment: "how does this relate to backtracking?"} |
590 | 29 |
\begin{itemize} |
30 |
\item |
|
618 | 31 |
It is a stepping stone towards the goal |
32 |
of eliminating ``catastrophic backtracking''. |
|
660
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
33 |
The derivative-based lexing algorithm avoids backtracking |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
34 |
by a trade-off between space and time. |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
35 |
Backtracking algorithms |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
36 |
save other possibilities on a stack when exploring one possible |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
37 |
path of matching. Catastrophic backtracking typically occurs |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
38 |
when the number of steps increase exponentially with respect |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
39 |
to input. In other words, the runtime is $O((c_r)^n)$ of the input |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
40 |
string length $n$, where the base of the exponent is determined by the |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
41 |
regular expression $r$. |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
42 |
%so that they |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
43 |
%can be traversed in the future in a DFS manner, |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
44 |
%different matchings are stored as sub-expressions |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
45 |
%in a regular expression derivative. |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
46 |
Derivatives saves these possibilities as sub-expressions |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
47 |
and traverse those during future derivatives. If we denote the size |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
48 |
of intermediate derivatives as $S_{r,n}$ (where the subscripts |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
49 |
$r,n$ indicate that $S$ depends on them), then the runtime of |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
50 |
derivative-based approaches would be $O(S_{r,n} * n)$. |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
51 |
We observe that if $S_{r,n}$ continously grows with $n$ (for example |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
52 |
growing exponentially fast), then this |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
53 |
is equally bad as catastrophic backtracking. |
661
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
54 |
Our finiteness bound seeks to find a constant integer |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
55 |
upper bound $C$ (which in our case is $N_a$ where $a = r^\uparrow$) of $\S_{r,n}$, |
660
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
56 |
so that the complexity of the algorithm can be seen as linear ($O(C * n)$). |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
57 |
Even if $C$ is still large in our current work, it is still a constant |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
58 |
rather than ever-increasing number with respect to input length $n$. |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
59 |
More importantly this $C$ constant can potentially |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
60 |
be shrunken as we optimize our simplification procedure. |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
61 |
%and showing the potential |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
62 |
%improvements can be by the notion of partial derivatives. |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
63 |
|
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
64 |
%If the internal data structures used by our algorithm |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
65 |
%grows beyond a finite bound, then clearly |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
66 |
%the algorithm (which traverses these structures) will |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
67 |
%be slow. |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
68 |
%The next step is to refine the bound $N_a$ so that it |
eddc4eaba7c4
addresses Gerog "N_r meaning and relation with backtracking?" comment
Chengsong
parents:
659
diff
changeset
|
69 |
%is not just finite but polynomial in $\llbracket a\rrbracket$. |
590 | 70 |
\item |
618 | 71 |
Having the finite bound formalised |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
72 |
gives us higher confidence that |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
73 |
our simplification algorithm $\simp$ does not ``misbehave'' |
639 | 74 |
like $\textit{simpSL}$ does. |
618 | 75 |
The bound is universal for a given regular expression, |
76 |
which is an advantage over work which |
|
624 | 77 |
only gives empirical evidence on |
639 | 78 |
some test cases (see for example Verbatim work \cite{Verbatimpp}). |
590 | 79 |
\end{itemize} |
625 | 80 |
\noindent |
81 |
We then extend our $\blexersimp$ |
|
82 |
to support bounded repetitions ($r^{\{n\}}$). |
|
83 |
We update our formalisation of |
|
84 |
the correctness and finiteness properties to |
|
85 |
include this new construct. |
|
639 | 86 |
We show that we can out-compete other verified lexers such as |
625 | 87 |
Verbatim++ on bounded regular expressions. |
88 |
||
618 | 89 |
In the next section we describe in more detail |
90 |
what the finite bound means in our algorithm |
|
91 |
and why the size of the internal data structures of |
|
92 |
a typical derivative-based lexer such as |
|
639 | 93 |
Sulzmann and Lu's needs formal treatment. |
618 | 94 |
|
625 | 95 |
|
96 |
||
97 |
||
639 | 98 |
\section{Formalising Size Bound of Derivatives} |
577 | 99 |
\noindent |
613 | 100 |
In our lexer ($\blexersimp$), |
101 |
we take an annotated regular expression as input, |
|
618 | 102 |
and repeately take derivative of and simplify it. |
103 |
\begin{figure} |
|
104 |
\begin{center} |
|
105 |
\begin{tabular}{lcl} |
|
106 |
$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ |
|
107 |
$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ |
|
108 |
$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ |
|
109 |
$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ |
|
110 |
$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ |
|
111 |
$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$. |
|
112 |
\end{tabular} |
|
113 |
\end{center} |
|
114 |
\caption{The size function of bitcoded regular expressions}\label{brexpSize} |
|
115 |
\end{figure} |
|
116 |
||
117 |
\begin{figure} |
|
593 | 118 |
\begin{tikzpicture}[scale=2, |
119 |
every node/.style={minimum size=11mm}, |
|
120 |
->,>=stealth',shorten >=1pt,auto,thick |
|
121 |
] |
|
122 |
\node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$}; |
|
123 |
\node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$}; |
|
124 |
\draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$}; |
|
590 | 125 |
|
593 | 126 |
\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=6mm]{$a_{1s}$}; |
127 |
\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp$}; |
|
590 | 128 |
|
593 | 129 |
\node (r2) [rectangle, draw=black, thick, right=of r1s, minimum size = 12mm]{$a_2$}; |
130 |
\draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; |
|
590 | 131 |
|
593 | 132 |
\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=6mm]{$a_{2s}$}; |
133 |
\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp$}; |
|
590 | 134 |
|
593 | 135 |
\node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=6mm]{$a_{ns}$}; |
136 |
\draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$}; |
|
590 | 137 |
|
593 | 138 |
\node (v) [circle, thick, draw, right=of rns, minimum size=6mm, right=1.7cm]{$v$}; |
139 |
\draw[->, line width=0.2mm](rns)--(v) node[above, midway] {\bmkeps} node [below, midway] {\decode}; |
|
140 |
\end{tikzpicture} |
|
141 |
\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks} |
|
590 | 142 |
\end{figure} |
618 | 143 |
|
576 | 144 |
\noindent |
590 | 145 |
Each time |
613 | 146 |
a derivative is taken, the regular expression might grow. |
639 | 147 |
However, the simplification that is immediately afterwards will often shrink it so that |
148 |
the overall size of the derivatives stays relatively small. |
|
577 | 149 |
This intuition is depicted by the relative size |
590 | 150 |
change between the black and blue nodes: |
639 | 151 |
After $\simp$ the node shrinks. |
618 | 152 |
Our proof states that all the blue nodes |
613 | 153 |
stay below a size bound $N_a$ determined by the input $a$. |
576 | 154 |
|
590 | 155 |
\noindent |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
156 |
Sulzmann and Lu's assumed a similar picture of their algorithm, |
590 | 157 |
though in fact their algorithm's size might be better depicted by the following graph: |
158 |
\begin{figure}[H] |
|
593 | 159 |
\begin{tikzpicture}[scale=2, |
160 |
every node/.style={minimum size=11mm}, |
|
161 |
->,>=stealth',shorten >=1pt,auto,thick |
|
162 |
] |
|
163 |
\node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$}; |
|
164 |
\node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$}; |
|
165 |
\draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$}; |
|
590 | 166 |
|
593 | 167 |
\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$a_{1s}$}; |
168 |
\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp'$}; |
|
590 | 169 |
|
593 | 170 |
\node (r2) [rectangle, draw=black, thick, right=of r1s, minimum size = 17mm]{$a_2$}; |
171 |
\draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; |
|
590 | 172 |
|
593 | 173 |
\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=14mm]{$a_{2s}$}; |
174 |
\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp'$}; |
|
590 | 175 |
|
593 | 176 |
\node (r3) [rectangle, draw = black, thick, right= of r2s, minimum size = 22mm]{$a_3$}; |
177 |
\draw[->,line width=0.2mm](r2s)--(r3) node[above,midway] {$\backslash c_3$}; |
|
590 | 178 |
|
593 | 179 |
\node (rns) [right = of r3, draw=blue, minimum size = 20mm]{$a_{3s}$}; |
180 |
\draw[->,line width=0.2mm] (r3)--(rns) node [above, midway] {$\simp'$}; |
|
590 | 181 |
|
593 | 182 |
\node (rnn) [right = of rns, minimum size = 1mm]{}; |
183 |
\draw[->, dashed] (rns)--(rnn) node [above, midway] {$\ldots$}; |
|
590 | 184 |
|
593 | 185 |
\end{tikzpicture} |
186 |
\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks} |
|
590 | 187 |
\end{figure} |
188 |
\noindent |
|
639 | 189 |
The picture means that in some cases their lexer (where they use $\simpsulz$ |
613 | 190 |
as the simplification function) |
618 | 191 |
will have a size explosion, causing the running time |
613 | 192 |
of each derivative step to grow continuously (for example |
590 | 193 |
in \ref{SulzmannLuLexerTime}). |
613 | 194 |
They tested out the run time of their |
590 | 195 |
lexer on particular examples such as $(a+b+ab)^*$ |
613 | 196 |
and claimed that their algorithm is linear w.r.t to the input. |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
197 |
With our mechanised proof, we avoid this type of unintentional |
639 | 198 |
generalisation. |
613 | 199 |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
200 |
Before delving into the details of the formalisation, |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
201 |
we are going to provide an overview of it in the following subsection. |
613 | 202 |
|
590 | 203 |
|
577 | 204 |
\subsection{Overview of the Proof} |
661
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
205 |
\marginpar{trying to make it more intuitive |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
206 |
and provide more insights into proof} |
663
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
207 |
The most important idea in this chapter %intuition |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
208 |
is what we call the "closed forms" of |
661
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
209 |
regular expression derivatives with respect to strings. |
663
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
210 |
In short it allows us to express $r \backslash_{rsimps} s$ |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
211 |
as a different recursive function so induction on the size bound can go through. |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
212 |
A simple induction on $s$ or $r$ fails for $r\backslash_{rsimps} s$, but |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
213 |
works for $\textit{ClosedForm}(r,s)$. |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
214 |
|
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
215 |
|
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
216 |
|
661
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
217 |
Assume we have a regular expression $r$, be it an alternative, |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
218 |
a sequence or a star, the idea is if we try to take several derivatives |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
219 |
of it on paper, we end up getting a list of subexpressions, |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
220 |
something like |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
221 |
%omitting certain |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
222 |
%nested structures of those expressions: |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
223 |
\[ |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
224 |
r\backslash s = r_1 + r_2 + r_3 + \ldots + r_n, |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
225 |
\] |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
226 |
if we omit the way these regular expressions need to be nested. |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
227 |
where each $r_i$ ($i \in \{1, \ldots, n\}$) is related to some fragments |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
228 |
of $r$ and $s$. |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
229 |
The second important observation is that the list %of regular expressions |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
230 |
$[r_1, \ldots, r_n]$ %is not |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
231 |
cannot grow indefinitely because they all come from $r$, and derivatives |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
232 |
of the same regular expression are finite up to some isomorphisms. |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
233 |
We prove that the simplifications of $\blexersimp$ %make use of |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
234 |
is powerful enough to counteract the effect of nested structure of alternatives |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
235 |
and eliminate duplicates |
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
236 |
such that indeed the list in $a\backslash s$ does not grow unbounded. |
663
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
237 |
We call the precise formalisation for the shape of |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
238 |
\[ |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
239 |
r_1 + r_2 + r_3 + \ldots + r_n |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
240 |
\] |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
241 |
"closed form". |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
242 |
The name was chosen because turning the recursive relation |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
243 |
\[ |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
244 |
a \backslash_{bsimps} (c\!::\!s) \dn (\textit{bsimp} \; (a\backslash c)) \backslash_{bsimps} s |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
245 |
\] |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
246 |
into some easier-to-estimate forms |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
247 |
like |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
248 |
\[ |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
249 |
\sum (a_1\backslash s \cdot a_2) :: (\map \; (a_2\backslash\_) \; |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
250 |
(\textit{Suffix} \; s \; a_1)) |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
251 |
%\backslash_{bsimp |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
252 |
\] |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
253 |
was reminiscent of |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
254 |
%similar to t |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
255 |
solving recurrence relations like |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
256 |
$T \; n = 2 (T \frac{1}{2} n) + n$ to obtain |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
257 |
their closed forms. |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
258 |
%$T \; n = n \ln n + (s \; n)$ ($s \; n$ is |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
259 |
%some higher-order terms). |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
260 |
%(for example we know $T$ is $\Theta (n \ln n)$). |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
261 |
Just like a closed form of a recursive definition makes estimating |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
262 |
their growth possible, the closed |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
263 |
form of $a \backslash_{bsimps} s$ |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
264 |
allows us to prove the existence of a size bound. |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
265 |
Note that \ref{eq:approx} is only an approximate |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
266 |
term to show our point. |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
267 |
The precise formalised formula (\ref{seqClosedForm}) |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
268 |
needs to wait until all $\textit{rrexp}$-related |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
269 |
definitions are given, |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
270 |
%but for now we can think of the above as "the sequence |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
271 |
%regular expression $a_1 \cdot a_2$ after derivatives and simplifications |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
272 |
%w.r.t string $s$ looks like |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
273 |
%an alternative of giant list of sub-expressions, where each |
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
274 |
|
0d1e68268d0f
more explanation for the name "closed form" and their intuition
Chengsong
parents:
662
diff
changeset
|
275 |
|
661
71502e4d8691
overview of finiteness proof Gerog comment "not helpful", adding more intuitions of "closed forms"
Chengsong
parents:
660
diff
changeset
|
276 |
|
618 | 277 |
A high-level overview of the main components of the finiteness proof |
278 |
is as follows: |
|
593 | 279 |
\begin{figure}[H] |
280 |
\begin{tikzpicture}[scale=1,font=\bf, |
|
281 |
node/.style={ |
|
282 |
rectangle,rounded corners=3mm, |
|
283 |
ultra thick,draw=black!50,minimum height=18mm, |
|
284 |
minimum width=20mm, |
|
285 |
top color=white,bottom color=black!20}] |
|
543 | 286 |
|
287 |
||
593 | 288 |
\node (0) at (-5,0) |
289 |
[node, text width=1.8cm, text centered] |
|
290 |
{$\llbracket \bderssimp{a}{s} \rrbracket$}; |
|
291 |
\node (A) at (0,0) |
|
292 |
[node,text width=1.6cm, text centered] |
|
293 |
{$\llbracket \rderssimp{r}{s} \rrbracket_r$}; |
|
294 |
\node (B) at (3,0) |
|
295 |
[node,text width=3.0cm, anchor=west, minimum width = 40mm] |
|
296 |
{$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$}; |
|
297 |
\node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$}; |
|
298 |
||
299 |
\draw [->,line width=0.5mm] (0) -- |
|
300 |
node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$(r = a \downarrow_r)$} (A); |
|
301 |
\draw [->,line width=0.5mm] (A) -- |
|
302 |
node [above,pos=0.35] {$\quad =\ldots=$} (B); |
|
303 |
\draw [->,line width=0.5mm] (B) -- |
|
304 |
node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); |
|
305 |
\end{tikzpicture} |
|
306 |
%\caption{ |
|
307 |
\end{figure} |
|
576 | 308 |
\noindent |
577 | 309 |
We explain the steps one by one: |
532 | 310 |
\begin{itemize} |
590 | 311 |
\item |
312 |
We first introduce the operations such as |
|
313 |
derivatives, simplification, size calculation, etc. |
|
618 | 314 |
associated with $\rrexp$s, which we have introduced |
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
315 |
in chapter \ref{Bitcoded2}. As promised we will discuss |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
316 |
why they are needed in \ref{whyRerase}. |
593 | 317 |
The operations on $\rrexp$s are identical to those on |
618 | 318 |
annotated regular expressions except that they dispense with |
319 |
bitcodes. This means that all proofs about size of $\rrexp$s will apply to |
|
320 |
annotated regular expressions, because the size of a regular |
|
321 |
expression is independent of the bitcodes. |
|
590 | 322 |
\item |
593 | 323 |
We prove that $\rderssimp{r}{s} = \textit{ClosedForm}(r, s)$, |
324 |
where $\textit{ClosedForm}(r, s)$ is entirely |
|
618 | 325 |
given as the derivatives of their children regular |
593 | 326 |
expressions. |
327 |
We call the right-hand-side the \emph{Closed Form} |
|
328 |
of the derivative $\rderssimp{r}{s}$. |
|
590 | 329 |
\item |
618 | 330 |
Formally we give an estimate of |
331 |
$\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$. |
|
593 | 332 |
The key observation is that $\distinctBy$'s output is |
333 |
a list with a constant length bound. |
|
532 | 334 |
\end{itemize} |
594 | 335 |
We will expand on these steps in the next sections.\\ |
532 | 336 |
|
613 | 337 |
\section{The $\textit{Rrexp}$ Datatype} |
594 | 338 |
The first step is to define |
339 |
$\textit{rrexp}$s. |
|
618 | 340 |
They are annotated regular expressions without bitcodes, |
639 | 341 |
allowing a more convenient size bound proof. |
618 | 342 |
%Of course, the bits which encode the lexing information |
343 |
%would grow linearly with respect |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
344 |
%to the input, which should be taken into accounte when we wish to tackle the runtime complexity. |
618 | 345 |
%But for the sake of the structural size |
346 |
%we can safely ignore them.\\ |
|
347 |
The datatype |
|
594 | 348 |
definition of the $\rrexp$, called |
593 | 349 |
\emph{r-regular expressions}, |
594 | 350 |
was initially defined in \ref{rrexpDef}. |
351 |
The reason for the prefix $r$ is |
|
593 | 352 |
to make a distinction |
594 | 353 |
with basic regular expressions. |
639 | 354 |
We give here again the definition of $\rrexp$. |
576 | 355 |
\[ \rrexp ::= \RZERO \mid \RONE |
593 | 356 |
\mid \RCHAR{c} |
357 |
\mid \RSEQ{r_1}{r_2} |
|
358 |
\mid \RALTS{rs} |
|
359 |
\mid \RSTAR{r} |
|
576 | 360 |
\] |
593 | 361 |
The size of an r-regular expression is |
362 |
written $\llbracket r\rrbracket_r$, |
|
363 |
whose definition mirrors that of an annotated regular expression. |
|
576 | 364 |
\begin{center} |
618 | 365 |
\begin{tabular}{lcl} |
593 | 366 |
$\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\ |
367 |
$\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\ |
|
368 |
$\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\ |
|
369 |
$\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\ |
|
370 |
$\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as + 1$\\ |
|
371 |
$\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$. |
|
372 |
\end{tabular} |
|
576 | 373 |
\end{center} |
374 |
\noindent |
|
593 | 375 |
The $r$ in the subscript of $\llbracket \rrbracket_r$ is to |
376 |
differentiate with the same operation for annotated regular expressions. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
377 |
Similar subscripts will be added for operations like $\rerase{}$: |
593 | 378 |
\begin{center} |
379 |
\begin{tabular}{lcl} |
|
380 |
$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\ |
|
381 |
$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\ |
|
382 |
$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\ |
|
383 |
$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ |
|
384 |
$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\ |
|
385 |
$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$ |
|
386 |
\end{tabular} |
|
387 |
\end{center} |
|
594 | 388 |
|
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
389 |
\subsection{Why a New Datatype?}\label{whyRerase} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
390 |
\marginpar{\em added label so this section can be referenced by other parts of the thesis |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
391 |
so that interested readers can jump to/be reassured that there will explanations.} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
392 |
Originally the erase operation $(\_)_\downarrow$ was |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
393 |
used by Ausaf et al. in their proofs related to $\blexer$. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
394 |
This function was not part of the lexing algorithm, and the sole purpose was to |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
395 |
bridge the gap between the $r$ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
396 |
%$\textit{rexp}$ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
397 |
(un-annotated) and $\textit{arexp}$ (annotated) |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
398 |
regular expression datatypes so as to leverage the correctness |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
399 |
theorem of $\lexer$.%to establish the correctness of $\blexer$. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
400 |
For example, lemma \ref{retrieveStepwise} %and \ref{bmkepsRetrieve} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
401 |
uses $\erase$ to convert an annotated regular expression $a$ into |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
402 |
a plain one so that it can be used by $\inj$ to create the desired value |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
403 |
$\inj\; (a)_\downarrow \; c \; v$. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
404 |
|
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
405 |
Ideally $\erase$ should only remove the auxiliary information not related to the |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
406 |
structure--the |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
407 |
bitcodes. However there exists a complication |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
408 |
where the alternative constructors have different arity for $\textit{arexp}$ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
409 |
and $\textit{r}$: |
576 | 410 |
\begin{center} |
618 | 411 |
\begin{tabular}{lcl} |
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
412 |
$\textit{r}$ & $::=$ & $\ldots \;|\; (\_ + \_) \; ::\; "\textit{r} \Rightarrow \textit{r} \Rightarrow \textit{r}" | \ldots$\\ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
413 |
$\textit{arexp}$ & $::=$ & $\ldots\; |\; (\Sigma \_ ) \; ::\; "\textit{arexp} \; list \Rightarrow \textit{arexp}" | \ldots$ |
594 | 414 |
\end{tabular} |
576 | 415 |
\end{center} |
594 | 416 |
\noindent |
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
417 |
To convert between the two |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
418 |
$\erase$ has to recursively disassemble a list into nested binary applications of the |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
419 |
$(\_ + \_)$ operator, |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
420 |
handling corner cases like empty or |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
421 |
singleton alternative lists: |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
422 |
%becomes $r$ during the |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
423 |
%$\erase$ function. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
424 |
%The annotated regular expression $\sum[a, b, c]$ would turn into |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
425 |
%$(a+(b+c))$. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
426 |
\begin{center} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
427 |
\begin{tabular}{lcl} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
428 |
$ (_{bs}\sum [])_\downarrow $ & $\dn$ & $\ZERO$\\ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
429 |
$ (_{bs}\sum [a])_\downarrow$ & $\dn$ & $a$\\ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
430 |
$ (_{bs}\sum a_1 :: a_2)_\downarrow$ & $\dn$ & $(a_1)_\downarrow + (a_2)_\downarrow)$\\ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
431 |
$ (_{bs}\sum a :: as)_\downarrow$ & $\dn$ & $a_\downarrow + (\erase \; _{[]} \sum as)$ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
432 |
\end{tabular} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
433 |
\end{center} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
434 |
\noindent |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
435 |
These operations inevitably change the structure and size of |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
436 |
an annotated regular expression. For example, |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
437 |
$a_1 = \sum _{Z}[x]$ has size 2, but $(a_1)_\downarrow = x$ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
438 |
only has size 1. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
439 |
%adding unnecessary |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
440 |
%complexities to the size bound proof. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
441 |
%The reason we |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
442 |
%define a new datatype is that |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
443 |
%the $\erase$ function |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
444 |
%does not preserve the structure of annotated |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
445 |
%regular expressions. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
446 |
%We initially started by using |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
447 |
%plain regular expressions and tried to prove |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
448 |
%lemma \ref{rsizeAsize}, |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
449 |
%however the $\erase$ function messes with the structure of the |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
450 |
%annotated regular expression. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
451 |
%The $+$ constructor |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
452 |
%of basic regular expressions is only binary, whereas $\sum$ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
453 |
%takes a list. Therefore we need to convert between |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
454 |
%annotated and normal regular expressions as follows: |
613 | 455 |
For example, if we define the size of a basic plain regular expression |
594 | 456 |
in the usual way, |
543 | 457 |
\begin{center} |
618 | 458 |
\begin{tabular}{lcl} |
593 | 459 |
$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\ |
460 |
$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\ |
|
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
461 |
$\llbracket r_1 + r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\ |
593 | 462 |
$\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\ |
463 |
$\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\ |
|
464 |
$\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$ |
|
465 |
\end{tabular} |
|
532 | 466 |
\end{center} |
543 | 467 |
\noindent |
594 | 468 |
Then the property |
532 | 469 |
\begin{center} |
613 | 470 |
$\llbracket a \rrbracket \stackrel{?}{=} \llbracket a_\downarrow \rrbracket_p$ |
532 | 471 |
\end{center} |
594 | 472 |
does not hold. |
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
473 |
%With $\textit{rerase}$, however, |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
474 |
%only the bitcodes are thrown away. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
475 |
That leads to us defining the new regular expression datatype without |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
476 |
bitcodes but with a list alternative constructor, and defining a new erase function |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
477 |
in a strictly structure-preserving manner: |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
478 |
\begin{center} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
479 |
\begin{tabular}{lcl} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
480 |
$\textit{rrexp}$ & $::=$ & $\ldots\; |\; (\sum \_ ) \; ::\; "\textit{rrexp} \; list \Rightarrow \textit{rrexp}" | \ldots$\\ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
481 |
$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
482 |
\end{tabular} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
483 |
\end{center} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
484 |
\noindent |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
485 |
%But |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
486 |
%Everything about the structure remains intact. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
487 |
%Therefore it does not change the size |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
488 |
%of an annotated regular expression and we have: |
613 | 489 |
\noindent |
594 | 490 |
One might be able to prove an inequality such as |
491 |
$\llbracket a \rrbracket \leq \llbracket a_\downarrow \rrbracket_p $ |
|
492 |
and then estimate $\llbracket a_\downarrow \rrbracket_p$, |
|
493 |
but we found our approach more straightforward.\\ |
|
532 | 494 |
|
613 | 495 |
\subsection{Functions for R-regular Expressions} |
639 | 496 |
The downside of our approach is that we need to redefine |
497 |
several functions for $\rrexp$. |
|
618 | 498 |
In this section we shall define the r-regular expression version |
639 | 499 |
of $\bder$, and $\textit{bsimp}$ related functions. |
613 | 500 |
We use $r$ as the prefix or subscript to differentiate |
501 |
with the bitcoded version. |
|
618 | 502 |
%For example,$\backslash_r$, $\rdistincts$, and $\rsimp$ |
503 |
%as opposed to $\backslash$, $\distinctBy$, and $\bsimp$. |
|
504 |
%As promised, they are much simpler than their bitcoded counterparts. |
|
613 | 505 |
%The operations on r-regular expressions are |
506 |
%almost identical to those of the annotated regular expressions, |
|
507 |
%except that no bitcodes are used. For example, |
|
618 | 508 |
The derivative operation for an r-regular expression is\\ |
543 | 509 |
\begin{center} |
593 | 510 |
\begin{tabular}{@{}lcl@{}} |
511 |
$(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\ |
|
512 |
$(\ONE)\,\backslash_r c$ & $\dn$ & |
|
513 |
$\textit{if}\;c=d\; \;\textit{then}\; |
|
514 |
\ONE\;\textit{else}\;\ZERO$\\ |
|
515 |
$(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ & |
|
516 |
$\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\ |
|
517 |
$(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ & |
|
594 | 518 |
$\textit{if}\;(\textit{rnullable}\,r_1)$\\ |
593 | 519 |
& &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\ |
520 |
& &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\ |
|
521 |
& &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\ |
|
522 |
$(r^*)\,\backslash_r c$ & $\dn$ & |
|
523 |
$( r\,\backslash_r c)\cdot |
|
524 |
(_{[]}r^*))$ |
|
525 |
\end{tabular} |
|
543 | 526 |
\end{center} |
527 |
\noindent |
|
618 | 528 |
where we omit the definition of $\textit{rnullable}$. |
639 | 529 |
The generalisation from the derivatives w.r.t a character to |
530 |
derivatives w.r.t strings is given as |
|
620 | 531 |
\begin{center} |
532 |
\begin{tabular}{lcl} |
|
533 |
$r \backslash_{rs} []$ & $\dn$ & $r$\\ |
|
534 |
$r \backslash_{rs} c::s$ & $\dn$ & $(r\backslash_r c) \backslash_{rs} s$ |
|
535 |
\end{tabular} |
|
536 |
\end{center} |
|
618 | 537 |
|
538 |
The function $\distinctBy$ for r-regular expressions does not need |
|
594 | 539 |
a function checking equivalence because |
618 | 540 |
there are no bit annotations. |
541 |
Therefore we have |
|
532 | 542 |
\begin{center} |
593 | 543 |
\begin{tabular}{lcl} |
544 |
$\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\ |
|
594 | 545 |
$\rdistinct{r :: rs}{rset}$ & $\dn$ & |
546 |
$\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ |
|
547 |
& & $\textit{else}\; \; |
|
548 |
r::\rdistinct{rs}{(rset \cup \{r\})}$ |
|
593 | 549 |
\end{tabular} |
532 | 550 |
\end{center} |
551 |
%TODO: definition of rsimp (maybe only the alternative clause) |
|
543 | 552 |
\noindent |
618 | 553 |
%We would like to make clear |
554 |
%a difference between our $\rdistincts$ and |
|
555 |
%the Isabelle $\textit {distinct}$ predicate. |
|
556 |
%In Isabelle $\textit{distinct}$ is a function that returns a boolean |
|
557 |
%rather than a list. |
|
558 |
%It tests if all the elements of a list are unique.\\ |
|
559 |
With $\textit{rdistinct}$ in place, |
|
560 |
the flatten function for $\rrexp$ is as follows: |
|
595 | 561 |
\begin{center} |
562 |
\begin{tabular}{@{}lcl@{}} |
|
596 | 563 |
$\textit{rflts} \; (\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $as \; @ \; \textit{rflts} \; as' $ \\ |
595 | 564 |
$\textit{rflts} \; \ZERO :: as'$ & $\dn$ & $ \textit{rflts} \; \textit{as'} $ \\ |
565 |
$\textit{rflts} \; a :: as'$ & $\dn$ & $a :: \textit{rflts} \; \textit{as'}$ \quad(otherwise) |
|
566 |
\end{tabular} |
|
567 |
\end{center} |
|
568 |
\noindent |
|
618 | 569 |
The function |
570 |
$\rsimpalts$ corresponds to $\textit{bsimp}_{ALTS}$: |
|
596 | 571 |
\begin{center} |
572 |
\begin{tabular}{@{}lcl@{}} |
|
573 |
$\rsimpalts \;\; nil$ & $\dn$ & $\RZERO$\\ |
|
574 |
$\rsimpalts \;\; r::nil$ & $\dn$ & $r$\\ |
|
575 |
$\rsimpalts \;\; rs$ & $\dn$ & $\sum rs$\\ |
|
576 |
\end{tabular} |
|
577 |
\end{center} |
|
578 |
\noindent |
|
618 | 579 |
Similarly, we have $\rsimpseq$ which corresponds to $\textit{bsimp}_{SEQ}$: |
596 | 580 |
\begin{center} |
581 |
\begin{tabular}{@{}lcl@{}} |
|
582 |
$\rsimpseq \;\; \RZERO \; \_ $ & $=$ & $\RZERO$\\ |
|
583 |
$\rsimpseq \;\; \_ \; \RZERO $ & $=$ & $\RZERO$\\ |
|
584 |
$\rsimpseq \;\; \RONE \cdot r_2$ & $\dn$ & $r_2$\\ |
|
585 |
$\rsimpseq \;\; r_1 r_2$ & $\dn$ & $r_1 \cdot r_2$\\ |
|
586 |
\end{tabular} |
|
587 |
\end{center} |
|
588 |
and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$: |
|
595 | 589 |
\begin{center} |
590 |
\begin{tabular}{@{}lcl@{}} |
|
591 |
||
596 | 592 |
$\textit{rsimp} \; (r_1\cdot r_2)$ & $\dn$ & $ \textit{rsimp}_{SEQ} \; bs \;(\textit{rsimp} \; r_1) \; (\textit{rsimp} \; r_2) $ \\ |
593 |
$\textit{rsimp} \; (_{bs}\sum \textit{rs})$ & $\dn$ & $\textit{rsimp}_{ALTS} \; \textit{bs} \; (\textit{rdistinct} \; ( \textit{rflts} ( \textit{map} \; rsimp \; rs)) \; \rerases \; \varnothing) $ \\ |
|
594 |
$\textit{rsimp} \; r$ & $\dn$ & $\textit{r} \qquad \textit{otherwise}$ |
|
595 | 595 |
\end{tabular} |
596 |
\end{center} |
|
596 | 597 |
\begin{center} |
598 |
\begin{tabular}{@{}lcl@{}} |
|
599 |
$r\backslash_{rsimp} \, c$ & $\dn$ & $\rsimp \; (r\backslash_r \, c)$ |
|
600 |
\end{tabular} |
|
601 |
\end{center} |
|
602 |
||
603 |
\begin{center} |
|
604 |
\begin{tabular}{@{}lcl@{}} |
|
601 | 605 |
$r \backslash_{rsimps} \; \; c\!::\!s $ & $\dn$ & $(r \backslash_{rsimp}\, c) \backslash_{rsimps}\, s$ \\ |
596 | 606 |
$r \backslash_{rsimps} [\,] $ & $\dn$ & $r$ |
607 |
\end{tabular} |
|
608 |
\end{center} |
|
609 |
\noindent |
|
601 | 610 |
We do not define an r-regular expression version of $\blexersimp$, |
618 | 611 |
as our proof does not depend on it. |
613 | 612 |
Now we are ready to introduce how r-regular expressions allow |
613 |
us to prove the size bound on bitcoded regular expressions. |
|
614 |
||
615 |
\subsection{Using R-regular Expressions to Bound Bit-coded Regular Expressions} |
|
616 |
Everything about the size of annotated regular expressions after the application |
|
617 |
of function $\bsimp$ and $\backslash_{simps}$ |
|
618 |
can be calculated via the size of r-regular expressions after the application |
|
619 |
of $\rsimp$ and $\backslash_{rsimps}$: |
|
564 | 620 |
\begin{lemma}\label{sizeRelations} |
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
621 |
The following equalities hold: |
543 | 622 |
\begin{itemize} |
623 |
\item |
|
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
624 |
$\rsize{\rerase a} = \asize a$ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
625 |
\item |
601 | 626 |
$\asize{\bsimps \; a} = \rsize{\rsimp{ \rerase{a}}}$ |
554 | 627 |
\item |
596 | 628 |
$\asize{\bderssimp{a}{s}} = \rsize{\rderssimp{\rerase{a}}{s}}$ |
554 | 629 |
\end{itemize} |
532 | 630 |
\end{lemma} |
601 | 631 |
\begin{proof} |
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
632 |
First part follows from the definition of $(\_)_{\downarrow_r}$. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
633 |
The second part is by induction on the inductive cases |
601 | 634 |
of $\textit{bsimp}$. |
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
635 |
The third part is by induction on the string $s$, |
601 | 636 |
where the inductive step follows from part one. |
637 |
\end{proof} |
|
543 | 638 |
\noindent |
596 | 639 |
With lemma \ref{sizeRelations}, |
601 | 640 |
we will be able to focus on |
641 |
estimating only |
|
642 |
$\rsize{\rderssimp{\rerase{a}}{s}}$ |
|
643 |
in later parts because |
|
644 |
\begin{center} |
|
645 |
$\rsize{\rderssimp{\rerase{a}}{s}} \leq N_r \quad$ |
|
646 |
implies |
|
647 |
$\quad \llbracket a \backslash_{bsimps} s \rrbracket \leq N_r$. |
|
648 |
\end{center} |
|
618 | 649 |
%From now on we |
650 |
%Unless stated otherwise in the rest of this |
|
651 |
%chapter all regular expressions without |
|
652 |
%bitcodes are seen as r-regular expressions ($\rrexp$s). |
|
653 |
%For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$, |
|
654 |
%we use the notation $r_1 + r_2$ |
|
655 |
%for brevity. |
|
532 | 656 |
|
657 |
||
658 |
%----------------------------------- |
|
596 | 659 |
% SUB SECTION ROADMAP RREXP BOUND |
532 | 660 |
%----------------------------------- |
553 | 661 |
|
596 | 662 |
%\subsection{Roadmap to a Bound for $\textit{Rrexp}$} |
553 | 663 |
|
596 | 664 |
%The way we obtain the bound for $\rrexp$s is by two steps: |
665 |
%\begin{itemize} |
|
666 |
% \item |
|
667 |
% First, we rewrite $r\backslash s$ into something else that is easier |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
668 |
% to bound. This step is crucial for the inductive case |
596 | 669 |
% $r_1 \cdot r_2$ and $r^*$, where the derivative can grow and bloat in a wild way, |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
670 |
% but after simplification, they will always be equal or smaller to a form consisting of an alternative |
596 | 671 |
% list of regular expressions $f \; (g\; (\sum rs))$ with some functions applied to it, where each element will be distinct after the function application. |
672 |
% \item |
|
673 |
% Then, for such a sum list of regular expressions $f\; (g\; (\sum rs))$, we can control its size |
|
674 |
% by estimation, since $\distinctBy$ and $\flts$ are well-behaved and working together would only |
|
675 |
% reduce the size of a regular expression, not adding to it. |
|
676 |
%\end{itemize} |
|
677 |
% |
|
678 |
%\section{Step One: Closed Forms} |
|
679 |
%We transform the function application $\rderssimp{r}{s}$ |
|
680 |
%into an equivalent |
|
681 |
%form $f\; (g \; (\sum rs))$. |
|
682 |
%The functions $f$ and $g$ can be anything from $\flts$, $\distinctBy$ and other helper functions from $\bsimp{\_}$. |
|
683 |
%This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call the |
|
684 |
%right hand side the "closed form" of $r\backslash s$. |
|
685 |
% |
|
686 |
%\begin{quote}\it |
|
687 |
% Claim: For regular expressions $r_1 \cdot r_2$, we claim that |
|
688 |
%\end{quote} |
|
689 |
%\noindent |
|
690 |
%We explain in detail how we reached those claims. |
|
601 | 691 |
If we attempt to prove |
692 |
\begin{center} |
|
609 | 693 |
$\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{rsimps} s \rrbracket_r \leq N_r$ |
601 | 694 |
\end{center} |
695 |
using a naive induction on the structure of $r$, |
|
696 |
then we are stuck at the inductive cases such as |
|
697 |
$r_1\cdot r_2$. |
|
698 |
The inductive hypotheses are: |
|
699 |
\begin{center} |
|
700 |
1: $\text{for } r_1, \text{there exists } N_{r_1}.\;\; s.t. |
|
609 | 701 |
\;\;\forall s. \llbracket r_1 \backslash_{rsimps} s \rrbracket_r \leq N_{r_1}. $\\ |
601 | 702 |
2: $\text{for } r_2, \text{there exists } N_{r_2}.\;\; s.t. |
609 | 703 |
\;\; \forall s. \llbracket r_2 \backslash_{rsimps} s \rrbracket_r \leq N_{r_2}. $ |
601 | 704 |
\end{center} |
705 |
The inductive step to prove would be |
|
706 |
\begin{center} |
|
707 |
$\text{there exists } N_{r_1\cdot r_2}. \;\; s.t. \forall s. |
|
609 | 708 |
\llbracket (r_1 \cdot r_2) \backslash_{rsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$ |
601 | 709 |
\end{center} |
710 |
The problem is that it is not clear what |
|
609 | 711 |
$(r_1\cdot r_2) \backslash_{rsimps} s$ looks like, |
601 | 712 |
and therefore $N_{r_1}$ and $N_{r_2}$ in the |
713 |
inductive hypotheses cannot be directly used. |
|
618 | 714 |
%We have already seen that $(r_1 \cdot r_2)\backslash s$ |
715 |
%and $(r^*)\backslash s$ can grow in a wild way. |
|
613 | 716 |
|
618 | 717 |
The point however, is that they will be equivalent to a list of |
609 | 718 |
terms $\sum rs$, where each term in $rs$ will |
719 |
be made of $r_1 \backslash s' $, $r_2\backslash s'$, |
|
618 | 720 |
and $r \backslash s'$ with $s' \in \textit{SubString} \; s$ (which stands |
721 |
for the set of substrings of $s$). |
|
609 | 722 |
The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$ |
618 | 723 |
in the simplification, which prevents the $rs$ from growing indefinitely. |
609 | 724 |
|
613 | 725 |
Based on this idea, we develop a proof in two steps. |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
726 |
First, we show the below equality (where |
609 | 727 |
$f$ and $g$ are functions that do not increase the size of the input) |
728 |
\begin{center} |
|
613 | 729 |
$r\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$, |
609 | 730 |
\end{center} |
613 | 731 |
where $r = r_1 \cdot r_2$ or $r = r_0^*$ and so on. |
732 |
For example, for $r_1 \cdot r_2$ we have the equality as |
|
733 |
\begin{center} |
|
734 |
$ \rderssimp{r_1 \cdot r_2}{s} = |
|
639 | 735 |
\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r_2}{\_} \;(\vsuf{s}{r_1})))}$ |
613 | 736 |
\end{center} |
609 | 737 |
We call the right-hand-side the |
738 |
\emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$. |
|
613 | 739 |
Second, we will bound the closed form of r-regular expressions |
740 |
using some estimation techniques |
|
618 | 741 |
and then apply |
742 |
lemma \ref{sizeRelations} to show that the bitcoded regular expressions |
|
613 | 743 |
in our $\blexersimp$ are finitely bounded. |
609 | 744 |
|
618 | 745 |
We will describe in detail the first step of the proof |
746 |
in the next section. |
|
613 | 747 |
|
748 |
\section{Closed Forms} |
|
609 | 749 |
In this section we introduce in detail |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
750 |
how to express the string derivatives |
618 | 751 |
of regular expressions (i.e. $r \backslash_r s$ where $s$ is a string |
752 |
rather than a single character) in a different way than |
|
753 |
our previous definition. |
|
754 |
In previous chapters, the derivative of a |
|
755 |
regular expression $r$ w.r.t a string $s$ |
|
756 |
was recursively defined on the string: |
|
757 |
\begin{center} |
|
758 |
$r \backslash_s (c::s) \dn (r \backslash c) \backslash_s s$ |
|
759 |
\end{center} |
|
760 |
The problem is that |
|
761 |
this definition does not provide much information |
|
762 |
on what $r \backslash_s s$ looks like. |
|
763 |
If we are interested in the size of a derivative like |
|
764 |
$(r_1 \cdot r_2)\backslash s$, |
|
765 |
we have to somehow get a more concrete form to begin. |
|
766 |
We call such more concrete representations the ``closed forms'' of |
|
767 |
string derivatives as opposed to their original definitions. |
|
768 |
The terminology ``closed form'' is borrowed from mathematics, |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
769 |
which usually describe expressions that are solely comprised of finitely many |
618 | 770 |
well-known and easy-to-compute operations such as |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
771 |
additions, multiplications, and exponential functions. |
618 | 772 |
|
613 | 773 |
We start by proving some basic identities |
609 | 774 |
involving the simplification functions for r-regular expressions. |
613 | 775 |
After that we introduce the rewrite relations |
776 |
$\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$ |
|
777 |
$\rightsquigarrow_f$ and $\rightsquigarrow_g$. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
778 |
These relations involve similar techniques as in chapter \ref{Bitcoded2} |
618 | 779 |
for annotated regular expressions. |
613 | 780 |
Finally, we use these identities to establish the |
781 |
closed forms of the alternative regular expression, |
|
782 |
the sequence regular expression, and the star regular expression. |
|
609 | 783 |
%$r_1\cdot r_2$, $r^*$ and $\sum rs$. |
601 | 784 |
|
785 |
||
609 | 786 |
|
613 | 787 |
\subsection{Some Basic Identities} |
609 | 788 |
|
618 | 789 |
In what follows we will often convert between lists |
790 |
and sets. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
791 |
We use Isabelle's $set$ to refer to the |
611 | 792 |
function that converts a list $rs$ to the set |
793 |
containing all the elements in $rs$. |
|
794 |
\subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication} |
|
543 | 795 |
The $\textit{rdistinct}$ function, as its name suggests, will |
613 | 796 |
de-duplicate an r-regular expression list. |
797 |
It will also remove any elements that |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
798 |
are already in the accumulator set. |
555 | 799 |
\begin{lemma}\label{rdistinctDoesTheJob} |
609 | 800 |
%The function $\textit{rdistinct}$ satisfies the following |
801 |
%properties: |
|
802 |
Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its |
|
618 | 803 |
recursive definition here. Its Isabelle counterpart would be $\textit{distinct}$.} |
609 | 804 |
for testing |
613 | 805 |
whether a list's elements are unique. Then the following |
609 | 806 |
properties about $\textit{rdistinct}$ hold: |
543 | 807 |
\begin{itemize} |
808 |
\item |
|
809 |
If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$. |
|
810 |
\item |
|
609 | 811 |
%If list $rs'$ is the result of $\rdistinct{rs}{acc}$, |
812 |
$\textit{isDistinct} \;\;\; (\rdistinct{rs}{acc})$. |
|
555 | 813 |
\item |
609 | 814 |
$\textit{set} \; (\rdistinct{rs}{acc}) |
639 | 815 |
= (\textit{set} \; rs) - acc$ |
543 | 816 |
\end{itemize} |
817 |
\end{lemma} |
|
555 | 818 |
\noindent |
543 | 819 |
\begin{proof} |
820 |
The first part is by an induction on $rs$. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
821 |
The second and third parts can be proven by using the |
609 | 822 |
inductive cases of $\textit{rdistinct}$. |
593 | 823 |
|
543 | 824 |
\end{proof} |
825 |
||
826 |
\noindent |
|
613 | 827 |
%$\textit{rdistinct}$ will out all regular expression terms |
828 |
%that are in the accumulator, therefore |
|
829 |
Concatenating a list $rs_a$ at the front of another |
|
830 |
list $rs$ whose elements are all from the accumulator, and then calling $\textit{rdistinct}$ |
|
831 |
on the merged list, the output will be as if we had called $\textit{rdistinct}$ |
|
543 | 832 |
without the prepending of $rs$: |
609 | 833 |
\begin{lemma}\label{rdistinctConcat} |
554 | 834 |
The elements appearing in the accumulator will always be removed. |
835 |
More precisely, |
|
836 |
\begin{itemize} |
|
837 |
\item |
|
838 |
If $rs \subseteq rset$, then |
|
839 |
$\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$. |
|
840 |
\item |
|
609 | 841 |
More generally, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$, |
554 | 842 |
then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{rs'}{rset}$ |
843 |
\end{itemize} |
|
543 | 844 |
\end{lemma} |
554 | 845 |
|
543 | 846 |
\begin{proof} |
609 | 847 |
By induction on $rs$ and using \ref{rdistinctDoesTheJob}. |
543 | 848 |
\end{proof} |
849 |
\noindent |
|
850 |
On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated, |
|
851 |
then expanding the accumulator to include that element will not cause the output list to change: |
|
611 | 852 |
\begin{lemma}\label{rdistinctOnDistinct} |
543 | 853 |
The accumulator can be augmented to include elements not appearing in the input list, |
854 |
and the output will not change. |
|
855 |
\begin{itemize} |
|
856 |
\item |
|
611 | 857 |
If $r \notin rs$, then $\rdistinct{rs}{acc} = \rdistinct{rs}{(\{r\} \cup acc)}$. |
543 | 858 |
\item |
611 | 859 |
Particularly, if $\;\;\textit{isDistinct} \; rs$, then we have\\ |
543 | 860 |
\[ \rdistinct{rs}{\varnothing} = rs \] |
861 |
\end{itemize} |
|
862 |
\end{lemma} |
|
863 |
\begin{proof} |
|
864 |
The first half is by induction on $rs$. The second half is a corollary of the first. |
|
865 |
\end{proof} |
|
866 |
\noindent |
|
611 | 867 |
The function $\textit{rdistinct}$ removes duplicates from anywhere in a list. |
868 |
Despite being seemingly obvious, |
|
869 |
the induction technique is not as straightforward. |
|
554 | 870 |
\begin{lemma}\label{distinctRemovesMiddle} |
871 |
The two properties hold if $r \in rs$: |
|
872 |
\begin{itemize} |
|
873 |
\item |
|
555 | 874 |
$\rdistinct{rs}{rset} = \rdistinct{(rs @ [r])}{rset}$\\ |
875 |
and\\ |
|
554 | 876 |
$\rdistinct{(ab :: rs @ [ab])}{rset'} = \rdistinct{(ab :: rs)}{rset'}$ |
877 |
\item |
|
555 | 878 |
$\rdistinct{ (rs @ rs') }{rset} = \rdistinct{rs @ [r] @ rs'}{rset}$\\ |
879 |
and\\ |
|
554 | 880 |
$\rdistinct{(ab :: rs @ [ab] @ rs'')}{rset'} = |
593 | 881 |
\rdistinct{(ab :: rs @ rs'')}{rset'}$ |
554 | 882 |
\end{itemize} |
883 |
\end{lemma} |
|
884 |
\noindent |
|
885 |
\begin{proof} |
|
593 | 886 |
By induction on $rs$. All other variables are allowed to be arbitrary. |
611 | 887 |
The second part of the lemma requires the first. |
618 | 888 |
Note that for each part, the two sub-propositions need to be proven |
889 |
at the same time, |
|
593 | 890 |
so that the induction goes through. |
554 | 891 |
\end{proof} |
555 | 892 |
\noindent |
611 | 893 |
This allows us to prove a few more equivalence relations involving |
618 | 894 |
$\textit{rdistinct}$ (they will be useful later): |
555 | 895 |
\begin{lemma}\label{rdistinctConcatGeneral} |
611 | 896 |
\mbox{} |
555 | 897 |
\begin{itemize} |
898 |
\item |
|
899 |
$\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{((\rdistinct{rs}{\varnothing})@ rs')}{\varnothing}$ |
|
900 |
\item |
|
901 |
$\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{(\rdistinct{rs}{\varnothing} @ rs')}{\varnothing}$ |
|
902 |
\item |
|
903 |
If $rset' \subseteq rset$, then $\rdistinct{rs}{rset} = |
|
904 |
\rdistinct{(\rdistinct{rs}{rset'})}{rset}$. As a corollary |
|
905 |
of this, |
|
906 |
\item |
|
907 |
$\rdistinct{(rs @ rs')}{rset} = \rdistinct{ |
|
908 |
(\rdistinct{rs}{\varnothing}) @ rs')}{rset}$. This |
|
909 |
gives another corollary use later: |
|
910 |
\item |
|
911 |
If $a \in rset$, then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{ |
|
912 |
(\rdistinct{(a :: rs)}{\varnothing} @ rs')}{rset} $, |
|
913 |
||
914 |
\end{itemize} |
|
915 |
\end{lemma} |
|
916 |
\begin{proof} |
|
917 |
By \ref{rdistinctDoesTheJob} and \ref{distinctRemovesMiddle}. |
|
918 |
\end{proof} |
|
611 | 919 |
\noindent |
639 | 920 |
The next lemma is a more general form of \ref{rdistinctConcat}; |
921 |
It says that |
|
611 | 922 |
$\textit{rdistinct}$ is composable w.r.t list concatenation: |
923 |
\begin{lemma}\label{distinctRdistinctAppend} |
|
924 |
If $\;\; \textit{isDistinct} \; rs_1$, |
|
925 |
and $(set \; rs_1) \cap acc = \varnothing$, |
|
926 |
then applying $\textit{rdistinct}$ on $rs_1 @ rs_a$ does not |
|
927 |
have an effect on $rs_1$: |
|
928 |
\[\textit{rdistinct}\; (rs_1 @ rsa)\;\, acc |
|
929 |
= rs_1@(\textit{rdistinct} rsa \; (acc \cup rs_1))\] |
|
930 |
\end{lemma} |
|
931 |
\begin{proof} |
|
932 |
By an induction on |
|
933 |
$rs_1$, where $rsa$ and $acc$ are allowed to be arbitrary. |
|
934 |
\end{proof} |
|
935 |
\noindent |
|
936 |
$\textit{rdistinct}$ needs to be applied only once, and |
|
618 | 937 |
applying it multiple times does not make any difference: |
611 | 938 |
\begin{corollary}\label{distinctOnceEnough} |
639 | 939 |
$\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; ( (rdistinct \; |
940 |
rs \; \{ \}) @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$ |
|
611 | 941 |
\end{corollary} |
942 |
\begin{proof} |
|
943 |
By lemma \ref{distinctRdistinctAppend}. |
|
944 |
\end{proof} |
|
555 | 945 |
|
611 | 946 |
\subsubsection{The Properties of $\textit{Rflts}$} |
947 |
We give in this subsection some properties |
|
620 | 948 |
involving $\backslash_r$, $\backslash_{rsimps}$, $\textit{rflts}$ and |
611 | 949 |
$\textit{rsimp}_{ALTS} $, together with any non-trivial lemmas that lead to them. |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
950 |
These will be helpful in later closed-form proofs, when |
611 | 951 |
we want to transform derivative terms which have |
952 |
%the ways in which multiple functions involving |
|
953 |
%those are composed together |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
954 |
interleaving derivatives and simplifications applied to them. |
543 | 955 |
|
611 | 956 |
\noindent |
957 |
%When the function $\textit{Rflts}$ |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
958 |
%is applied to the concatenation of two lists; the output can be calculated by first applying the |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
959 |
%functions on two lists separately and then concatenating them together. |
611 | 960 |
$\textit{Rflts}$ is composable in terms of concatenation: |
554 | 961 |
\begin{lemma}\label{rfltsProps} |
618 | 962 |
The function $\rflts$ has the properties below:\\ |
543 | 963 |
\begin{itemize} |
964 |
\item |
|
554 | 965 |
$\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$ |
966 |
\item |
|
967 |
If $r \neq \RZERO$ and $\nexists rs_1. r = \RALTS{rs}_1$, then $\rflts \; (r::rs) = r :: \rflts \; rs$ |
|
968 |
\item |
|
969 |
$\rflts \; (rs @ [\RZERO]) = \rflts \; rs$ |
|
970 |
\item |
|
971 |
$\rflts \; (rs' @ [\RALTS{rs}]) = \rflts \; rs'@rs$ |
|
972 |
\item |
|
973 |
$\rflts \; (rs @ [\RONE]) = \rflts \; rs @ [\RONE]$ |
|
974 |
\item |
|
975 |
If $r \neq \RZERO$ and $\nexists rs'. r = \RALTS{rs'}$ then $\rflts \; (rs @ [r]) |
|
976 |
= (\rflts \; rs) @ [r]$ |
|
555 | 977 |
\item |
978 |
If $r = \RALTS{rs}$ and $r \in rs'$ then for all $r_1 \in rs. |
|
979 |
r_1 \in \rflts \; rs'$. |
|
980 |
\item |
|
981 |
$\rflts \; (rs_a @ \RZERO :: rs_b) = \rflts \; (rs_a @ rs_b)$ |
|
543 | 982 |
\end{itemize} |
983 |
\end{lemma} |
|
984 |
\noindent |
|
985 |
\begin{proof} |
|
555 | 986 |
By induction on $rs_1$ in the first sub-lemma, and induction on $r$ in the second part, |
987 |
and induction on $rs$, $rs'$, $rs$, $rs'$, $rs_a$ in the third, fourth, fifth, sixth and |
|
988 |
last sub-lemma. |
|
543 | 989 |
\end{proof} |
611 | 990 |
\noindent |
991 |
Now we introduce the property that the operations |
|
992 |
derivative and $\rsimpalts$ |
|
639 | 993 |
commute, this will be used later on when deriving the closed form for |
611 | 994 |
the alternative regular expression: |
995 |
\begin{lemma}\label{rderRsimpAltsCommute} |
|
996 |
$\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$ |
|
997 |
\end{lemma} |
|
618 | 998 |
\begin{proof} |
999 |
By induction on $rs$. |
|
1000 |
\end{proof} |
|
611 | 1001 |
\noindent |
614 | 1002 |
|
639 | 1003 |
\subsubsection{The $RL$ Function: Language Interpretation for $\textit{Rrexp}$s} |
618 | 1004 |
Much like the definition of $L$ on plain regular expressions, one can also |
639 | 1005 |
define the language interpretation for $\rrexp$s. |
614 | 1006 |
\begin{center} |
1007 |
\begin{tabular}{lcl} |
|
639 | 1008 |
$RL \; (\ZERO_r)$ & $\dn$ & $\phi$\\ |
1009 |
$RL \; (\ONE_r)$ & $\dn$ & $\{[]\}$\\ |
|
614 | 1010 |
$RL \; (c)$ & $\dn$ & $\{[c]\}$\\ |
1011 |
$RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\ |
|
1012 |
$RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\ |
|
1013 |
$RL \; (r^*)$ & $\dn$ & $ (RL(r))^*$ |
|
1014 |
\end{tabular} |
|
1015 |
\end{center} |
|
1016 |
\noindent |
|
1017 |
The main use of $RL$ is to establish some connections between $\rsimp{}$ |
|
1018 |
and $\rnullable{}$: |
|
1019 |
\begin{lemma} |
|
1020 |
The following properties hold: |
|
1021 |
\begin{itemize} |
|
1022 |
\item |
|
1023 |
If $\rnullable{r}$, then $\rsimp{r} \neq \RZERO$. |
|
1024 |
\item |
|
1025 |
$\rnullable{r \backslash s} \quad $ if and only if $\quad \rnullable{\rderssimp{r}{s}}$. |
|
1026 |
\end{itemize} |
|
1027 |
\end{lemma} |
|
1028 |
\begin{proof} |
|
1029 |
The first part is by induction on $r$. |
|
1030 |
The second part is true because property |
|
1031 |
\[ RL \; r = RL \; (\rsimp{r})\] holds. |
|
1032 |
\end{proof} |
|
1033 |
||
1034 |
\subsubsection{Simplified $\textit{Rrexp}$s are Good} |
|
1035 |
We formalise the notion of ``good" regular expressions, |
|
1036 |
which means regular expressions that |
|
618 | 1037 |
are fully simplified in terms of our $\textit{rsimp}$ function. |
1038 |
For alternative regular expressions that means they |
|
1039 |
do not contain any nested alternatives, un-eliminated $\RZERO$s |
|
1040 |
or duplicate elements (for example, |
|
1041 |
$r_1 + (r_2 + r_3)$, $\RZERO + r$ and $ \sum [r, r, \ldots]$). |
|
1042 |
The clauses for $\good$ are: |
|
614 | 1043 |
\begin{center} |
1044 |
\begin{tabular}{@{}lcl@{}} |
|
1045 |
$\good\; \RZERO$ & $\dn$ & $\textit{false}$\\ |
|
1046 |
$\good\; \RONE$ & $\dn$ & $\textit{true}$\\ |
|
1047 |
$\good\; \RCHAR{c}$ & $\dn$ & $\btrue$\\ |
|
1048 |
$\good\; \RALTS{[]}$ & $\dn$ & $\bfalse$\\ |
|
1049 |
$\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\ |
|
1050 |
$\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ & |
|
1051 |
$\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\ |
|
618 | 1052 |
& & $\land \; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \, \land \; \, \textit{nonAlt}\; r')$\\ |
614 | 1053 |
$\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\ |
1054 |
$\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\ |
|
1055 |
$\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\ |
|
1056 |
$\good \; \RSEQ{r_1}{r_2}$ & $\dn$ & $\good \; r_1 \;\, \textit{and} \;\, \good \; r_2$\\ |
|
1057 |
$\good \; \RSTAR{r}$ & $\dn$ & $\btrue$\\ |
|
1058 |
\end{tabular} |
|
1059 |
\end{center} |
|
1060 |
\noindent |
|
618 | 1061 |
We omit the recursive definition of the predicate $\textit{nonAlt}$, |
1062 |
which evaluates to true when the regular expression is not an |
|
614 | 1063 |
alternative, and false otherwise. |
1064 |
The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1065 |
its non-empty argument list of expressions are all good themselves, and $\textit{nonAlt}$, |
614 | 1066 |
and unique: |
1067 |
\begin{lemma}\label{rsimpaltsGood} |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1068 |
If $rs \neq []$ and for all $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$, |
614 | 1069 |
then $\good \; (\rsimpalts \; rs)$ if and only if forall $r \in rs. \; \good \; r$. |
1070 |
\end{lemma} |
|
1071 |
\noindent |
|
1072 |
We also note that |
|
1073 |
if a regular expression $r$ is good, then $\rflts$ on the singleton |
|
1074 |
list $[r]$ will not break goodness: |
|
1075 |
\begin{lemma}\label{flts2} |
|
1076 |
If $\good \; r$, then forall $r' \in \rflts \; [r]. \; \good \; r'$ and $\textit{nonAlt} \; r'$. |
|
1077 |
\end{lemma} |
|
1078 |
\begin{proof} |
|
1079 |
By an induction on $r$. |
|
1080 |
\end{proof} |
|
1081 |
\noindent |
|
1082 |
The other observation we make about $\rsimp{r}$ is that it never |
|
1083 |
comes with nested alternatives, which we describe as the $\nonnested$ |
|
1084 |
property: |
|
1085 |
\begin{center} |
|
1086 |
\begin{tabular}{lcl} |
|
1087 |
$\nonnested \; \, \sum []$ & $\dn$ & $\btrue$\\ |
|
1088 |
$\nonnested \; \, \sum ((\sum rs_1) :: rs_2)$ & $\dn$ & $\bfalse$\\ |
|
1089 |
$\nonnested \; \, \sum (r :: rs)$ & $\dn$ & $\nonnested (\sum rs)$\\ |
|
1090 |
$\nonnested \; \, r $ & $\dn$ & $\btrue$ |
|
1091 |
\end{tabular} |
|
1092 |
\end{center} |
|
1093 |
\noindent |
|
1094 |
The $\rflts$ function |
|
1095 |
always opens up nested alternatives, |
|
1096 |
which enables $\rsimp$ to be non-nested: |
|
1097 |
||
1098 |
\begin{lemma}\label{nonnestedRsimp} |
|
618 | 1099 |
It is always the case that |
1100 |
\begin{center} |
|
1101 |
$\nonnested \; (\rsimp{r})$ |
|
1102 |
\end{center} |
|
614 | 1103 |
\end{lemma} |
1104 |
\begin{proof} |
|
618 | 1105 |
By induction on $r$. |
614 | 1106 |
\end{proof} |
1107 |
\noindent |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1108 |
With this we can prove that a regular expression |
614 | 1109 |
after simplification and flattening and de-duplication, |
1110 |
will not contain any alternative regular expression directly: |
|
1111 |
\begin{lemma}\label{nonaltFltsRd} |
|
1112 |
If $x \in \rdistinct{\rflts\; (\map \; \rsimp{} \; rs)}{\varnothing}$ |
|
1113 |
then $\textit{nonAlt} \; x$. |
|
1114 |
\end{lemma} |
|
1115 |
\begin{proof} |
|
1116 |
By \ref{nonnestedRsimp}. |
|
1117 |
\end{proof} |
|
1118 |
\noindent |
|
618 | 1119 |
The other fact we know is that once $\rsimp{}$ has finished |
614 | 1120 |
processing an alternative regular expression, it will not |
618 | 1121 |
contain any $\RZERO$s. This is because all the recursive |
614 | 1122 |
calls to the simplification on the children regular expressions |
618 | 1123 |
make the children good, and $\rflts$ will not delete |
614 | 1124 |
any $\RZERO$s out of a good regular expression list, |
618 | 1125 |
and $\rdistinct{}$ will not ``mess'' with the result. |
614 | 1126 |
\begin{lemma}\label{flts3Obv} |
1127 |
The following are true: |
|
1128 |
\begin{itemize} |
|
1129 |
\item |
|
1130 |
If for all $r \in rs. \, \good \; r $ or $r = \RZERO$, |
|
1131 |
then for all $r \in \rflts\; rs. \, \good \; r$. |
|
1132 |
\item |
|
1133 |
If $x \in \rdistinct{\rflts\; (\map \; rsimp{}\; rs)}{\varnothing}$ |
|
1134 |
and for all $y$ such that $\llbracket y \rrbracket_r$ less than |
|
1135 |
$\llbracket rs \rrbracket_r + 1$, either |
|
1136 |
$\good \; (\rsimp{y})$ or $\rsimp{y} = \RZERO$, |
|
1137 |
then $\good \; x$. |
|
1138 |
\end{itemize} |
|
1139 |
\end{lemma} |
|
1140 |
\begin{proof} |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1141 |
The first part is by induction, where the inductive cases |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1142 |
are the inductive cases of $\rflts$. |
614 | 1143 |
The second part is a corollary from the first part. |
1144 |
\end{proof} |
|
1145 |
||
639 | 1146 |
This leads to good structural property of $\rsimp{}$, |
614 | 1147 |
that after simplification, a regular expression is |
1148 |
either good or $\RZERO$: |
|
1149 |
\begin{lemma}\label{good1} |
|
1150 |
For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$. |
|
1151 |
\end{lemma} |
|
1152 |
\begin{proof} |
|
1153 |
By an induction on $r$. The inductive measure is the size $\llbracket \rrbracket_r$. |
|
618 | 1154 |
Lemma \ref{rsimpMono} says that |
614 | 1155 |
$\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to |
1156 |
$\llbracket r \rrbracket_r$. |
|
1157 |
Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case, |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1158 |
The inductive hypothesis applies to the children regular expressions |
614 | 1159 |
$r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied |
1160 |
by that as well. |
|
1161 |
The lemmas \ref{nonnestedRsimp} and \ref{nonaltFltsRd} are used |
|
1162 |
to ensure that goodness is preserved at the topmost level. |
|
1163 |
\end{proof} |
|
1164 |
We shall prove that any good regular expression is |
|
639 | 1165 |
a fixed-point for $\textit{rsimp}$. |
614 | 1166 |
First we prove an auxiliary lemma: |
1167 |
\begin{lemma}\label{goodaltsNonalt} |
|
1168 |
If $\good \; \sum rs$, then $\rflts\; rs = rs$. |
|
1169 |
\end{lemma} |
|
1170 |
\begin{proof} |
|
1171 |
By an induction on $\sum rs$. The inductive rules are the cases |
|
1172 |
for $\good$. |
|
1173 |
\end{proof} |
|
1174 |
\noindent |
|
1175 |
Now we are ready to prove that good regular expressions are invariant |
|
618 | 1176 |
with respect to $\rsimp{}$: |
614 | 1177 |
\begin{lemma}\label{test} |
1178 |
If $\good \;r$ then $\rsimp{r} = r$. |
|
1179 |
\end{lemma} |
|
1180 |
\begin{proof} |
|
1181 |
By an induction on the inductive cases of $\good$, using lemmas |
|
1182 |
\ref{goodaltsNonalt} and \ref{rdistinctOnDistinct}. |
|
1183 |
The lemma \ref{goodaltsNonalt} is used in the alternative |
|
1184 |
case where 2 or more elements are present in the list. |
|
1185 |
\end{proof} |
|
1186 |
\noindent |
|
639 | 1187 |
Below we show a property involving $\rflts$, $\textit{rdistinct}$, |
618 | 1188 |
$\rsimp{}$ and $\rsimp_{ALTS}$, |
1189 |
which requires $\ref{good1}$ to go through smoothly: |
|
614 | 1190 |
\begin{lemma}\label{flattenRsimpalts} |
618 | 1191 |
An application of $\rsimp_{ALTS}$ can be ``absorbed'', |
1192 |
if its output is concatenated with a list and then applied to $\rflts$. |
|
1193 |
\begin{center} |
|
614 | 1194 |
$\rflts \; ( (\rsimp_{ALTS} \; |
1195 |
(\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing})) :: |
|
1196 |
\map \; \rsimp{} \; rs' ) = |
|
1197 |
\rflts \; ( (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing}) @ ( |
|
1198 |
\map \; \rsimp{rs'}))$ |
|
618 | 1199 |
\end{center} |
614 | 1200 |
|
1201 |
||
1202 |
\end{lemma} |
|
1203 |
\begin{proof} |
|
1204 |
By \ref{good1}. |
|
1205 |
\end{proof} |
|
1206 |
\noindent |
|
1207 |
||
1208 |
||
1209 |
||
1210 |
||
1211 |
||
1212 |
We are also ready to prove that $\textit{rsimp}$ is idempotent. |
|
1213 |
\subsubsection{$\rsimp$ is Idempotent} |
|
1214 |
The idempotency of $\rsimp$ is very useful in |
|
1215 |
manipulating regular expression terms into desired |
|
1216 |
forms so that key steps allowing further rewriting to closed forms |
|
1217 |
are possible. |
|
1218 |
\begin{lemma}\label{rsimpIdem} |
|
618 | 1219 |
$\rsimp{r} = \rsimp{(\rsimp{r})}$ |
614 | 1220 |
\end{lemma} |
1221 |
||
1222 |
\begin{proof} |
|
1223 |
By \ref{test} and \ref{good1}. |
|
1224 |
\end{proof} |
|
1225 |
\noindent |
|
1226 |
This property means we do not have to repeatedly |
|
1227 |
apply simplification in each step, which justifies |
|
1228 |
our definition of $\blexersimp$. |
|
639 | 1229 |
This is in contrast to the work of Sulzmann and Lu where |
1230 |
the simplification is applied in a fixpoint manner. |
|
614 | 1231 |
|
1232 |
||
618 | 1233 |
On the other hand, we can repeat the same $\rsimp{}$ applications |
614 | 1234 |
on regular expressions as many times as we want, if we have at least |
639 | 1235 |
one simplification applied to it, and apply it wherever we need to: |
614 | 1236 |
\begin{corollary}\label{headOneMoreSimp} |
1237 |
The following properties hold, directly from \ref{rsimpIdem}: |
|
1238 |
||
1239 |
\begin{itemize} |
|
1240 |
\item |
|
1241 |
$\map \; \rsimp{(r :: rs)} = \map \; \rsimp{} \; (\rsimp{r} :: rs)$ |
|
1242 |
\item |
|
1243 |
$\rsimp{(\RALTS{rs})} = \rsimp{(\RALTS{\map \; \rsimp{} \; rs})}$ |
|
1244 |
\end{itemize} |
|
1245 |
\end{corollary} |
|
1246 |
\noindent |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1247 |
This will be useful in the later closed-form proof's rewriting steps. |
618 | 1248 |
Similarly, we state the following useful facts below: |
614 | 1249 |
\begin{lemma} |
1250 |
The following equalities hold if $r = \rsimp{r'}$ for some $r'$: |
|
1251 |
\begin{itemize} |
|
1252 |
\item |
|
1253 |
If $r = \sum rs$ then $\rsimpalts \; rs = \sum rs$. |
|
1254 |
\item |
|
1255 |
If $r = \sum rs$ then $\rdistinct{rs}{\varnothing} = rs$. |
|
1256 |
\item |
|
1257 |
$\rsimpalts \; (\rdistinct{\rflts \; [r]}{\varnothing}) = r$. |
|
1258 |
\end{itemize} |
|
1259 |
\end{lemma} |
|
1260 |
\begin{proof} |
|
1261 |
By application of lemmas \ref{rsimpIdem} and \ref{good1}. |
|
1262 |
\end{proof} |
|
1263 |
||
1264 |
\noindent |
|
639 | 1265 |
With the idempotency of $\textit{rsimp}$ and its corollaries, |
614 | 1266 |
we can start proving some key equalities leading to the |
1267 |
closed forms. |
|
639 | 1268 |
Next we present a few equivalent terms under $\textit{rsimp}$. |
618 | 1269 |
To make the notation more concise |
639 | 1270 |
We use $r_1 \sequal r_2 $ to denote that $\rsimp{r_1} = \rsimp{r_2}$. |
1271 |
%\begin{center} |
|
1272 |
%\begin{tabular}{lcl} |
|
1273 |
% $a \sequal b$ & $ \dn$ & $ \textit{rsimp} \; a = \textit{rsimp} \; b$ |
|
1274 |
%\end{tabular} |
|
1275 |
%\end{center} |
|
1276 |
%\noindent |
|
618 | 1277 |
%\vspace{0em} |
614 | 1278 |
\begin{lemma} |
618 | 1279 |
The following equivalence hold: |
614 | 1280 |
\begin{itemize} |
1281 |
\item |
|
1282 |
$\rsimpalts \; (\RZERO :: rs) \sequal \rsimpalts\; rs$ |
|
1283 |
\item |
|
1284 |
$\rsimpalts \; rs \sequal \rsimpalts (\map \; \rsimp{} \; rs)$ |
|
1285 |
\item |
|
1286 |
$\RALTS{\RALTS{rs}} \sequal \RALTS{rs}$ |
|
1287 |
\item |
|
1288 |
$\sum ((\sum rs_a) :: rs_b) \sequal \sum rs_a @ rs_b$ |
|
1289 |
\item |
|
639 | 1290 |
$\RALTS{rs} \sequal \RALTS{\map \; \rsimp{} \; rs}$ |
614 | 1291 |
\end{itemize} |
1292 |
\end{lemma} |
|
1293 |
\begin{proof} |
|
1294 |
By induction on the lists involved. |
|
1295 |
\end{proof} |
|
1296 |
\noindent |
|
1297 |
The above allows us to prove |
|
1298 |
two similar equalities (which are a bit more involved). |
|
618 | 1299 |
It says that we could flatten the elements |
614 | 1300 |
before simplification and still get the same result. |
1301 |
\begin{lemma}\label{simpFlatten3} |
|
1302 |
One can flatten the inside $\sum$ of a $\sum$ if it is being |
|
1303 |
simplified. Concretely, |
|
1304 |
\begin{itemize} |
|
1305 |
\item |
|
1306 |
If for all $r \in rs, rs', rs''$, we have $\good \; r $ |
|
1307 |
or $r = \RZERO$, then $\sum (rs' @ rs @ rs'') \sequal |
|
1308 |
\sum (rs' @ [\sum rs] @ rs'')$ holds. As a corollary, |
|
1309 |
\item |
|
1310 |
$\sum (rs' @ [\sum rs] @ rs'') \sequal \sum (rs' @ rs @ rs'')$ |
|
1311 |
\end{itemize} |
|
1312 |
\end{lemma} |
|
1313 |
\begin{proof} |
|
1314 |
By rewriting steps involving the use of \ref{test} and \ref{rdistinctConcatGeneral}. |
|
1315 |
The second sub-lemma is a corollary of the previous. |
|
1316 |
\end{proof} |
|
1317 |
%Rewriting steps not put in--too long and complicated------------------------------- |
|
1318 |
\begin{comment} |
|
1319 |
\begin{center} |
|
1320 |
$\rsimp{\sum (rs' @ rs @ rs'')} \stackrel{def of bsimp}{=}$ \\ |
|
1321 |
$\rsimpalts \; (\rdistinct{\rflts \; ((\map \; \rsimp{}\; rs') @ (\map \; \rsimp{} \; rs ) @ (\map \; \rsimp{} \; rs''))}{\varnothing})$ \\ |
|
1322 |
$\stackrel{by \ref{test}}{=} |
|
1323 |
\rsimpalts \; (\rdistinct{(\rflts \; rs' @ \rflts \; rs @ \rflts \; rs'')}{ |
|
1324 |
\varnothing})$\\ |
|
1325 |
$\stackrel{by \ref{rdistinctConcatGeneral}}{=} |
|
1326 |
\rsimpalts \; (\rdistinct{\rflts \; rs'}{\varnothing} @ \rdistinct{( |
|
1327 |
\rflts\; rs @ \rflts \; rs'')}{\rflts \; rs'})$\\ |
|
1328 |
||
1329 |
\end{center} |
|
1330 |
\end{comment} |
|
1331 |
%Rewriting steps not put in--too long and complicated------------------------------- |
|
1332 |
\noindent |
|
1333 |
||
1334 |
||
618 | 1335 |
We need more equalities like the above to enable a closed form lemma, |
613 | 1336 |
for which we need to introduce a few rewrite relations |
1337 |
to help |
|
1338 |
us obtain them. |
|
554 | 1339 |
|
610 | 1340 |
\subsection{The rewrite relation $\hrewrite$ , $\scfrewrites$ , $\frewrite$ and $\grewrite$} |
613 | 1341 |
Inspired by the success we had in the correctness proof |
1342 |
in \ref{Bitcoded2}, |
|
1343 |
we follow suit here, defining atomic simplification |
|
1344 |
steps as ``small-step'' rewriting steps. This allows capturing |
|
555 | 1345 |
similarities between terms that would be otherwise |
1346 |
hard to express. |
|
1347 |
||
557 | 1348 |
We use $\hrewrite$ for one-step atomic rewrite of |
1349 |
regular expression simplification, |
|
555 | 1350 |
$\frewrite$ for rewrite of list of regular expressions that |
1351 |
include all operations carried out in $\rflts$, and $\grewrite$ for |
|
613 | 1352 |
rewriting a list of regular expressions possible in both $\rflts$ and $\textit{rdistinct}$. |
555 | 1353 |
Their reflexive transitive closures are used to denote zero or many steps, |
1354 |
as was the case in the previous chapter. |
|
613 | 1355 |
As we have already |
1356 |
done something similar, the presentation about |
|
1357 |
these rewriting rules will be more concise than that in \ref{Bitcoded2}. |
|
554 | 1358 |
To differentiate between the rewriting steps for annotated regular expressions |
1359 |
and $\rrexp$s, we add characters $h$ and $g$ below the squig arrow symbol |
|
1360 |
to mean atomic simplification transitions |
|
1361 |
of $\rrexp$s and $\rrexp$ lists, respectively. |
|
1362 |
||
555 | 1363 |
|
1364 |
||
1365 |
||
613 | 1366 |
\begin{figure}[H] |
554 | 1367 |
\begin{center} |
593 | 1368 |
\begin{mathpar} |
1369 |
\inferrule[RSEQ0L]{}{\RZERO \cdot r_2 \hrewrite \RZERO\\} |
|
555 | 1370 |
|
593 | 1371 |
\inferrule[RSEQ0R]{}{r_1 \cdot \RZERO \hrewrite \RZERO\\} |
555 | 1372 |
|
593 | 1373 |
\inferrule[RSEQ1]{}{(\RONE \cdot r) \hrewrite r\\}\\ |
555 | 1374 |
|
593 | 1375 |
\inferrule[RSEQL]{ r_1 \hrewrite r_2}{r_1 \cdot r_3 \hrewrite r_2 \cdot r_3\\} |
1376 |
||
1377 |
\inferrule[RSEQR]{ r_3 \hrewrite r_4}{r_1 \cdot r_3 \hrewrite r_1 \cdot r_4\\}\\ |
|
555 | 1378 |
|
593 | 1379 |
\inferrule[RALTSChild]{r \hrewrite r'}{\sum (rs_1 @ [r] @ rs_2) \hrewrite \sum (rs_1 @ [r'] @ rs_2)\\} |
555 | 1380 |
|
593 | 1381 |
\inferrule[RALTS0]{}{\sum (rs_a @ [\RZERO] @ rs_b) \hrewrite \sum (rs_a @ rs_b)} |
555 | 1382 |
|
593 | 1383 |
\inferrule[RALTSNested]{}{\sum (rs_a @ [\sum rs_1] @ rs_b) \hrewrite \sum (rs_a @ rs_1 @ rs_b)} |
555 | 1384 |
|
593 | 1385 |
\inferrule[RALTSNil]{}{ \sum [] \hrewrite \RZERO\\} |
555 | 1386 |
|
593 | 1387 |
\inferrule[RALTSSingle]{}{ \sum [r] \hrewrite r\\} |
555 | 1388 |
|
593 | 1389 |
\inferrule[RALTSDelete]{\\ r_1 = r_2}{\sum rs_a @ [r_1] @ rs_b @ [r_2] @ rsc \hrewrite \sum rs_a @ [r_1] @ rs_b @ rs_c} |
555 | 1390 |
|
593 | 1391 |
\end{mathpar} |
555 | 1392 |
\end{center} |
613 | 1393 |
\caption{List of one-step rewrite rules for r-regular expressions ($\hrewrite$)}\label{hRewrite} |
1394 |
\end{figure} |
|
554 | 1395 |
|
557 | 1396 |
|
613 | 1397 |
Like $\rightsquigarrow_s$, it is |
1398 |
convenient to define rewrite rules for a list of regular expressions, |
|
593 | 1399 |
where each element can rewrite in many steps to the other (scf stands for |
1400 |
li\emph{s}t \emph{c}losed \emph{f}orm). This relation is similar to the |
|
1401 |
$\stackrel{s*}{\rightsquigarrow}$ for annotated regular expressions. |
|
557 | 1402 |
|
613 | 1403 |
\begin{figure}[H] |
557 | 1404 |
\begin{center} |
593 | 1405 |
\begin{mathpar} |
1406 |
\inferrule{}{[] \scfrewrites [] } |
|
613 | 1407 |
|
593 | 1408 |
\inferrule{r \hrewrites r' \\ rs \scfrewrites rs'}{r :: rs \scfrewrites r' :: rs'} |
1409 |
\end{mathpar} |
|
557 | 1410 |
\end{center} |
613 | 1411 |
\caption{List of one-step rewrite rules for a list of r-regular expressions}\label{scfRewrite} |
1412 |
\end{figure} |
|
555 | 1413 |
%frewrite |
593 | 1414 |
List of one-step rewrite rules for flattening |
1415 |
a list of regular expressions($\frewrite$): |
|
613 | 1416 |
\begin{figure}[H] |
555 | 1417 |
\begin{center} |
593 | 1418 |
\begin{mathpar} |
1419 |
\inferrule{}{\RZERO :: rs \frewrite rs \\} |
|
555 | 1420 |
|
593 | 1421 |
\inferrule{}{(\sum rs) :: rs_a \frewrite rs @ rs_a \\} |
555 | 1422 |
|
593 | 1423 |
\inferrule{rs_1 \frewrite rs_2}{r :: rs_1 \frewrite r :: rs_2} |
1424 |
\end{mathpar} |
|
555 | 1425 |
\end{center} |
613 | 1426 |
\caption{List of one-step rewrite rules characterising the $\rflts$ operation on a list}\label{fRewrites} |
1427 |
\end{figure} |
|
555 | 1428 |
|
593 | 1429 |
Lists of one-step rewrite rules for flattening and de-duplicating |
1430 |
a list of regular expressions ($\grewrite$): |
|
613 | 1431 |
\begin{figure}[H] |
555 | 1432 |
\begin{center} |
593 | 1433 |
\begin{mathpar} |
1434 |
\inferrule{}{\RZERO :: rs \grewrite rs \\} |
|
532 | 1435 |
|
593 | 1436 |
\inferrule{}{(\sum rs) :: rs_a \grewrite rs @ rs_a \\} |
555 | 1437 |
|
593 | 1438 |
\inferrule{rs_1 \grewrite rs_2}{r :: rs_1 \grewrite r :: rs_2} |
555 | 1439 |
|
593 | 1440 |
\inferrule[dB]{}{rs_a @ [a] @ rs_b @[a] @ rs_c \grewrite rs_a @ [a] @ rsb @ rsc} |
1441 |
\end{mathpar} |
|
555 | 1442 |
\end{center} |
613 | 1443 |
\caption{List of one-step rewrite rules characterising the $\rflts$ and $\textit{rdistinct}$ |
1444 |
operations}\label{gRewrite} |
|
1445 |
\end{figure} |
|
555 | 1446 |
\noindent |
618 | 1447 |
We define |
613 | 1448 |
two separate list rewriting relations $\frewrite$ and $\grewrite$. |
611 | 1449 |
The rewriting steps that take place during |
639 | 1450 |
flattening are characterised by $\frewrite$. |
618 | 1451 |
The rewrite relation $\grewrite$ characterises both flattening and de-duplicating. |
557 | 1452 |
Sometimes $\grewrites$ is slightly too powerful |
613 | 1453 |
so we would rather use $\frewrites$ to prove |
1454 |
%because we only |
|
1455 |
equalities related to $\rflts$. |
|
1456 |
%certain equivalence under the rewriting steps of $\frewrites$. |
|
556 | 1457 |
For example, when proving the closed-form for the alternative regular expression, |
613 | 1458 |
one of the equalities needed is: |
1459 |
\begin{center} |
|
557 | 1460 |
$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal |
593 | 1461 |
\sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing) |
1462 |
$ |
|
613 | 1463 |
\end{center} |
556 | 1464 |
\noindent |
1465 |
Proving this is by first showing |
|
557 | 1466 |
\begin{lemma}\label{earlyLaterDerFrewrites} |
556 | 1467 |
$\map \; (\_ \backslash x) \; (\rflts \; rs) \frewrites |
557 | 1468 |
\rflts \; (\map \; (\_ \backslash x) \; rs)$ |
556 | 1469 |
\end{lemma} |
1470 |
\noindent |
|
613 | 1471 |
and then the equivalence between two terms |
618 | 1472 |
that can reduce in many steps to each other: |
556 | 1473 |
\begin{lemma}\label{frewritesSimpeq} |
1474 |
If $rs_1 \frewrites rs_2 $, then $\sum (\rDistinct \; rs_1 \; \varnothing) \sequal |
|
557 | 1475 |
\sum (\rDistinct \; rs_2 \; \varnothing)$. |
556 | 1476 |
\end{lemma} |
557 | 1477 |
\noindent |
618 | 1478 |
These two lemmas can both be proven using a straightforward induction (and |
1479 |
the proofs for them are therefore omitted). |
|
613 | 1480 |
|
639 | 1481 |
Now the above equalities can be derived with ease: |
613 | 1482 |
\begin{corollary} |
1483 |
$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal |
|
1484 |
\sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing) |
|
1485 |
$ |
|
1486 |
\end{corollary} |
|
618 | 1487 |
\begin{proof} |
1488 |
By lemmas \ref{earlyLaterDerFrewrites} and \ref{frewritesSimpeq}. |
|
1489 |
\end{proof} |
|
557 | 1490 |
But this trick will not work for $\grewrites$. |
1491 |
For example, a rewriting step in proving |
|
1492 |
closed forms is: |
|
1493 |
\begin{center} |
|
593 | 1494 |
$\rsimp{(\rsimpalts \; (\map \; (\_ \backslash x) \; (\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs}))))}{\varnothing})))}$\\ |
1495 |
$=$ \\ |
|
1496 |
$\rsimp{(\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; (\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}))} $ |
|
1497 |
\noindent |
|
557 | 1498 |
\end{center} |
618 | 1499 |
For this, one would hope to have a rewriting relation between the two lists involved, |
557 | 1500 |
similar to \ref{earlyLaterDerFrewrites}. However, it turns out that |
556 | 1501 |
\begin{center} |
593 | 1502 |
$\map \; (\_ \backslash x) \; (\rDistinct \; rs \; rset) \grewrites \rDistinct \; (\map \; |
1503 |
(\_ \backslash x) \; rs) \; ( rset \backslash x)$ |
|
556 | 1504 |
\end{center} |
1505 |
\noindent |
|
557 | 1506 |
does $\mathbf{not}$ hold in general. |
1507 |
For this rewriting step we will introduce some slightly more cumbersome |
|
618 | 1508 |
proof technique later. |
557 | 1509 |
The point is that $\frewrite$ |
613 | 1510 |
allows us to prove equivalence in a straightforward way that is |
1511 |
not possible for $\grewrite$. |
|
555 | 1512 |
|
556 | 1513 |
|
557 | 1514 |
\subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$} |
613 | 1515 |
In this part, we present lemmas stating |
618 | 1516 |
pairs of r-regular expressions and r-regular expression lists |
639 | 1517 |
where one can rewrite from one in many steps to the other. |
613 | 1518 |
Most of the proofs to these lemmas are straightforward, using |
618 | 1519 |
an induction on the corresponding rewriting relations. |
613 | 1520 |
These proofs will therefore be omitted when this is the case. |
618 | 1521 |
We present in the following lemma a few pairs of terms that are rewritable via |
557 | 1522 |
$\grewrites$: |
1523 |
\begin{lemma}\label{gstarRdistinctGeneral} |
|
613 | 1524 |
\mbox{} |
557 | 1525 |
\begin{itemize} |
1526 |
\item |
|
1527 |
$rs_1 @ rs \grewrites rs_1 @ (\rDistinct \; rs \; rs_1)$ |
|
1528 |
\item |
|
1529 |
$rs \grewrites \rDistinct \; rs \; \varnothing$ |
|
1530 |
\item |
|
1531 |
$rs_a @ (\rDistinct \; rs \; rs_a) \grewrites rs_a @ (\rDistinct \; |
|
1532 |
rs \; (\{\RZERO\} \cup rs_a))$ |
|
1533 |
\item |
|
1534 |
$rs \;\; @ \;\; \rDistinct \; rs_a \; rset \grewrites rs @ \rDistinct \; rs_a \; |
|
1535 |
(rest \cup rs)$ |
|
1536 |
||
1537 |
\end{itemize} |
|
1538 |
\end{lemma} |
|
1539 |
\noindent |
|
1540 |
If a pair of terms $rs_1, rs_2$ are rewritable via $\grewrites$ to each other, |
|
1541 |
then they are equivalent under $\rsimp{}$: |
|
1542 |
\begin{lemma}\label{grewritesSimpalts} |
|
618 | 1543 |
\mbox{} |
557 | 1544 |
If $rs_1 \grewrites rs_2$, then |
613 | 1545 |
we have the following equivalence: |
557 | 1546 |
\begin{itemize} |
1547 |
\item |
|
1548 |
$\sum rs_1 \sequal \sum rs_2$ |
|
1549 |
\item |
|
1550 |
$\rsimpalts \; rs_1 \sequal \rsimpalts \; rs_2$ |
|
1551 |
\end{itemize} |
|
1552 |
\end{lemma} |
|
1553 |
\noindent |
|
1554 |
Here are a few connecting lemmas showing that |
|
1555 |
if a list of regular expressions can be rewritten using $\grewrites$ or $\frewrites $ or |
|
1556 |
$\scfrewrites$, |
|
1557 |
then an alternative constructor taking the list can also be rewritten using $\hrewrites$: |
|
1558 |
\begin{lemma} |
|
1559 |
\begin{itemize} |
|
1560 |
\item |
|
1561 |
If $rs \grewrites rs'$ then $\sum rs \hrewrites \sum rs'$. |
|
1562 |
\item |
|
1563 |
If $rs \grewrites rs'$ then $\sum rs \hrewrites \rsimpalts \; rs'$ |
|
1564 |
\item |
|
1565 |
If $rs_1 \scfrewrites rs_2$ then $\sum (rs @ rs_1) \hrewrites \sum (rs @ rs_2)$ |
|
1566 |
\item |
|
1567 |
If $rs_1 \scfrewrites rs_2$ then $\sum rs_1 \hrewrites \sum rs_2$ |
|
1568 |
||
1569 |
\end{itemize} |
|
1570 |
\end{lemma} |
|
1571 |
\noindent |
|
618 | 1572 |
Now comes the core of the proof, |
557 | 1573 |
which says that once two lists are rewritable to each other, |
639 | 1574 |
then they are equivalent under $\textit{rsimp}$: |
557 | 1575 |
\begin{lemma} |
1576 |
If $r_1 \hrewrites r_2$ then $r_1 \sequal r_2$. |
|
1577 |
\end{lemma} |
|
1578 |
||
1579 |
\noindent |
|
639 | 1580 |
Similar to what we did in chapter \ref{Bitcoded2}, |
618 | 1581 |
we prove that if one can rewrite from one r-regular expression ($r$) |
639 | 1582 |
to the other ($r'$), after taking derivatives one can still rewrite |
618 | 1583 |
the first ($r\backslash c$) to the other ($r'\backslash c$). |
557 | 1584 |
\begin{lemma}\label{interleave} |
1585 |
If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$ |
|
1586 |
\end{lemma} |
|
1587 |
\noindent |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1588 |
This allows us to prove more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$. |
557 | 1589 |
\begin{lemma}\label{insideSimpRemoval} |
618 | 1590 |
$\rsimp{(\rder{c}{(\rsimp{r})})} = \rsimp{(\rder{c}{r})} $ |
557 | 1591 |
\end{lemma} |
1592 |
\noindent |
|
1593 |
\begin{proof} |
|
1594 |
By \ref{interleave} and \ref{rsimpIdem}. |
|
1595 |
\end{proof} |
|
1596 |
\noindent |
|
1597 |
And this unlocks more equivalent terms: |
|
1598 |
\begin{lemma}\label{Simpders} |
|
1599 |
As corollaries of \ref{insideSimpRemoval}, we have |
|
1600 |
\begin{itemize} |
|
1601 |
\item |
|
620 | 1602 |
If $s \neq []$ then $\rderssimp{r}{s} = \rsimp{( r \backslash_{rs} s)}$. |
557 | 1603 |
\item |
1604 |
$\rsimpalts \; (\map \; (\_ \backslash_r x) \; |
|
593 | 1605 |
(\rdistinct{rs}{\varnothing})) \sequal |
1606 |
\rsimpalts \; (\rDistinct \; |
|
1607 |
(\map \; (\_ \backslash_r x) rs) \;\varnothing )$ |
|
1608 |
\end{itemize} |
|
1609 |
\end{lemma} |
|
611 | 1610 |
\begin{proof} |
1611 |
Part 1 is by lemma \ref{insideSimpRemoval}, |
|
613 | 1612 |
part 2 is by lemma \ref{insideSimpRemoval} .%and \ref{distinctDer}. |
611 | 1613 |
\end{proof} |
557 | 1614 |
\noindent |
613 | 1615 |
|
1616 |
\subsection{Closed Forms for $\sum rs$, $r_1\cdot r_2$ and $r^*$} |
|
618 | 1617 |
Lemma \ref{Simpders} leads to our first closed form, |
1618 |
which is for the alternative regular expression: |
|
1619 |
\begin{theorem}\label{altsClosedForm} |
|
1620 |
\mbox{} |
|
593 | 1621 |
\begin{center} |
1622 |
$\rderssimp{(\sum rs)}{s} \sequal |
|
1623 |
\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$ |
|
1624 |
\end{center} |
|
618 | 1625 |
\end{theorem} |
556 | 1626 |
\noindent |
557 | 1627 |
\begin{proof} |
1628 |
By a reverse induction on the string $s$. |
|
1629 |
One rewriting step, as we mentioned earlier, |
|
1630 |
involves |
|
1631 |
\begin{center} |
|
1632 |
$\rsimpalts \; (\map \; (\_ \backslash x) \; |
|
1633 |
(\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; |
|
1634 |
(\lambda r. \rderssimp{r}{xs}))))}{\varnothing})) |
|
1635 |
\sequal |
|
1636 |
\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; |
|
593 | 1637 |
(\rflts \; (\map \; (\rsimp{} \; \circ \; |
557 | 1638 |
(\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}) $. |
1639 |
\end{center} |
|
1640 |
This can be proven by a combination of |
|
1641 |
\ref{grewritesSimpalts}, \ref{gstarRdistinctGeneral}, \ref{rderRsimpAltsCommute}, and |
|
1642 |
\ref{insideSimpRemoval}. |
|
1643 |
\end{proof} |
|
1644 |
\noindent |
|
1645 |
This closed form has a variant which can be more convenient in later proofs: |
|
618 | 1646 |
\begin{corollary}\label{altsClosedForm1} |
557 | 1647 |
If $s \neq []$ then |
1648 |
$\rderssimp \; (\sum \; rs) \; s = |
|
1649 |
\rsimp{(\sum \; (\map \; \rderssimp{\_}{s} \; rs))}$. |
|
1650 |
\end{corollary} |
|
1651 |
\noindent |
|
1652 |
The harder closed forms are the sequence and star ones. |
|
613 | 1653 |
Before we obtain them, some preliminary definitions |
557 | 1654 |
are needed to make proof statements concise. |
556 | 1655 |
|
609 | 1656 |
|
1657 |
\subsubsection{Closed Form for Sequence Regular Expressions} |
|
618 | 1658 |
For the sequence regular expression, |
1659 |
let's first look at a series of derivative steps on it |
|
1660 |
(assuming that each time when a derivative is taken, |
|
1661 |
the head of the sequence is always nullable): |
|
557 | 1662 |
\begin{center} |
618 | 1663 |
\begin{tabular}{llll} |
1664 |
$r_1 \cdot r_2$ & |
|
1665 |
$\longrightarrow_{\backslash c}$ & |
|
1666 |
$r_1\backslash c \cdot r_2 + r_2 \backslash c$ & |
|
1667 |
$ \longrightarrow_{\backslash c'} $ \\ |
|
1668 |
\\ |
|
1669 |
$(r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc'$ & |
|
1670 |
$\longrightarrow_{\backslash c''} $ & |
|
1671 |
$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') |
|
1672 |
+ r_2 \backslash cc'c''$ & |
|
1673 |
$ \longrightarrow_{\backslash c''} \quad \ldots$\\ |
|
1674 |
\end{tabular} |
|
557 | 1675 |
\end{center} |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1676 |
Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expressed as |
558 | 1677 |
a giant alternative taking a list of terms |
1678 |
$[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$, |
|
1679 |
where the head of the list is always the term |
|
1680 |
representing a match involving only $r_1$, and the tail of the list consisting of |
|
1681 |
terms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$. |
|
618 | 1682 |
This intuition is also echoed by Murugesan and Sundaram \cite{Murugesan2014}, |
1683 |
where they gave |
|
557 | 1684 |
a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$: |
532 | 1685 |
\begin{center} |
618 | 1686 |
\begin{tabular}{lc} |
1687 |
$L \; [ (r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) ]$ & $ =$\\ |
|
1688 |
\\ |
|
1689 |
\rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1) \cdot r_2 + |
|
1690 |
(\delta\; (\nullable \; r_1) \; (r_2 \backslash_r c_1) )) \backslash_r |
|
1691 |
(c_2 :: \ldots c_n) ]$ & |
|
1692 |
$=$\\ |
|
1693 |
\\ |
|
1694 |
\rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1c_2 \cdot r_2 + |
|
1695 |
(\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2))) |
|
1696 |
$ & \\ |
|
1697 |
\\ |
|
1698 |
$\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) )) |
|
1699 |
\backslash_r (c_3 \ldots c_n) ]$ & $\ldots$ \\ |
|
558 | 1700 |
\end{tabular} |
557 | 1701 |
\end{center} |
1702 |
\noindent |
|
618 | 1703 |
The $\delta$ function |
1704 |
returns $r$ when the boolean condition |
|
1705 |
$b$ evaluates to true and |
|
639 | 1706 |
$\ZERO_r$ otherwise: |
618 | 1707 |
\begin{center} |
1708 |
\begin{tabular}{lcl} |
|
1709 |
$\delta \; b\; r$ & $\dn$ & $r \quad \textit{if} \; b \; is \;\textit{true}$\\ |
|
639 | 1710 |
& $\dn$ & $\ZERO_r \quad otherwise$ |
618 | 1711 |
\end{tabular} |
1712 |
\end{center} |
|
1713 |
\noindent |
|
1714 |
Note that the term |
|
1715 |
\begin{center} |
|
1716 |
\begin{tabular}{lc} |
|
1717 |
\rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + |
|
1718 |
(\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2))) |
|
1719 |
$ & \\ |
|
1720 |
\\ |
|
1721 |
$\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) )) |
|
1722 |
\backslash_r (c_3 \ldots c_n)$ &\\ |
|
1723 |
\end{tabular} |
|
1724 |
\end{center} |
|
1725 |
\noindent |
|
558 | 1726 |
does not faithfully |
1727 |
represent what the intermediate derivatives would actually look like |
|
1728 |
when one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not |
|
1729 |
nullable in the head of the sequence. |
|
1730 |
For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable, |
|
1731 |
the regular expression would not look like |
|
1732 |
\[ |
|
618 | 1733 |
r_1 \backslash_r c_1c_2 |
558 | 1734 |
\] |
618 | 1735 |
instead of |
1736 |
\[ |
|
639 | 1737 |
(r_1 \backslash_r c_1c_2 + \ZERO_r ) + \ZERO_r. |
618 | 1738 |
\] |
639 | 1739 |
The redundant $\ZERO_r$s will not be created in the |
558 | 1740 |
first place. |
618 | 1741 |
In a closed-form one needs to take into account this (because |
1742 |
closed forms require exact equality rather than language equivalence) |
|
1743 |
and only generate the |
|
1744 |
$r_2 \backslash_r s''$ terms satisfying the property |
|
1745 |
\begin{center} |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1746 |
$\exists s'. such \; that \; s'@s'' = s \;\; \land \;\; |
618 | 1747 |
r_1 \backslash s' \; is \; nullable$. |
1748 |
\end{center} |
|
1749 |
Given the arguments $s$ and $r_1$, we denote the list of strings |
|
1750 |
$s''$ satisfying the above property as $\vsuf{s}{r_1}$. |
|
1751 |
The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string\footnote{ |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1752 |
Perhaps a better name for it would be ``NullablePrefixSuffix'' |
618 | 1753 |
to differentiate with the list of \emph{all} prefixes of $s$, but |
1754 |
that is a bit too long for a function name and we are yet to find |
|
1755 |
a more concise and easy-to-understand name.} |
|
558 | 1756 |
\begin{center} |
593 | 1757 |
\begin{tabular}{lcl} |
1758 |
$\vsuf{[]}{\_} $ & $=$ & $[]$\\ |
|
618 | 1759 |
$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} \; (\rnullable{r_1}) \; \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ |
593 | 1760 |
&& $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $ |
1761 |
\end{tabular} |
|
558 | 1762 |
\end{center} |
1763 |
\noindent |
|
618 | 1764 |
The list starts with shorter suffixes |
1765 |
and ends with longer ones (in other words, the string elements $s''$ |
|
1766 |
in the list $\vsuf{s}{r_1}$ are sorted |
|
1767 |
in the same order as that of the terms $r_2\backslash s''$ |
|
1768 |
appearing in $(r_1\cdot r_2)\backslash s$). |
|
558 | 1769 |
In essence, $\vsuf{\_}{\_}$ is doing a |
1770 |
"virtual derivative" of $r_1 \cdot r_2$, but instead of producing |
|
1771 |
the entire result $(r_1 \cdot r_2) \backslash s$, |
|
618 | 1772 |
it only stores strings, |
1773 |
with each string $s''$ representing a term such that $r_2 \backslash s''$ |
|
1774 |
is occurring in $(r_1\cdot r_2)\backslash s$. |
|
558 | 1775 |
|
618 | 1776 |
With $\textit{Suffix}$ we are ready to express the |
1777 |
sequence regular expression's closed form, |
|
1778 |
but before doing so |
|
639 | 1779 |
more definitions are needed. |
618 | 1780 |
The first thing is the flattening function $\sflat{\_}$, |
1781 |
which takes an alternative regular expression and produces a flattened version |
|
1782 |
of that alternative regular expression. |
|
1783 |
It is needed to convert |
|
557 | 1784 |
a left-associative nested sequence of alternatives into |
1785 |
a flattened list: |
|
558 | 1786 |
\[ |
618 | 1787 |
\sum(\ldots ((r_1 + r_2) + r_3) + \ldots) |
1788 |
\stackrel{\sflat{\_}}{\rightarrow} |
|
1789 |
\sum[r_1, r_2, r_3, \ldots] |
|
558 | 1790 |
\] |
1791 |
\noindent |
|
618 | 1792 |
The definitions of $\sflat{\_}$ and helper functions |
1793 |
$\sflataux{\_}$ and $\llparenthesis \_ \rrparenthesis''$ are given below. |
|
593 | 1794 |
\begin{center} |
618 | 1795 |
\begin{tabular}{lcl} |
1796 |
$\sflataux{\sum r :: rs}$ & $\dn$ & $\sflataux{r} @ rs$\\ |
|
1797 |
$\sflataux{\sum []}$ & $ \dn $ & $ []$\\ |
|
1798 |
$\sflataux r$ & $\dn$ & $ [r]$ |
|
593 | 1799 |
\end{tabular} |
532 | 1800 |
\end{center} |
1801 |
||
593 | 1802 |
\begin{center} |
618 | 1803 |
\begin{tabular}{lcl} |
1804 |
$\sflat{(\sum r :: rs)}$ & $\dn$ & $\sum (\sflataux{r} @ rs)$\\ |
|
1805 |
$\sflat{\sum []}$ & $ \dn $ & $ \sum []$\\ |
|
1806 |
$\sflat r$ & $\dn$ & $ r$ |
|
1807 |
\end{tabular} |
|
1808 |
\end{center} |
|
1809 |
||
1810 |
\begin{center} |
|
1811 |
\begin{tabular}{lcl} |
|
1812 |
$\sflataux{[]}'$ & $ \dn $ & $ []$\\ |
|
1813 |
$\sflataux{ (r_1 + r_2) :: rs }'$ & $\dn$ & $r_1 :: r_2 :: rs$\\ |
|
639 | 1814 |
$\sflataux{r :: rs}'$ & $\dn$ & $ r::rs$ |
593 | 1815 |
\end{tabular} |
557 | 1816 |
\end{center} |
558 | 1817 |
\noindent |
576 | 1818 |
$\sflataux{\_}$ breaks up nested alternative regular expressions |
557 | 1819 |
of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape |
558 | 1820 |
into a "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$. |
557 | 1821 |
It will return the singleton list $[r]$ otherwise. |
1822 |
$\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps |
|
1823 |
the output type a regular expression, not a list. |
|
558 | 1824 |
$\sflataux{\_}$ and $\sflat{\_}$ are only recursive on the |
1825 |
first element of the list. |
|
618 | 1826 |
$\sflataux{\_}'$ takes a list of regular expressions as input, and outputs |
1827 |
a list of regular expressions. |
|
1828 |
The use of $\sflataux{\_}$ and $\sflataux{\_}'$ is clear once we have |
|
1829 |
$\textit{createdBySequence}$ defined: |
|
1830 |
\begin{center} |
|
1831 |
\begin{mathpar} |
|
1832 |
\inferrule{\mbox{}}{\textit{createdBySequence}\; (r_1 \cdot r_2)} |
|
558 | 1833 |
|
618 | 1834 |
\inferrule{\textit{createdBySequence} \; r_1}{\textit{createdBySequence} \; |
1835 |
(r_1 + r_2)} |
|
1836 |
\end{mathpar} |
|
1837 |
\end{center} |
|
1838 |
\noindent |
|
1839 |
The predicate $\textit{createdBySequence}$ is used to describe the shape of |
|
1840 |
the derivative regular expressions $(r_1\cdot r_2) \backslash s$: |
|
1841 |
\begin{lemma}\label{recursivelyDerseq} |
|
1842 |
It is always the case that |
|
1843 |
\begin{center} |
|
1844 |
$\textit{createdBySequence} \; ( (r_1\cdot r_2) \backslash_r s) $ |
|
1845 |
\end{center} |
|
1846 |
holds. |
|
1847 |
\end{lemma} |
|
1848 |
\begin{proof} |
|
1849 |
By a reverse induction on the string $s$, where the inductive cases are $[]$ |
|
1850 |
and $xs @ [x]$. |
|
1851 |
\end{proof} |
|
1852 |
\noindent |
|
639 | 1853 |
If we have a regular expression $r$ whose shape |
618 | 1854 |
fits into those described by $\textit{createdBySequence}$, |
639 | 1855 |
then we can convert between |
618 | 1856 |
$r \backslash_r c$ and $(\sflataux{r}) \backslash_r c$ with |
1857 |
$\sflataux{\_}'$: |
|
1858 |
\begin{lemma}\label{sfauIdemDer} |
|
1859 |
If $\textit{createdBySequence} \; r$, then |
|
1860 |
\begin{center} |
|
1861 |
$\sflataux{ r \backslash_r c} = |
|
1862 |
\llparenthesis (\map \; (\_ \backslash_r c) \; (\sflataux{r}) ) \rrparenthesis''$ |
|
1863 |
\end{center} |
|
1864 |
holds. |
|
1865 |
\end{lemma} |
|
1866 |
\begin{proof} |
|
1867 |
By a simple induction on the inductive cases of $\textit{createdBySequence}. |
|
1868 |
$ |
|
1869 |
\end{proof} |
|
1870 |
||
1871 |
Now we are ready to express |
|
1872 |
the shape of $r_1 \cdot r_2 \backslash s$ |
|
558 | 1873 |
\begin{lemma}\label{seqSfau0} |
618 | 1874 |
$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 |
639 | 1875 |
:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r_1))$ |
558 | 1876 |
\end{lemma} |
1877 |
\begin{proof} |
|
618 | 1878 |
By a reverse induction on the string $s$, where the inductive cases |
1879 |
are $[]$ and $xs @ [x]$. |
|
1880 |
For the inductive case, we know that $\textit{createdBySequence} \; ((r_1 \cdot r_2) |
|
1881 |
\backslash_r xs)$ holds from lemma \ref{recursivelyDerseq}, |
|
1882 |
which can be used to prove |
|
558 | 1883 |
\[ |
1884 |
\map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\; |
|
1885 |
\map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1})) |
|
1886 |
\] |
|
593 | 1887 |
= |
558 | 1888 |
\[ |
1889 |
\map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1}) |
|
1890 |
\] |
|
618 | 1891 |
using lemma \ref{sfauIdemDer}. |
1892 |
This equality enables the inductive case to go through. |
|
558 | 1893 |
\end{proof} |
1894 |
\noindent |
|
618 | 1895 |
This lemma says that $(r_1\cdot r_2)\backslash s$ |
1896 |
can be flattened into a list whose head and tail meet the description |
|
1897 |
we gave earlier. |
|
1898 |
%Note that this lemma does $\mathbf{not}$ depend on any |
|
1899 |
%specific definitions we used, |
|
1900 |
%allowing people investigating derivatives to get an alternative |
|
1901 |
%view of what $r_1 \cdot r_2$ is. |
|
532 | 1902 |
|
618 | 1903 |
We now use $\textit{createdBySequence}$ and |
1904 |
$\sflataux{\_}$ to describe an intuition |
|
639 | 1905 |
behind the sequence closed form. |
618 | 1906 |
If two regular expressions only differ in the way their |
1907 |
alternatives are nested, then we should be able to get the same result |
|
1908 |
once we apply simplification to both of them: |
|
1909 |
\begin{lemma}\label{sflatRsimpeq} |
|
1910 |
If $r$ is created from a sequence through |
|
1911 |
a series of derivatives |
|
1912 |
(i.e. if $\textit{createdBySequence} \; r$ holds), |
|
1913 |
and that $\sflataux{r} = rs$, |
|
1914 |
then we have |
|
1915 |
that |
|
1916 |
\begin{center} |
|
1917 |
$\textit{rsimp} \; r = \textit{rsimp} \; (\sum \; rs)$ |
|
1918 |
\end{center} |
|
1919 |
holds. |
|
1920 |
\end{lemma} |
|
1921 |
\begin{proof} |
|
1922 |
By an induction on the inductive cases of $\textit{createdBySequence}$. |
|
1923 |
\end{proof} |
|
1924 |
||
1925 |
Now we are ready for the closed form |
|
1926 |
for the sequence regular expressions (without the inner applications |
|
1927 |
of simplifications): |
|
1928 |
\begin{lemma}\label{seqClosedFormGeneral} |
|
558 | 1929 |
$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} } |
1930 |
=\rsimp{(\sum ( (r_1 \backslash s) \cdot r_2 :: |
|
593 | 1931 |
\map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$ |
558 | 1932 |
\end{lemma} |
1933 |
\begin{proof} |
|
618 | 1934 |
We know that |
1935 |
$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 |
|
639 | 1936 |
:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r_1))$ |
618 | 1937 |
holds |
1938 |
by lemma \ref{seqSfau0}. |
|
1939 |
This allows the theorem to go through because of lemma \ref{sflatRsimpeq}. |
|
558 | 1940 |
\end{proof} |
618 | 1941 |
Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}), |
1942 |
it is possible to convert the above lemma to obtain the |
|
1943 |
proper closed form for $\backslash_{rsimps}$ rather than $\backslash_r$: |
|
1944 |
for derivatives nested with simplification: |
|
1945 |
\begin{theorem}\label{seqClosedForm} |
|
1946 |
$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) |
|
1947 |
:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$ |
|
1948 |
\end{theorem} |
|
1949 |
\begin{proof} |
|
1950 |
By a case analysis of the string $s$. |
|
1951 |
When $s$ is an empty list, the rewrite is straightforward. |
|
1952 |
When $s$ is a non-empty list, the |
|
1953 |
lemmas \ref{seqClosedFormGeneral} and \ref{Simpders} apply, |
|
1954 |
making the proof go through. |
|
1955 |
\end{proof} |
|
609 | 1956 |
\subsubsection{Closed Forms for Star Regular Expressions} |
618 | 1957 |
The closed form for the star regular expression involves similar tricks |
1958 |
for the sequence regular expression. |
|
1959 |
The $\textit{Suffix}$ function is now replaced by something |
|
1960 |
slightly more complex, because the growth pattern of star |
|
1961 |
regular expressions' derivatives is a bit different: |
|
564 | 1962 |
\begin{center} |
618 | 1963 |
\begin{tabular}{lclc} |
1964 |
$r^* $ & $\longrightarrow_{\backslash c}$ & |
|
1965 |
$(r\backslash c) \cdot r^*$ & $\longrightarrow_{\backslash c'}$\\ |
|
1966 |
\\ |
|
1967 |
$r \backslash cc' \cdot r^* + r \backslash c' \cdot r^*$ & |
|
1968 |
$\longrightarrow_{\backslash c''}$ & |
|
1969 |
$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + |
|
1970 |
(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)$ & |
|
1971 |
$\longrightarrow_{\backslash c'''}$ \\ |
|
1972 |
\\ |
|
1973 |
$\ldots$\\ |
|
1974 |
\end{tabular} |
|
564 | 1975 |
\end{center} |
1976 |
When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, |
|
1977 |
$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable, |
|
618 | 1978 |
the number of terms in $r^* \backslash s$ will grow exponentially rather than linearly |
1979 |
in the sequence case. |
|
1980 |
The good news is that the function $\textit{rsimp}$ will again |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1981 |
ignore the difference between different nesting patterns of alternatives, |
618 | 1982 |
and the exponentially growing star derivative like |
1983 |
\begin{center} |
|
1984 |
$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + |
|
1985 |
(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ |
|
1986 |
\end{center} |
|
1987 |
can be treated as |
|
1988 |
\begin{center} |
|
1989 |
$\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', |
|
1990 |
r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$ |
|
1991 |
\end{center} |
|
1992 |
which can be de-duplicated by $\rDistinct$ |
|
1993 |
and therefore bounded finitely. |
|
564 | 1994 |
|
639 | 1995 |
%and then de-duplicate terms of the form ($s'$ being a substring of $s$). |
1996 |
%This allows us to use a similar technique as $r_1 \cdot r_2$ case, |
|
618 | 1997 |
|
1998 |
Now the crux of this section is finding a suitable description |
|
1999 |
for $rs$ where |
|
2000 |
\begin{center} |
|
2001 |
$\rderssimp{r^*}{s} = \rsimp{\sum rs}$. |
|
2002 |
\end{center} |
|
2003 |
holds. |
|
2004 |
In addition, the list $rs$ |
|
2005 |
shall be in the form of |
|
2006 |
$\map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss$. |
|
2007 |
The $Ss$ is a list of strings, and for example in the sequence |
|
2008 |
closed form it is specified as $\textit{Suffix} \; s \; r_1$. |
|
2009 |
To get $Ss$ for the star regular expression, |
|
2010 |
we need to introduce $\starupdate$ and $\starupdates$: |
|
558 | 2011 |
\begin{center} |
2012 |
\begin{tabular}{lcl} |
|
2013 |
$\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ |
|
2014 |
$\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\ |
|
2015 |
& & $\textit{if} \; |
|
620 | 2016 |
(\rnullable \; (r \backslash_{rs} s))$ \\ |
558 | 2017 |
& & $\textit{then} \;\; (s @ [c]) :: [c] :: ( |
2018 |
\starupdate \; c \; r \; Ss)$ \\ |
|
2019 |
& & $\textit{else} \;\; (s @ [c]) :: ( |
|
2020 |
\starupdate \; c \; r \; Ss)$ |
|
2021 |
\end{tabular} |
|
2022 |
\end{center} |
|
2023 |
\begin{center} |
|
2024 |
\begin{tabular}{lcl} |
|
2025 |
$\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\ |
|
2026 |
$\starupdates \; (c :: cs) \; r \; Ss$ & $=$ & $\starupdates \; cs \; r \; ( |
|
2027 |
\starupdate \; c \; r \; Ss)$ |
|
2028 |
\end{tabular} |
|
2029 |
\end{center} |
|
2030 |
\noindent |
|
618 | 2031 |
Assuming we have that |
2032 |
\begin{center} |
|
2033 |
$\rderssimp{r^*}{s} = \rsimp{(\sum \map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss)}$. |
|
2034 |
\end{center} |
|
2035 |
holds. |
|
2036 |
The idea of $\starupdate$ and $\starupdates$ |
|
2037 |
is to update $Ss$ when another |
|
2038 |
derivative is taken on $\rderssimp{r^*}{s}$ |
|
2039 |
w.r.t a character $c$ and a string $s'$ |
|
2040 |
respectively. |
|
2041 |
Both $\starupdate$ and $\starupdates$ take three arguments as input: |
|
2042 |
the new character $c$ or string $s$ to take derivative with, |
|
2043 |
the regular expression |
|
2044 |
$r$ under the star $r^*$, and the |
|
2045 |
list of strings $Ss$ for the derivative $r^* \backslash s$ |
|
2046 |
up until this point |
|
2047 |
such that |
|
2048 |
\begin{center} |
|
2049 |
$(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ |
|
2050 |
\end{center} |
|
2051 |
is satisfied. |
|
2052 |
||
2053 |
Functions $\starupdate$ and $\starupdates$ characterise what the |
|
2054 |
star derivatives will look like once ``straightened out'' into lists. |
|
2055 |
The helper functions for such operations will be similar to |
|
2056 |
$\sflat{\_}$, $\sflataux{\_}$ and $\sflataux{\_}$, which we defined for sequence. |
|
2057 |
We use similar symbols to denote them, with a $*$ subscript to mark the difference. |
|
558 | 2058 |
\begin{center} |
2059 |
\begin{tabular}{lcl} |
|
2060 |
$\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\ |
|
2061 |
$\hflataux{r}$ & $\dn$ & $[r]$ |
|
2062 |
\end{tabular} |
|
2063 |
\end{center} |
|
557 | 2064 |
|
2065 |
\begin{center} |
|
558 | 2066 |
\begin{tabular}{lcl} |
2067 |
$\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\ |
|
2068 |
$\hflat{r}$ & $\dn$ & $r$ |
|
2069 |
\end{tabular} |
|
2070 |
\end{center} |
|
2071 |
\noindent |
|
618 | 2072 |
These definitions are tailor-made for dealing with alternatives that have |
2073 |
originated from a star's derivatives. |
|
2074 |
A typical star derivative always has the structure of a balanced binary tree: |
|
564 | 2075 |
\begin{center} |
618 | 2076 |
$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + |
2077 |
(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ |
|
593 | 2078 |
\end{center} |
618 | 2079 |
All of the nested structures of alternatives |
2080 |
generated from derivatives are binary, and therefore |
|
2081 |
$\hflat{\_}$ and $\hflataux{\_}$ only deal with binary alternatives. |
|
2082 |
$\hflat{\_}$ ``untangles'' like the following: |
|
2083 |
\[ |
|
2084 |
\sum ((r_1 + r_2) + (r_3 + r_4)) + \ldots \; |
|
2085 |
\stackrel{\hflat{\_}}{\longrightarrow} \; |
|
2086 |
\RALTS{[r_1, r_2, \ldots, r_n]} |
|
2087 |
\] |
|
2088 |
Here is a lemma stating the recursive property of $\starupdate$ and $\starupdates$, |
|
2089 |
with the helpers $\hflat{\_}$ and $\hflataux{\_}$\footnote{The function $\textit{concat}$ takes a list of lists |
|
2090 |
and merges each of the element lists to form a flattened list.}: |
|
2091 |
\begin{lemma}\label{stupdateInduct1} |
|
2092 |
\mbox |
|
2093 |
For a list of strings $Ss$, the following hold. |
|
2094 |
\begin{itemize} |
|
2095 |
\item |
|
2096 |
If we do a derivative on the terms |
|
2097 |
$r\backslash_r s \cdot r^*$ (where $s$ is taken from the list $Ss$), |
|
2098 |
the result will be the same as if we apply $\starupdate$ to $Ss$. |
|
2099 |
\begin{center} |
|
2100 |
\begin{tabular}{c} |
|
2101 |
$\textit{concat} \; (\map \; (\hflataux{\_} \circ ( (\_\backslash_r x) |
|
2102 |
\circ (\lambda s.\;\; (r \backslash_r s) \cdot r^*)))\; Ss )\; |
|
2103 |
$\\ |
|
2104 |
\\ |
|
2105 |
$=$ \\ |
|
2106 |
\\ |
|
2107 |
$\map \; (\lambda s. (r \backslash_r s) \cdot (r^*)) \; |
|
2108 |
(\starupdate \; x \; r \; Ss)$ |
|
2109 |
\end{tabular} |
|
2110 |
\end{center} |
|
2111 |
\item |
|
2112 |
$\starupdates$ is ``composable'' w.r.t a derivative. |
|
2113 |
It piggybacks the character $x$ to the tail of the string |
|
2114 |
$xs$. |
|
2115 |
\begin{center} |
|
2116 |
\begin{tabular}{c} |
|
2117 |
$\textit{concat} \; (\map \; \hflataux{\_} \; |
|
2118 |
(\map \; (\_\backslash_r x) \; |
|
2119 |
(\map \; (\lambda s.\;\; (r \backslash_r s) \cdot |
|
2120 |
(r^*) ) \; (\starupdates \; xs \; r \; Ss))))$\\ |
|
2121 |
\\ |
|
2122 |
$=$\\ |
|
2123 |
\\ |
|
2124 |
$\map \; (\lambda s.\;\; (r\backslash_r s) \cdot (r^*)) \; |
|
2125 |
(\starupdates \; (xs @ [x]) \; r \; Ss)$ |
|
2126 |
\end{tabular} |
|
2127 |
\end{center} |
|
2128 |
\end{itemize} |
|
2129 |
\end{lemma} |
|
2130 |
||
2131 |
\begin{proof} |
|
2132 |
Part 1 is by induction on $Ss$. |
|
2133 |
Part 2 is by induction on $xs$, where $Ss$ is left to take arbitrary values. |
|
2134 |
\end{proof} |
|
2135 |
||
593 | 2136 |
|
618 | 2137 |
Like $\textit{createdBySequence}$, we need |
2138 |
a predicate for ``star-created'' regular expressions: |
|
593 | 2139 |
\begin{center} |
618 | 2140 |
\begin{mathpar} |
2141 |
\inferrule{\mbox{}}{ \textit{createdByStar}\; \RSEQ{ra}{\RSTAR{rb}} } |
|
2142 |
||
2143 |
\inferrule{ \textit{createdByStar} \; r_1\; \land \; \textit{createdByStar} \; r_2 }{\textit{createdByStar} \; (r_1 + r_2) } |
|
2144 |
\end{mathpar} |
|
593 | 2145 |
\end{center} |
618 | 2146 |
\noindent |
2147 |
All regular expressions created by taking derivatives of |
|
2148 |
$r_1 \cdot (r_2)^*$ satisfy the $\textit{createdByStar}$ predicate: |
|
2149 |
\begin{lemma}\label{starDersCbs} |
|
2150 |
$\textit{createdByStar} \; ((r_1 \cdot r_2^*) \backslash_r s) $ holds. |
|
2151 |
\end{lemma} |
|
2152 |
\begin{proof} |
|
2153 |
By a reverse induction on $s$. |
|
2154 |
\end{proof} |
|
2155 |
If a regular expression conforms to the shape of a star's derivative, |
|
2156 |
then we can push an application of $\hflataux{\_}$ inside a derivative of it: |
|
2157 |
\begin{lemma}\label{hfauPushin} |
|
2158 |
If $\textit{createdByStar} \; r$ holds, then |
|
2159 |
\begin{center} |
|
2160 |
$\hflataux{r \backslash_r c} = \textit{concat} \; ( |
|
2161 |
\map \; \hflataux{\_} (\map \; (\_\backslash_r c) \;(\hflataux{r})))$ |
|
2162 |
\end{center} |
|
2163 |
holds. |
|
2164 |
\end{lemma} |
|
2165 |
\begin{proof} |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2166 |
By an induction on the inductive cases of $\textit{createdByStar}$. |
618 | 2167 |
\end{proof} |
2168 |
%This is not entirely true for annotated regular expressions: |
|
2169 |
%%TODO: bsimp bders \neq bderssimp |
|
2170 |
%\begin{center} |
|
2171 |
% $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ |
|
2172 |
%\end{center} |
|
2173 |
%For bit-codes, the order in which simplification is applied |
|
2174 |
%might cause a difference in the location they are placed. |
|
2175 |
%If we want something like |
|
2176 |
%\begin{center} |
|
2177 |
% $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ |
|
2178 |
%\end{center} |
|
2179 |
%Some "canonicalization" procedure is required, |
|
2180 |
%which either pushes all the common bitcodes to nodes |
|
2181 |
%as senior as possible: |
|
2182 |
%\begin{center} |
|
2183 |
% $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ |
|
2184 |
%\end{center} |
|
2185 |
%or does the reverse. However bitcodes are not of interest if we are talking about |
|
2186 |
%the $\llbracket r \rrbracket$ size of a regex. |
|
2187 |
%Therefore for the ease and simplicity of producing a |
|
2188 |
%proof for a size bound, we are happy to restrict ourselves to |
|
2189 |
%unannotated regular expressions, and obtain such equalities as |
|
2190 |
%TODO: rsimp sflat |
|
2191 |
% The simplification of a flattened out regular expression, provided it comes |
|
2192 |
%from the derivative of a star, is the same as the one nested. |
|
2193 |
||
2194 |
||
2195 |
||
2196 |
Now we introduce an inductive property |
|
2197 |
for $\starupdate$ and $\hflataux{\_}$. |
|
2198 |
\begin{lemma}\label{starHfauInduct} |
|
2199 |
If we do derivatives of $r^*$ |
|
2200 |
with a string that starts with $c$, |
|
2201 |
then flatten it out, |
|
2202 |
we obtain a list |
|
2203 |
of the shape $\sum_{s' \in sS} (r\backslash_r s') \cdot r^*$, |
|
2204 |
where $sS = \starupdates \; s \; r \; [[c]]$. Namely, |
|
2205 |
\begin{center} |
|
620 | 2206 |
$\hflataux{(( (\rder{c}{r_0})\cdot(r_0^*))\backslash_{rs} s)} = |
618 | 2207 |
\map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; |
2208 |
(\starupdates \; s \; r_0 \; [[c]])$ |
|
2209 |
\end{center} |
|
2210 |
holds. |
|
2211 |
\end{lemma} |
|
2212 |
\begin{proof} |
|
2213 |
By an induction on $s$, the inductive cases |
|
2214 |
being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used. |
|
2215 |
\end{proof} |
|
2216 |
\noindent |
|
2217 |
||
639 | 2218 |
The function $\hflataux{\_}$ has a similar effect as $\textit{flatten}$: |
618 | 2219 |
\begin{lemma}\label{hflatauxGrewrites} |
2220 |
$a :: rs \grewrites \hflataux{a} @ rs$ |
|
2221 |
\end{lemma} |
|
2222 |
\begin{proof} |
|
2223 |
By induction on $a$. $rs$ is set to take arbitrary values. |
|
2224 |
\end{proof} |
|
638 | 2225 |
It is also not surprising that |
2226 |
two regular expressions differing only in terms |
|
2227 |
of the |
|
2228 |
nesting of parentheses are equivalent w.r.t. $\textit{rsimp}$: |
|
618 | 2229 |
\begin{lemma}\label{cbsHfauRsimpeq1} |
2230 |
$\rsimp{(r_1 + r_2)} = \rsimp{(\RALTS{\hflataux{r_1} @ \hflataux{r_2}})}$ |
|
593 | 2231 |
\end{lemma} |
2232 |
||
2233 |
\begin{proof} |
|
2234 |
By using the rewriting relation $\rightsquigarrow$ |
|
2235 |
\end{proof} |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2236 |
And from this we obtain the following fact: a |
639 | 2237 |
regular expression created by star |
2238 |
is the same as its flattened version, up to equivalence under $\textit{bsimp}$. |
|
593 | 2239 |
For example, |
618 | 2240 |
\begin{lemma}\label{hfauRsimpeq2} |
639 | 2241 |
$\textit{createdByStar} \; r \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ |
593 | 2242 |
\end{lemma} |
2243 |
\begin{proof} |
|
618 | 2244 |
By structural induction on $r$, where the induction rules |
2245 |
are these of $\createdByStar{\_}$. |
|
2246 |
Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case. |
|
593 | 2247 |
\end{proof} |
564 | 2248 |
|
2249 |
||
618 | 2250 |
%Here is a corollary that states the lemma in |
2251 |
%a more intuitive way: |
|
2252 |
%\begin{corollary} |
|
2253 |
% $\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot |
|
2254 |
% (r^*))\; (\starupdates \; c\; r\; [[c]])$ |
|
2255 |
%\end{corollary} |
|
2256 |
%\noindent |
|
2257 |
%Note that this is also agnostic of the simplification |
|
2258 |
%function we defined, and is therefore of more general interest. |
|
2259 |
||
2260 |
Together with the rewriting relation |
|
2261 |
\begin{lemma}\label{starClosedForm6Hrewrites} |
|
2262 |
We have the following set of rewriting relations or equalities: |
|
2263 |
\begin{itemize} |
|
2264 |
\item |
|
2265 |
$\textit{rsimp} \; (r^* \backslash_r (c::s)) |
|
2266 |
\sequal |
|
2267 |
\sum \; ( ( \sum (\lambda s. (r\backslash_r s) \cdot r^*) \; ( |
|
2268 |
\starupdates \; s \; r \; [ c::[]] ) ) )$ |
|
2269 |
\item |
|
620 | 2270 |
$r \backslash_{rsimps} (c::s) = \textit{rsimp} \; ( ( |
618 | 2271 |
\sum ( (\map \; (\lambda s_1. (r\backslash s_1) \; r^*) \; |
2272 |
(\starupdates \;s \; r \; [ c::[] ])))))$ |
|
2273 |
\item |
|
2274 |
$\sum ( (\map \; (\lambda s. (r\backslash s) \; r^*) \; Ss)) |
|
2275 |
\sequal |
|
2276 |
\sum ( (\map \; (\lambda s. \textit{rsimp} \; (r\backslash s) \; |
|
2277 |
r^*) \; Ss) )$ |
|
2278 |
\item |
|
2279 |
$\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss |
|
2280 |
\scfrewrites |
|
2281 |
\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$ |
|
2282 |
\item |
|
2283 |
$( ( \sum ( ( \map \ (\lambda s. \;\; |
|
2284 |
(\textit{rsimp} \; (r \backslash_r s)) \cdot r^*) \; (\starupdates \; |
|
2285 |
s \; r \; [ c::[] ])))))$\\ |
|
2286 |
$\sequal$\\ |
|
2287 |
$( ( \sum ( ( \map \ (\lambda s. \;\; |
|
620 | 2288 |
( r \backslash_{rsimps} s)) \cdot r^*) \; (\starupdates \; |
618 | 2289 |
s \; r \; [ c::[] ]))))$\\ |
2290 |
\end{itemize} |
|
558 | 2291 |
\end{lemma} |
2292 |
\begin{proof} |
|
618 | 2293 |
Part 1 leads to part 2. |
2294 |
The rest of them are routine. |
|
558 | 2295 |
\end{proof} |
2296 |
\noindent |
|
618 | 2297 |
Next the closed form for star regular expressions can be derived: |
2298 |
\begin{theorem}\label{starClosedForm} |
|
558 | 2299 |
$\rderssimp{r^*}{c::s} = |
2300 |
\rsimp{ |
|
2301 |
(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; |
|
593 | 2302 |
(\starupdates \; s\; r \; [[c]]) |
558 | 2303 |
) |
593 | 2304 |
) |
2305 |
} |
|
558 | 2306 |
$ |
618 | 2307 |
\end{theorem} |
558 | 2308 |
\begin{proof} |
2309 |
By an induction on $s$. |
|
618 | 2310 |
The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, \ref{starClosedForm6Hrewrites} |
2311 |
and \ref{hfauRsimpeq2} |
|
558 | 2312 |
are used. |
618 | 2313 |
In \ref{starClosedForm6Hrewrites}, the equalities are |
2314 |
used to link the LHS and RHS. |
|
558 | 2315 |
\end{proof} |
609 | 2316 |
|
2317 |
||
2318 |
||
2319 |
||
2320 |
||
2321 |
||
613 | 2322 |
%---------------------------------------------------------------------------------------- |
2323 |
% SECTION ?? |
|
2324 |
%---------------------------------------------------------------------------------------- |
|
2325 |
||
2326 |
%----------------------------------- |
|
2327 |
% SECTION syntactic equivalence under simp |
|
2328 |
%----------------------------------- |
|
2329 |
||
2330 |
||
2331 |
%---------------------------------------------------------------------------------------- |
|
2332 |
% SECTION ALTS CLOSED FORM |
|
2333 |
%---------------------------------------------------------------------------------------- |
|
2334 |
%\section{A Closed Form for \textit{ALTS}} |
|
2335 |
%Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. |
|
2336 |
% |
|
2337 |
% |
|
2338 |
%There are a few key steps, one of these steps is |
|
2339 |
% |
|
2340 |
% |
|
2341 |
% |
|
2342 |
%One might want to prove this by something a simple statement like: |
|
2343 |
% |
|
2344 |
%For this to hold we want the $\textit{distinct}$ function to pick up |
|
2345 |
%the elements before and after derivatives correctly: |
|
2346 |
%$r \in rset \equiv (rder x r) \in (rder x rset)$. |
|
2347 |
%which essentially requires that the function $\backslash$ is an injective mapping. |
|
2348 |
% |
|
2349 |
%Unfortunately the function $\backslash c$ is not an injective mapping. |
|
2350 |
% |
|
2351 |
%\subsection{function $\backslash c$ is not injective (1-to-1)} |
|
2352 |
%\begin{center} |
|
2353 |
% The derivative $w.r.t$ character $c$ is not one-to-one. |
|
2354 |
% Formally, |
|
2355 |
% $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ |
|
2356 |
%\end{center} |
|
2357 |
%This property is trivially true for the |
|
2358 |
%character regex example: |
|
2359 |
%\begin{center} |
|
2360 |
% $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ |
|
2361 |
%\end{center} |
|
2362 |
%But apart from the cases where the derivative |
|
2363 |
%output is $\ZERO$, are there non-trivial results |
|
2364 |
%of derivatives which contain strings? |
|
2365 |
%The answer is yes. |
|
2366 |
%For example, |
|
2367 |
%\begin{center} |
|
2368 |
% Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ |
|
2369 |
% where $a$ is not nullable.\\ |
|
2370 |
% $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ |
|
2371 |
% $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ |
|
2372 |
%\end{center} |
|
2373 |
%We start with two syntactically different regular expressions, |
|
2374 |
%and end up with the same derivative result. |
|
2375 |
%This is not surprising as we have such |
|
2376 |
%equality as below in the style of Arden's lemma:\\ |
|
2377 |
%\begin{center} |
|
2378 |
% $L(A^*B) = L(A\cdot A^* \cdot B + B)$ |
|
2379 |
%\end{center} |
|
2380 |
\section{Bounding Closed Forms} |
|
2381 |
||
2382 |
In this section, we introduce how we formalised the bound |
|
2383 |
on closed forms. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2384 |
We first show that in general the number of regular expressions up to a certain |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2385 |
size is finite. |
618 | 2386 |
Then we prove that functions such as $\rflts$ |
613 | 2387 |
will not cause the size of r-regular expressions to grow. |
2388 |
Putting this together with a general bound |
|
2389 |
on the finiteness of distinct regular expressions |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2390 |
up to a specific size, we obtain a bound on |
613 | 2391 |
the closed forms. |
2392 |
||
618 | 2393 |
\subsection{Finiteness of Distinct Regular Expressions} |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2394 |
We define the set of regular expressions whose size is no more than |
618 | 2395 |
a certain size $N$ as $\textit{sizeNregex} \; N$: |
2396 |
\[ |
|
2397 |
\textit{sizeNregex} \; N \dn \{r\; \mid \; \llbracket r \rrbracket_r \leq N \} |
|
2398 |
\] |
|
2399 |
We have that $\textit{sizeNregex} \; N$ is always a finite set: |
|
2400 |
\begin{lemma}\label{finiteSizeN} |
|
2401 |
$\textit{finite} \; (\textit{sizeNregex} \; N)$ holds. |
|
2402 |
\end{lemma} |
|
2403 |
\begin{proof} |
|
2404 |
By splitting the set $\textit{sizeNregex} \; (N + 1)$ into |
|
2405 |
subsets by their categories: |
|
639 | 2406 |
$\{\ZERO_r, \ONE_r, c\}$, $\{r^* \mid r \in \textit{sizeNregex} \; N\}$, |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2407 |
and so on. Each of these subsets is finitely bounded. |
618 | 2408 |
\end{proof} |
2409 |
\noindent |
|
2410 |
From this we get a corollary that |
|
2411 |
if forall $r \in rs$, $\rsize{r} \leq N$, then the output of |
|
2412 |
$\rdistinct{rs}{\varnothing}$ is a list of regular |
|
2413 |
expressions of finite size depending on $N$ only. |
|
2414 |
\begin{corollary}\label{finiteSizeNCorollary} |
|
2415 |
$\rsize{\rdistinct{rs}{\varnothing}} \leq c_N * N$ holds, |
|
2416 |
where the constant $c_N$ is equal to $\textit{card} \; (\textit{sizeNregex} \; |
|
2417 |
N)$. |
|
2418 |
\end{corollary} |
|
2419 |
\begin{proof} |
|
2420 |
For all $r$ in |
|
2421 |
$\textit{set} \; (\rdistinct{rs}{\varnothing})$, |
|
2422 |
it is always the case that $\rsize{r} \leq N$. |
|
2423 |
In addition, the list length is bounded by |
|
2424 |
$c_N$, yielding the desired bound. |
|
2425 |
\end{proof} |
|
2426 |
\noindent |
|
2427 |
This fact will be handy in estimating the closed form sizes. |
|
2428 |
%We have proven that the size of the |
|
2429 |
%output of $\textit{rdistinct} \; rs' \; \varnothing$ |
|
2430 |
%is bounded by a constant $N * c_N$ depending only on $N$, |
|
2431 |
%provided that each of $rs'$'s element |
|
2432 |
%is bounded by $N$. |
|
2433 |
||
639 | 2434 |
\subsection{$\textit{rsimp}$ Does Not Increase the Size} |
613 | 2435 |
Although it seems evident, we need a series |
2436 |
of non-trivial lemmas to establish that functions such as $\rflts$ |
|
2437 |
do not cause the regular expressions to grow. |
|
2438 |
\begin{lemma}\label{rsimpMonoLemmas} |
|
2439 |
\mbox{} |
|
2440 |
\begin{itemize} |
|
2441 |
\item |
|
2442 |
\[ |
|
2443 |
\llbracket \rsimpalts \; rs \rrbracket_r \leq |
|
2444 |
\llbracket \sum \; rs \rrbracket_r |
|
2445 |
\] |
|
2446 |
\item |
|
2447 |
\[ |
|
2448 |
\llbracket \rsimpseq \; r_1 \; r_2 \rrbracket_r \leq |
|
2449 |
\llbracket r_1 \cdot r_2 \rrbracket_r |
|
2450 |
\] |
|
2451 |
\item |
|
2452 |
\[ |
|
2453 |
\llbracket \rflts \; rs \rrbracket_r \leq |
|
2454 |
\llbracket rs \rrbracket_r |
|
2455 |
\] |
|
2456 |
\item |
|
2457 |
\[ |
|
2458 |
\llbracket \rDistinct \; rs \; ss \rrbracket_r \leq |
|
2459 |
\llbracket rs \rrbracket_r |
|
2460 |
\] |
|
2461 |
\item |
|
2462 |
If all elements $a$ in the set $as$ satisfy the property |
|
2463 |
that $\llbracket \textit{rsimp} \; a \rrbracket_r \leq |
|
2464 |
\llbracket a \rrbracket_r$, then we have |
|
2465 |
\[ |
|
2466 |
\llbracket \; \rsimpalts \; (\textit{rdistinct} \; |
|
2467 |
(\textit{rflts} \; (\textit{map}\;\textit{rsimp} as)) \{\}) |
|
2468 |
\rrbracket \leq |
|
2469 |
\llbracket \; \sum \; (\rDistinct \; (\rflts \;(\map \; |
|
2470 |
\textit{rsimp} \; x))\; \{ \} ) \rrbracket_r |
|
2471 |
\] |
|
2472 |
\end{itemize} |
|
2473 |
\end{lemma} |
|
2474 |
\begin{proof} |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2475 |
Points 1, 3, and 4 can be proven by an induction on $rs$. |
613 | 2476 |
Point 2 is by case analysis on $r_1$ and $r_2$. |
2477 |
The last part is a corollary of the previous ones. |
|
2478 |
\end{proof} |
|
2479 |
\noindent |
|
2480 |
With the lemmas for each inductive case in place, we are ready to get |
|
2481 |
the non-increasing property as a corollary: |
|
2482 |
\begin{corollary}\label{rsimpMono} |
|
2483 |
$\llbracket \textit{rsimp} \; r \rrbracket_r \leq \llbracket r \rrbracket_r$ |
|
2484 |
\end{corollary} |
|
2485 |
\begin{proof} |
|
2486 |
By \ref{rsimpMonoLemmas}. |
|
2487 |
\end{proof} |
|
2488 |
||
609 | 2489 |
\subsection{Estimating the Closed Forms' sizes} |
618 | 2490 |
We recap the closed forms we obtained |
639 | 2491 |
earlier: |
558 | 2492 |
\begin{itemize} |
2493 |
\item |
|
593 | 2494 |
$\rderssimp{(\sum rs)}{s} \sequal |
2495 |
\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$ |
|
558 | 2496 |
\item |
593 | 2497 |
$\rderssimp{(r_1 \cdot r_2)}{s} \sequal \sum ((r_1 \backslash s) \cdot r_2 ) |
2498 |
:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1}))$ |
|
558 | 2499 |
\item |
2500 |
||
593 | 2501 |
$\rderssimp{r^*}{c::s} = |
2502 |
\rsimp{ |
|
2503 |
(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; |
|
558 | 2504 |
(\starupdates \; s\; r \; [[c]]) |
593 | 2505 |
) |
2506 |
) |
|
2507 |
} |
|
2508 |
$ |
|
558 | 2509 |
\end{itemize} |
2510 |
\noindent |
|
2511 |
The closed forms on the left-hand-side |
|
2512 |
are all of the same shape: $\rsimp{ (\sum rs)} $. |
|
2513 |
Such regular expression will be bounded by the size of $\sum rs'$, |
|
2514 |
where every element in $rs'$ is distinct, and each element |
|
2515 |
can be described by some inductive sub-structures |
|
2516 |
(for example when $r = r_1 \cdot r_2$ then $rs'$ |
|
2517 |
will be solely comprised of $r_1 \backslash s'$ |
|
2518 |
and $r_2 \backslash s''$, $s'$ and $s''$ being |
|
2519 |
sub-strings of $s$). |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2520 |
which will each have a size upper bound |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2521 |
according to the inductive hypothesis, which controls $r \backslash s$. |
557 | 2522 |
|
558 | 2523 |
We elaborate the above reasoning by a series of lemmas |
2524 |
below, where straightforward proofs are omitted. |
|
639 | 2525 |
%We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$. |
2526 |
We show that $\textit{rdistinct}$ and $\rflts$ |
|
609 | 2527 |
working together is at least as |
639 | 2528 |
good as $\textit{rdistinct}$ alone, which can be written as |
609 | 2529 |
\begin{center} |
2530 |
$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r |
|
2531 |
\leq |
|
2532 |
\llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. |
|
2533 |
\end{center} |
|
2534 |
We need this so that we know the outcome of our real |
|
2535 |
simplification is better than or equal to a rough estimate, |
|
2536 |
and therefore can be bounded by that estimate. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2537 |
This is a bit harder to establish compared to proving |
609 | 2538 |
$\textit{flts}$ does not make a list larger (which can |
2539 |
be proven using routine induction): |
|
2540 |
\begin{center} |
|
2541 |
$\llbracket \textit{rflts}\; rs \rrbracket_r \leq |
|
2542 |
\llbracket \textit{rs} \rrbracket_r$ |
|
2543 |
\end{center} |
|
2544 |
We cannot simply prove how each helper function |
|
2545 |
reduces the size and then put them together: |
|
2546 |
From |
|
2547 |
\begin{center} |
|
2548 |
$\llbracket \textit{rflts}\; rs \rrbracket_r \leq |
|
618 | 2549 |
\llbracket \textit{rs} \rrbracket_r$ |
609 | 2550 |
\end{center} |
2551 |
and |
|
2552 |
\begin{center} |
|
2553 |
$\llbracket \textit{rdistinct} \; rs \; \varnothing \leq |
|
2554 |
\llbracket rs \rrbracket_r$ |
|
2555 |
\end{center} |
|
618 | 2556 |
one cannot infer |
609 | 2557 |
\begin{center} |
558 | 2558 |
$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r |
2559 |
\leq |
|
2560 |
\llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. |
|
609 | 2561 |
\end{center} |
618 | 2562 |
What we can infer is that |
609 | 2563 |
\begin{center} |
2564 |
$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r |
|
2565 |
\leq |
|
2566 |
\llbracket rs \rrbracket_r$ |
|
2567 |
\end{center} |
|
2568 |
but this estimate is too rough and $\llbracket rs \rrbracket_r$ is unbounded. |
|
2569 |
The way we |
|
618 | 2570 |
get around this is by first proving a more general lemma |
609 | 2571 |
(so that the inductive case goes through): |
2572 |
\begin{lemma}\label{fltsSizeReductionAlts} |
|
2573 |
If we have three accumulator sets: |
|
2574 |
$noalts\_set$, $alts\_set$ and $corr\_set$, |
|
2575 |
satisfying: |
|
2576 |
\begin{itemize} |
|
2577 |
\item |
|
2578 |
$\forall r \in noalts\_set. \; \nexists xs.\; r = \sum xs$ |
|
2579 |
\item |
|
2580 |
$\forall r \in alts\_set. \; \exists xs. \; r = \sum xs |
|
2581 |
\; \textit{and} \; set \; xs \subseteq corr\_set$ |
|
2582 |
\end{itemize} |
|
2583 |
then we have that |
|
2584 |
\begin{center} |
|
2585 |
\begin{tabular}{lcl} |
|
2586 |
$\llbracket (\textit{rdistinct} \; (\textit{rflts} \; as) \; |
|
2587 |
(noalts\_set \cup corr\_set)) \rrbracket_r$ & $\leq$ &\\ |
|
2588 |
$\llbracket (\textit{rdistinct} \; as \; (noalts\_set \cup alts\_set \cup |
|
639 | 2589 |
\{ \ZERO_r \} )) \rrbracket_r$ & & \\ |
609 | 2590 |
\end{tabular} |
2591 |
\end{center} |
|
2592 |
holds. |
|
532 | 2593 |
\end{lemma} |
558 | 2594 |
\noindent |
618 | 2595 |
We split the accumulator into two parts: the part |
609 | 2596 |
which contains alternative regular expressions ($alts\_set$), and |
2597 |
the part without any of them($noalts\_set$). |
|
618 | 2598 |
This is because $\rflts$ opens up the alternatives in $as$, |
2599 |
causing the accumulators on both sides of the inequality |
|
2600 |
to diverge slightly. |
|
2601 |
If we want to compare the accumulators that are not |
|
2602 |
perfectly in sync, we need to consider the alternatives and non-alternatives |
|
2603 |
separately. |
|
609 | 2604 |
The set $corr\_set$ is the corresponding set |
618 | 2605 |
of $alts\_set$ with all elements under the alternative constructor |
609 | 2606 |
spilled out. |
2607 |
\begin{proof} |
|
2608 |
By induction on the list $as$. We make use of lemma \ref{rdistinctConcat}. |
|
2609 |
\end{proof} |
|
2610 |
By setting all three sets to the empty set, one gets the desired size estimate: |
|
2611 |
\begin{corollary}\label{interactionFltsDB} |
|
2612 |
$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r |
|
2613 |
\leq |
|
2614 |
\llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. |
|
2615 |
\end{corollary} |
|
2616 |
\begin{proof} |
|
2617 |
By using the lemma \ref{fltsSizeReductionAlts}. |
|
2618 |
\end{proof} |
|
2619 |
\noindent |
|
618 | 2620 |
The intuition for why this is true |
2621 |
is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of |
|
558 | 2622 |
duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. |
2623 |
||
2624 |
Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$: |
|
2625 |
\begin{lemma}\label{altsSimpControl} |
|
2626 |
$\rsize{\rsimp{\sum rs}} \leq \rsize{\rdistinct{rs}{\varnothing}}+ 1$ |
|
532 | 2627 |
\end{lemma} |
558 | 2628 |
\begin{proof} |
618 | 2629 |
By using corollary \ref{interactionFltsDB}. |
558 | 2630 |
\end{proof} |
2631 |
\noindent |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2632 |
This is a key lemma in establishing the bounds of all the |
609 | 2633 |
closed forms. |
2634 |
With this we are now ready to control the sizes of |
|
618 | 2635 |
$(r_1 \cdot r_2 )\backslash s$ and $r^* \backslash s$. |
2636 |
\begin{theorem}\label{rBound} |
|
593 | 2637 |
For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$ |
558 | 2638 |
\end{theorem} |
2639 |
\noindent |
|
2640 |
\begin{proof} |
|
593 | 2641 |
We prove this by induction on $r$. The base cases for $\RZERO$, |
2642 |
$\RONE $ and $\RCHAR{c}$ are straightforward. |
|
2643 |
In the sequence $r_1 \cdot r_2$ case, |
|
2644 |
the inductive hypotheses state |
|
2645 |
$\exists N_1. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N_1$ and |
|
2646 |
$\exists N_2. \forall s. \; \llbracket \rderssimp{r_2}{s} \rrbracket \leq N_2$. |
|
562 | 2647 |
|
593 | 2648 |
When the string $s$ is not empty, we can reason as follows |
2649 |
% |
|
2650 |
\begin{center} |
|
2651 |
\begin{tabular}{lcll} |
|
558 | 2652 |
& & $ \llbracket \rderssimp{r_1\cdot r_2 }{s} \rrbracket_r $\\ |
620 | 2653 |
& $ = $ & $\llbracket \rsimp{(\sum(r_1 \backslash_{rsimps} s \cdot r_2 \; \; :: \; \; |
2654 |
\map \; (r_2\backslash_{rsimps} \_)\; (\vsuf{s}{r})))} \rrbracket_r $ & (1) \\ |
|
2655 |
& $\leq$ & $\llbracket \rdistinct{(r_1 \backslash_{rsimps} s \cdot r_2 \; \; :: \; \; |
|
2656 |
\map \; (r_2\backslash_{rsimps} \_)\; (\vsuf{s}{r}))}{\varnothing} \rrbracket_r + 1$ & (2) \\ |
|
593 | 2657 |
& $\leq$ & $2 + N_1 + \rsize{r_2} + (N_2 * (card\;(\sizeNregex \; N_2)))$ & (3)\\ |
558 | 2658 |
\end{tabular} |
2659 |
\end{center} |
|
561 | 2660 |
\noindent |
618 | 2661 |
(1) is by theorem \ref{seqClosedForm}. |
561 | 2662 |
(2) is by \ref{altsSimpControl}. |
2663 |
(3) is by \ref{finiteSizeNCorollary}. |
|
562 | 2664 |
|
2665 |
||
2666 |
Combining the cases when $s = []$ and $s \neq []$, we get (4): |
|
2667 |
\begin{center} |
|
2668 |
\begin{tabular}{lcll} |
|
2669 |
$\rsize{(r_1 \cdot r_2) \backslash_r s}$ & $\leq$ & |
|
2670 |
$max \; (2 + N_1 + |
|
2671 |
\llbracket r_2 \rrbracket_r + |
|
2672 |
N_2 * (card\; (\sizeNregex \; N_2))) \; \rsize{r_1\cdot r_2}$ & (4) |
|
2673 |
\end{tabular} |
|
2674 |
\end{center} |
|
558 | 2675 |
|
562 | 2676 |
We reason similarly for $\STAR$. |
2677 |
The inductive hypothesis is |
|
2678 |
$\exists N. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N$. |
|
564 | 2679 |
Let $n_r = \llbracket r^* \rrbracket_r$. |
562 | 2680 |
When $s = c :: cs$ is not empty, |
2681 |
\begin{center} |
|
593 | 2682 |
\begin{tabular}{lcll} |
562 | 2683 |
& & $ \llbracket \rderssimp{r^* }{c::cs} \rrbracket_r $\\ |
620 | 2684 |
& $ = $ & $\llbracket \rsimp{(\sum (\map \; (\lambda s. (r \backslash_{rsimps} s) \cdot r^*) \; (\starupdates\; |
593 | 2685 |
cs \; r \; [[c]] )) )} \rrbracket_r $ & (5) \\ |
2686 |
& $\leq$ & $\llbracket |
|
2687 |
\rdistinct{ |
|
2688 |
(\map \; |
|
620 | 2689 |
(\lambda s. (r \backslash_{rsimps} s) \cdot r^*) \; |
593 | 2690 |
(\starupdates\; cs \; r \; [[c]] ) |
2691 |
)} |
|
562 | 2692 |
{\varnothing} \rrbracket_r + 1$ & (6) \\ |
2693 |
& $\leq$ & $1 + (\textit{card} (\sizeNregex \; (N + n_r))) |
|
2694 |
* (1 + (N + n_r)) $ & (7)\\ |
|
2695 |
\end{tabular} |
|
2696 |
\end{center} |
|
2697 |
\noindent |
|
618 | 2698 |
(5) is by theorem \ref{starClosedForm}. |
562 | 2699 |
(6) is by \ref{altsSimpControl}. |
618 | 2700 |
(7) is by corollary \ref{finiteSizeNCorollary}. |
639 | 2701 |
Combining with the case when $s = []$, one obtains |
562 | 2702 |
\begin{center} |
593 | 2703 |
\begin{tabular}{lcll} |
2704 |
$\rsize{r^* \backslash_r s}$ & $\leq$ & $max \; n_r \; 1 + (\textit{card} (\sizeNregex \; (N + n_r))) |
|
2705 |
* (1 + (N + n_r)) $ & (8)\\ |
|
2706 |
\end{tabular} |
|
562 | 2707 |
\end{center} |
2708 |
\noindent |
|
2709 |
||
2710 |
The alternative case is slightly less involved. |
|
2711 |
The inductive hypothesis |
|
2712 |
is equivalent to $\exists N. \forall r \in (\map \; (\_ \backslash_r s) \; rs). \rsize{r} \leq N$. |
|
2713 |
In the case when $s = c::cs$, we have |
|
2714 |
\begin{center} |
|
593 | 2715 |
\begin{tabular}{lcll} |
562 | 2716 |
& & $ \llbracket \rderssimp{\sum rs }{c::cs} \rrbracket_r $\\ |
620 | 2717 |
& $ = $ & $\llbracket \rsimp{(\sum (\map \; (\_ \backslash_{rsimps} s) \; rs) )} \rrbracket_r $ & (9) \\ |
2718 |
& $\leq$ & $\llbracket (\sum (\map \; (\_ \backslash_{rsimps} s) \; rs) ) \rrbracket_r $ & (10) \\ |
|
562 | 2719 |
& $\leq$ & $1 + N * (length \; rs) $ & (11)\\ |
593 | 2720 |
\end{tabular} |
562 | 2721 |
\end{center} |
2722 |
\noindent |
|
618 | 2723 |
(9) is by theorem \ref{altsClosedForm}, (10) by lemma \ref{rsimpMono} and (11) by inductive hypothesis. |
562 | 2724 |
|
639 | 2725 |
Combining with the case when $s = []$, we obtain |
562 | 2726 |
\begin{center} |
593 | 2727 |
\begin{tabular}{lcll} |
2728 |
$\rsize{\sum rs \backslash_r s}$ & $\leq$ & $max \; \rsize{\sum rs} \; 1+N*(length \; rs)$ |
|
2729 |
& (12)\\ |
|
2730 |
\end{tabular} |
|
562 | 2731 |
\end{center} |
618 | 2732 |
We have all the inductive cases proven. |
558 | 2733 |
\end{proof} |
2734 |
||
618 | 2735 |
This leads to our main result on the size bound: |
564 | 2736 |
\begin{corollary} |
618 | 2737 |
For any annotated regular expression $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$ |
564 | 2738 |
\end{corollary} |
2739 |
\begin{proof} |
|
618 | 2740 |
By lemma \ref{sizeRelations} and theorem \ref{rBound}. |
564 | 2741 |
\end{proof} |
558 | 2742 |
\noindent |
2743 |
||
609 | 2744 |
|
2745 |
||
2746 |
||
2747 |
||
558 | 2748 |
%----------------------------------- |
2749 |
% SECTION 2 |
|
2750 |
%----------------------------------- |
|
2751 |
||
625 | 2752 |
\section{Bounded Repetitions} |
2753 |
We have promised in chapter \ref{Introduction} |
|
2754 |
that our lexing algorithm can potentially be extended |
|
2755 |
to handle bounded repetitions |
|
2756 |
in natural and elegant ways. |
|
2757 |
Now we fulfill our promise by adding support for |
|
2758 |
the ``exactly-$n$-times'' bounded regular expression $r^{\{n\}}$. |
|
2759 |
We add clauses in our derivatives-based lexing algorithms (with simplifications) |
|
2760 |
introduced in chapter \ref{Bitcoded2}. |
|
2761 |
||
2762 |
\subsection{Augmented Definitions} |
|
2763 |
There are a number of definitions that need to be augmented. |
|
2764 |
The most notable one would be the POSIX rules for $r^{\{n\}}$: |
|
2765 |
\begin{center} |
|
2766 |
\begin{mathpar} |
|
2767 |
\inferrule{\forall v \in vs_1. \vdash v:r \land |
|
2768 |
|v| \neq []\\ \forall v \in vs_2. \vdash v:r \land |v| = []\\ |
|
2769 |
\textit{length} \; (vs_1 @ vs_2) = n}{\textit{Stars} \; |
|
2770 |
(vs_1 @ vs_2) : r^{\{n\}} } |
|
2771 |
\end{mathpar} |
|
2772 |
\end{center} |
|
2773 |
As Ausaf had pointed out \cite{Ausaf}, |
|
2774 |
sometimes empty iterations have to be taken to get |
|
2775 |
a match with exactly $n$ repetitions, |
|
2776 |
and hence the $vs_2$ part. |
|
2777 |
||
2778 |
Another important definition would be the size: |
|
2779 |
\begin{center} |
|
2780 |
\begin{tabular}{lcl} |
|
2781 |
$\llbracket r^{\{n\}} \rrbracket_r$ & $\dn$ & |
|
2782 |
$\llbracket r \rrbracket_r + n$\\ |
|
2783 |
\end{tabular} |
|
2784 |
\end{center} |
|
2785 |
\noindent |
|
2786 |
Arguably we should use $\log \; n$ for the size because |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2787 |
the number of digits increases logarithmically w.r.t $n$. |
625 | 2788 |
For simplicity we choose to add the counter directly to the size. |
2789 |
||
2790 |
The derivative w.r.t a bounded regular expression |
|
2791 |
is given as |
|
2792 |
\begin{center} |
|
2793 |
\begin{tabular}{lcl} |
|
2794 |
$r^{\{n\}} \backslash_r c$ & $\dn$ & |
|
2795 |
$r\backslash_r c \cdot r^{\{n-1\}} \;\; \textit{if} \; n \geq 1$\\ |
|
2796 |
& & $\RZERO \;\quad \quad\quad \quad |
|
2797 |
\textit{otherwise}$\\ |
|
2798 |
\end{tabular} |
|
2799 |
\end{center} |
|
2800 |
\noindent |
|
2801 |
For brevity, we sometimes use NTIMES to refer to bounded |
|
2802 |
regular expressions. |
|
2803 |
The $\mkeps$ function clause for NTIMES would be |
|
2804 |
\begin{center} |
|
2805 |
\begin{tabular}{lcl} |
|
2806 |
$\mkeps \; r^{\{n\}} $ & $\dn$ & $\Stars \; |
|
2807 |
(\textit{replicate} \; n\; (\mkeps \; r))$ |
|
2808 |
\end{tabular} |
|
2809 |
\end{center} |
|
2810 |
\noindent |
|
2811 |
The injection looks like |
|
2812 |
\begin{center} |
|
2813 |
\begin{tabular}{lcl} |
|
2814 |
$\inj \; r^{\{n\}} \; c\; (\Seq \;v \; (\Stars \; vs)) $ & |
|
2815 |
$\dn$ & $\Stars \; |
|
2816 |
((\inj \; r \;c \;v ) :: vs)$ |
|
2817 |
\end{tabular} |
|
2818 |
\end{center} |
|
2819 |
\noindent |
|
2820 |
||
2821 |
||
2822 |
\subsection{Proofs for the Augmented Lexing Algorithm} |
|
2823 |
We need to maintain two proofs with the additional $r^{\{n\}}$ |
|
2824 |
construct: the |
|
2825 |
correctness proof in chapter \ref{Bitcoded2}, |
|
2826 |
and the finiteness proof in chapter \ref{Finite}. |
|
2827 |
||
2828 |
\subsubsection{Correctness Proof Augmentation} |
|
2829 |
The correctness of $\textit{lexer}$ and $\textit{blexer}$ with bounded repetitions |
|
2830 |
have been proven by Ausaf and Urban\cite{AusafDyckhoffUrban2016}. |
|
2831 |
As they have commented, once the definitions are in place, |
|
2832 |
the proofs given for the basic regular expressions will extend to |
|
2833 |
bounded regular expressions, and there are no ``surprises''. |
|
2834 |
We confirm this point because the correctness theorem would also |
|
2835 |
extend without surprise to $\blexersimp$. |
|
2836 |
The rewrite rules such as $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$ and so on |
|
2837 |
do not need to be changed, |
|
2838 |
and only a few lemmas such as lemma \ref{fltsPreserves} need to be adjusted to |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2839 |
add one more line which can be solved by the Sledgehammer tool |
625 | 2840 |
to solve the $r^{\{n\}}$ inductive case. |
2841 |
||
2842 |
||
2843 |
\subsubsection{Finiteness Proof Augmentation} |
|
2844 |
The bounded repetitions are |
|
2845 |
very similar to stars, and therefore the treatment |
|
2846 |
is similar, with minor changes to handle some slight complications |
|
2847 |
when the counter reaches 0. |
|
2848 |
The exponential growth is similar: |
|
2849 |
\begin{center} |
|
2850 |
\begin{tabular}{ll} |
|
2851 |
$r^{\{n\}} $ & $\longrightarrow_{\backslash c}$\\ |
|
2852 |
$(r\backslash c) \cdot |
|
2853 |
r^{\{n - 1\}}*$ & $\longrightarrow_{\backslash c'}$\\ |
|
2854 |
\\ |
|
2855 |
$r \backslash cc' \cdot r^{\{n - 2\}}* + |
|
2856 |
r \backslash c' \cdot r^{\{n - 1\}}*$ & |
|
2857 |
$\longrightarrow_{\backslash c''}$\\ |
|
2858 |
\\ |
|
2859 |
$(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* + |
|
2860 |
r \backslash c''\cdot r^{\{n-1\}}) + |
|
2861 |
(r \backslash c'c'' \cdot r^{\{n-2\}}* + |
|
2862 |
r \backslash c'' \cdot r^{\{n-1\}}*)$ & |
|
2863 |
$\longrightarrow_{\backslash c'''}$ \\ |
|
2864 |
\\ |
|
2865 |
$\ldots$\\ |
|
2866 |
\end{tabular} |
|
2867 |
\end{center} |
|
2868 |
Again, we assume that $r\backslash c$, $r \backslash cc'$ and so on |
|
2869 |
are all nullable. |
|
2870 |
The flattened list of terms for $r^{\{n\}} \backslash_{rs} s$ |
|
2871 |
\begin{center} |
|
2872 |
$[r_1 \backslash cc'c'' \cdot r^{\{n-3\}}*,\; |
|
2873 |
r \backslash c''\cdot r^{\{n-1\}}, \; |
|
2874 |
r \backslash c'c'' \cdot r^{\{n-2\}}*, \; |
|
2875 |
r \backslash c'' \cdot r^{\{n-1\}}*,\; \ldots ]$ |
|
2876 |
\end{center} |
|
2877 |
that comes from |
|
2878 |
\begin{center} |
|
2879 |
$(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* + |
|
2880 |
r \backslash c''\cdot r^{\{n-1\}}) + |
|
2881 |
(r \backslash c'c'' \cdot r^{\{n-2\}}* + |
|
2882 |
r \backslash c'' \cdot r^{\{n-1\}}*)+ \ldots$ |
|
2883 |
\end{center} |
|
2884 |
are made of sequences with different tails, where the counters |
|
2885 |
might differ. |
|
2886 |
The observation for maintaining the bound is that |
|
2887 |
these counters never exceed $n$, the original |
|
2888 |
counter. With the number of counters staying finite, |
|
2889 |
$\rDistinct$ will deduplicate and keep the list finite. |
|
2890 |
We introduce this idea as a lemma once we describe all |
|
2891 |
the necessary helper functions. |
|
2892 |
||
2893 |
Similar to the star case, we want |
|
2894 |
\begin{center} |
|
2895 |
$\rderssimp{r^{\{n\}}}{s} = \rsimp{\sum rs}$. |
|
2896 |
\end{center} |
|
2897 |
where $rs$ |
|
2898 |
shall be in the form of |
|
2899 |
$\map \; f \; Ss$, where $f$ is a function and |
|
2900 |
$Ss$ a list of objects to act on. |
|
2901 |
For star, the object's datatype is string. |
|
2902 |
The list of strings $Ss$ |
|
2903 |
is generated using functions |
|
2904 |
$\starupdate$ and $\starupdates$. |
|
2905 |
The function that takes a string and returns a regular expression |
|
2906 |
is the anonymous function $ |
|
2907 |
(\lambda s'. \; r\backslash s' \cdot r^{\{m\}})$. |
|
2908 |
In the NTIMES setting, |
|
2909 |
the $\starupdate$ and $\starupdates$ functions are replaced by |
|
2910 |
$\textit{nupdate}$ and $\textit{nupdates}$: |
|
2911 |
\begin{center} |
|
2912 |
\begin{tabular}{lcl} |
|
2913 |
$\nupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ |
|
2914 |
$\nupdate \; c \; r \; |
|
2915 |
(\Some \; (s, \; n + 1) \; :: \; Ss)$ & $\dn$ & %\\ |
|
2916 |
$\textit{if} \; |
|
2917 |
(\rnullable \; (r \backslash_{rs} s))$ \\ |
|
2918 |
& & $\;\;\textit{then} |
|
2919 |
\;\; \Some \; (s @ [c], n + 1) :: \Some \; ([c], n) :: ( |
|
2920 |
\nupdate \; c \; r \; Ss)$ \\ |
|
2921 |
& & $\textit{else} \;\; \Some \; (s @ [c], n+1) :: ( |
|
2922 |
\nupdate \; c \; r \; Ss)$\\ |
|
2923 |
$\nupdate \; c \; r \; (\textit{None} :: Ss)$ & $\dn$ & |
|
2924 |
$(\None :: \nupdate \; c \; r \; Ss)$\\ |
|
2925 |
& & \\ |
|
2926 |
%\end{tabular} |
|
2927 |
%\end{center} |
|
2928 |
%\begin{center} |
|
2929 |
%\begin{tabular}{lcl} |
|
2930 |
$\nupdates \; [] \; r \; Ss$ & $\dn$ & $Ss$\\ |
|
2931 |
$\nupdates \; (c :: cs) \; r \; Ss$ & $\dn$ & $\nupdates \; cs \; r \; ( |
|
2932 |
\nupdate \; c \; r \; Ss)$ |
|
2933 |
\end{tabular} |
|
2934 |
\end{center} |
|
2935 |
\noindent |
|
2936 |
which take into account when a subterm |
|
2937 |
\begin{center} |
|
2938 |
$r \backslash_s s \cdot r^{\{n\}}$ |
|
2939 |
\end{center} |
|
2940 |
counter $n$ |
|
2941 |
is 0, and therefore expands to |
|
2942 |
\begin{center} |
|
2943 |
$r \backslash_s (s@[c]) \cdot r^{\{n\}} \;+ |
|
2944 |
\; \ZERO$ |
|
2945 |
\end{center} |
|
2946 |
after taking a derivative. |
|
2947 |
The object now has type |
|
2948 |
\begin{center} |
|
2949 |
$\textit{option} \;(\textit{string}, \textit{nat})$ |
|
2950 |
\end{center} |
|
2951 |
and therefore the function for converting such an option into |
|
2952 |
a regular expression term is called $\opterm$: |
|
2953 |
||
2954 |
\begin{center} |
|
2955 |
\begin{tabular}{lcl} |
|
2956 |
$\opterm \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ |
|
2957 |
& & $\;\;\Some \; (s, n) \Rightarrow |
|
2958 |
(r\backslash_{rs} s)\cdot r^{\{n\}}$\\ |
|
2959 |
& & $\;\;\None \Rightarrow |
|
2960 |
\ZERO$\\ |
|
2961 |
\end{tabular} |
|
2962 |
\end{center} |
|
2963 |
\noindent |
|
2964 |
Put together, the list $\map \; f \; Ss$ is instantiated as |
|
2965 |
\begin{center} |
|
2966 |
$\map \; (\opterm \; r) \; (\nupdates \; s \; r \; |
|
2967 |
[\Some \; ([c], n)])$. |
|
2968 |
\end{center} |
|
2969 |
For the closed form to be bounded, we would like |
|
2970 |
simplification to be applied to each term in the list. |
|
2971 |
Therefore we introduce some variants of $\opterm$, |
|
2972 |
which help conveniently express the rewriting steps |
|
2973 |
needed in the closed form proof. |
|
639 | 2974 |
We have $\optermOsimp$, $\optermosimp$ and $\optermsimp$ |
2975 |
with slightly different spellings because they help the proof to go through: |
|
625 | 2976 |
\begin{center} |
2977 |
\begin{tabular}{lcl} |
|
2978 |
$\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ |
|
2979 |
& & $\;\;\Some \; (s, n) \Rightarrow |
|
2980 |
\textit{rsimp} \; ((r\backslash_{rs} s)\cdot r^{\{n\}})$\\ |
|
2981 |
& & $\;\;\None \Rightarrow |
|
2982 |
\ZERO$\\ |
|
2983 |
\\ |
|
2984 |
$\optermosimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ |
|
2985 |
& & $\;\;\Some \; (s, n) \Rightarrow |
|
2986 |
(\textit{rsimp} \; (r\backslash_{rs} s)) |
|
2987 |
\cdot r^{\{n\}}$\\ |
|
2988 |
& & $\;\;\None \Rightarrow |
|
2989 |
\ZERO$\\ |
|
2990 |
\\ |
|
2991 |
$\optermsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ |
|
2992 |
& & $\;\;\Some \; (s, n) \Rightarrow |
|
2993 |
(r\backslash_{rsimps} s)\cdot r^{\{n\}}$\\ |
|
2994 |
& & $\;\;\None \Rightarrow |
|
2995 |
\ZERO$\\ |
|
2996 |
\end{tabular} |
|
2997 |
\end{center} |
|
2998 |
||
2999 |
||
3000 |
For a list of |
|
3001 |
$\textit{option} \;(\textit{string}, \textit{nat})$ elements, |
|
3002 |
we define the highest power for it recursively: |
|
3003 |
\begin{center} |
|
3004 |
\begin{tabular}{lcl} |
|
3005 |
$\hpa \; [] \; n $ & $\dn$ & $n$\\ |
|
3006 |
$\hpa \; (\None :: os) \; n $ & $\dn$ & $\hpa \; os \; n$\\ |
|
3007 |
$\hpa \; (\Some \; (s, n) :: os) \; m$ & $\dn$ & |
|
3008 |
$\hpa \;os \; (\textit{max} \; n\; m)$\\ |
|
3009 |
\\ |
|
3010 |
$\hpower \; rs $ & $\dn$ & $\hpa \; rs \; 0$\\ |
|
3011 |
\end{tabular} |
|
3012 |
\end{center} |
|
3013 |
||
3014 |
Now the intuition that an NTIMES regular expression's power |
|
3015 |
does not increase can be easily expressed as |
|
3016 |
\begin{lemma}\label{nupdatesMono2} |
|
3017 |
$\hpower \; (\nupdates \;s \; r \; [\Some \; ([c], n)]) \leq n$ |
|
3018 |
\end{lemma} |
|
3019 |
\begin{proof} |
|
3020 |
Note that the power is non-increasing after a $\nupdate$ application: |
|
3021 |
\begin{center} |
|
3022 |
$\hpa \;\; (\nupdate \; c \; r \; Ss)\;\; m \leq |
|
3023 |
\hpa\; \; Ss \; m$. |
|
3024 |
\end{center} |
|
3025 |
This is also the case for $\nupdates$: |
|
3026 |
\begin{center} |
|
3027 |
$\hpa \;\; (\nupdates \; s \; r \; Ss)\;\; m \leq |
|
3028 |
\hpa\; \; Ss \; m$. |
|
3029 |
\end{center} |
|
3030 |
Therefore we have that |
|
3031 |
\begin{center} |
|
3032 |
$\hpower \;\; (\nupdates \; s \; r \; Ss) \leq |
|
3033 |
\hpower \;\; Ss$ |
|
3034 |
\end{center} |
|
3035 |
which leads to the lemma being proven. |
|
3036 |
||
3037 |
\end{proof} |
|
3038 |
||
3039 |
||
3040 |
We also define the inductive rules for |
|
3041 |
the shape of derivatives of the NTIMES regular expressions:\\[-3em] |
|
3042 |
\begin{center} |
|
3043 |
\begin{mathpar} |
|
3044 |
\inferrule{\mbox{}}{\cbn \;\ZERO} |
|
3045 |
||
3046 |
\inferrule{\mbox{}}{\cbn \; \; r_a \cdot (r^{\{n\}})} |
|
3047 |
||
3048 |
\inferrule{\cbn \; r_1 \;\; \; \cbn \; r_2}{\cbn \; r_1 + r_2} |
|
3049 |
||
3050 |
\inferrule{\cbn \; r}{\cbn \; r + \ZERO} |
|
3051 |
\end{mathpar} |
|
3052 |
\end{center} |
|
3053 |
\noindent |
|
3054 |
A derivative of NTIMES fits into the shape described by $\cbn$: |
|
3055 |
\begin{lemma}\label{ntimesDersCbn} |
|
3056 |
$\cbn \; ((r' \cdot r^{\{n\}}) \backslash_{rs} s)$ holds. |
|
3057 |
\end{lemma} |
|
3058 |
\begin{proof} |
|
3059 |
By a reverse induction on $s$. |
|
3060 |
For the inductive case, note that if $\cbn \; r$ holds, |
|
3061 |
then $\cbn \; (r\backslash_r c)$ holds. |
|
3062 |
\end{proof} |
|
3063 |
\noindent |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3064 |
In addition, for $\cbn$-shaped regular expressions, one can flatten |
625 | 3065 |
them: |
3066 |
\begin{lemma}\label{ntimesHfauPushin} |
|
3067 |
If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} = |
|
3068 |
\textit{concat} \; (\map \; \hflataux{\map \; (\_\backslash_r c) \; |
|
3069 |
(\hflataux{r})})$ |
|
3070 |
\end{lemma} |
|
3071 |
\begin{proof} |
|
3072 |
By an induction on the inductive cases of $\cbn$. |
|
3073 |
\end{proof} |
|
3074 |
\noindent |
|
3075 |
This time we do not need to define the flattening functions for NTIMES only, |
|
3076 |
because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already. |
|
3077 |
\begin{lemma}\label{ntimesHfauInduct} |
|
3078 |
$\hflataux{( (r\backslash_r c) \cdot r^{\{n\}}) \backslash_{rsimps} s} = |
|
3079 |
\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$ |
|
3080 |
\end{lemma} |
|
3081 |
\begin{proof} |
|
3082 |
By a reverse induction on $s$. |
|
3083 |
The lemmas \ref{ntimesHfauPushin} and \ref{ntimesDersCbn} are used. |
|
3084 |
\end{proof} |
|
3085 |
\noindent |
|
3086 |
We have a recursive property for NTIMES with $\nupdate$ |
|
3087 |
similar to that for STAR, |
|
3088 |
and one for $\nupdates $ as well: |
|
3089 |
\begin{lemma}\label{nupdateInduct1} |
|
3090 |
\mbox{} |
|
3091 |
\begin{itemize} |
|
3092 |
\item |
|
3093 |
\begin{center} |
|
3094 |
$\textit{concat} \; (\map \; (\hflataux{\_} \circ ( |
|
3095 |
\opterm \; r)) \; Ss) = \map \; (\opterm \; r) \; (\nupdate \; |
|
3096 |
c \; r \; Ss)$\\ |
|
3097 |
\end{center} |
|
3098 |
holds. |
|
3099 |
\item |
|
3100 |
\begin{center} |
|
3101 |
$\textit{concat} \; (\map \; \hflataux{\_}\; |
|
3102 |
\map \; (\_\backslash_r x) \; |
|
3103 |
(\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss)))$\\ |
|
3104 |
$=$\\ |
|
3105 |
$\map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$ |
|
3106 |
\end{center} |
|
3107 |
holds. |
|
3108 |
\end{itemize} |
|
3109 |
\end{lemma} |
|
3110 |
\begin{proof} |
|
3111 |
(i) is by an induction on $Ss$. |
|
3112 |
(ii) is by an induction on $xs$. |
|
3113 |
\end{proof} |
|
3114 |
\noindent |
|
3115 |
The $\nString$ predicate is defined for conveniently |
|
3116 |
expressing that there are no empty strings in the |
|
3117 |
$\Some \;(s, n)$ elements generated by $\nupdate$: |
|
3118 |
\begin{center} |
|
3119 |
\begin{tabular}{lcl} |
|
3120 |
$\nString \; \None$ & $\dn$ & $ \textit{true}$\\ |
|
3121 |
$\nString \; (\Some \; ([], n))$ & $\dn$ & $ \textit{false}$\\ |
|
3122 |
$\nString \; (\Some \; (c::s, n))$ & $\dn$ & $ \textit{true}$\\ |
|
3123 |
\end{tabular} |
|
3124 |
\end{center} |
|
3125 |
\begin{lemma}\label{nupdatesNonempty} |
|
3126 |
If for all elements $o \in \textit{set} \; Ss$, |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3127 |
$\nString \; o$ holds, then we have that |
625 | 3128 |
for all elements $o' \in \textit{set} \; (\nupdates \; s \; r \; Ss)$, |
3129 |
$\nString \; o'$ holds. |
|
3130 |
\end{lemma} |
|
3131 |
\begin{proof} |
|
3132 |
By an induction on $s$, where $Ss$ is set to vary over all possible values. |
|
3133 |
\end{proof} |
|
3134 |
||
3135 |
\noindent |
|
3136 |
||
3137 |
\begin{lemma}\label{ntimesClosedFormsSteps} |
|
3138 |
The following list of equalities or rewriting relations hold:\\ |
|
3139 |
(i) $r^{\{n+1\}} \backslash_{rsimps} (c::s) = |
|
3140 |
\textit{rsimp} \; (\sum (\map \; (\opterm \;r \;\_) \; (\nupdates \; |
|
3141 |
s \; r \; [\Some \; ([c], n)])))$\\ |
|
3142 |
(ii) |
|
3143 |
\begin{center} |
|
3144 |
$\sum (\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [ |
|
3145 |
\Some \; ([c], n)]))$ \\ $ \sequal$\\ |
|
3146 |
$\sum (\map \; (\textit{rsimp} \circ (\opterm \; r))\; (\nupdates \; |
|
3147 |
s\;r \; [\Some \; ([c], n)]))$\\ |
|
3148 |
\end{center} |
|
3149 |
(iii) |
|
3150 |
\begin{center} |
|
3151 |
$\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \; |
|
3152 |
([c], n)]))$\\ |
|
3153 |
$\sequal$\\ |
|
3154 |
$\sum \;(\map \; (\optermsimp r) \; (\nupdates \; s \; r \; [\Some \; |
|
3155 |
([c], n)])) $\\ |
|
3156 |
\end{center} |
|
3157 |
(iv) |
|
3158 |
\begin{center} |
|
3159 |
$\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \; |
|
3160 |
([c], n)])) $ \\ $\sequal$\\ |
|
3161 |
$\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \; |
|
3162 |
([c], n)])) $\\ |
|
3163 |
\end{center} |
|
3164 |
(v) |
|
3165 |
\begin{center} |
|
3166 |
$\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \; |
|
3167 |
([c], n)])) $ \\ $\sequal$\\ |
|
3168 |
$\sum \; (\map \; (\textit{rsimp} \circ (\opterm \; r)) \; |
|
3169 |
(\nupdates \; s \; r \; [\Some \; ([c], n)]))$ |
|
3170 |
\end{center} |
|
3171 |
\end{lemma} |
|
3172 |
\begin{proof} |
|
3173 |
Routine. |
|
3174 |
(iii) and (iv) make use of the fact that all the strings $s$ |
|
3175 |
inside $\Some \; (s, m)$ which are elements of the list |
|
3176 |
$\nupdates \; s\;r\;[\Some\; ([c], n)]$ are non-empty, |
|
3177 |
which is from lemma \ref{nupdatesNonempty}. |
|
3178 |
Once the string in $o = \Some \; (s, n)$ is |
|
3179 |
nonempty, $\optermsimp \; r \;o$, |
|
3180 |
$\optermosimp \; r \; o$ and $\optermosimp \; \; o$ are guaranteed |
|
3181 |
to be equal. |
|
3182 |
(v) uses \ref{nupdateInduct1}. |
|
3183 |
\end{proof} |
|
3184 |
\noindent |
|
3185 |
Now we are ready to present the closed form for NTIMES: |
|
3186 |
\begin{theorem}\label{ntimesClosedForm} |
|
3187 |
The derivative of $r^{\{n+1\}}$ can be described as an alternative |
|
3188 |
containing a list |
|
3189 |
of terms:\\ |
|
3190 |
$r^{\{n+1\}} \backslash_{rsimps} (c::s) = \textit{rsimp} \; ( |
|
3191 |
\sum (\map \; (\optermsimp \; r) \; (\nupdates \; s \; r \; |
|
3192 |
[\Some \; ([c], n)])))$ |
|
3193 |
\end{theorem} |
|
3194 |
\begin{proof} |
|
3195 |
By the rewriting steps described in lemma \ref{ntimesClosedFormsSteps}. |
|
3196 |
\end{proof} |
|
3197 |
\noindent |
|
3198 |
The key observation for bounding this closed form |
|
3199 |
is that the counter on $r^{\{n\}}$ will |
|
3200 |
only decrement during derivatives: |
|
3201 |
\begin{lemma}\label{nupdatesNLeqN} |
|
3202 |
For an element $o$ in $\textit{set} \; (\nupdates \; s \; r \; |
|
3203 |
[\Some \; ([c], n)])$, either $o = \None$, or $o = \Some |
|
3204 |
\; (s', m)$ for some string $s'$ and number $m \leq n$. |
|
3205 |
\end{lemma} |
|
3206 |
\noindent |
|
3207 |
The proof is routine and therefore omitted. |
|
3208 |
This allows us to say what kind of terms |
|
3209 |
are in the list $\textit{set} \; (\map \; (\optermsimp \; r) \; ( |
|
3210 |
\nupdates \; s \; r \; [\Some \; ([c], n)]))$: |
|
3211 |
only $\ZERO_r$s or a sequence with the tail an $r^{\{m\}}$ |
|
3212 |
with a small $m$: |
|
3213 |
\begin{lemma}\label{ntimesClosedFormListElemShape} |
|
3214 |
For any element $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; ( |
|
3215 |
\nupdates \; s \; r \; [\Some \; ([c], n)]))$, |
|
3216 |
we have that $r'$ is either $\ZERO$ or $r \backslash_{rsimps} s' \cdot |
|
3217 |
r^{\{m\}}$ for some string $s'$ and number $m \leq n$. |
|
3218 |
\end{lemma} |
|
3219 |
\begin{proof} |
|
3220 |
Using lemma \ref{nupdatesNLeqN}. |
|
3221 |
\end{proof} |
|
3222 |
||
3223 |
\begin{theorem}\label{ntimesClosedFormBounded} |
|
3224 |
Assuming that for any string $s$, $\llbracket r \backslash_{rsimps} s |
|
3225 |
\rrbracket_r \leq N$ holds, then we have that\\ |
|
3226 |
$\llbracket r^{\{n+1\}} \backslash_{rsimps} s \rrbracket_r \leq |
|
3227 |
\textit{max} \; (c_N+1)* (N + \llbracket r^{\{n\}} \rrbracket+1)$, |
|
3228 |
where $c_N = \textit{card} \; (\textit{sizeNregex} \; ( |
|
3229 |
N + \llbracket r^{\{n\}} \rrbracket_r+1))$. |
|
3230 |
\end{theorem} |
|
3231 |
\begin{proof} |
|
639 | 3232 |
We have that for all regular expressions $r'$ in |
3233 |
\begin{center} |
|
3234 |
$\textit{set} \; (\map \; (\optermsimp \; r) \; ( |
|
625 | 3235 |
\nupdates \; s \; r \; [\Some \; ([c], n)]))$, |
639 | 3236 |
\end{center} |
625 | 3237 |
$r'$'s size is less than or equal to $N + \llbracket r^{\{n\}} |
3238 |
\rrbracket_r + 1$ |
|
3239 |
because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdot |
|
3240 |
r^{\{m\}}$ for some string $s'$ and number |
|
3241 |
$m \leq n$ (lemma \ref{ntimesClosedFormListElemShape}). |
|
3242 |
In addition, we know that the list |
|
3243 |
$\map \; (\optermsimp \; r) \; ( |
|
3244 |
\nupdates \; s \; r \; [\Some \; ([c], n)])$'s size is at most |
|
3245 |
$c_N = \textit{card} \; |
|
3246 |
(\sizeNregex \; ((N + \llbracket r^{\{n\}} \rrbracket) + 1))$. |
|
3247 |
This gives us $\llbracket r \backslash_{rsimps} \;s \rrbracket_r |
|
3248 |
\leq N * c_N$. |
|
3249 |
\end{proof} |
|
3250 |
||
3251 |
We aim to formalise the correctness and size bound |
|
3252 |
for constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$ |
|
3253 |
and so on, which is still work in progress. |
|
3254 |
They should more or less follow the same recipe described in this section. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3255 |
Once we know how to deal with them recursively using suitable auxiliary |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3256 |
definitions, we can routinely establish the proofs. |
625 | 3257 |
|
532 | 3258 |
|
557 | 3259 |
%---------------------------------------------------------------------------------------- |
3260 |
% SECTION 3 |
|
3261 |
%---------------------------------------------------------------------------------------- |
|
3262 |
||
532 | 3263 |
|
618 | 3264 |
\section{Comments and Future Improvements} |
3265 |
\subsection{Some Experimental Results} |
|
532 | 3266 |
What guarantee does this bound give us? |
639 | 3267 |
It states that whatever the regex is, it will not grow indefinitely. |
532 | 3268 |
Take our previous example $(a + aa)^*$ as an example: |
3269 |
\begin{center} |
|
593 | 3270 |
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
3271 |
\begin{tikzpicture} |
|
3272 |
\begin{axis}[ |
|
3273 |
xlabel={number of $a$'s}, |
|
3274 |
x label style={at={(1.05,-0.05)}}, |
|
3275 |
ylabel={regex size}, |
|
3276 |
enlargelimits=false, |
|
3277 |
xtick={0,5,...,30}, |
|
3278 |
xmax=33, |
|
3279 |
ymax= 40, |
|
3280 |
ytick={0,10,...,40}, |
|
3281 |
scaled ticks=false, |
|
3282 |
axis lines=left, |
|
3283 |
width=5cm, |
|
3284 |
height=4cm, |
|
3285 |
legend entries={$(a + aa)^*$}, |
|
618 | 3286 |
legend pos=south east, |
593 | 3287 |
legend cell align=left] |
3288 |
\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data}; |
|
3289 |
\end{axis} |
|
3290 |
\end{tikzpicture} |
|
3291 |
\end{tabular} |
|
532 | 3292 |
\end{center} |
3293 |
We are able to limit the size of the regex $(a + aa)^*$'s derivatives |
|
593 | 3294 |
with our simplification |
532 | 3295 |
rules very effectively. |
3296 |
||
3297 |
||
3298 |
In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound |
|
3299 |
is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$. |
|
3300 |
Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$ |
|
3301 |
inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function |
|
3302 |
$f(x) = x * 2^x$. |
|
3303 |
This means the bound we have will surge up at least |
|
3304 |
tower-exponentially with a linear increase of the depth. |
|
3305 |
||
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3306 |
One might be pretty skepticafl about what this non-elementary |
618 | 3307 |
bound can bring us. |
3308 |
It turns out that the giant bounds are far from being hit. |
|
3309 |
Here we have some test data from randomly generated regular expressions: |
|
3310 |
\begin{figure}[H] |
|
3311 |
\begin{tabular}{@{}c@{\hspace{2mm}}c@{\hspace{0mm}}c@{}} |
|
593 | 3312 |
\begin{tikzpicture} |
3313 |
\begin{axis}[ |
|
618 | 3314 |
xlabel={$n$}, |
593 | 3315 |
x label style={at={(1.05,-0.05)}}, |
3316 |
ylabel={regex size}, |
|
3317 |
enlargelimits=false, |
|
3318 |
xtick={0,5,...,30}, |
|
3319 |
xmax=33, |
|
611 | 3320 |
%ymax=1000, |
3321 |
%ytick={0,100,...,1000}, |
|
593 | 3322 |
scaled ticks=false, |
3323 |
axis lines=left, |
|
618 | 3324 |
width=4.75cm, |
3325 |
height=3.8cm, |
|
593 | 3326 |
legend entries={regex1}, |
618 | 3327 |
legend pos=north east, |
593 | 3328 |
legend cell align=left] |
3329 |
\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data}; |
|
3330 |
\end{axis} |
|
3331 |
\end{tikzpicture} |
|
618 | 3332 |
& |
593 | 3333 |
\begin{tikzpicture} |
3334 |
\begin{axis}[ |
|
3335 |
xlabel={$n$}, |
|
3336 |
x label style={at={(1.05,-0.05)}}, |
|
3337 |
%ylabel={time in secs}, |
|
3338 |
enlargelimits=false, |
|
3339 |
xtick={0,5,...,30}, |
|
3340 |
xmax=33, |
|
611 | 3341 |
%ymax=1000, |
3342 |
%ytick={0,100,...,1000}, |
|
593 | 3343 |
scaled ticks=false, |
3344 |
axis lines=left, |
|
618 | 3345 |
width=4.75cm, |
3346 |
height=3.8cm, |
|
593 | 3347 |
legend entries={regex2}, |
618 | 3348 |
legend pos=south east, |
593 | 3349 |
legend cell align=left] |
3350 |
\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data}; |
|
3351 |
\end{axis} |
|
3352 |
\end{tikzpicture} |
|
618 | 3353 |
& |
593 | 3354 |
\begin{tikzpicture} |
3355 |
\begin{axis}[ |
|
3356 |
xlabel={$n$}, |
|
3357 |
x label style={at={(1.05,-0.05)}}, |
|
3358 |
%ylabel={time in secs}, |
|
3359 |
enlargelimits=false, |
|
3360 |
xtick={0,5,...,30}, |
|
3361 |
xmax=33, |
|
611 | 3362 |
%ymax=1000, |
3363 |
%ytick={0,100,...,1000}, |
|
593 | 3364 |
scaled ticks=false, |
3365 |
axis lines=left, |
|
618 | 3366 |
width=4.75cm, |
3367 |
height=3.8cm, |
|
593 | 3368 |
legend entries={regex3}, |
618 | 3369 |
legend pos=south east, |
593 | 3370 |
legend cell align=left] |
3371 |
\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data}; |
|
3372 |
\end{axis} |
|
3373 |
\end{tikzpicture}\\ |
|
618 | 3374 |
\multicolumn{3}{c}{} |
593 | 3375 |
\end{tabular} |
618 | 3376 |
\caption{Graphs: size change of 3 randomly generated |
3377 |
regular expressions $w.r.t.$ input string length. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3378 |
The x-axis represents the length of the input.} |
611 | 3379 |
\end{figure} |
532 | 3380 |
\noindent |
3381 |
Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the |
|
3382 |
original size. |
|
591 | 3383 |
We will discuss improvements to this bound in the next chapter. |
532 | 3384 |
|
3385 |
||
3386 |
||
618 | 3387 |
\subsection{Possible Further Improvements} |
639 | 3388 |
There are two problems with this finiteness result, though:\\ |
618 | 3389 |
(i) |
3390 |
First, it is not yet a direct formalisation of our lexer's complexity, |
|
593 | 3391 |
as a complexity proof would require looking into |
3392 |
the time it takes to execute {\bf all} the operations |
|
618 | 3393 |
involved in the lexer (simp, collect, decode), not just the derivative.\\ |
3394 |
(ii) |
|
593 | 3395 |
Second, the bound is not yet tight, and we seek to improve $N_a$ so that |
618 | 3396 |
it is polynomial on $\llbracket a \rrbracket$.\\ |
3397 |
Still, we believe this contribution is useful, |
|
590 | 3398 |
because |
3399 |
\begin{itemize} |
|
3400 |
\item |
|
3401 |
||
618 | 3402 |
The size proof can serve as a starting point for a complexity |
590 | 3403 |
formalisation. |
3404 |
Derivatives are the most important phases of our lexer algorithm. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3405 |
Size properties about derivatives cover the majority of the algorithm |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3406 |
and is therefore a good indication of the complexity of the entire program. |
590 | 3407 |
\item |
3408 |
The bound is already a strong indication that catastrophic |
|
3409 |
backtracking is much less likely to occur in our $\blexersimp$ |
|
3410 |
algorithm. |
|
3411 |
We refine $\blexersimp$ with $\blexerStrong$ in the next chapter |
|
639 | 3412 |
so that we conjecture the bound becomes polynomial. |
590 | 3413 |
\end{itemize} |
593 | 3414 |
|
532 | 3415 |
%---------------------------------------------------------------------------------------- |
3416 |
% SECTION 4 |
|
3417 |
%---------------------------------------------------------------------------------------- |
|
593 | 3418 |
|
3419 |
||
3420 |
||
3421 |
||
3422 |
||
3423 |
||
3424 |
||
3425 |
||
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3426 |
One might wonder about the actual bound rather than the loose bound we gave |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3427 |
for the convenience of a more straightforward proof. |
532 | 3428 |
How much can the regex $r^* \backslash s$ grow? |
3429 |
As earlier graphs have shown, |
|
3430 |
%TODO: reference that graph where size grows quickly |
|
593 | 3431 |
they can grow at a maximum speed |
3432 |
exponential $w.r.t$ the number of characters, |
|
532 | 3433 |
but will eventually level off when the string $s$ is long enough. |
3434 |
If they grow to a size exponential $w.r.t$ the original regex, our algorithm |
|
3435 |
would still be slow. |
|
3436 |
And unfortunately, we have concrete examples |
|
576 | 3437 |
where such regular expressions grew exponentially large before levelling off: |
618 | 3438 |
\begin{center} |
532 | 3439 |
$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
618 | 3440 |
(\underbrace{a \ldots a}_{\text{n a's}})^*)^*$ |
3441 |
\end{center} |
|
3442 |
will already have a maximum |
|
593 | 3443 |
size that is exponential on the number $n$ |
532 | 3444 |
under our current simplification rules: |
3445 |
%TODO: graph of a regex whose size increases exponentially. |
|
3446 |
\begin{center} |
|
593 | 3447 |
\begin{tikzpicture} |
3448 |
\begin{axis}[ |
|
3449 |
height=0.5\textwidth, |
|
3450 |
width=\textwidth, |
|
3451 |
xlabel=number of a's, |
|
3452 |
xtick={0,...,9}, |
|
3453 |
ylabel=maximum size, |
|
3454 |
ymode=log, |
|
3455 |
log basis y={2} |
|
3456 |
] |
|
3457 |
\addplot[mark=*,blue] table {re-chengsong.data}; |
|
3458 |
\end{axis} |
|
3459 |
\end{tikzpicture} |
|
532 | 3460 |
\end{center} |
3461 |
||
618 | 3462 |
For convenience we use $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ |
532 | 3463 |
to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
3464 |
(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion. |
|
3465 |
The exponential size is triggered by that the regex |
|
618 | 3466 |
$\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$ |
532 | 3467 |
inside the $(\ldots) ^*$ having exponentially many |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3468 |
different derivatives, despite those differences being minor. |
618 | 3469 |
$(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$ |
532 | 3470 |
will therefore contain the following terms (after flattening out all nested |
3471 |
alternatives): |
|
3472 |
\begin{center} |
|
618 | 3473 |
$(\sum_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ |
3474 |
[1mm] |
|
593 | 3475 |
$(1 \leq m' \leq m )$ |
532 | 3476 |
\end{center} |
639 | 3477 |
There are at least exponentially |
618 | 3478 |
many such terms.\footnote{To be exact, these terms are |
3479 |
distinct for $m' \leq L.C.M.(1, \ldots, n)$, the details are omitted, |
|
3480 |
but the point is that the number is exponential. |
|
3481 |
} |
|
593 | 3482 |
With each new input character taking the derivative against the intermediate result, more and more such distinct |
618 | 3483 |
terms will accumulate. |
3484 |
The function $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms |
|
3485 |
\begin{center} |
|
3486 |
$(\sum_{i = 1}^{n} |
|
3487 |
(\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot |
|
3488 |
(\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot |
|
3489 |
(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ |
|
3490 |
$(\sum_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot |
|
3491 |
(\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot |
|
3492 |
(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ |
|
3493 |
\end{center} |
|
3494 |
\noindent |
|
3495 |
where $m' \neq m''$ |
|
593 | 3496 |
as they are slightly different. |
3497 |
This means that with our current simplification methods, |
|
3498 |
we will not be able to control the derivative so that |
|
618 | 3499 |
$\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial. %\leq O((\llbracket r\rrbacket)^c)$ |
593 | 3500 |
These terms are similar in the sense that the head of those terms |
3501 |
are all consisted of sub-terms of the form: |
|
3502 |
$(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $. |
|
618 | 3503 |
For $\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most |
593 | 3504 |
$n * (n + 1) / 2$ such terms. |
3505 |
For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives |
|
3506 |
can be described by 6 terms: |
|
3507 |
$a^*$, $a\cdot (aa)^*$, $ (aa)^*$, |
|
3508 |
$aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$. |
|
532 | 3509 |
The total number of different "head terms", $n * (n + 1) / 2$, |
593 | 3510 |
is proportional to the number of characters in the regex |
618 | 3511 |
$(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$. |
3512 |
If we can improve our deduplication process so that it becomes smarter |
|
3513 |
and only keep track of these $n * (n+1) /2$ terms, then we can keep |
|
3514 |
the size growth polynomial again. |
|
3515 |
This example also suggests a slightly different notion of size, which we call the |
|
532 | 3516 |
alphabetic width: |
618 | 3517 |
\begin{center} |
3518 |
\begin{tabular}{lcl} |
|
3519 |
$\textit{awidth} \; \ZERO$ & $\dn$ & $0$\\ |
|
3520 |
$\textit{awidth} \; \ONE$ & $\dn$ & $0$\\ |
|
3521 |
$\textit{awidth} \; c$ & $\dn$ & $1$\\ |
|
3522 |
$\textit{awidth} \; r_1 + r_2$ & $\dn$ & $\textit{awidth} \; |
|
3523 |
r_1 + \textit{awidth} \; r_2$\\ |
|
3524 |
$\textit{awidth} \; r_1 \cdot r_2$ & $\dn$ & $\textit{awidth} \; |
|
3525 |
r_1 + \textit{awidth} \; r_2$\\ |
|
3526 |
$\textit{awidth} \; r^*$ & $\dn$ & $\textit{awidth} \; r$\\ |
|
3527 |
\end{tabular} |
|
3528 |
\end{center} |
|
3529 |
||
532 | 3530 |
|
593 | 3531 |
|
532 | 3532 |
Antimirov\parencite{Antimirov95} has proven that |
618 | 3533 |
$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$, |
532 | 3534 |
where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms |
3535 |
created by doing derivatives of $r$ against all possible strings. |
|
3536 |
If we can make sure that at any moment in our lexing algorithm our |
|
3537 |
intermediate result hold at most one copy of each of the |
|
3538 |
subterms then we can get the same bound as Antimirov's. |
|
3539 |
This leads to the algorithm in the next chapter. |
|
3540 |
||
3541 |
||
3542 |
||
3543 |
||
3544 |
||
3545 |
%---------------------------------------------------------------------------------------- |
|
3546 |
% SECTION 1 |
|
3547 |
%---------------------------------------------------------------------------------------- |
|
3548 |
||
3549 |
||
3550 |
%----------------------------------- |
|
3551 |
% SUBSECTION 1 |
|
3552 |
%----------------------------------- |
|
618 | 3553 |
%\subsection{Syntactic Equivalence Under $\simp$} |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3554 |
%We prove that minor differences can be annihilated |
618 | 3555 |
%by $\simp$. |
3556 |
%For example, |
|
3557 |
%\begin{center} |
|
3558 |
% $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = |
|
3559 |
% \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$ |
|
3560 |
%\end{center} |
|
532 | 3561 |