author | Chengsong |
Mon, 10 Jul 2023 00:44:45 +0100 | |
changeset 659 | 2e05f04ed6b3 |
parent 640 | bd1354127574 |
child 660 | eddc4eaba7c4 |
permissions | -rwxr-xr-x |
532 | 1 |
% Chapter Template |
2 |
||
3 |
\chapter{Finiteness Bound} % Main chapter title |
|
4 |
||
5 |
\label{Finite} |
|
6 |
% In Chapter 4 \ref{Chapter4} we give the second guarantee |
|
7 |
%of our bitcoded algorithm, that is a finite bound on the size of any |
|
8 |
%regex's derivatives. |
|
9 |
||
618 | 10 |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
11 |
In this chapter we give a bound in terms of the size of |
624 | 12 |
the calculated derivatives: |
590 | 13 |
given an annotated regular expression $a$, for any string $s$ |
624 | 14 |
our algorithm $\blexersimp$'s derivatives |
15 |
are finitely bounded |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
16 |
by a constant that only depends on $a$. |
618 | 17 |
Formally we show that there exists an $N_a$ such that |
576 | 18 |
\begin{center} |
593 | 19 |
$\llbracket \bderssimp{a}{s} \rrbracket \leq N_a$ |
576 | 20 |
\end{center} |
21 |
\noindent |
|
618 | 22 |
where the size ($\llbracket \_ \rrbracket$) of |
23 |
an annotated regular expression is defined |
|
24 |
in terms of the number of nodes in its |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
25 |
tree structure (its recursive definition is given in the next page). |
613 | 26 |
We believe this size bound |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
27 |
is important in the context of POSIX lexing because |
590 | 28 |
\begin{itemize} |
29 |
\item |
|
618 | 30 |
It is a stepping stone towards the goal |
31 |
of eliminating ``catastrophic backtracking''. |
|
32 |
If the internal data structures used by our algorithm |
|
33 |
grows beyond a finite bound, then clearly |
|
34 |
the algorithm (which traverses these structures) will |
|
35 |
be slow. |
|
624 | 36 |
The next step is to refine the bound $N_a$ so that it |
37 |
is not just finite but polynomial in $\llbracket a\rrbracket$. |
|
590 | 38 |
\item |
618 | 39 |
Having the finite bound formalised |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
40 |
gives us higher confidence that |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
41 |
our simplification algorithm $\simp$ does not ``misbehave'' |
639 | 42 |
like $\textit{simpSL}$ does. |
618 | 43 |
The bound is universal for a given regular expression, |
44 |
which is an advantage over work which |
|
624 | 45 |
only gives empirical evidence on |
639 | 46 |
some test cases (see for example Verbatim work \cite{Verbatimpp}). |
590 | 47 |
\end{itemize} |
625 | 48 |
\noindent |
49 |
We then extend our $\blexersimp$ |
|
50 |
to support bounded repetitions ($r^{\{n\}}$). |
|
51 |
We update our formalisation of |
|
52 |
the correctness and finiteness properties to |
|
53 |
include this new construct. |
|
639 | 54 |
We show that we can out-compete other verified lexers such as |
625 | 55 |
Verbatim++ on bounded regular expressions. |
56 |
||
618 | 57 |
In the next section we describe in more detail |
58 |
what the finite bound means in our algorithm |
|
59 |
and why the size of the internal data structures of |
|
60 |
a typical derivative-based lexer such as |
|
639 | 61 |
Sulzmann and Lu's needs formal treatment. |
618 | 62 |
|
625 | 63 |
|
64 |
||
65 |
||
639 | 66 |
\section{Formalising Size Bound of Derivatives} |
577 | 67 |
\noindent |
613 | 68 |
In our lexer ($\blexersimp$), |
69 |
we take an annotated regular expression as input, |
|
618 | 70 |
and repeately take derivative of and simplify it. |
71 |
\begin{figure} |
|
72 |
\begin{center} |
|
73 |
\begin{tabular}{lcl} |
|
74 |
$\llbracket _{bs}\ONE \rrbracket$ & $\dn$ & $1$\\ |
|
75 |
$\llbracket \ZERO \rrbracket$ & $\dn$ & $1$ \\ |
|
76 |
$\llbracket _{bs} r_1 \cdot r_2 \rrbracket$ & $\dn$ & $\llbracket r_1 \rrbracket + \llbracket r_2 \rrbracket + 1$\\ |
|
77 |
$\llbracket _{bs}\mathbf{c} \rrbracket $ & $\dn$ & $1$\\ |
|
78 |
$\llbracket _{bs}\sum as \rrbracket $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket)\; as + 1$\\ |
|
79 |
$\llbracket _{bs} a^* \rrbracket $ & $\dn$ & $\llbracket a \rrbracket + 1$. |
|
80 |
\end{tabular} |
|
81 |
\end{center} |
|
82 |
\caption{The size function of bitcoded regular expressions}\label{brexpSize} |
|
83 |
\end{figure} |
|
84 |
||
85 |
\begin{figure} |
|
593 | 86 |
\begin{tikzpicture}[scale=2, |
87 |
every node/.style={minimum size=11mm}, |
|
88 |
->,>=stealth',shorten >=1pt,auto,thick |
|
89 |
] |
|
90 |
\node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$}; |
|
91 |
\node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$}; |
|
92 |
\draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$}; |
|
590 | 93 |
|
593 | 94 |
\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=6mm]{$a_{1s}$}; |
95 |
\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp$}; |
|
590 | 96 |
|
593 | 97 |
\node (r2) [rectangle, draw=black, thick, right=of r1s, minimum size = 12mm]{$a_2$}; |
98 |
\draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; |
|
590 | 99 |
|
593 | 100 |
\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=6mm]{$a_{2s}$}; |
101 |
\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp$}; |
|
590 | 102 |
|
593 | 103 |
\node (rns) [rectangle, draw = blue, thick, right=of r2s,minimum size=6mm]{$a_{ns}$}; |
104 |
\draw[->,line width=0.2mm, dashed](r2s)--(rns) node[above,midway] {$\backslash \ldots$}; |
|
590 | 105 |
|
593 | 106 |
\node (v) [circle, thick, draw, right=of rns, minimum size=6mm, right=1.7cm]{$v$}; |
107 |
\draw[->, line width=0.2mm](rns)--(v) node[above, midway] {\bmkeps} node [below, midway] {\decode}; |
|
108 |
\end{tikzpicture} |
|
109 |
\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{simpShrinks} |
|
590 | 110 |
\end{figure} |
618 | 111 |
|
576 | 112 |
\noindent |
590 | 113 |
Each time |
613 | 114 |
a derivative is taken, the regular expression might grow. |
639 | 115 |
However, the simplification that is immediately afterwards will often shrink it so that |
116 |
the overall size of the derivatives stays relatively small. |
|
577 | 117 |
This intuition is depicted by the relative size |
590 | 118 |
change between the black and blue nodes: |
639 | 119 |
After $\simp$ the node shrinks. |
618 | 120 |
Our proof states that all the blue nodes |
613 | 121 |
stay below a size bound $N_a$ determined by the input $a$. |
576 | 122 |
|
590 | 123 |
\noindent |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
124 |
Sulzmann and Lu's assumed a similar picture of their algorithm, |
590 | 125 |
though in fact their algorithm's size might be better depicted by the following graph: |
126 |
\begin{figure}[H] |
|
593 | 127 |
\begin{tikzpicture}[scale=2, |
128 |
every node/.style={minimum size=11mm}, |
|
129 |
->,>=stealth',shorten >=1pt,auto,thick |
|
130 |
] |
|
131 |
\node (r0) [rectangle, draw=black, thick, minimum size = 5mm, draw=blue] {$a$}; |
|
132 |
\node (r1) [rectangle, draw=black, thick, right=of r0, minimum size = 7mm]{$a_1$}; |
|
133 |
\draw[->,line width=0.2mm](r0)--(r1) node[above,midway] {$\backslash c_1$}; |
|
590 | 134 |
|
593 | 135 |
\node (r1s) [rectangle, draw=blue, thick, right=of r1, minimum size=7mm]{$a_{1s}$}; |
136 |
\draw[->, line width=0.2mm](r1)--(r1s) node[above, midway] {$\simp'$}; |
|
590 | 137 |
|
593 | 138 |
\node (r2) [rectangle, draw=black, thick, right=of r1s, minimum size = 17mm]{$a_2$}; |
139 |
\draw[->,line width=0.2mm](r1s)--(r2) node[above,midway] {$\backslash c_2$}; |
|
590 | 140 |
|
593 | 141 |
\node (r2s) [rectangle, draw = blue, thick, right=of r2,minimum size=14mm]{$a_{2s}$}; |
142 |
\draw[->,line width=0.2mm](r2)--(r2s) node[above,midway] {$\simp'$}; |
|
590 | 143 |
|
593 | 144 |
\node (r3) [rectangle, draw = black, thick, right= of r2s, minimum size = 22mm]{$a_3$}; |
145 |
\draw[->,line width=0.2mm](r2s)--(r3) node[above,midway] {$\backslash c_3$}; |
|
590 | 146 |
|
593 | 147 |
\node (rns) [right = of r3, draw=blue, minimum size = 20mm]{$a_{3s}$}; |
148 |
\draw[->,line width=0.2mm] (r3)--(rns) node [above, midway] {$\simp'$}; |
|
590 | 149 |
|
593 | 150 |
\node (rnn) [right = of rns, minimum size = 1mm]{}; |
151 |
\draw[->, dashed] (rns)--(rnn) node [above, midway] {$\ldots$}; |
|
590 | 152 |
|
593 | 153 |
\end{tikzpicture} |
154 |
\caption{Regular expression size change during our $\blexersimp$ algorithm}\label{sulzShrinks} |
|
590 | 155 |
\end{figure} |
156 |
\noindent |
|
639 | 157 |
The picture means that in some cases their lexer (where they use $\simpsulz$ |
613 | 158 |
as the simplification function) |
618 | 159 |
will have a size explosion, causing the running time |
613 | 160 |
of each derivative step to grow continuously (for example |
590 | 161 |
in \ref{SulzmannLuLexerTime}). |
613 | 162 |
They tested out the run time of their |
590 | 163 |
lexer on particular examples such as $(a+b+ab)^*$ |
613 | 164 |
and claimed that their algorithm is linear w.r.t to the input. |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
165 |
With our mechanised proof, we avoid this type of unintentional |
639 | 166 |
generalisation. |
613 | 167 |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
168 |
Before delving into the details of the formalisation, |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
169 |
we are going to provide an overview of it in the following subsection. |
613 | 170 |
|
590 | 171 |
|
577 | 172 |
\subsection{Overview of the Proof} |
618 | 173 |
A high-level overview of the main components of the finiteness proof |
174 |
is as follows: |
|
593 | 175 |
\begin{figure}[H] |
176 |
\begin{tikzpicture}[scale=1,font=\bf, |
|
177 |
node/.style={ |
|
178 |
rectangle,rounded corners=3mm, |
|
179 |
ultra thick,draw=black!50,minimum height=18mm, |
|
180 |
minimum width=20mm, |
|
181 |
top color=white,bottom color=black!20}] |
|
543 | 182 |
|
183 |
||
593 | 184 |
\node (0) at (-5,0) |
185 |
[node, text width=1.8cm, text centered] |
|
186 |
{$\llbracket \bderssimp{a}{s} \rrbracket$}; |
|
187 |
\node (A) at (0,0) |
|
188 |
[node,text width=1.6cm, text centered] |
|
189 |
{$\llbracket \rderssimp{r}{s} \rrbracket_r$}; |
|
190 |
\node (B) at (3,0) |
|
191 |
[node,text width=3.0cm, anchor=west, minimum width = 40mm] |
|
192 |
{$\llbracket \textit{ClosedForm}(r, s)\rrbracket_r$}; |
|
193 |
\node (C) at (9.5,0) [node, minimum width=10mm] {$N_r$}; |
|
194 |
||
195 |
\draw [->,line width=0.5mm] (0) -- |
|
196 |
node [above,pos=0.45] {=} (A) node [below, pos = 0.45] {$(r = a \downarrow_r)$} (A); |
|
197 |
\draw [->,line width=0.5mm] (A) -- |
|
198 |
node [above,pos=0.35] {$\quad =\ldots=$} (B); |
|
199 |
\draw [->,line width=0.5mm] (B) -- |
|
200 |
node [above,pos=0.35] {$\quad \leq \ldots \leq$} (C); |
|
201 |
\end{tikzpicture} |
|
202 |
%\caption{ |
|
203 |
\end{figure} |
|
576 | 204 |
\noindent |
577 | 205 |
We explain the steps one by one: |
532 | 206 |
\begin{itemize} |
590 | 207 |
\item |
208 |
We first introduce the operations such as |
|
209 |
derivatives, simplification, size calculation, etc. |
|
618 | 210 |
associated with $\rrexp$s, which we have introduced |
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
211 |
in chapter \ref{Bitcoded2}. As promised we will discuss |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
212 |
why they are needed in \ref{whyRerase}. |
593 | 213 |
The operations on $\rrexp$s are identical to those on |
618 | 214 |
annotated regular expressions except that they dispense with |
215 |
bitcodes. This means that all proofs about size of $\rrexp$s will apply to |
|
216 |
annotated regular expressions, because the size of a regular |
|
217 |
expression is independent of the bitcodes. |
|
590 | 218 |
\item |
593 | 219 |
We prove that $\rderssimp{r}{s} = \textit{ClosedForm}(r, s)$, |
220 |
where $\textit{ClosedForm}(r, s)$ is entirely |
|
618 | 221 |
given as the derivatives of their children regular |
593 | 222 |
expressions. |
223 |
We call the right-hand-side the \emph{Closed Form} |
|
224 |
of the derivative $\rderssimp{r}{s}$. |
|
590 | 225 |
\item |
618 | 226 |
Formally we give an estimate of |
227 |
$\llbracket \textit{ClosedForm}(r, s) \rrbracket_r$. |
|
593 | 228 |
The key observation is that $\distinctBy$'s output is |
229 |
a list with a constant length bound. |
|
532 | 230 |
\end{itemize} |
594 | 231 |
We will expand on these steps in the next sections.\\ |
532 | 232 |
|
613 | 233 |
\section{The $\textit{Rrexp}$ Datatype} |
594 | 234 |
The first step is to define |
235 |
$\textit{rrexp}$s. |
|
618 | 236 |
They are annotated regular expressions without bitcodes, |
639 | 237 |
allowing a more convenient size bound proof. |
618 | 238 |
%Of course, the bits which encode the lexing information |
239 |
%would grow linearly with respect |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
240 |
%to the input, which should be taken into accounte when we wish to tackle the runtime complexity. |
618 | 241 |
%But for the sake of the structural size |
242 |
%we can safely ignore them.\\ |
|
243 |
The datatype |
|
594 | 244 |
definition of the $\rrexp$, called |
593 | 245 |
\emph{r-regular expressions}, |
594 | 246 |
was initially defined in \ref{rrexpDef}. |
247 |
The reason for the prefix $r$ is |
|
593 | 248 |
to make a distinction |
594 | 249 |
with basic regular expressions. |
639 | 250 |
We give here again the definition of $\rrexp$. |
576 | 251 |
\[ \rrexp ::= \RZERO \mid \RONE |
593 | 252 |
\mid \RCHAR{c} |
253 |
\mid \RSEQ{r_1}{r_2} |
|
254 |
\mid \RALTS{rs} |
|
255 |
\mid \RSTAR{r} |
|
576 | 256 |
\] |
593 | 257 |
The size of an r-regular expression is |
258 |
written $\llbracket r\rrbracket_r$, |
|
259 |
whose definition mirrors that of an annotated regular expression. |
|
576 | 260 |
\begin{center} |
618 | 261 |
\begin{tabular}{lcl} |
593 | 262 |
$\llbracket _{bs}\ONE \rrbracket_r$ & $\dn$ & $1$\\ |
263 |
$\llbracket \ZERO \rrbracket_r$ & $\dn$ & $1$ \\ |
|
264 |
$\llbracket _{bs} r_1 \cdot r_2 \rrbracket_r$ & $\dn$ & $\llbracket r_1 \rrbracket_r + \llbracket r_2 \rrbracket_r + 1$\\ |
|
265 |
$\llbracket _{bs}\mathbf{c} \rrbracket_r $ & $\dn$ & $1$\\ |
|
266 |
$\llbracket _{bs}\sum as \rrbracket_r $ & $\dn$ & $\map \; (\llbracket \_ \rrbracket_r)\; as + 1$\\ |
|
267 |
$\llbracket _{bs} a^* \rrbracket_r $ & $\dn$ & $\llbracket a \rrbracket_r + 1$. |
|
268 |
\end{tabular} |
|
576 | 269 |
\end{center} |
270 |
\noindent |
|
593 | 271 |
The $r$ in the subscript of $\llbracket \rrbracket_r$ is to |
272 |
differentiate with the same operation for annotated regular expressions. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
273 |
Similar subscripts will be added for operations like $\rerase{}$: |
593 | 274 |
\begin{center} |
275 |
\begin{tabular}{lcl} |
|
276 |
$\rerase{\ZERO}$ & $\dn$ & $\RZERO$\\ |
|
277 |
$\rerase{_{bs}\ONE}$ & $\dn$ & $\RONE$\\ |
|
278 |
$\rerase{_{bs}\mathbf{c}}$ & $\dn$ & $\RCHAR{c}$\\ |
|
279 |
$\rerase{_{bs}r_1\cdot r_2}$ & $\dn$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\ |
|
280 |
$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\ |
|
281 |
$\rerase{_{bs} a ^*}$ & $\dn$ & $\rerase{a} ^*$ |
|
282 |
\end{tabular} |
|
283 |
\end{center} |
|
594 | 284 |
|
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
285 |
\subsection{Why a New Datatype?}\label{whyRerase} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
286 |
\marginpar{\em added label so this section can be referenced by other parts of the thesis |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
287 |
so that interested readers can jump to/be reassured that there will explanations.} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
288 |
Originally the erase operation $(\_)_\downarrow$ was |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
289 |
used by Ausaf et al. in their proofs related to $\blexer$. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
290 |
This function was not part of the lexing algorithm, and the sole purpose was to |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
291 |
bridge the gap between the $r$ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
292 |
%$\textit{rexp}$ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
293 |
(un-annotated) and $\textit{arexp}$ (annotated) |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
294 |
regular expression datatypes so as to leverage the correctness |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
295 |
theorem of $\lexer$.%to establish the correctness of $\blexer$. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
296 |
For example, lemma \ref{retrieveStepwise} %and \ref{bmkepsRetrieve} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
297 |
uses $\erase$ to convert an annotated regular expression $a$ into |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
298 |
a plain one so that it can be used by $\inj$ to create the desired value |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
299 |
$\inj\; (a)_\downarrow \; c \; v$. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
300 |
|
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
301 |
Ideally $\erase$ should only remove the auxiliary information not related to the |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
302 |
structure--the |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
303 |
bitcodes. However there exists a complication |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
304 |
where the alternative constructors have different arity for $\textit{arexp}$ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
305 |
and $\textit{r}$: |
576 | 306 |
\begin{center} |
618 | 307 |
\begin{tabular}{lcl} |
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
308 |
$\textit{r}$ & $::=$ & $\ldots \;|\; (\_ + \_) \; ::\; "\textit{r} \Rightarrow \textit{r} \Rightarrow \textit{r}" | \ldots$\\ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
309 |
$\textit{arexp}$ & $::=$ & $\ldots\; |\; (\Sigma \_ ) \; ::\; "\textit{arexp} \; list \Rightarrow \textit{arexp}" | \ldots$ |
594 | 310 |
\end{tabular} |
576 | 311 |
\end{center} |
594 | 312 |
\noindent |
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
313 |
To convert between the two |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
314 |
$\erase$ has to recursively disassemble a list into nested binary applications of the |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
315 |
$(\_ + \_)$ operator, |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
316 |
handling corner cases like empty or |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
317 |
singleton alternative lists: |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
318 |
%becomes $r$ during the |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
319 |
%$\erase$ function. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
320 |
%The annotated regular expression $\sum[a, b, c]$ would turn into |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
321 |
%$(a+(b+c))$. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
322 |
\begin{center} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
323 |
\begin{tabular}{lcl} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
324 |
$ (_{bs}\sum [])_\downarrow $ & $\dn$ & $\ZERO$\\ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
325 |
$ (_{bs}\sum [a])_\downarrow$ & $\dn$ & $a$\\ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
326 |
$ (_{bs}\sum a_1 :: a_2)_\downarrow$ & $\dn$ & $(a_1)_\downarrow + (a_2)_\downarrow)$\\ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
327 |
$ (_{bs}\sum a :: as)_\downarrow$ & $\dn$ & $a_\downarrow + (\erase \; _{[]} \sum as)$ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
328 |
\end{tabular} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
329 |
\end{center} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
330 |
\noindent |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
331 |
These operations inevitably change the structure and size of |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
332 |
an annotated regular expression. For example, |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
333 |
$a_1 = \sum _{Z}[x]$ has size 2, but $(a_1)_\downarrow = x$ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
334 |
only has size 1. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
335 |
%adding unnecessary |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
336 |
%complexities to the size bound proof. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
337 |
%The reason we |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
338 |
%define a new datatype is that |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
339 |
%the $\erase$ function |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
340 |
%does not preserve the structure of annotated |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
341 |
%regular expressions. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
342 |
%We initially started by using |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
343 |
%plain regular expressions and tried to prove |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
344 |
%lemma \ref{rsizeAsize}, |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
345 |
%however the $\erase$ function messes with the structure of the |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
346 |
%annotated regular expression. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
347 |
%The $+$ constructor |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
348 |
%of basic regular expressions is only binary, whereas $\sum$ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
349 |
%takes a list. Therefore we need to convert between |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
350 |
%annotated and normal regular expressions as follows: |
613 | 351 |
For example, if we define the size of a basic plain regular expression |
594 | 352 |
in the usual way, |
543 | 353 |
\begin{center} |
618 | 354 |
\begin{tabular}{lcl} |
593 | 355 |
$\llbracket \ONE \rrbracket_p$ & $\dn$ & $1$\\ |
356 |
$\llbracket \ZERO \rrbracket_p$ & $\dn$ & $1$ \\ |
|
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
357 |
$\llbracket r_1 + r_2 \rrbracket_p$ & $\dn$ & $\llbracket r_1 \rrbracket_p + \llbracket r_2 \rrbracket_p + 1$\\ |
593 | 358 |
$\llbracket \mathbf{c} \rrbracket_p $ & $\dn$ & $1$\\ |
359 |
$\llbracket r_1 \cdot r_2 \rrbracket_p $ & $\dn$ & $\llbracket r_1 \rrbracket_p \; + \llbracket r_2 \rrbracket_p + 1$\\ |
|
360 |
$\llbracket a^* \rrbracket_p $ & $\dn$ & $\llbracket a \rrbracket_p + 1$ |
|
361 |
\end{tabular} |
|
532 | 362 |
\end{center} |
543 | 363 |
\noindent |
594 | 364 |
Then the property |
532 | 365 |
\begin{center} |
613 | 366 |
$\llbracket a \rrbracket \stackrel{?}{=} \llbracket a_\downarrow \rrbracket_p$ |
532 | 367 |
\end{center} |
594 | 368 |
does not hold. |
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
369 |
%With $\textit{rerase}$, however, |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
370 |
%only the bitcodes are thrown away. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
371 |
That leads to us defining the new regular expression datatype without |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
372 |
bitcodes but with a list alternative constructor, and defining a new erase function |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
373 |
in a strictly structure-preserving manner: |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
374 |
\begin{center} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
375 |
\begin{tabular}{lcl} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
376 |
$\textit{rrexp}$ & $::=$ & $\ldots\; |\; (\sum \_ ) \; ::\; "\textit{rrexp} \; list \Rightarrow \textit{rrexp}" | \ldots$\\ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
377 |
$\rerase{_{bs}\sum as}$ & $\dn$ & $\RALTS{\map \; \rerase{\_} \; as}$\\ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
378 |
\end{tabular} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
379 |
\end{center} |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
380 |
\noindent |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
381 |
%But |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
382 |
%Everything about the structure remains intact. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
383 |
%Therefore it does not change the size |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
384 |
%of an annotated regular expression and we have: |
613 | 385 |
\noindent |
594 | 386 |
One might be able to prove an inequality such as |
387 |
$\llbracket a \rrbracket \leq \llbracket a_\downarrow \rrbracket_p $ |
|
388 |
and then estimate $\llbracket a_\downarrow \rrbracket_p$, |
|
389 |
but we found our approach more straightforward.\\ |
|
532 | 390 |
|
613 | 391 |
\subsection{Functions for R-regular Expressions} |
639 | 392 |
The downside of our approach is that we need to redefine |
393 |
several functions for $\rrexp$. |
|
618 | 394 |
In this section we shall define the r-regular expression version |
639 | 395 |
of $\bder$, and $\textit{bsimp}$ related functions. |
613 | 396 |
We use $r$ as the prefix or subscript to differentiate |
397 |
with the bitcoded version. |
|
618 | 398 |
%For example,$\backslash_r$, $\rdistincts$, and $\rsimp$ |
399 |
%as opposed to $\backslash$, $\distinctBy$, and $\bsimp$. |
|
400 |
%As promised, they are much simpler than their bitcoded counterparts. |
|
613 | 401 |
%The operations on r-regular expressions are |
402 |
%almost identical to those of the annotated regular expressions, |
|
403 |
%except that no bitcodes are used. For example, |
|
618 | 404 |
The derivative operation for an r-regular expression is\\ |
543 | 405 |
\begin{center} |
593 | 406 |
\begin{tabular}{@{}lcl@{}} |
407 |
$(\ZERO)\,\backslash_r c$ & $\dn$ & $\ZERO$\\ |
|
408 |
$(\ONE)\,\backslash_r c$ & $\dn$ & |
|
409 |
$\textit{if}\;c=d\; \;\textit{then}\; |
|
410 |
\ONE\;\textit{else}\;\ZERO$\\ |
|
411 |
$(\sum \;\textit{rs})\,\backslash_r c$ & $\dn$ & |
|
412 |
$\sum\;(\textit{map} \; (\_\backslash_r c) \; rs )$\\ |
|
413 |
$(r_1\cdot r_2)\,\backslash_r c$ & $\dn$ & |
|
594 | 414 |
$\textit{if}\;(\textit{rnullable}\,r_1)$\\ |
593 | 415 |
& &$\textit{then}\;\sum\,[(r_1\,\backslash_r c)\cdot\,r_2,$\\ |
416 |
& &$\phantom{\textit{then},\;\sum\,}((r_2\,\backslash_r c))]$\\ |
|
417 |
& &$\textit{else}\;\,(r_1\,\backslash_r c)\cdot r_2$\\ |
|
418 |
$(r^*)\,\backslash_r c$ & $\dn$ & |
|
419 |
$( r\,\backslash_r c)\cdot |
|
420 |
(_{[]}r^*))$ |
|
421 |
\end{tabular} |
|
543 | 422 |
\end{center} |
423 |
\noindent |
|
618 | 424 |
where we omit the definition of $\textit{rnullable}$. |
639 | 425 |
The generalisation from the derivatives w.r.t a character to |
426 |
derivatives w.r.t strings is given as |
|
620 | 427 |
\begin{center} |
428 |
\begin{tabular}{lcl} |
|
429 |
$r \backslash_{rs} []$ & $\dn$ & $r$\\ |
|
430 |
$r \backslash_{rs} c::s$ & $\dn$ & $(r\backslash_r c) \backslash_{rs} s$ |
|
431 |
\end{tabular} |
|
432 |
\end{center} |
|
618 | 433 |
|
434 |
The function $\distinctBy$ for r-regular expressions does not need |
|
594 | 435 |
a function checking equivalence because |
618 | 436 |
there are no bit annotations. |
437 |
Therefore we have |
|
532 | 438 |
\begin{center} |
593 | 439 |
\begin{tabular}{lcl} |
440 |
$\rdistinct{[]}{rset} $ & $\dn$ & $[]$\\ |
|
594 | 441 |
$\rdistinct{r :: rs}{rset}$ & $\dn$ & |
442 |
$\textit{if}(r \in \textit{rset}) \; \textit{then} \; \rdistinct{rs}{rset}$\\ |
|
443 |
& & $\textit{else}\; \; |
|
444 |
r::\rdistinct{rs}{(rset \cup \{r\})}$ |
|
593 | 445 |
\end{tabular} |
532 | 446 |
\end{center} |
447 |
%TODO: definition of rsimp (maybe only the alternative clause) |
|
543 | 448 |
\noindent |
618 | 449 |
%We would like to make clear |
450 |
%a difference between our $\rdistincts$ and |
|
451 |
%the Isabelle $\textit {distinct}$ predicate. |
|
452 |
%In Isabelle $\textit{distinct}$ is a function that returns a boolean |
|
453 |
%rather than a list. |
|
454 |
%It tests if all the elements of a list are unique.\\ |
|
455 |
With $\textit{rdistinct}$ in place, |
|
456 |
the flatten function for $\rrexp$ is as follows: |
|
595 | 457 |
\begin{center} |
458 |
\begin{tabular}{@{}lcl@{}} |
|
596 | 459 |
$\textit{rflts} \; (\sum \textit{as}) :: \textit{as'}$ & $\dn$ & $as \; @ \; \textit{rflts} \; as' $ \\ |
595 | 460 |
$\textit{rflts} \; \ZERO :: as'$ & $\dn$ & $ \textit{rflts} \; \textit{as'} $ \\ |
461 |
$\textit{rflts} \; a :: as'$ & $\dn$ & $a :: \textit{rflts} \; \textit{as'}$ \quad(otherwise) |
|
462 |
\end{tabular} |
|
463 |
\end{center} |
|
464 |
\noindent |
|
618 | 465 |
The function |
466 |
$\rsimpalts$ corresponds to $\textit{bsimp}_{ALTS}$: |
|
596 | 467 |
\begin{center} |
468 |
\begin{tabular}{@{}lcl@{}} |
|
469 |
$\rsimpalts \;\; nil$ & $\dn$ & $\RZERO$\\ |
|
470 |
$\rsimpalts \;\; r::nil$ & $\dn$ & $r$\\ |
|
471 |
$\rsimpalts \;\; rs$ & $\dn$ & $\sum rs$\\ |
|
472 |
\end{tabular} |
|
473 |
\end{center} |
|
474 |
\noindent |
|
618 | 475 |
Similarly, we have $\rsimpseq$ which corresponds to $\textit{bsimp}_{SEQ}$: |
596 | 476 |
\begin{center} |
477 |
\begin{tabular}{@{}lcl@{}} |
|
478 |
$\rsimpseq \;\; \RZERO \; \_ $ & $=$ & $\RZERO$\\ |
|
479 |
$\rsimpseq \;\; \_ \; \RZERO $ & $=$ & $\RZERO$\\ |
|
480 |
$\rsimpseq \;\; \RONE \cdot r_2$ & $\dn$ & $r_2$\\ |
|
481 |
$\rsimpseq \;\; r_1 r_2$ & $\dn$ & $r_1 \cdot r_2$\\ |
|
482 |
\end{tabular} |
|
483 |
\end{center} |
|
484 |
and get $\textit{rsimp}$ and $\rderssimp{\_}{\_}$: |
|
595 | 485 |
\begin{center} |
486 |
\begin{tabular}{@{}lcl@{}} |
|
487 |
||
596 | 488 |
$\textit{rsimp} \; (r_1\cdot r_2)$ & $\dn$ & $ \textit{rsimp}_{SEQ} \; bs \;(\textit{rsimp} \; r_1) \; (\textit{rsimp} \; r_2) $ \\ |
489 |
$\textit{rsimp} \; (_{bs}\sum \textit{rs})$ & $\dn$ & $\textit{rsimp}_{ALTS} \; \textit{bs} \; (\textit{rdistinct} \; ( \textit{rflts} ( \textit{map} \; rsimp \; rs)) \; \rerases \; \varnothing) $ \\ |
|
490 |
$\textit{rsimp} \; r$ & $\dn$ & $\textit{r} \qquad \textit{otherwise}$ |
|
595 | 491 |
\end{tabular} |
492 |
\end{center} |
|
596 | 493 |
\begin{center} |
494 |
\begin{tabular}{@{}lcl@{}} |
|
495 |
$r\backslash_{rsimp} \, c$ & $\dn$ & $\rsimp \; (r\backslash_r \, c)$ |
|
496 |
\end{tabular} |
|
497 |
\end{center} |
|
498 |
||
499 |
\begin{center} |
|
500 |
\begin{tabular}{@{}lcl@{}} |
|
601 | 501 |
$r \backslash_{rsimps} \; \; c\!::\!s $ & $\dn$ & $(r \backslash_{rsimp}\, c) \backslash_{rsimps}\, s$ \\ |
596 | 502 |
$r \backslash_{rsimps} [\,] $ & $\dn$ & $r$ |
503 |
\end{tabular} |
|
504 |
\end{center} |
|
505 |
\noindent |
|
601 | 506 |
We do not define an r-regular expression version of $\blexersimp$, |
618 | 507 |
as our proof does not depend on it. |
613 | 508 |
Now we are ready to introduce how r-regular expressions allow |
509 |
us to prove the size bound on bitcoded regular expressions. |
|
510 |
||
511 |
\subsection{Using R-regular Expressions to Bound Bit-coded Regular Expressions} |
|
512 |
Everything about the size of annotated regular expressions after the application |
|
513 |
of function $\bsimp$ and $\backslash_{simps}$ |
|
514 |
can be calculated via the size of r-regular expressions after the application |
|
515 |
of $\rsimp$ and $\backslash_{rsimps}$: |
|
564 | 516 |
\begin{lemma}\label{sizeRelations} |
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
517 |
The following equalities hold: |
543 | 518 |
\begin{itemize} |
519 |
\item |
|
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
520 |
$\rsize{\rerase a} = \asize a$ |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
521 |
\item |
601 | 522 |
$\asize{\bsimps \; a} = \rsize{\rsimp{ \rerase{a}}}$ |
554 | 523 |
\item |
596 | 524 |
$\asize{\bderssimp{a}{s}} = \rsize{\rderssimp{\rerase{a}}{s}}$ |
554 | 525 |
\end{itemize} |
532 | 526 |
\end{lemma} |
601 | 527 |
\begin{proof} |
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
528 |
First part follows from the definition of $(\_)_{\downarrow_r}$. |
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
529 |
The second part is by induction on the inductive cases |
601 | 530 |
of $\textit{bsimp}$. |
659
2e05f04ed6b3
Addressed Gerog "can't understand 'erase messes with structure'" comment
Chengsong
parents:
640
diff
changeset
|
531 |
The third part is by induction on the string $s$, |
601 | 532 |
where the inductive step follows from part one. |
533 |
\end{proof} |
|
543 | 534 |
\noindent |
596 | 535 |
With lemma \ref{sizeRelations}, |
601 | 536 |
we will be able to focus on |
537 |
estimating only |
|
538 |
$\rsize{\rderssimp{\rerase{a}}{s}}$ |
|
539 |
in later parts because |
|
540 |
\begin{center} |
|
541 |
$\rsize{\rderssimp{\rerase{a}}{s}} \leq N_r \quad$ |
|
542 |
implies |
|
543 |
$\quad \llbracket a \backslash_{bsimps} s \rrbracket \leq N_r$. |
|
544 |
\end{center} |
|
618 | 545 |
%From now on we |
546 |
%Unless stated otherwise in the rest of this |
|
547 |
%chapter all regular expressions without |
|
548 |
%bitcodes are seen as r-regular expressions ($\rrexp$s). |
|
549 |
%For the binary alternative r-regular expression $\RALTS{[r_1, r_2]}$, |
|
550 |
%we use the notation $r_1 + r_2$ |
|
551 |
%for brevity. |
|
532 | 552 |
|
553 |
||
554 |
%----------------------------------- |
|
596 | 555 |
% SUB SECTION ROADMAP RREXP BOUND |
532 | 556 |
%----------------------------------- |
553 | 557 |
|
596 | 558 |
%\subsection{Roadmap to a Bound for $\textit{Rrexp}$} |
553 | 559 |
|
596 | 560 |
%The way we obtain the bound for $\rrexp$s is by two steps: |
561 |
%\begin{itemize} |
|
562 |
% \item |
|
563 |
% First, we rewrite $r\backslash s$ into something else that is easier |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
564 |
% to bound. This step is crucial for the inductive case |
596 | 565 |
% $r_1 \cdot r_2$ and $r^*$, where the derivative can grow and bloat in a wild way, |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
566 |
% but after simplification, they will always be equal or smaller to a form consisting of an alternative |
596 | 567 |
% list of regular expressions $f \; (g\; (\sum rs))$ with some functions applied to it, where each element will be distinct after the function application. |
568 |
% \item |
|
569 |
% Then, for such a sum list of regular expressions $f\; (g\; (\sum rs))$, we can control its size |
|
570 |
% by estimation, since $\distinctBy$ and $\flts$ are well-behaved and working together would only |
|
571 |
% reduce the size of a regular expression, not adding to it. |
|
572 |
%\end{itemize} |
|
573 |
% |
|
574 |
%\section{Step One: Closed Forms} |
|
575 |
%We transform the function application $\rderssimp{r}{s}$ |
|
576 |
%into an equivalent |
|
577 |
%form $f\; (g \; (\sum rs))$. |
|
578 |
%The functions $f$ and $g$ can be anything from $\flts$, $\distinctBy$ and other helper functions from $\bsimp{\_}$. |
|
579 |
%This way we get a different but equivalent way of expressing : $r\backslash s = f \; (g\; (\sum rs))$, we call the |
|
580 |
%right hand side the "closed form" of $r\backslash s$. |
|
581 |
% |
|
582 |
%\begin{quote}\it |
|
583 |
% Claim: For regular expressions $r_1 \cdot r_2$, we claim that |
|
584 |
%\end{quote} |
|
585 |
%\noindent |
|
586 |
%We explain in detail how we reached those claims. |
|
601 | 587 |
If we attempt to prove |
588 |
\begin{center} |
|
609 | 589 |
$\forall r. \; \exists N_r.\;\; s.t. \llbracket r\backslash_{rsimps} s \rrbracket_r \leq N_r$ |
601 | 590 |
\end{center} |
591 |
using a naive induction on the structure of $r$, |
|
592 |
then we are stuck at the inductive cases such as |
|
593 |
$r_1\cdot r_2$. |
|
594 |
The inductive hypotheses are: |
|
595 |
\begin{center} |
|
596 |
1: $\text{for } r_1, \text{there exists } N_{r_1}.\;\; s.t. |
|
609 | 597 |
\;\;\forall s. \llbracket r_1 \backslash_{rsimps} s \rrbracket_r \leq N_{r_1}. $\\ |
601 | 598 |
2: $\text{for } r_2, \text{there exists } N_{r_2}.\;\; s.t. |
609 | 599 |
\;\; \forall s. \llbracket r_2 \backslash_{rsimps} s \rrbracket_r \leq N_{r_2}. $ |
601 | 600 |
\end{center} |
601 |
The inductive step to prove would be |
|
602 |
\begin{center} |
|
603 |
$\text{there exists } N_{r_1\cdot r_2}. \;\; s.t. \forall s. |
|
609 | 604 |
\llbracket (r_1 \cdot r_2) \backslash_{rsimps} s \rrbracket_r \leq N_{r_1\cdot r_2}.$ |
601 | 605 |
\end{center} |
606 |
The problem is that it is not clear what |
|
609 | 607 |
$(r_1\cdot r_2) \backslash_{rsimps} s$ looks like, |
601 | 608 |
and therefore $N_{r_1}$ and $N_{r_2}$ in the |
609 |
inductive hypotheses cannot be directly used. |
|
618 | 610 |
%We have already seen that $(r_1 \cdot r_2)\backslash s$ |
611 |
%and $(r^*)\backslash s$ can grow in a wild way. |
|
613 | 612 |
|
618 | 613 |
The point however, is that they will be equivalent to a list of |
609 | 614 |
terms $\sum rs$, where each term in $rs$ will |
615 |
be made of $r_1 \backslash s' $, $r_2\backslash s'$, |
|
618 | 616 |
and $r \backslash s'$ with $s' \in \textit{SubString} \; s$ (which stands |
617 |
for the set of substrings of $s$). |
|
609 | 618 |
The list $\sum rs$ will then be de-duplicated by $\textit{rdistinct}$ |
618 | 619 |
in the simplification, which prevents the $rs$ from growing indefinitely. |
609 | 620 |
|
613 | 621 |
Based on this idea, we develop a proof in two steps. |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
622 |
First, we show the below equality (where |
609 | 623 |
$f$ and $g$ are functions that do not increase the size of the input) |
624 |
\begin{center} |
|
613 | 625 |
$r\backslash_{rsimps} s = f\; (\textit{rdistinct} \; (g\; \sum rs))$, |
609 | 626 |
\end{center} |
613 | 627 |
where $r = r_1 \cdot r_2$ or $r = r_0^*$ and so on. |
628 |
For example, for $r_1 \cdot r_2$ we have the equality as |
|
629 |
\begin{center} |
|
630 |
$ \rderssimp{r_1 \cdot r_2}{s} = |
|
639 | 631 |
\rsimp{(\sum (r_1 \backslash s \cdot r_2 ) \; :: \;(\map \; \rderssimp{r_2}{\_} \;(\vsuf{s}{r_1})))}$ |
613 | 632 |
\end{center} |
609 | 633 |
We call the right-hand-side the |
634 |
\emph{Closed Form} of $(r_1 \cdot r_2)\backslash_{rsimps} s$. |
|
613 | 635 |
Second, we will bound the closed form of r-regular expressions |
636 |
using some estimation techniques |
|
618 | 637 |
and then apply |
638 |
lemma \ref{sizeRelations} to show that the bitcoded regular expressions |
|
613 | 639 |
in our $\blexersimp$ are finitely bounded. |
609 | 640 |
|
618 | 641 |
We will describe in detail the first step of the proof |
642 |
in the next section. |
|
613 | 643 |
|
644 |
\section{Closed Forms} |
|
609 | 645 |
In this section we introduce in detail |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
646 |
how to express the string derivatives |
618 | 647 |
of regular expressions (i.e. $r \backslash_r s$ where $s$ is a string |
648 |
rather than a single character) in a different way than |
|
649 |
our previous definition. |
|
650 |
In previous chapters, the derivative of a |
|
651 |
regular expression $r$ w.r.t a string $s$ |
|
652 |
was recursively defined on the string: |
|
653 |
\begin{center} |
|
654 |
$r \backslash_s (c::s) \dn (r \backslash c) \backslash_s s$ |
|
655 |
\end{center} |
|
656 |
The problem is that |
|
657 |
this definition does not provide much information |
|
658 |
on what $r \backslash_s s$ looks like. |
|
659 |
If we are interested in the size of a derivative like |
|
660 |
$(r_1 \cdot r_2)\backslash s$, |
|
661 |
we have to somehow get a more concrete form to begin. |
|
662 |
We call such more concrete representations the ``closed forms'' of |
|
663 |
string derivatives as opposed to their original definitions. |
|
664 |
The terminology ``closed form'' is borrowed from mathematics, |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
665 |
which usually describe expressions that are solely comprised of finitely many |
618 | 666 |
well-known and easy-to-compute operations such as |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
667 |
additions, multiplications, and exponential functions. |
618 | 668 |
|
613 | 669 |
We start by proving some basic identities |
609 | 670 |
involving the simplification functions for r-regular expressions. |
613 | 671 |
After that we introduce the rewrite relations |
672 |
$\rightsquigarrow_h$, $\rightsquigarrow^*_{scf}$ |
|
673 |
$\rightsquigarrow_f$ and $\rightsquigarrow_g$. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
674 |
These relations involve similar techniques as in chapter \ref{Bitcoded2} |
618 | 675 |
for annotated regular expressions. |
613 | 676 |
Finally, we use these identities to establish the |
677 |
closed forms of the alternative regular expression, |
|
678 |
the sequence regular expression, and the star regular expression. |
|
609 | 679 |
%$r_1\cdot r_2$, $r^*$ and $\sum rs$. |
601 | 680 |
|
681 |
||
609 | 682 |
|
613 | 683 |
\subsection{Some Basic Identities} |
609 | 684 |
|
618 | 685 |
In what follows we will often convert between lists |
686 |
and sets. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
687 |
We use Isabelle's $set$ to refer to the |
611 | 688 |
function that converts a list $rs$ to the set |
689 |
containing all the elements in $rs$. |
|
690 |
\subsubsection{$\textit{rdistinct}$'s Does the Job of De-duplication} |
|
543 | 691 |
The $\textit{rdistinct}$ function, as its name suggests, will |
613 | 692 |
de-duplicate an r-regular expression list. |
693 |
It will also remove any elements that |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
694 |
are already in the accumulator set. |
555 | 695 |
\begin{lemma}\label{rdistinctDoesTheJob} |
609 | 696 |
%The function $\textit{rdistinct}$ satisfies the following |
697 |
%properties: |
|
698 |
Assume we have the predicate $\textit{isDistinct}$\footnote{We omit its |
|
618 | 699 |
recursive definition here. Its Isabelle counterpart would be $\textit{distinct}$.} |
609 | 700 |
for testing |
613 | 701 |
whether a list's elements are unique. Then the following |
609 | 702 |
properties about $\textit{rdistinct}$ hold: |
543 | 703 |
\begin{itemize} |
704 |
\item |
|
705 |
If $a \in acc$ then $a \notin (\rdistinct{rs}{acc})$. |
|
706 |
\item |
|
609 | 707 |
%If list $rs'$ is the result of $\rdistinct{rs}{acc}$, |
708 |
$\textit{isDistinct} \;\;\; (\rdistinct{rs}{acc})$. |
|
555 | 709 |
\item |
609 | 710 |
$\textit{set} \; (\rdistinct{rs}{acc}) |
639 | 711 |
= (\textit{set} \; rs) - acc$ |
543 | 712 |
\end{itemize} |
713 |
\end{lemma} |
|
555 | 714 |
\noindent |
543 | 715 |
\begin{proof} |
716 |
The first part is by an induction on $rs$. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
717 |
The second and third parts can be proven by using the |
609 | 718 |
inductive cases of $\textit{rdistinct}$. |
593 | 719 |
|
543 | 720 |
\end{proof} |
721 |
||
722 |
\noindent |
|
613 | 723 |
%$\textit{rdistinct}$ will out all regular expression terms |
724 |
%that are in the accumulator, therefore |
|
725 |
Concatenating a list $rs_a$ at the front of another |
|
726 |
list $rs$ whose elements are all from the accumulator, and then calling $\textit{rdistinct}$ |
|
727 |
on the merged list, the output will be as if we had called $\textit{rdistinct}$ |
|
543 | 728 |
without the prepending of $rs$: |
609 | 729 |
\begin{lemma}\label{rdistinctConcat} |
554 | 730 |
The elements appearing in the accumulator will always be removed. |
731 |
More precisely, |
|
732 |
\begin{itemize} |
|
733 |
\item |
|
734 |
If $rs \subseteq rset$, then |
|
735 |
$\rdistinct{rs@rsa }{acc} = \rdistinct{rsa }{acc}$. |
|
736 |
\item |
|
609 | 737 |
More generally, if $a \in rset$ and $\rdistinct{rs}{\{a\}} = []$, |
554 | 738 |
then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{rs'}{rset}$ |
739 |
\end{itemize} |
|
543 | 740 |
\end{lemma} |
554 | 741 |
|
543 | 742 |
\begin{proof} |
609 | 743 |
By induction on $rs$ and using \ref{rdistinctDoesTheJob}. |
543 | 744 |
\end{proof} |
745 |
\noindent |
|
746 |
On the other hand, if an element $r$ does not appear in the input list waiting to be deduplicated, |
|
747 |
then expanding the accumulator to include that element will not cause the output list to change: |
|
611 | 748 |
\begin{lemma}\label{rdistinctOnDistinct} |
543 | 749 |
The accumulator can be augmented to include elements not appearing in the input list, |
750 |
and the output will not change. |
|
751 |
\begin{itemize} |
|
752 |
\item |
|
611 | 753 |
If $r \notin rs$, then $\rdistinct{rs}{acc} = \rdistinct{rs}{(\{r\} \cup acc)}$. |
543 | 754 |
\item |
611 | 755 |
Particularly, if $\;\;\textit{isDistinct} \; rs$, then we have\\ |
543 | 756 |
\[ \rdistinct{rs}{\varnothing} = rs \] |
757 |
\end{itemize} |
|
758 |
\end{lemma} |
|
759 |
\begin{proof} |
|
760 |
The first half is by induction on $rs$. The second half is a corollary of the first. |
|
761 |
\end{proof} |
|
762 |
\noindent |
|
611 | 763 |
The function $\textit{rdistinct}$ removes duplicates from anywhere in a list. |
764 |
Despite being seemingly obvious, |
|
765 |
the induction technique is not as straightforward. |
|
554 | 766 |
\begin{lemma}\label{distinctRemovesMiddle} |
767 |
The two properties hold if $r \in rs$: |
|
768 |
\begin{itemize} |
|
769 |
\item |
|
555 | 770 |
$\rdistinct{rs}{rset} = \rdistinct{(rs @ [r])}{rset}$\\ |
771 |
and\\ |
|
554 | 772 |
$\rdistinct{(ab :: rs @ [ab])}{rset'} = \rdistinct{(ab :: rs)}{rset'}$ |
773 |
\item |
|
555 | 774 |
$\rdistinct{ (rs @ rs') }{rset} = \rdistinct{rs @ [r] @ rs'}{rset}$\\ |
775 |
and\\ |
|
554 | 776 |
$\rdistinct{(ab :: rs @ [ab] @ rs'')}{rset'} = |
593 | 777 |
\rdistinct{(ab :: rs @ rs'')}{rset'}$ |
554 | 778 |
\end{itemize} |
779 |
\end{lemma} |
|
780 |
\noindent |
|
781 |
\begin{proof} |
|
593 | 782 |
By induction on $rs$. All other variables are allowed to be arbitrary. |
611 | 783 |
The second part of the lemma requires the first. |
618 | 784 |
Note that for each part, the two sub-propositions need to be proven |
785 |
at the same time, |
|
593 | 786 |
so that the induction goes through. |
554 | 787 |
\end{proof} |
555 | 788 |
\noindent |
611 | 789 |
This allows us to prove a few more equivalence relations involving |
618 | 790 |
$\textit{rdistinct}$ (they will be useful later): |
555 | 791 |
\begin{lemma}\label{rdistinctConcatGeneral} |
611 | 792 |
\mbox{} |
555 | 793 |
\begin{itemize} |
794 |
\item |
|
795 |
$\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{((\rdistinct{rs}{\varnothing})@ rs')}{\varnothing}$ |
|
796 |
\item |
|
797 |
$\rdistinct{(rs @ rs')}{\varnothing} = \rdistinct{(\rdistinct{rs}{\varnothing} @ rs')}{\varnothing}$ |
|
798 |
\item |
|
799 |
If $rset' \subseteq rset$, then $\rdistinct{rs}{rset} = |
|
800 |
\rdistinct{(\rdistinct{rs}{rset'})}{rset}$. As a corollary |
|
801 |
of this, |
|
802 |
\item |
|
803 |
$\rdistinct{(rs @ rs')}{rset} = \rdistinct{ |
|
804 |
(\rdistinct{rs}{\varnothing}) @ rs')}{rset}$. This |
|
805 |
gives another corollary use later: |
|
806 |
\item |
|
807 |
If $a \in rset$, then $\rdistinct{(rs @ rs')}{rset} = \rdistinct{ |
|
808 |
(\rdistinct{(a :: rs)}{\varnothing} @ rs')}{rset} $, |
|
809 |
||
810 |
\end{itemize} |
|
811 |
\end{lemma} |
|
812 |
\begin{proof} |
|
813 |
By \ref{rdistinctDoesTheJob} and \ref{distinctRemovesMiddle}. |
|
814 |
\end{proof} |
|
611 | 815 |
\noindent |
639 | 816 |
The next lemma is a more general form of \ref{rdistinctConcat}; |
817 |
It says that |
|
611 | 818 |
$\textit{rdistinct}$ is composable w.r.t list concatenation: |
819 |
\begin{lemma}\label{distinctRdistinctAppend} |
|
820 |
If $\;\; \textit{isDistinct} \; rs_1$, |
|
821 |
and $(set \; rs_1) \cap acc = \varnothing$, |
|
822 |
then applying $\textit{rdistinct}$ on $rs_1 @ rs_a$ does not |
|
823 |
have an effect on $rs_1$: |
|
824 |
\[\textit{rdistinct}\; (rs_1 @ rsa)\;\, acc |
|
825 |
= rs_1@(\textit{rdistinct} rsa \; (acc \cup rs_1))\] |
|
826 |
\end{lemma} |
|
827 |
\begin{proof} |
|
828 |
By an induction on |
|
829 |
$rs_1$, where $rsa$ and $acc$ are allowed to be arbitrary. |
|
830 |
\end{proof} |
|
831 |
\noindent |
|
832 |
$\textit{rdistinct}$ needs to be applied only once, and |
|
618 | 833 |
applying it multiple times does not make any difference: |
611 | 834 |
\begin{corollary}\label{distinctOnceEnough} |
639 | 835 |
$\textit{rdistinct} \; (rs @ rsa) {} = \textit{rdistinct} \; ( (rdistinct \; |
836 |
rs \; \{ \}) @ (\textit{rdistinct} \; rs_a \; (set \; rs)))$ |
|
611 | 837 |
\end{corollary} |
838 |
\begin{proof} |
|
839 |
By lemma \ref{distinctRdistinctAppend}. |
|
840 |
\end{proof} |
|
555 | 841 |
|
611 | 842 |
\subsubsection{The Properties of $\textit{Rflts}$} |
843 |
We give in this subsection some properties |
|
620 | 844 |
involving $\backslash_r$, $\backslash_{rsimps}$, $\textit{rflts}$ and |
611 | 845 |
$\textit{rsimp}_{ALTS} $, together with any non-trivial lemmas that lead to them. |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
846 |
These will be helpful in later closed-form proofs, when |
611 | 847 |
we want to transform derivative terms which have |
848 |
%the ways in which multiple functions involving |
|
849 |
%those are composed together |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
850 |
interleaving derivatives and simplifications applied to them. |
543 | 851 |
|
611 | 852 |
\noindent |
853 |
%When the function $\textit{Rflts}$ |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
854 |
%is applied to the concatenation of two lists; the output can be calculated by first applying the |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
855 |
%functions on two lists separately and then concatenating them together. |
611 | 856 |
$\textit{Rflts}$ is composable in terms of concatenation: |
554 | 857 |
\begin{lemma}\label{rfltsProps} |
618 | 858 |
The function $\rflts$ has the properties below:\\ |
543 | 859 |
\begin{itemize} |
860 |
\item |
|
554 | 861 |
$\rflts \; (rs_1 @ rs_2) = \rflts \; rs_1 @ \rflts \; rs_2$ |
862 |
\item |
|
863 |
If $r \neq \RZERO$ and $\nexists rs_1. r = \RALTS{rs}_1$, then $\rflts \; (r::rs) = r :: \rflts \; rs$ |
|
864 |
\item |
|
865 |
$\rflts \; (rs @ [\RZERO]) = \rflts \; rs$ |
|
866 |
\item |
|
867 |
$\rflts \; (rs' @ [\RALTS{rs}]) = \rflts \; rs'@rs$ |
|
868 |
\item |
|
869 |
$\rflts \; (rs @ [\RONE]) = \rflts \; rs @ [\RONE]$ |
|
870 |
\item |
|
871 |
If $r \neq \RZERO$ and $\nexists rs'. r = \RALTS{rs'}$ then $\rflts \; (rs @ [r]) |
|
872 |
= (\rflts \; rs) @ [r]$ |
|
555 | 873 |
\item |
874 |
If $r = \RALTS{rs}$ and $r \in rs'$ then for all $r_1 \in rs. |
|
875 |
r_1 \in \rflts \; rs'$. |
|
876 |
\item |
|
877 |
$\rflts \; (rs_a @ \RZERO :: rs_b) = \rflts \; (rs_a @ rs_b)$ |
|
543 | 878 |
\end{itemize} |
879 |
\end{lemma} |
|
880 |
\noindent |
|
881 |
\begin{proof} |
|
555 | 882 |
By induction on $rs_1$ in the first sub-lemma, and induction on $r$ in the second part, |
883 |
and induction on $rs$, $rs'$, $rs$, $rs'$, $rs_a$ in the third, fourth, fifth, sixth and |
|
884 |
last sub-lemma. |
|
543 | 885 |
\end{proof} |
611 | 886 |
\noindent |
887 |
Now we introduce the property that the operations |
|
888 |
derivative and $\rsimpalts$ |
|
639 | 889 |
commute, this will be used later on when deriving the closed form for |
611 | 890 |
the alternative regular expression: |
891 |
\begin{lemma}\label{rderRsimpAltsCommute} |
|
892 |
$\rder{x}{(\rsimpalts \; rs)} = \rsimpalts \; (\map \; (\rder{x}{\_}) \; rs)$ |
|
893 |
\end{lemma} |
|
618 | 894 |
\begin{proof} |
895 |
By induction on $rs$. |
|
896 |
\end{proof} |
|
611 | 897 |
\noindent |
614 | 898 |
|
639 | 899 |
\subsubsection{The $RL$ Function: Language Interpretation for $\textit{Rrexp}$s} |
618 | 900 |
Much like the definition of $L$ on plain regular expressions, one can also |
639 | 901 |
define the language interpretation for $\rrexp$s. |
614 | 902 |
\begin{center} |
903 |
\begin{tabular}{lcl} |
|
639 | 904 |
$RL \; (\ZERO_r)$ & $\dn$ & $\phi$\\ |
905 |
$RL \; (\ONE_r)$ & $\dn$ & $\{[]\}$\\ |
|
614 | 906 |
$RL \; (c)$ & $\dn$ & $\{[c]\}$\\ |
907 |
$RL \; \sum rs$ & $\dn$ & $ \bigcup_{r \in rs} (RL \; r)$\\ |
|
908 |
$RL \; (r_1 \cdot r_2)$ & $\dn$ & $ RL \; (r_1) @ RL \; (r_2)$\\ |
|
909 |
$RL \; (r^*)$ & $\dn$ & $ (RL(r))^*$ |
|
910 |
\end{tabular} |
|
911 |
\end{center} |
|
912 |
\noindent |
|
913 |
The main use of $RL$ is to establish some connections between $\rsimp{}$ |
|
914 |
and $\rnullable{}$: |
|
915 |
\begin{lemma} |
|
916 |
The following properties hold: |
|
917 |
\begin{itemize} |
|
918 |
\item |
|
919 |
If $\rnullable{r}$, then $\rsimp{r} \neq \RZERO$. |
|
920 |
\item |
|
921 |
$\rnullable{r \backslash s} \quad $ if and only if $\quad \rnullable{\rderssimp{r}{s}}$. |
|
922 |
\end{itemize} |
|
923 |
\end{lemma} |
|
924 |
\begin{proof} |
|
925 |
The first part is by induction on $r$. |
|
926 |
The second part is true because property |
|
927 |
\[ RL \; r = RL \; (\rsimp{r})\] holds. |
|
928 |
\end{proof} |
|
929 |
||
930 |
\subsubsection{Simplified $\textit{Rrexp}$s are Good} |
|
931 |
We formalise the notion of ``good" regular expressions, |
|
932 |
which means regular expressions that |
|
618 | 933 |
are fully simplified in terms of our $\textit{rsimp}$ function. |
934 |
For alternative regular expressions that means they |
|
935 |
do not contain any nested alternatives, un-eliminated $\RZERO$s |
|
936 |
or duplicate elements (for example, |
|
937 |
$r_1 + (r_2 + r_3)$, $\RZERO + r$ and $ \sum [r, r, \ldots]$). |
|
938 |
The clauses for $\good$ are: |
|
614 | 939 |
\begin{center} |
940 |
\begin{tabular}{@{}lcl@{}} |
|
941 |
$\good\; \RZERO$ & $\dn$ & $\textit{false}$\\ |
|
942 |
$\good\; \RONE$ & $\dn$ & $\textit{true}$\\ |
|
943 |
$\good\; \RCHAR{c}$ & $\dn$ & $\btrue$\\ |
|
944 |
$\good\; \RALTS{[]}$ & $\dn$ & $\bfalse$\\ |
|
945 |
$\good\; \RALTS{[r]}$ & $\dn$ & $\bfalse$\\ |
|
946 |
$\good\; \RALTS{r_1 :: r_2 :: rs}$ & $\dn$ & |
|
947 |
$\textit{isDistinct} \; (r_1 :: r_2 :: rs) \;$\\ |
|
618 | 948 |
& & $\land \; (\forall r' \in (r_1 :: r_2 :: rs).\; \good \; r'\; \, \land \; \, \textit{nonAlt}\; r')$\\ |
614 | 949 |
$\good \; \RSEQ{\RZERO}{r}$ & $\dn$ & $\bfalse$\\ |
950 |
$\good \; \RSEQ{\RONE}{r}$ & $\dn$ & $\bfalse$\\ |
|
951 |
$\good \; \RSEQ{r}{\RZERO}$ & $\dn$ & $\bfalse$\\ |
|
952 |
$\good \; \RSEQ{r_1}{r_2}$ & $\dn$ & $\good \; r_1 \;\, \textit{and} \;\, \good \; r_2$\\ |
|
953 |
$\good \; \RSTAR{r}$ & $\dn$ & $\btrue$\\ |
|
954 |
\end{tabular} |
|
955 |
\end{center} |
|
956 |
\noindent |
|
618 | 957 |
We omit the recursive definition of the predicate $\textit{nonAlt}$, |
958 |
which evaluates to true when the regular expression is not an |
|
614 | 959 |
alternative, and false otherwise. |
960 |
The $\good$ property is preserved under $\rsimp_{ALTS}$, provided that |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
961 |
its non-empty argument list of expressions are all good themselves, and $\textit{nonAlt}$, |
614 | 962 |
and unique: |
963 |
\begin{lemma}\label{rsimpaltsGood} |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
964 |
If $rs \neq []$ and for all $r \in rs. \textit{nonAlt} \; r$ and $\textit{isDistinct} \; rs$, |
614 | 965 |
then $\good \; (\rsimpalts \; rs)$ if and only if forall $r \in rs. \; \good \; r$. |
966 |
\end{lemma} |
|
967 |
\noindent |
|
968 |
We also note that |
|
969 |
if a regular expression $r$ is good, then $\rflts$ on the singleton |
|
970 |
list $[r]$ will not break goodness: |
|
971 |
\begin{lemma}\label{flts2} |
|
972 |
If $\good \; r$, then forall $r' \in \rflts \; [r]. \; \good \; r'$ and $\textit{nonAlt} \; r'$. |
|
973 |
\end{lemma} |
|
974 |
\begin{proof} |
|
975 |
By an induction on $r$. |
|
976 |
\end{proof} |
|
977 |
\noindent |
|
978 |
The other observation we make about $\rsimp{r}$ is that it never |
|
979 |
comes with nested alternatives, which we describe as the $\nonnested$ |
|
980 |
property: |
|
981 |
\begin{center} |
|
982 |
\begin{tabular}{lcl} |
|
983 |
$\nonnested \; \, \sum []$ & $\dn$ & $\btrue$\\ |
|
984 |
$\nonnested \; \, \sum ((\sum rs_1) :: rs_2)$ & $\dn$ & $\bfalse$\\ |
|
985 |
$\nonnested \; \, \sum (r :: rs)$ & $\dn$ & $\nonnested (\sum rs)$\\ |
|
986 |
$\nonnested \; \, r $ & $\dn$ & $\btrue$ |
|
987 |
\end{tabular} |
|
988 |
\end{center} |
|
989 |
\noindent |
|
990 |
The $\rflts$ function |
|
991 |
always opens up nested alternatives, |
|
992 |
which enables $\rsimp$ to be non-nested: |
|
993 |
||
994 |
\begin{lemma}\label{nonnestedRsimp} |
|
618 | 995 |
It is always the case that |
996 |
\begin{center} |
|
997 |
$\nonnested \; (\rsimp{r})$ |
|
998 |
\end{center} |
|
614 | 999 |
\end{lemma} |
1000 |
\begin{proof} |
|
618 | 1001 |
By induction on $r$. |
614 | 1002 |
\end{proof} |
1003 |
\noindent |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1004 |
With this we can prove that a regular expression |
614 | 1005 |
after simplification and flattening and de-duplication, |
1006 |
will not contain any alternative regular expression directly: |
|
1007 |
\begin{lemma}\label{nonaltFltsRd} |
|
1008 |
If $x \in \rdistinct{\rflts\; (\map \; \rsimp{} \; rs)}{\varnothing}$ |
|
1009 |
then $\textit{nonAlt} \; x$. |
|
1010 |
\end{lemma} |
|
1011 |
\begin{proof} |
|
1012 |
By \ref{nonnestedRsimp}. |
|
1013 |
\end{proof} |
|
1014 |
\noindent |
|
618 | 1015 |
The other fact we know is that once $\rsimp{}$ has finished |
614 | 1016 |
processing an alternative regular expression, it will not |
618 | 1017 |
contain any $\RZERO$s. This is because all the recursive |
614 | 1018 |
calls to the simplification on the children regular expressions |
618 | 1019 |
make the children good, and $\rflts$ will not delete |
614 | 1020 |
any $\RZERO$s out of a good regular expression list, |
618 | 1021 |
and $\rdistinct{}$ will not ``mess'' with the result. |
614 | 1022 |
\begin{lemma}\label{flts3Obv} |
1023 |
The following are true: |
|
1024 |
\begin{itemize} |
|
1025 |
\item |
|
1026 |
If for all $r \in rs. \, \good \; r $ or $r = \RZERO$, |
|
1027 |
then for all $r \in \rflts\; rs. \, \good \; r$. |
|
1028 |
\item |
|
1029 |
If $x \in \rdistinct{\rflts\; (\map \; rsimp{}\; rs)}{\varnothing}$ |
|
1030 |
and for all $y$ such that $\llbracket y \rrbracket_r$ less than |
|
1031 |
$\llbracket rs \rrbracket_r + 1$, either |
|
1032 |
$\good \; (\rsimp{y})$ or $\rsimp{y} = \RZERO$, |
|
1033 |
then $\good \; x$. |
|
1034 |
\end{itemize} |
|
1035 |
\end{lemma} |
|
1036 |
\begin{proof} |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1037 |
The first part is by induction, where the inductive cases |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1038 |
are the inductive cases of $\rflts$. |
614 | 1039 |
The second part is a corollary from the first part. |
1040 |
\end{proof} |
|
1041 |
||
639 | 1042 |
This leads to good structural property of $\rsimp{}$, |
614 | 1043 |
that after simplification, a regular expression is |
1044 |
either good or $\RZERO$: |
|
1045 |
\begin{lemma}\label{good1} |
|
1046 |
For any r-regular expression $r$, $\good \; \rsimp{r}$ or $\rsimp{r} = \RZERO$. |
|
1047 |
\end{lemma} |
|
1048 |
\begin{proof} |
|
1049 |
By an induction on $r$. The inductive measure is the size $\llbracket \rrbracket_r$. |
|
618 | 1050 |
Lemma \ref{rsimpMono} says that |
614 | 1051 |
$\llbracket \rsimp{r}\rrbracket_r$ is smaller than or equal to |
1052 |
$\llbracket r \rrbracket_r$. |
|
1053 |
Therefore, in the $r_1 \cdot r_2$ and $\sum rs$ case, |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1054 |
The inductive hypothesis applies to the children regular expressions |
614 | 1055 |
$r_1$, $r_2$, etc. The lemma \ref{flts3Obv}'s precondition is satisfied |
1056 |
by that as well. |
|
1057 |
The lemmas \ref{nonnestedRsimp} and \ref{nonaltFltsRd} are used |
|
1058 |
to ensure that goodness is preserved at the topmost level. |
|
1059 |
\end{proof} |
|
1060 |
We shall prove that any good regular expression is |
|
639 | 1061 |
a fixed-point for $\textit{rsimp}$. |
614 | 1062 |
First we prove an auxiliary lemma: |
1063 |
\begin{lemma}\label{goodaltsNonalt} |
|
1064 |
If $\good \; \sum rs$, then $\rflts\; rs = rs$. |
|
1065 |
\end{lemma} |
|
1066 |
\begin{proof} |
|
1067 |
By an induction on $\sum rs$. The inductive rules are the cases |
|
1068 |
for $\good$. |
|
1069 |
\end{proof} |
|
1070 |
\noindent |
|
1071 |
Now we are ready to prove that good regular expressions are invariant |
|
618 | 1072 |
with respect to $\rsimp{}$: |
614 | 1073 |
\begin{lemma}\label{test} |
1074 |
If $\good \;r$ then $\rsimp{r} = r$. |
|
1075 |
\end{lemma} |
|
1076 |
\begin{proof} |
|
1077 |
By an induction on the inductive cases of $\good$, using lemmas |
|
1078 |
\ref{goodaltsNonalt} and \ref{rdistinctOnDistinct}. |
|
1079 |
The lemma \ref{goodaltsNonalt} is used in the alternative |
|
1080 |
case where 2 or more elements are present in the list. |
|
1081 |
\end{proof} |
|
1082 |
\noindent |
|
639 | 1083 |
Below we show a property involving $\rflts$, $\textit{rdistinct}$, |
618 | 1084 |
$\rsimp{}$ and $\rsimp_{ALTS}$, |
1085 |
which requires $\ref{good1}$ to go through smoothly: |
|
614 | 1086 |
\begin{lemma}\label{flattenRsimpalts} |
618 | 1087 |
An application of $\rsimp_{ALTS}$ can be ``absorbed'', |
1088 |
if its output is concatenated with a list and then applied to $\rflts$. |
|
1089 |
\begin{center} |
|
614 | 1090 |
$\rflts \; ( (\rsimp_{ALTS} \; |
1091 |
(\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing})) :: |
|
1092 |
\map \; \rsimp{} \; rs' ) = |
|
1093 |
\rflts \; ( (\rdistinct{(\rflts \; (\map \; \rsimp{}\; rs))}{\varnothing}) @ ( |
|
1094 |
\map \; \rsimp{rs'}))$ |
|
618 | 1095 |
\end{center} |
614 | 1096 |
|
1097 |
||
1098 |
\end{lemma} |
|
1099 |
\begin{proof} |
|
1100 |
By \ref{good1}. |
|
1101 |
\end{proof} |
|
1102 |
\noindent |
|
1103 |
||
1104 |
||
1105 |
||
1106 |
||
1107 |
||
1108 |
We are also ready to prove that $\textit{rsimp}$ is idempotent. |
|
1109 |
\subsubsection{$\rsimp$ is Idempotent} |
|
1110 |
The idempotency of $\rsimp$ is very useful in |
|
1111 |
manipulating regular expression terms into desired |
|
1112 |
forms so that key steps allowing further rewriting to closed forms |
|
1113 |
are possible. |
|
1114 |
\begin{lemma}\label{rsimpIdem} |
|
618 | 1115 |
$\rsimp{r} = \rsimp{(\rsimp{r})}$ |
614 | 1116 |
\end{lemma} |
1117 |
||
1118 |
\begin{proof} |
|
1119 |
By \ref{test} and \ref{good1}. |
|
1120 |
\end{proof} |
|
1121 |
\noindent |
|
1122 |
This property means we do not have to repeatedly |
|
1123 |
apply simplification in each step, which justifies |
|
1124 |
our definition of $\blexersimp$. |
|
639 | 1125 |
This is in contrast to the work of Sulzmann and Lu where |
1126 |
the simplification is applied in a fixpoint manner. |
|
614 | 1127 |
|
1128 |
||
618 | 1129 |
On the other hand, we can repeat the same $\rsimp{}$ applications |
614 | 1130 |
on regular expressions as many times as we want, if we have at least |
639 | 1131 |
one simplification applied to it, and apply it wherever we need to: |
614 | 1132 |
\begin{corollary}\label{headOneMoreSimp} |
1133 |
The following properties hold, directly from \ref{rsimpIdem}: |
|
1134 |
||
1135 |
\begin{itemize} |
|
1136 |
\item |
|
1137 |
$\map \; \rsimp{(r :: rs)} = \map \; \rsimp{} \; (\rsimp{r} :: rs)$ |
|
1138 |
\item |
|
1139 |
$\rsimp{(\RALTS{rs})} = \rsimp{(\RALTS{\map \; \rsimp{} \; rs})}$ |
|
1140 |
\end{itemize} |
|
1141 |
\end{corollary} |
|
1142 |
\noindent |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1143 |
This will be useful in the later closed-form proof's rewriting steps. |
618 | 1144 |
Similarly, we state the following useful facts below: |
614 | 1145 |
\begin{lemma} |
1146 |
The following equalities hold if $r = \rsimp{r'}$ for some $r'$: |
|
1147 |
\begin{itemize} |
|
1148 |
\item |
|
1149 |
If $r = \sum rs$ then $\rsimpalts \; rs = \sum rs$. |
|
1150 |
\item |
|
1151 |
If $r = \sum rs$ then $\rdistinct{rs}{\varnothing} = rs$. |
|
1152 |
\item |
|
1153 |
$\rsimpalts \; (\rdistinct{\rflts \; [r]}{\varnothing}) = r$. |
|
1154 |
\end{itemize} |
|
1155 |
\end{lemma} |
|
1156 |
\begin{proof} |
|
1157 |
By application of lemmas \ref{rsimpIdem} and \ref{good1}. |
|
1158 |
\end{proof} |
|
1159 |
||
1160 |
\noindent |
|
639 | 1161 |
With the idempotency of $\textit{rsimp}$ and its corollaries, |
614 | 1162 |
we can start proving some key equalities leading to the |
1163 |
closed forms. |
|
639 | 1164 |
Next we present a few equivalent terms under $\textit{rsimp}$. |
618 | 1165 |
To make the notation more concise |
639 | 1166 |
We use $r_1 \sequal r_2 $ to denote that $\rsimp{r_1} = \rsimp{r_2}$. |
1167 |
%\begin{center} |
|
1168 |
%\begin{tabular}{lcl} |
|
1169 |
% $a \sequal b$ & $ \dn$ & $ \textit{rsimp} \; a = \textit{rsimp} \; b$ |
|
1170 |
%\end{tabular} |
|
1171 |
%\end{center} |
|
1172 |
%\noindent |
|
618 | 1173 |
%\vspace{0em} |
614 | 1174 |
\begin{lemma} |
618 | 1175 |
The following equivalence hold: |
614 | 1176 |
\begin{itemize} |
1177 |
\item |
|
1178 |
$\rsimpalts \; (\RZERO :: rs) \sequal \rsimpalts\; rs$ |
|
1179 |
\item |
|
1180 |
$\rsimpalts \; rs \sequal \rsimpalts (\map \; \rsimp{} \; rs)$ |
|
1181 |
\item |
|
1182 |
$\RALTS{\RALTS{rs}} \sequal \RALTS{rs}$ |
|
1183 |
\item |
|
1184 |
$\sum ((\sum rs_a) :: rs_b) \sequal \sum rs_a @ rs_b$ |
|
1185 |
\item |
|
639 | 1186 |
$\RALTS{rs} \sequal \RALTS{\map \; \rsimp{} \; rs}$ |
614 | 1187 |
\end{itemize} |
1188 |
\end{lemma} |
|
1189 |
\begin{proof} |
|
1190 |
By induction on the lists involved. |
|
1191 |
\end{proof} |
|
1192 |
\noindent |
|
1193 |
The above allows us to prove |
|
1194 |
two similar equalities (which are a bit more involved). |
|
618 | 1195 |
It says that we could flatten the elements |
614 | 1196 |
before simplification and still get the same result. |
1197 |
\begin{lemma}\label{simpFlatten3} |
|
1198 |
One can flatten the inside $\sum$ of a $\sum$ if it is being |
|
1199 |
simplified. Concretely, |
|
1200 |
\begin{itemize} |
|
1201 |
\item |
|
1202 |
If for all $r \in rs, rs', rs''$, we have $\good \; r $ |
|
1203 |
or $r = \RZERO$, then $\sum (rs' @ rs @ rs'') \sequal |
|
1204 |
\sum (rs' @ [\sum rs] @ rs'')$ holds. As a corollary, |
|
1205 |
\item |
|
1206 |
$\sum (rs' @ [\sum rs] @ rs'') \sequal \sum (rs' @ rs @ rs'')$ |
|
1207 |
\end{itemize} |
|
1208 |
\end{lemma} |
|
1209 |
\begin{proof} |
|
1210 |
By rewriting steps involving the use of \ref{test} and \ref{rdistinctConcatGeneral}. |
|
1211 |
The second sub-lemma is a corollary of the previous. |
|
1212 |
\end{proof} |
|
1213 |
%Rewriting steps not put in--too long and complicated------------------------------- |
|
1214 |
\begin{comment} |
|
1215 |
\begin{center} |
|
1216 |
$\rsimp{\sum (rs' @ rs @ rs'')} \stackrel{def of bsimp}{=}$ \\ |
|
1217 |
$\rsimpalts \; (\rdistinct{\rflts \; ((\map \; \rsimp{}\; rs') @ (\map \; \rsimp{} \; rs ) @ (\map \; \rsimp{} \; rs''))}{\varnothing})$ \\ |
|
1218 |
$\stackrel{by \ref{test}}{=} |
|
1219 |
\rsimpalts \; (\rdistinct{(\rflts \; rs' @ \rflts \; rs @ \rflts \; rs'')}{ |
|
1220 |
\varnothing})$\\ |
|
1221 |
$\stackrel{by \ref{rdistinctConcatGeneral}}{=} |
|
1222 |
\rsimpalts \; (\rdistinct{\rflts \; rs'}{\varnothing} @ \rdistinct{( |
|
1223 |
\rflts\; rs @ \rflts \; rs'')}{\rflts \; rs'})$\\ |
|
1224 |
||
1225 |
\end{center} |
|
1226 |
\end{comment} |
|
1227 |
%Rewriting steps not put in--too long and complicated------------------------------- |
|
1228 |
\noindent |
|
1229 |
||
1230 |
||
618 | 1231 |
We need more equalities like the above to enable a closed form lemma, |
613 | 1232 |
for which we need to introduce a few rewrite relations |
1233 |
to help |
|
1234 |
us obtain them. |
|
554 | 1235 |
|
610 | 1236 |
\subsection{The rewrite relation $\hrewrite$ , $\scfrewrites$ , $\frewrite$ and $\grewrite$} |
613 | 1237 |
Inspired by the success we had in the correctness proof |
1238 |
in \ref{Bitcoded2}, |
|
1239 |
we follow suit here, defining atomic simplification |
|
1240 |
steps as ``small-step'' rewriting steps. This allows capturing |
|
555 | 1241 |
similarities between terms that would be otherwise |
1242 |
hard to express. |
|
1243 |
||
557 | 1244 |
We use $\hrewrite$ for one-step atomic rewrite of |
1245 |
regular expression simplification, |
|
555 | 1246 |
$\frewrite$ for rewrite of list of regular expressions that |
1247 |
include all operations carried out in $\rflts$, and $\grewrite$ for |
|
613 | 1248 |
rewriting a list of regular expressions possible in both $\rflts$ and $\textit{rdistinct}$. |
555 | 1249 |
Their reflexive transitive closures are used to denote zero or many steps, |
1250 |
as was the case in the previous chapter. |
|
613 | 1251 |
As we have already |
1252 |
done something similar, the presentation about |
|
1253 |
these rewriting rules will be more concise than that in \ref{Bitcoded2}. |
|
554 | 1254 |
To differentiate between the rewriting steps for annotated regular expressions |
1255 |
and $\rrexp$s, we add characters $h$ and $g$ below the squig arrow symbol |
|
1256 |
to mean atomic simplification transitions |
|
1257 |
of $\rrexp$s and $\rrexp$ lists, respectively. |
|
1258 |
||
555 | 1259 |
|
1260 |
||
1261 |
||
613 | 1262 |
\begin{figure}[H] |
554 | 1263 |
\begin{center} |
593 | 1264 |
\begin{mathpar} |
1265 |
\inferrule[RSEQ0L]{}{\RZERO \cdot r_2 \hrewrite \RZERO\\} |
|
555 | 1266 |
|
593 | 1267 |
\inferrule[RSEQ0R]{}{r_1 \cdot \RZERO \hrewrite \RZERO\\} |
555 | 1268 |
|
593 | 1269 |
\inferrule[RSEQ1]{}{(\RONE \cdot r) \hrewrite r\\}\\ |
555 | 1270 |
|
593 | 1271 |
\inferrule[RSEQL]{ r_1 \hrewrite r_2}{r_1 \cdot r_3 \hrewrite r_2 \cdot r_3\\} |
1272 |
||
1273 |
\inferrule[RSEQR]{ r_3 \hrewrite r_4}{r_1 \cdot r_3 \hrewrite r_1 \cdot r_4\\}\\ |
|
555 | 1274 |
|
593 | 1275 |
\inferrule[RALTSChild]{r \hrewrite r'}{\sum (rs_1 @ [r] @ rs_2) \hrewrite \sum (rs_1 @ [r'] @ rs_2)\\} |
555 | 1276 |
|
593 | 1277 |
\inferrule[RALTS0]{}{\sum (rs_a @ [\RZERO] @ rs_b) \hrewrite \sum (rs_a @ rs_b)} |
555 | 1278 |
|
593 | 1279 |
\inferrule[RALTSNested]{}{\sum (rs_a @ [\sum rs_1] @ rs_b) \hrewrite \sum (rs_a @ rs_1 @ rs_b)} |
555 | 1280 |
|
593 | 1281 |
\inferrule[RALTSNil]{}{ \sum [] \hrewrite \RZERO\\} |
555 | 1282 |
|
593 | 1283 |
\inferrule[RALTSSingle]{}{ \sum [r] \hrewrite r\\} |
555 | 1284 |
|
593 | 1285 |
\inferrule[RALTSDelete]{\\ r_1 = r_2}{\sum rs_a @ [r_1] @ rs_b @ [r_2] @ rsc \hrewrite \sum rs_a @ [r_1] @ rs_b @ rs_c} |
555 | 1286 |
|
593 | 1287 |
\end{mathpar} |
555 | 1288 |
\end{center} |
613 | 1289 |
\caption{List of one-step rewrite rules for r-regular expressions ($\hrewrite$)}\label{hRewrite} |
1290 |
\end{figure} |
|
554 | 1291 |
|
557 | 1292 |
|
613 | 1293 |
Like $\rightsquigarrow_s$, it is |
1294 |
convenient to define rewrite rules for a list of regular expressions, |
|
593 | 1295 |
where each element can rewrite in many steps to the other (scf stands for |
1296 |
li\emph{s}t \emph{c}losed \emph{f}orm). This relation is similar to the |
|
1297 |
$\stackrel{s*}{\rightsquigarrow}$ for annotated regular expressions. |
|
557 | 1298 |
|
613 | 1299 |
\begin{figure}[H] |
557 | 1300 |
\begin{center} |
593 | 1301 |
\begin{mathpar} |
1302 |
\inferrule{}{[] \scfrewrites [] } |
|
613 | 1303 |
|
593 | 1304 |
\inferrule{r \hrewrites r' \\ rs \scfrewrites rs'}{r :: rs \scfrewrites r' :: rs'} |
1305 |
\end{mathpar} |
|
557 | 1306 |
\end{center} |
613 | 1307 |
\caption{List of one-step rewrite rules for a list of r-regular expressions}\label{scfRewrite} |
1308 |
\end{figure} |
|
555 | 1309 |
%frewrite |
593 | 1310 |
List of one-step rewrite rules for flattening |
1311 |
a list of regular expressions($\frewrite$): |
|
613 | 1312 |
\begin{figure}[H] |
555 | 1313 |
\begin{center} |
593 | 1314 |
\begin{mathpar} |
1315 |
\inferrule{}{\RZERO :: rs \frewrite rs \\} |
|
555 | 1316 |
|
593 | 1317 |
\inferrule{}{(\sum rs) :: rs_a \frewrite rs @ rs_a \\} |
555 | 1318 |
|
593 | 1319 |
\inferrule{rs_1 \frewrite rs_2}{r :: rs_1 \frewrite r :: rs_2} |
1320 |
\end{mathpar} |
|
555 | 1321 |
\end{center} |
613 | 1322 |
\caption{List of one-step rewrite rules characterising the $\rflts$ operation on a list}\label{fRewrites} |
1323 |
\end{figure} |
|
555 | 1324 |
|
593 | 1325 |
Lists of one-step rewrite rules for flattening and de-duplicating |
1326 |
a list of regular expressions ($\grewrite$): |
|
613 | 1327 |
\begin{figure}[H] |
555 | 1328 |
\begin{center} |
593 | 1329 |
\begin{mathpar} |
1330 |
\inferrule{}{\RZERO :: rs \grewrite rs \\} |
|
532 | 1331 |
|
593 | 1332 |
\inferrule{}{(\sum rs) :: rs_a \grewrite rs @ rs_a \\} |
555 | 1333 |
|
593 | 1334 |
\inferrule{rs_1 \grewrite rs_2}{r :: rs_1 \grewrite r :: rs_2} |
555 | 1335 |
|
593 | 1336 |
\inferrule[dB]{}{rs_a @ [a] @ rs_b @[a] @ rs_c \grewrite rs_a @ [a] @ rsb @ rsc} |
1337 |
\end{mathpar} |
|
555 | 1338 |
\end{center} |
613 | 1339 |
\caption{List of one-step rewrite rules characterising the $\rflts$ and $\textit{rdistinct}$ |
1340 |
operations}\label{gRewrite} |
|
1341 |
\end{figure} |
|
555 | 1342 |
\noindent |
618 | 1343 |
We define |
613 | 1344 |
two separate list rewriting relations $\frewrite$ and $\grewrite$. |
611 | 1345 |
The rewriting steps that take place during |
639 | 1346 |
flattening are characterised by $\frewrite$. |
618 | 1347 |
The rewrite relation $\grewrite$ characterises both flattening and de-duplicating. |
557 | 1348 |
Sometimes $\grewrites$ is slightly too powerful |
613 | 1349 |
so we would rather use $\frewrites$ to prove |
1350 |
%because we only |
|
1351 |
equalities related to $\rflts$. |
|
1352 |
%certain equivalence under the rewriting steps of $\frewrites$. |
|
556 | 1353 |
For example, when proving the closed-form for the alternative regular expression, |
613 | 1354 |
one of the equalities needed is: |
1355 |
\begin{center} |
|
557 | 1356 |
$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal |
593 | 1357 |
\sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing) |
1358 |
$ |
|
613 | 1359 |
\end{center} |
556 | 1360 |
\noindent |
1361 |
Proving this is by first showing |
|
557 | 1362 |
\begin{lemma}\label{earlyLaterDerFrewrites} |
556 | 1363 |
$\map \; (\_ \backslash x) \; (\rflts \; rs) \frewrites |
557 | 1364 |
\rflts \; (\map \; (\_ \backslash x) \; rs)$ |
556 | 1365 |
\end{lemma} |
1366 |
\noindent |
|
613 | 1367 |
and then the equivalence between two terms |
618 | 1368 |
that can reduce in many steps to each other: |
556 | 1369 |
\begin{lemma}\label{frewritesSimpeq} |
1370 |
If $rs_1 \frewrites rs_2 $, then $\sum (\rDistinct \; rs_1 \; \varnothing) \sequal |
|
557 | 1371 |
\sum (\rDistinct \; rs_2 \; \varnothing)$. |
556 | 1372 |
\end{lemma} |
557 | 1373 |
\noindent |
618 | 1374 |
These two lemmas can both be proven using a straightforward induction (and |
1375 |
the proofs for them are therefore omitted). |
|
613 | 1376 |
|
639 | 1377 |
Now the above equalities can be derived with ease: |
613 | 1378 |
\begin{corollary} |
1379 |
$\sum (\rDistinct \;\; (\map \; (\_ \backslash x) \; (\rflts \; rs)) \;\; \varnothing) \sequal |
|
1380 |
\sum (\rDistinct \;\; (\rflts \; (\map \; (\_ \backslash x) \; rs)) \;\; \varnothing) |
|
1381 |
$ |
|
1382 |
\end{corollary} |
|
618 | 1383 |
\begin{proof} |
1384 |
By lemmas \ref{earlyLaterDerFrewrites} and \ref{frewritesSimpeq}. |
|
1385 |
\end{proof} |
|
557 | 1386 |
But this trick will not work for $\grewrites$. |
1387 |
For example, a rewriting step in proving |
|
1388 |
closed forms is: |
|
1389 |
\begin{center} |
|
593 | 1390 |
$\rsimp{(\rsimpalts \; (\map \; (\_ \backslash x) \; (\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs}))))}{\varnothing})))}$\\ |
1391 |
$=$ \\ |
|
1392 |
$\rsimp{(\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; (\rflts \; (\map \; (\rsimp{} \; \circ \; (\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}))} $ |
|
1393 |
\noindent |
|
557 | 1394 |
\end{center} |
618 | 1395 |
For this, one would hope to have a rewriting relation between the two lists involved, |
557 | 1396 |
similar to \ref{earlyLaterDerFrewrites}. However, it turns out that |
556 | 1397 |
\begin{center} |
593 | 1398 |
$\map \; (\_ \backslash x) \; (\rDistinct \; rs \; rset) \grewrites \rDistinct \; (\map \; |
1399 |
(\_ \backslash x) \; rs) \; ( rset \backslash x)$ |
|
556 | 1400 |
\end{center} |
1401 |
\noindent |
|
557 | 1402 |
does $\mathbf{not}$ hold in general. |
1403 |
For this rewriting step we will introduce some slightly more cumbersome |
|
618 | 1404 |
proof technique later. |
557 | 1405 |
The point is that $\frewrite$ |
613 | 1406 |
allows us to prove equivalence in a straightforward way that is |
1407 |
not possible for $\grewrite$. |
|
555 | 1408 |
|
556 | 1409 |
|
557 | 1410 |
\subsubsection{Terms That Can Be Rewritten Using $\hrewrites$, $\grewrites$, and $\frewrites$} |
613 | 1411 |
In this part, we present lemmas stating |
618 | 1412 |
pairs of r-regular expressions and r-regular expression lists |
639 | 1413 |
where one can rewrite from one in many steps to the other. |
613 | 1414 |
Most of the proofs to these lemmas are straightforward, using |
618 | 1415 |
an induction on the corresponding rewriting relations. |
613 | 1416 |
These proofs will therefore be omitted when this is the case. |
618 | 1417 |
We present in the following lemma a few pairs of terms that are rewritable via |
557 | 1418 |
$\grewrites$: |
1419 |
\begin{lemma}\label{gstarRdistinctGeneral} |
|
613 | 1420 |
\mbox{} |
557 | 1421 |
\begin{itemize} |
1422 |
\item |
|
1423 |
$rs_1 @ rs \grewrites rs_1 @ (\rDistinct \; rs \; rs_1)$ |
|
1424 |
\item |
|
1425 |
$rs \grewrites \rDistinct \; rs \; \varnothing$ |
|
1426 |
\item |
|
1427 |
$rs_a @ (\rDistinct \; rs \; rs_a) \grewrites rs_a @ (\rDistinct \; |
|
1428 |
rs \; (\{\RZERO\} \cup rs_a))$ |
|
1429 |
\item |
|
1430 |
$rs \;\; @ \;\; \rDistinct \; rs_a \; rset \grewrites rs @ \rDistinct \; rs_a \; |
|
1431 |
(rest \cup rs)$ |
|
1432 |
||
1433 |
\end{itemize} |
|
1434 |
\end{lemma} |
|
1435 |
\noindent |
|
1436 |
If a pair of terms $rs_1, rs_2$ are rewritable via $\grewrites$ to each other, |
|
1437 |
then they are equivalent under $\rsimp{}$: |
|
1438 |
\begin{lemma}\label{grewritesSimpalts} |
|
618 | 1439 |
\mbox{} |
557 | 1440 |
If $rs_1 \grewrites rs_2$, then |
613 | 1441 |
we have the following equivalence: |
557 | 1442 |
\begin{itemize} |
1443 |
\item |
|
1444 |
$\sum rs_1 \sequal \sum rs_2$ |
|
1445 |
\item |
|
1446 |
$\rsimpalts \; rs_1 \sequal \rsimpalts \; rs_2$ |
|
1447 |
\end{itemize} |
|
1448 |
\end{lemma} |
|
1449 |
\noindent |
|
1450 |
Here are a few connecting lemmas showing that |
|
1451 |
if a list of regular expressions can be rewritten using $\grewrites$ or $\frewrites $ or |
|
1452 |
$\scfrewrites$, |
|
1453 |
then an alternative constructor taking the list can also be rewritten using $\hrewrites$: |
|
1454 |
\begin{lemma} |
|
1455 |
\begin{itemize} |
|
1456 |
\item |
|
1457 |
If $rs \grewrites rs'$ then $\sum rs \hrewrites \sum rs'$. |
|
1458 |
\item |
|
1459 |
If $rs \grewrites rs'$ then $\sum rs \hrewrites \rsimpalts \; rs'$ |
|
1460 |
\item |
|
1461 |
If $rs_1 \scfrewrites rs_2$ then $\sum (rs @ rs_1) \hrewrites \sum (rs @ rs_2)$ |
|
1462 |
\item |
|
1463 |
If $rs_1 \scfrewrites rs_2$ then $\sum rs_1 \hrewrites \sum rs_2$ |
|
1464 |
||
1465 |
\end{itemize} |
|
1466 |
\end{lemma} |
|
1467 |
\noindent |
|
618 | 1468 |
Now comes the core of the proof, |
557 | 1469 |
which says that once two lists are rewritable to each other, |
639 | 1470 |
then they are equivalent under $\textit{rsimp}$: |
557 | 1471 |
\begin{lemma} |
1472 |
If $r_1 \hrewrites r_2$ then $r_1 \sequal r_2$. |
|
1473 |
\end{lemma} |
|
1474 |
||
1475 |
\noindent |
|
639 | 1476 |
Similar to what we did in chapter \ref{Bitcoded2}, |
618 | 1477 |
we prove that if one can rewrite from one r-regular expression ($r$) |
639 | 1478 |
to the other ($r'$), after taking derivatives one can still rewrite |
618 | 1479 |
the first ($r\backslash c$) to the other ($r'\backslash c$). |
557 | 1480 |
\begin{lemma}\label{interleave} |
1481 |
If $r \hrewrites r' $ then $\rder{c}{r} \hrewrites \rder{c}{r'}$ |
|
1482 |
\end{lemma} |
|
1483 |
\noindent |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1484 |
This allows us to prove more $\mathbf{rsimp}$-equivalent terms, involving $\backslash_r$. |
557 | 1485 |
\begin{lemma}\label{insideSimpRemoval} |
618 | 1486 |
$\rsimp{(\rder{c}{(\rsimp{r})})} = \rsimp{(\rder{c}{r})} $ |
557 | 1487 |
\end{lemma} |
1488 |
\noindent |
|
1489 |
\begin{proof} |
|
1490 |
By \ref{interleave} and \ref{rsimpIdem}. |
|
1491 |
\end{proof} |
|
1492 |
\noindent |
|
1493 |
And this unlocks more equivalent terms: |
|
1494 |
\begin{lemma}\label{Simpders} |
|
1495 |
As corollaries of \ref{insideSimpRemoval}, we have |
|
1496 |
\begin{itemize} |
|
1497 |
\item |
|
620 | 1498 |
If $s \neq []$ then $\rderssimp{r}{s} = \rsimp{( r \backslash_{rs} s)}$. |
557 | 1499 |
\item |
1500 |
$\rsimpalts \; (\map \; (\_ \backslash_r x) \; |
|
593 | 1501 |
(\rdistinct{rs}{\varnothing})) \sequal |
1502 |
\rsimpalts \; (\rDistinct \; |
|
1503 |
(\map \; (\_ \backslash_r x) rs) \;\varnothing )$ |
|
1504 |
\end{itemize} |
|
1505 |
\end{lemma} |
|
611 | 1506 |
\begin{proof} |
1507 |
Part 1 is by lemma \ref{insideSimpRemoval}, |
|
613 | 1508 |
part 2 is by lemma \ref{insideSimpRemoval} .%and \ref{distinctDer}. |
611 | 1509 |
\end{proof} |
557 | 1510 |
\noindent |
613 | 1511 |
|
1512 |
\subsection{Closed Forms for $\sum rs$, $r_1\cdot r_2$ and $r^*$} |
|
618 | 1513 |
Lemma \ref{Simpders} leads to our first closed form, |
1514 |
which is for the alternative regular expression: |
|
1515 |
\begin{theorem}\label{altsClosedForm} |
|
1516 |
\mbox{} |
|
593 | 1517 |
\begin{center} |
1518 |
$\rderssimp{(\sum rs)}{s} \sequal |
|
1519 |
\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$ |
|
1520 |
\end{center} |
|
618 | 1521 |
\end{theorem} |
556 | 1522 |
\noindent |
557 | 1523 |
\begin{proof} |
1524 |
By a reverse induction on the string $s$. |
|
1525 |
One rewriting step, as we mentioned earlier, |
|
1526 |
involves |
|
1527 |
\begin{center} |
|
1528 |
$\rsimpalts \; (\map \; (\_ \backslash x) \; |
|
1529 |
(\rdistinct{(\rflts \; (\map \; (\rsimp{} \; \circ \; |
|
1530 |
(\lambda r. \rderssimp{r}{xs}))))}{\varnothing})) |
|
1531 |
\sequal |
|
1532 |
\rsimpalts \; (\rdistinct{(\map \; (\_ \backslash x) \; |
|
593 | 1533 |
(\rflts \; (\map \; (\rsimp{} \; \circ \; |
557 | 1534 |
(\lambda r. \rderssimp{r}{xs})))) ) }{\varnothing}) $. |
1535 |
\end{center} |
|
1536 |
This can be proven by a combination of |
|
1537 |
\ref{grewritesSimpalts}, \ref{gstarRdistinctGeneral}, \ref{rderRsimpAltsCommute}, and |
|
1538 |
\ref{insideSimpRemoval}. |
|
1539 |
\end{proof} |
|
1540 |
\noindent |
|
1541 |
This closed form has a variant which can be more convenient in later proofs: |
|
618 | 1542 |
\begin{corollary}\label{altsClosedForm1} |
557 | 1543 |
If $s \neq []$ then |
1544 |
$\rderssimp \; (\sum \; rs) \; s = |
|
1545 |
\rsimp{(\sum \; (\map \; \rderssimp{\_}{s} \; rs))}$. |
|
1546 |
\end{corollary} |
|
1547 |
\noindent |
|
1548 |
The harder closed forms are the sequence and star ones. |
|
613 | 1549 |
Before we obtain them, some preliminary definitions |
557 | 1550 |
are needed to make proof statements concise. |
556 | 1551 |
|
609 | 1552 |
|
1553 |
\subsubsection{Closed Form for Sequence Regular Expressions} |
|
618 | 1554 |
For the sequence regular expression, |
1555 |
let's first look at a series of derivative steps on it |
|
1556 |
(assuming that each time when a derivative is taken, |
|
1557 |
the head of the sequence is always nullable): |
|
557 | 1558 |
\begin{center} |
618 | 1559 |
\begin{tabular}{llll} |
1560 |
$r_1 \cdot r_2$ & |
|
1561 |
$\longrightarrow_{\backslash c}$ & |
|
1562 |
$r_1\backslash c \cdot r_2 + r_2 \backslash c$ & |
|
1563 |
$ \longrightarrow_{\backslash c'} $ \\ |
|
1564 |
\\ |
|
1565 |
$(r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc'$ & |
|
1566 |
$\longrightarrow_{\backslash c''} $ & |
|
1567 |
$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') |
|
1568 |
+ r_2 \backslash cc'c''$ & |
|
1569 |
$ \longrightarrow_{\backslash c''} \quad \ldots$\\ |
|
1570 |
\end{tabular} |
|
557 | 1571 |
\end{center} |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1572 |
Roughly speaking $r_1 \cdot r_2 \backslash s$ can be expressed as |
558 | 1573 |
a giant alternative taking a list of terms |
1574 |
$[r_1 \backslash_r s \cdot r_2, r_2 \backslash_r s'', r_2 \backslash_r s_1'', \ldots]$, |
|
1575 |
where the head of the list is always the term |
|
1576 |
representing a match involving only $r_1$, and the tail of the list consisting of |
|
1577 |
terms of the shape $r_2 \backslash_r s''$, $s''$ being a suffix of $s$. |
|
618 | 1578 |
This intuition is also echoed by Murugesan and Sundaram \cite{Murugesan2014}, |
1579 |
where they gave |
|
557 | 1580 |
a pencil-and-paper derivation of $(r_1 \cdot r_2)\backslash s$: |
532 | 1581 |
\begin{center} |
618 | 1582 |
\begin{tabular}{lc} |
1583 |
$L \; [ (r_1 \cdot r_2) \backslash_r (c_1 :: c_2 :: \ldots c_n) ]$ & $ =$\\ |
|
1584 |
\\ |
|
1585 |
\rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1) \cdot r_2 + |
|
1586 |
(\delta\; (\nullable \; r_1) \; (r_2 \backslash_r c_1) )) \backslash_r |
|
1587 |
(c_2 :: \ldots c_n) ]$ & |
|
1588 |
$=$\\ |
|
1589 |
\\ |
|
1590 |
\rule{0pt}{3ex} $L \; [ ((r_1 \backslash_r c_1c_2 \cdot r_2 + |
|
1591 |
(\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2))) |
|
1592 |
$ & \\ |
|
1593 |
\\ |
|
1594 |
$\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) )) |
|
1595 |
\backslash_r (c_3 \ldots c_n) ]$ & $\ldots$ \\ |
|
558 | 1596 |
\end{tabular} |
557 | 1597 |
\end{center} |
1598 |
\noindent |
|
618 | 1599 |
The $\delta$ function |
1600 |
returns $r$ when the boolean condition |
|
1601 |
$b$ evaluates to true and |
|
639 | 1602 |
$\ZERO_r$ otherwise: |
618 | 1603 |
\begin{center} |
1604 |
\begin{tabular}{lcl} |
|
1605 |
$\delta \; b\; r$ & $\dn$ & $r \quad \textit{if} \; b \; is \;\textit{true}$\\ |
|
639 | 1606 |
& $\dn$ & $\ZERO_r \quad otherwise$ |
618 | 1607 |
\end{tabular} |
1608 |
\end{center} |
|
1609 |
\noindent |
|
1610 |
Note that the term |
|
1611 |
\begin{center} |
|
1612 |
\begin{tabular}{lc} |
|
1613 |
\rule{0pt}{3ex} $((r_1 \backslash_r c_1c_2 \cdot r_2 + |
|
1614 |
(\delta \; (\nullable \; r_1) \; (r_2 \backslash_r c_1c_2))) |
|
1615 |
$ & \\ |
|
1616 |
\\ |
|
1617 |
$\quad + (\delta \ (\nullable \; r_1 \backslash_r c)\; (r_2 \backslash_r c_2) )) |
|
1618 |
\backslash_r (c_3 \ldots c_n)$ &\\ |
|
1619 |
\end{tabular} |
|
1620 |
\end{center} |
|
1621 |
\noindent |
|
558 | 1622 |
does not faithfully |
1623 |
represent what the intermediate derivatives would actually look like |
|
1624 |
when one or more intermediate results $r_1 \backslash s' \cdot r_2$ are not |
|
1625 |
nullable in the head of the sequence. |
|
1626 |
For example, when $r_1$ and $r_1 \backslash_r c_1$ are not nullable, |
|
1627 |
the regular expression would not look like |
|
1628 |
\[ |
|
618 | 1629 |
r_1 \backslash_r c_1c_2 |
558 | 1630 |
\] |
618 | 1631 |
instead of |
1632 |
\[ |
|
639 | 1633 |
(r_1 \backslash_r c_1c_2 + \ZERO_r ) + \ZERO_r. |
618 | 1634 |
\] |
639 | 1635 |
The redundant $\ZERO_r$s will not be created in the |
558 | 1636 |
first place. |
618 | 1637 |
In a closed-form one needs to take into account this (because |
1638 |
closed forms require exact equality rather than language equivalence) |
|
1639 |
and only generate the |
|
1640 |
$r_2 \backslash_r s''$ terms satisfying the property |
|
1641 |
\begin{center} |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1642 |
$\exists s'. such \; that \; s'@s'' = s \;\; \land \;\; |
618 | 1643 |
r_1 \backslash s' \; is \; nullable$. |
1644 |
\end{center} |
|
1645 |
Given the arguments $s$ and $r_1$, we denote the list of strings |
|
1646 |
$s''$ satisfying the above property as $\vsuf{s}{r_1}$. |
|
1647 |
The function $\vsuf{\_}{\_}$ is defined recursively on the structure of the string\footnote{ |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1648 |
Perhaps a better name for it would be ``NullablePrefixSuffix'' |
618 | 1649 |
to differentiate with the list of \emph{all} prefixes of $s$, but |
1650 |
that is a bit too long for a function name and we are yet to find |
|
1651 |
a more concise and easy-to-understand name.} |
|
558 | 1652 |
\begin{center} |
593 | 1653 |
\begin{tabular}{lcl} |
1654 |
$\vsuf{[]}{\_} $ & $=$ & $[]$\\ |
|
618 | 1655 |
$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} \; (\rnullable{r_1}) \; \textit{then} \; (\vsuf{cs}{(\rder{c}{r_1})}) @ [c :: cs]$\\ |
593 | 1656 |
&& $\textit{else} \; (\vsuf{cs}{(\rder{c}{r_1}) }) $ |
1657 |
\end{tabular} |
|
558 | 1658 |
\end{center} |
1659 |
\noindent |
|
618 | 1660 |
The list starts with shorter suffixes |
1661 |
and ends with longer ones (in other words, the string elements $s''$ |
|
1662 |
in the list $\vsuf{s}{r_1}$ are sorted |
|
1663 |
in the same order as that of the terms $r_2\backslash s''$ |
|
1664 |
appearing in $(r_1\cdot r_2)\backslash s$). |
|
558 | 1665 |
In essence, $\vsuf{\_}{\_}$ is doing a |
1666 |
"virtual derivative" of $r_1 \cdot r_2$, but instead of producing |
|
1667 |
the entire result $(r_1 \cdot r_2) \backslash s$, |
|
618 | 1668 |
it only stores strings, |
1669 |
with each string $s''$ representing a term such that $r_2 \backslash s''$ |
|
1670 |
is occurring in $(r_1\cdot r_2)\backslash s$. |
|
558 | 1671 |
|
618 | 1672 |
With $\textit{Suffix}$ we are ready to express the |
1673 |
sequence regular expression's closed form, |
|
1674 |
but before doing so |
|
639 | 1675 |
more definitions are needed. |
618 | 1676 |
The first thing is the flattening function $\sflat{\_}$, |
1677 |
which takes an alternative regular expression and produces a flattened version |
|
1678 |
of that alternative regular expression. |
|
1679 |
It is needed to convert |
|
557 | 1680 |
a left-associative nested sequence of alternatives into |
1681 |
a flattened list: |
|
558 | 1682 |
\[ |
618 | 1683 |
\sum(\ldots ((r_1 + r_2) + r_3) + \ldots) |
1684 |
\stackrel{\sflat{\_}}{\rightarrow} |
|
1685 |
\sum[r_1, r_2, r_3, \ldots] |
|
558 | 1686 |
\] |
1687 |
\noindent |
|
618 | 1688 |
The definitions of $\sflat{\_}$ and helper functions |
1689 |
$\sflataux{\_}$ and $\llparenthesis \_ \rrparenthesis''$ are given below. |
|
593 | 1690 |
\begin{center} |
618 | 1691 |
\begin{tabular}{lcl} |
1692 |
$\sflataux{\sum r :: rs}$ & $\dn$ & $\sflataux{r} @ rs$\\ |
|
1693 |
$\sflataux{\sum []}$ & $ \dn $ & $ []$\\ |
|
1694 |
$\sflataux r$ & $\dn$ & $ [r]$ |
|
593 | 1695 |
\end{tabular} |
532 | 1696 |
\end{center} |
1697 |
||
593 | 1698 |
\begin{center} |
618 | 1699 |
\begin{tabular}{lcl} |
1700 |
$\sflat{(\sum r :: rs)}$ & $\dn$ & $\sum (\sflataux{r} @ rs)$\\ |
|
1701 |
$\sflat{\sum []}$ & $ \dn $ & $ \sum []$\\ |
|
1702 |
$\sflat r$ & $\dn$ & $ r$ |
|
1703 |
\end{tabular} |
|
1704 |
\end{center} |
|
1705 |
||
1706 |
\begin{center} |
|
1707 |
\begin{tabular}{lcl} |
|
1708 |
$\sflataux{[]}'$ & $ \dn $ & $ []$\\ |
|
1709 |
$\sflataux{ (r_1 + r_2) :: rs }'$ & $\dn$ & $r_1 :: r_2 :: rs$\\ |
|
639 | 1710 |
$\sflataux{r :: rs}'$ & $\dn$ & $ r::rs$ |
593 | 1711 |
\end{tabular} |
557 | 1712 |
\end{center} |
558 | 1713 |
\noindent |
576 | 1714 |
$\sflataux{\_}$ breaks up nested alternative regular expressions |
557 | 1715 |
of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape |
558 | 1716 |
into a "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$. |
557 | 1717 |
It will return the singleton list $[r]$ otherwise. |
1718 |
$\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps |
|
1719 |
the output type a regular expression, not a list. |
|
558 | 1720 |
$\sflataux{\_}$ and $\sflat{\_}$ are only recursive on the |
1721 |
first element of the list. |
|
618 | 1722 |
$\sflataux{\_}'$ takes a list of regular expressions as input, and outputs |
1723 |
a list of regular expressions. |
|
1724 |
The use of $\sflataux{\_}$ and $\sflataux{\_}'$ is clear once we have |
|
1725 |
$\textit{createdBySequence}$ defined: |
|
1726 |
\begin{center} |
|
1727 |
\begin{mathpar} |
|
1728 |
\inferrule{\mbox{}}{\textit{createdBySequence}\; (r_1 \cdot r_2)} |
|
558 | 1729 |
|
618 | 1730 |
\inferrule{\textit{createdBySequence} \; r_1}{\textit{createdBySequence} \; |
1731 |
(r_1 + r_2)} |
|
1732 |
\end{mathpar} |
|
1733 |
\end{center} |
|
1734 |
\noindent |
|
1735 |
The predicate $\textit{createdBySequence}$ is used to describe the shape of |
|
1736 |
the derivative regular expressions $(r_1\cdot r_2) \backslash s$: |
|
1737 |
\begin{lemma}\label{recursivelyDerseq} |
|
1738 |
It is always the case that |
|
1739 |
\begin{center} |
|
1740 |
$\textit{createdBySequence} \; ( (r_1\cdot r_2) \backslash_r s) $ |
|
1741 |
\end{center} |
|
1742 |
holds. |
|
1743 |
\end{lemma} |
|
1744 |
\begin{proof} |
|
1745 |
By a reverse induction on the string $s$, where the inductive cases are $[]$ |
|
1746 |
and $xs @ [x]$. |
|
1747 |
\end{proof} |
|
1748 |
\noindent |
|
639 | 1749 |
If we have a regular expression $r$ whose shape |
618 | 1750 |
fits into those described by $\textit{createdBySequence}$, |
639 | 1751 |
then we can convert between |
618 | 1752 |
$r \backslash_r c$ and $(\sflataux{r}) \backslash_r c$ with |
1753 |
$\sflataux{\_}'$: |
|
1754 |
\begin{lemma}\label{sfauIdemDer} |
|
1755 |
If $\textit{createdBySequence} \; r$, then |
|
1756 |
\begin{center} |
|
1757 |
$\sflataux{ r \backslash_r c} = |
|
1758 |
\llparenthesis (\map \; (\_ \backslash_r c) \; (\sflataux{r}) ) \rrparenthesis''$ |
|
1759 |
\end{center} |
|
1760 |
holds. |
|
1761 |
\end{lemma} |
|
1762 |
\begin{proof} |
|
1763 |
By a simple induction on the inductive cases of $\textit{createdBySequence}. |
|
1764 |
$ |
|
1765 |
\end{proof} |
|
1766 |
||
1767 |
Now we are ready to express |
|
1768 |
the shape of $r_1 \cdot r_2 \backslash s$ |
|
558 | 1769 |
\begin{lemma}\label{seqSfau0} |
618 | 1770 |
$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 |
639 | 1771 |
:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r_1))$ |
558 | 1772 |
\end{lemma} |
1773 |
\begin{proof} |
|
618 | 1774 |
By a reverse induction on the string $s$, where the inductive cases |
1775 |
are $[]$ and $xs @ [x]$. |
|
1776 |
For the inductive case, we know that $\textit{createdBySequence} \; ((r_1 \cdot r_2) |
|
1777 |
\backslash_r xs)$ holds from lemma \ref{recursivelyDerseq}, |
|
1778 |
which can be used to prove |
|
558 | 1779 |
\[ |
1780 |
\map \; (r_2 \backslash_r \_) \; (\vsuf{[x]}{(r_1 \backslash_r xs)}) \;\; @ \;\; |
|
1781 |
\map \; (\_ \backslash_r x) \; (\map \; (r_2 \backslash \_) \; (\vsuf{xs}{r_1})) |
|
1782 |
\] |
|
593 | 1783 |
= |
558 | 1784 |
\[ |
1785 |
\map \; (r_2 \backslash_r \_) \; (\vsuf{xs @ [x]}{r_1}) |
|
1786 |
\] |
|
618 | 1787 |
using lemma \ref{sfauIdemDer}. |
1788 |
This equality enables the inductive case to go through. |
|
558 | 1789 |
\end{proof} |
1790 |
\noindent |
|
618 | 1791 |
This lemma says that $(r_1\cdot r_2)\backslash s$ |
1792 |
can be flattened into a list whose head and tail meet the description |
|
1793 |
we gave earlier. |
|
1794 |
%Note that this lemma does $\mathbf{not}$ depend on any |
|
1795 |
%specific definitions we used, |
|
1796 |
%allowing people investigating derivatives to get an alternative |
|
1797 |
%view of what $r_1 \cdot r_2$ is. |
|
532 | 1798 |
|
618 | 1799 |
We now use $\textit{createdBySequence}$ and |
1800 |
$\sflataux{\_}$ to describe an intuition |
|
639 | 1801 |
behind the sequence closed form. |
618 | 1802 |
If two regular expressions only differ in the way their |
1803 |
alternatives are nested, then we should be able to get the same result |
|
1804 |
once we apply simplification to both of them: |
|
1805 |
\begin{lemma}\label{sflatRsimpeq} |
|
1806 |
If $r$ is created from a sequence through |
|
1807 |
a series of derivatives |
|
1808 |
(i.e. if $\textit{createdBySequence} \; r$ holds), |
|
1809 |
and that $\sflataux{r} = rs$, |
|
1810 |
then we have |
|
1811 |
that |
|
1812 |
\begin{center} |
|
1813 |
$\textit{rsimp} \; r = \textit{rsimp} \; (\sum \; rs)$ |
|
1814 |
\end{center} |
|
1815 |
holds. |
|
1816 |
\end{lemma} |
|
1817 |
\begin{proof} |
|
1818 |
By an induction on the inductive cases of $\textit{createdBySequence}$. |
|
1819 |
\end{proof} |
|
1820 |
||
1821 |
Now we are ready for the closed form |
|
1822 |
for the sequence regular expressions (without the inner applications |
|
1823 |
of simplifications): |
|
1824 |
\begin{lemma}\label{seqClosedFormGeneral} |
|
558 | 1825 |
$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} } |
1826 |
=\rsimp{(\sum ( (r_1 \backslash s) \cdot r_2 :: |
|
593 | 1827 |
\map\; (r_2 \backslash \_) \; (\vsuf{s}{r_1})))}$ |
558 | 1828 |
\end{lemma} |
1829 |
\begin{proof} |
|
618 | 1830 |
We know that |
1831 |
$\sflataux{(r_1 \cdot r_2) \backslash_r s} = (r_1 \backslash_r s) \cdot r_2 |
|
639 | 1832 |
:: (\map \; (r_2 \backslash_r \_) \; (\textit{Suffix} \; s \; r_1))$ |
618 | 1833 |
holds |
1834 |
by lemma \ref{seqSfau0}. |
|
1835 |
This allows the theorem to go through because of lemma \ref{sflatRsimpeq}. |
|
558 | 1836 |
\end{proof} |
618 | 1837 |
Together with the idempotency property of $\rsimp{}$ (lemma \ref{rsimpIdem}), |
1838 |
it is possible to convert the above lemma to obtain the |
|
1839 |
proper closed form for $\backslash_{rsimps}$ rather than $\backslash_r$: |
|
1840 |
for derivatives nested with simplification: |
|
1841 |
\begin{theorem}\label{seqClosedForm} |
|
1842 |
$\rderssimp{(r_1 \cdot r_2)}{s} = \rsimp{(\sum ((r_1 \backslash s) \cdot r_2 ) |
|
1843 |
:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1})))}$ |
|
1844 |
\end{theorem} |
|
1845 |
\begin{proof} |
|
1846 |
By a case analysis of the string $s$. |
|
1847 |
When $s$ is an empty list, the rewrite is straightforward. |
|
1848 |
When $s$ is a non-empty list, the |
|
1849 |
lemmas \ref{seqClosedFormGeneral} and \ref{Simpders} apply, |
|
1850 |
making the proof go through. |
|
1851 |
\end{proof} |
|
609 | 1852 |
\subsubsection{Closed Forms for Star Regular Expressions} |
618 | 1853 |
The closed form for the star regular expression involves similar tricks |
1854 |
for the sequence regular expression. |
|
1855 |
The $\textit{Suffix}$ function is now replaced by something |
|
1856 |
slightly more complex, because the growth pattern of star |
|
1857 |
regular expressions' derivatives is a bit different: |
|
564 | 1858 |
\begin{center} |
618 | 1859 |
\begin{tabular}{lclc} |
1860 |
$r^* $ & $\longrightarrow_{\backslash c}$ & |
|
1861 |
$(r\backslash c) \cdot r^*$ & $\longrightarrow_{\backslash c'}$\\ |
|
1862 |
\\ |
|
1863 |
$r \backslash cc' \cdot r^* + r \backslash c' \cdot r^*$ & |
|
1864 |
$\longrightarrow_{\backslash c''}$ & |
|
1865 |
$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + |
|
1866 |
(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*)$ & |
|
1867 |
$\longrightarrow_{\backslash c'''}$ \\ |
|
1868 |
\\ |
|
1869 |
$\ldots$\\ |
|
1870 |
\end{tabular} |
|
564 | 1871 |
\end{center} |
1872 |
When we have a string $s = c :: c' :: c'' \ldots$ such that $r \backslash c$, $r \backslash cc'$, $r \backslash c'$, |
|
1873 |
$r \backslash cc'c''$, $r \backslash c'c''$, $r\backslash c''$ etc. are all nullable, |
|
618 | 1874 |
the number of terms in $r^* \backslash s$ will grow exponentially rather than linearly |
1875 |
in the sequence case. |
|
1876 |
The good news is that the function $\textit{rsimp}$ will again |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
1877 |
ignore the difference between different nesting patterns of alternatives, |
618 | 1878 |
and the exponentially growing star derivative like |
1879 |
\begin{center} |
|
1880 |
$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + |
|
1881 |
(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ |
|
1882 |
\end{center} |
|
1883 |
can be treated as |
|
1884 |
\begin{center} |
|
1885 |
$\RALTS{[r_1 \backslash cc'c'' \cdot r^*, r \backslash c'', |
|
1886 |
r \backslash c'c'' \cdot r^*, r \backslash c'' \cdot r^*]}$ |
|
1887 |
\end{center} |
|
1888 |
which can be de-duplicated by $\rDistinct$ |
|
1889 |
and therefore bounded finitely. |
|
564 | 1890 |
|
639 | 1891 |
%and then de-duplicate terms of the form ($s'$ being a substring of $s$). |
1892 |
%This allows us to use a similar technique as $r_1 \cdot r_2$ case, |
|
618 | 1893 |
|
1894 |
Now the crux of this section is finding a suitable description |
|
1895 |
for $rs$ where |
|
1896 |
\begin{center} |
|
1897 |
$\rderssimp{r^*}{s} = \rsimp{\sum rs}$. |
|
1898 |
\end{center} |
|
1899 |
holds. |
|
1900 |
In addition, the list $rs$ |
|
1901 |
shall be in the form of |
|
1902 |
$\map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss$. |
|
1903 |
The $Ss$ is a list of strings, and for example in the sequence |
|
1904 |
closed form it is specified as $\textit{Suffix} \; s \; r_1$. |
|
1905 |
To get $Ss$ for the star regular expression, |
|
1906 |
we need to introduce $\starupdate$ and $\starupdates$: |
|
558 | 1907 |
\begin{center} |
1908 |
\begin{tabular}{lcl} |
|
1909 |
$\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ |
|
1910 |
$\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\ |
|
1911 |
& & $\textit{if} \; |
|
620 | 1912 |
(\rnullable \; (r \backslash_{rs} s))$ \\ |
558 | 1913 |
& & $\textit{then} \;\; (s @ [c]) :: [c] :: ( |
1914 |
\starupdate \; c \; r \; Ss)$ \\ |
|
1915 |
& & $\textit{else} \;\; (s @ [c]) :: ( |
|
1916 |
\starupdate \; c \; r \; Ss)$ |
|
1917 |
\end{tabular} |
|
1918 |
\end{center} |
|
1919 |
\begin{center} |
|
1920 |
\begin{tabular}{lcl} |
|
1921 |
$\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\ |
|
1922 |
$\starupdates \; (c :: cs) \; r \; Ss$ & $=$ & $\starupdates \; cs \; r \; ( |
|
1923 |
\starupdate \; c \; r \; Ss)$ |
|
1924 |
\end{tabular} |
|
1925 |
\end{center} |
|
1926 |
\noindent |
|
618 | 1927 |
Assuming we have that |
1928 |
\begin{center} |
|
1929 |
$\rderssimp{r^*}{s} = \rsimp{(\sum \map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss)}$. |
|
1930 |
\end{center} |
|
1931 |
holds. |
|
1932 |
The idea of $\starupdate$ and $\starupdates$ |
|
1933 |
is to update $Ss$ when another |
|
1934 |
derivative is taken on $\rderssimp{r^*}{s}$ |
|
1935 |
w.r.t a character $c$ and a string $s'$ |
|
1936 |
respectively. |
|
1937 |
Both $\starupdate$ and $\starupdates$ take three arguments as input: |
|
1938 |
the new character $c$ or string $s$ to take derivative with, |
|
1939 |
the regular expression |
|
1940 |
$r$ under the star $r^*$, and the |
|
1941 |
list of strings $Ss$ for the derivative $r^* \backslash s$ |
|
1942 |
up until this point |
|
1943 |
such that |
|
1944 |
\begin{center} |
|
1945 |
$(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$ |
|
1946 |
\end{center} |
|
1947 |
is satisfied. |
|
1948 |
||
1949 |
Functions $\starupdate$ and $\starupdates$ characterise what the |
|
1950 |
star derivatives will look like once ``straightened out'' into lists. |
|
1951 |
The helper functions for such operations will be similar to |
|
1952 |
$\sflat{\_}$, $\sflataux{\_}$ and $\sflataux{\_}$, which we defined for sequence. |
|
1953 |
We use similar symbols to denote them, with a $*$ subscript to mark the difference. |
|
558 | 1954 |
\begin{center} |
1955 |
\begin{tabular}{lcl} |
|
1956 |
$\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\ |
|
1957 |
$\hflataux{r}$ & $\dn$ & $[r]$ |
|
1958 |
\end{tabular} |
|
1959 |
\end{center} |
|
557 | 1960 |
|
1961 |
\begin{center} |
|
558 | 1962 |
\begin{tabular}{lcl} |
1963 |
$\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\ |
|
1964 |
$\hflat{r}$ & $\dn$ & $r$ |
|
1965 |
\end{tabular} |
|
1966 |
\end{center} |
|
1967 |
\noindent |
|
618 | 1968 |
These definitions are tailor-made for dealing with alternatives that have |
1969 |
originated from a star's derivatives. |
|
1970 |
A typical star derivative always has the structure of a balanced binary tree: |
|
564 | 1971 |
\begin{center} |
618 | 1972 |
$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') + |
1973 |
(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $ |
|
593 | 1974 |
\end{center} |
618 | 1975 |
All of the nested structures of alternatives |
1976 |
generated from derivatives are binary, and therefore |
|
1977 |
$\hflat{\_}$ and $\hflataux{\_}$ only deal with binary alternatives. |
|
1978 |
$\hflat{\_}$ ``untangles'' like the following: |
|
1979 |
\[ |
|
1980 |
\sum ((r_1 + r_2) + (r_3 + r_4)) + \ldots \; |
|
1981 |
\stackrel{\hflat{\_}}{\longrightarrow} \; |
|
1982 |
\RALTS{[r_1, r_2, \ldots, r_n]} |
|
1983 |
\] |
|
1984 |
Here is a lemma stating the recursive property of $\starupdate$ and $\starupdates$, |
|
1985 |
with the helpers $\hflat{\_}$ and $\hflataux{\_}$\footnote{The function $\textit{concat}$ takes a list of lists |
|
1986 |
and merges each of the element lists to form a flattened list.}: |
|
1987 |
\begin{lemma}\label{stupdateInduct1} |
|
1988 |
\mbox |
|
1989 |
For a list of strings $Ss$, the following hold. |
|
1990 |
\begin{itemize} |
|
1991 |
\item |
|
1992 |
If we do a derivative on the terms |
|
1993 |
$r\backslash_r s \cdot r^*$ (where $s$ is taken from the list $Ss$), |
|
1994 |
the result will be the same as if we apply $\starupdate$ to $Ss$. |
|
1995 |
\begin{center} |
|
1996 |
\begin{tabular}{c} |
|
1997 |
$\textit{concat} \; (\map \; (\hflataux{\_} \circ ( (\_\backslash_r x) |
|
1998 |
\circ (\lambda s.\;\; (r \backslash_r s) \cdot r^*)))\; Ss )\; |
|
1999 |
$\\ |
|
2000 |
\\ |
|
2001 |
$=$ \\ |
|
2002 |
\\ |
|
2003 |
$\map \; (\lambda s. (r \backslash_r s) \cdot (r^*)) \; |
|
2004 |
(\starupdate \; x \; r \; Ss)$ |
|
2005 |
\end{tabular} |
|
2006 |
\end{center} |
|
2007 |
\item |
|
2008 |
$\starupdates$ is ``composable'' w.r.t a derivative. |
|
2009 |
It piggybacks the character $x$ to the tail of the string |
|
2010 |
$xs$. |
|
2011 |
\begin{center} |
|
2012 |
\begin{tabular}{c} |
|
2013 |
$\textit{concat} \; (\map \; \hflataux{\_} \; |
|
2014 |
(\map \; (\_\backslash_r x) \; |
|
2015 |
(\map \; (\lambda s.\;\; (r \backslash_r s) \cdot |
|
2016 |
(r^*) ) \; (\starupdates \; xs \; r \; Ss))))$\\ |
|
2017 |
\\ |
|
2018 |
$=$\\ |
|
2019 |
\\ |
|
2020 |
$\map \; (\lambda s.\;\; (r\backslash_r s) \cdot (r^*)) \; |
|
2021 |
(\starupdates \; (xs @ [x]) \; r \; Ss)$ |
|
2022 |
\end{tabular} |
|
2023 |
\end{center} |
|
2024 |
\end{itemize} |
|
2025 |
\end{lemma} |
|
2026 |
||
2027 |
\begin{proof} |
|
2028 |
Part 1 is by induction on $Ss$. |
|
2029 |
Part 2 is by induction on $xs$, where $Ss$ is left to take arbitrary values. |
|
2030 |
\end{proof} |
|
2031 |
||
593 | 2032 |
|
618 | 2033 |
Like $\textit{createdBySequence}$, we need |
2034 |
a predicate for ``star-created'' regular expressions: |
|
593 | 2035 |
\begin{center} |
618 | 2036 |
\begin{mathpar} |
2037 |
\inferrule{\mbox{}}{ \textit{createdByStar}\; \RSEQ{ra}{\RSTAR{rb}} } |
|
2038 |
||
2039 |
\inferrule{ \textit{createdByStar} \; r_1\; \land \; \textit{createdByStar} \; r_2 }{\textit{createdByStar} \; (r_1 + r_2) } |
|
2040 |
\end{mathpar} |
|
593 | 2041 |
\end{center} |
618 | 2042 |
\noindent |
2043 |
All regular expressions created by taking derivatives of |
|
2044 |
$r_1 \cdot (r_2)^*$ satisfy the $\textit{createdByStar}$ predicate: |
|
2045 |
\begin{lemma}\label{starDersCbs} |
|
2046 |
$\textit{createdByStar} \; ((r_1 \cdot r_2^*) \backslash_r s) $ holds. |
|
2047 |
\end{lemma} |
|
2048 |
\begin{proof} |
|
2049 |
By a reverse induction on $s$. |
|
2050 |
\end{proof} |
|
2051 |
If a regular expression conforms to the shape of a star's derivative, |
|
2052 |
then we can push an application of $\hflataux{\_}$ inside a derivative of it: |
|
2053 |
\begin{lemma}\label{hfauPushin} |
|
2054 |
If $\textit{createdByStar} \; r$ holds, then |
|
2055 |
\begin{center} |
|
2056 |
$\hflataux{r \backslash_r c} = \textit{concat} \; ( |
|
2057 |
\map \; \hflataux{\_} (\map \; (\_\backslash_r c) \;(\hflataux{r})))$ |
|
2058 |
\end{center} |
|
2059 |
holds. |
|
2060 |
\end{lemma} |
|
2061 |
\begin{proof} |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2062 |
By an induction on the inductive cases of $\textit{createdByStar}$. |
618 | 2063 |
\end{proof} |
2064 |
%This is not entirely true for annotated regular expressions: |
|
2065 |
%%TODO: bsimp bders \neq bderssimp |
|
2066 |
%\begin{center} |
|
2067 |
% $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$ |
|
2068 |
%\end{center} |
|
2069 |
%For bit-codes, the order in which simplification is applied |
|
2070 |
%might cause a difference in the location they are placed. |
|
2071 |
%If we want something like |
|
2072 |
%\begin{center} |
|
2073 |
% $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$ |
|
2074 |
%\end{center} |
|
2075 |
%Some "canonicalization" procedure is required, |
|
2076 |
%which either pushes all the common bitcodes to nodes |
|
2077 |
%as senior as possible: |
|
2078 |
%\begin{center} |
|
2079 |
% $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $ |
|
2080 |
%\end{center} |
|
2081 |
%or does the reverse. However bitcodes are not of interest if we are talking about |
|
2082 |
%the $\llbracket r \rrbracket$ size of a regex. |
|
2083 |
%Therefore for the ease and simplicity of producing a |
|
2084 |
%proof for a size bound, we are happy to restrict ourselves to |
|
2085 |
%unannotated regular expressions, and obtain such equalities as |
|
2086 |
%TODO: rsimp sflat |
|
2087 |
% The simplification of a flattened out regular expression, provided it comes |
|
2088 |
%from the derivative of a star, is the same as the one nested. |
|
2089 |
||
2090 |
||
2091 |
||
2092 |
Now we introduce an inductive property |
|
2093 |
for $\starupdate$ and $\hflataux{\_}$. |
|
2094 |
\begin{lemma}\label{starHfauInduct} |
|
2095 |
If we do derivatives of $r^*$ |
|
2096 |
with a string that starts with $c$, |
|
2097 |
then flatten it out, |
|
2098 |
we obtain a list |
|
2099 |
of the shape $\sum_{s' \in sS} (r\backslash_r s') \cdot r^*$, |
|
2100 |
where $sS = \starupdates \; s \; r \; [[c]]$. Namely, |
|
2101 |
\begin{center} |
|
620 | 2102 |
$\hflataux{(( (\rder{c}{r_0})\cdot(r_0^*))\backslash_{rs} s)} = |
618 | 2103 |
\map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \; |
2104 |
(\starupdates \; s \; r_0 \; [[c]])$ |
|
2105 |
\end{center} |
|
2106 |
holds. |
|
2107 |
\end{lemma} |
|
2108 |
\begin{proof} |
|
2109 |
By an induction on $s$, the inductive cases |
|
2110 |
being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used. |
|
2111 |
\end{proof} |
|
2112 |
\noindent |
|
2113 |
||
639 | 2114 |
The function $\hflataux{\_}$ has a similar effect as $\textit{flatten}$: |
618 | 2115 |
\begin{lemma}\label{hflatauxGrewrites} |
2116 |
$a :: rs \grewrites \hflataux{a} @ rs$ |
|
2117 |
\end{lemma} |
|
2118 |
\begin{proof} |
|
2119 |
By induction on $a$. $rs$ is set to take arbitrary values. |
|
2120 |
\end{proof} |
|
638 | 2121 |
It is also not surprising that |
2122 |
two regular expressions differing only in terms |
|
2123 |
of the |
|
2124 |
nesting of parentheses are equivalent w.r.t. $\textit{rsimp}$: |
|
618 | 2125 |
\begin{lemma}\label{cbsHfauRsimpeq1} |
2126 |
$\rsimp{(r_1 + r_2)} = \rsimp{(\RALTS{\hflataux{r_1} @ \hflataux{r_2}})}$ |
|
593 | 2127 |
\end{lemma} |
2128 |
||
2129 |
\begin{proof} |
|
2130 |
By using the rewriting relation $\rightsquigarrow$ |
|
2131 |
\end{proof} |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2132 |
And from this we obtain the following fact: a |
639 | 2133 |
regular expression created by star |
2134 |
is the same as its flattened version, up to equivalence under $\textit{bsimp}$. |
|
593 | 2135 |
For example, |
618 | 2136 |
\begin{lemma}\label{hfauRsimpeq2} |
639 | 2137 |
$\textit{createdByStar} \; r \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$ |
593 | 2138 |
\end{lemma} |
2139 |
\begin{proof} |
|
618 | 2140 |
By structural induction on $r$, where the induction rules |
2141 |
are these of $\createdByStar{\_}$. |
|
2142 |
Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case. |
|
593 | 2143 |
\end{proof} |
564 | 2144 |
|
2145 |
||
618 | 2146 |
%Here is a corollary that states the lemma in |
2147 |
%a more intuitive way: |
|
2148 |
%\begin{corollary} |
|
2149 |
% $\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot |
|
2150 |
% (r^*))\; (\starupdates \; c\; r\; [[c]])$ |
|
2151 |
%\end{corollary} |
|
2152 |
%\noindent |
|
2153 |
%Note that this is also agnostic of the simplification |
|
2154 |
%function we defined, and is therefore of more general interest. |
|
2155 |
||
2156 |
Together with the rewriting relation |
|
2157 |
\begin{lemma}\label{starClosedForm6Hrewrites} |
|
2158 |
We have the following set of rewriting relations or equalities: |
|
2159 |
\begin{itemize} |
|
2160 |
\item |
|
2161 |
$\textit{rsimp} \; (r^* \backslash_r (c::s)) |
|
2162 |
\sequal |
|
2163 |
\sum \; ( ( \sum (\lambda s. (r\backslash_r s) \cdot r^*) \; ( |
|
2164 |
\starupdates \; s \; r \; [ c::[]] ) ) )$ |
|
2165 |
\item |
|
620 | 2166 |
$r \backslash_{rsimps} (c::s) = \textit{rsimp} \; ( ( |
618 | 2167 |
\sum ( (\map \; (\lambda s_1. (r\backslash s_1) \; r^*) \; |
2168 |
(\starupdates \;s \; r \; [ c::[] ])))))$ |
|
2169 |
\item |
|
2170 |
$\sum ( (\map \; (\lambda s. (r\backslash s) \; r^*) \; Ss)) |
|
2171 |
\sequal |
|
2172 |
\sum ( (\map \; (\lambda s. \textit{rsimp} \; (r\backslash s) \; |
|
2173 |
r^*) \; Ss) )$ |
|
2174 |
\item |
|
2175 |
$\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss |
|
2176 |
\scfrewrites |
|
2177 |
\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$ |
|
2178 |
\item |
|
2179 |
$( ( \sum ( ( \map \ (\lambda s. \;\; |
|
2180 |
(\textit{rsimp} \; (r \backslash_r s)) \cdot r^*) \; (\starupdates \; |
|
2181 |
s \; r \; [ c::[] ])))))$\\ |
|
2182 |
$\sequal$\\ |
|
2183 |
$( ( \sum ( ( \map \ (\lambda s. \;\; |
|
620 | 2184 |
( r \backslash_{rsimps} s)) \cdot r^*) \; (\starupdates \; |
618 | 2185 |
s \; r \; [ c::[] ]))))$\\ |
2186 |
\end{itemize} |
|
558 | 2187 |
\end{lemma} |
2188 |
\begin{proof} |
|
618 | 2189 |
Part 1 leads to part 2. |
2190 |
The rest of them are routine. |
|
558 | 2191 |
\end{proof} |
2192 |
\noindent |
|
618 | 2193 |
Next the closed form for star regular expressions can be derived: |
2194 |
\begin{theorem}\label{starClosedForm} |
|
558 | 2195 |
$\rderssimp{r^*}{c::s} = |
2196 |
\rsimp{ |
|
2197 |
(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; |
|
593 | 2198 |
(\starupdates \; s\; r \; [[c]]) |
558 | 2199 |
) |
593 | 2200 |
) |
2201 |
} |
|
558 | 2202 |
$ |
618 | 2203 |
\end{theorem} |
558 | 2204 |
\begin{proof} |
2205 |
By an induction on $s$. |
|
618 | 2206 |
The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, \ref{starClosedForm6Hrewrites} |
2207 |
and \ref{hfauRsimpeq2} |
|
558 | 2208 |
are used. |
618 | 2209 |
In \ref{starClosedForm6Hrewrites}, the equalities are |
2210 |
used to link the LHS and RHS. |
|
558 | 2211 |
\end{proof} |
609 | 2212 |
|
2213 |
||
2214 |
||
2215 |
||
2216 |
||
2217 |
||
613 | 2218 |
%---------------------------------------------------------------------------------------- |
2219 |
% SECTION ?? |
|
2220 |
%---------------------------------------------------------------------------------------- |
|
2221 |
||
2222 |
%----------------------------------- |
|
2223 |
% SECTION syntactic equivalence under simp |
|
2224 |
%----------------------------------- |
|
2225 |
||
2226 |
||
2227 |
%---------------------------------------------------------------------------------------- |
|
2228 |
% SECTION ALTS CLOSED FORM |
|
2229 |
%---------------------------------------------------------------------------------------- |
|
2230 |
%\section{A Closed Form for \textit{ALTS}} |
|
2231 |
%Now we prove that $rsimp (rders\_simp (RALTS rs) s) = rsimp (RALTS (map (\lambda r. rders\_simp r s) rs))$. |
|
2232 |
% |
|
2233 |
% |
|
2234 |
%There are a few key steps, one of these steps is |
|
2235 |
% |
|
2236 |
% |
|
2237 |
% |
|
2238 |
%One might want to prove this by something a simple statement like: |
|
2239 |
% |
|
2240 |
%For this to hold we want the $\textit{distinct}$ function to pick up |
|
2241 |
%the elements before and after derivatives correctly: |
|
2242 |
%$r \in rset \equiv (rder x r) \in (rder x rset)$. |
|
2243 |
%which essentially requires that the function $\backslash$ is an injective mapping. |
|
2244 |
% |
|
2245 |
%Unfortunately the function $\backslash c$ is not an injective mapping. |
|
2246 |
% |
|
2247 |
%\subsection{function $\backslash c$ is not injective (1-to-1)} |
|
2248 |
%\begin{center} |
|
2249 |
% The derivative $w.r.t$ character $c$ is not one-to-one. |
|
2250 |
% Formally, |
|
2251 |
% $\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$ |
|
2252 |
%\end{center} |
|
2253 |
%This property is trivially true for the |
|
2254 |
%character regex example: |
|
2255 |
%\begin{center} |
|
2256 |
% $r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$ |
|
2257 |
%\end{center} |
|
2258 |
%But apart from the cases where the derivative |
|
2259 |
%output is $\ZERO$, are there non-trivial results |
|
2260 |
%of derivatives which contain strings? |
|
2261 |
%The answer is yes. |
|
2262 |
%For example, |
|
2263 |
%\begin{center} |
|
2264 |
% Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\ |
|
2265 |
% where $a$ is not nullable.\\ |
|
2266 |
% $r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\ |
|
2267 |
% $r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$ |
|
2268 |
%\end{center} |
|
2269 |
%We start with two syntactically different regular expressions, |
|
2270 |
%and end up with the same derivative result. |
|
2271 |
%This is not surprising as we have such |
|
2272 |
%equality as below in the style of Arden's lemma:\\ |
|
2273 |
%\begin{center} |
|
2274 |
% $L(A^*B) = L(A\cdot A^* \cdot B + B)$ |
|
2275 |
%\end{center} |
|
2276 |
\section{Bounding Closed Forms} |
|
2277 |
||
2278 |
In this section, we introduce how we formalised the bound |
|
2279 |
on closed forms. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2280 |
We first show that in general the number of regular expressions up to a certain |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2281 |
size is finite. |
618 | 2282 |
Then we prove that functions such as $\rflts$ |
613 | 2283 |
will not cause the size of r-regular expressions to grow. |
2284 |
Putting this together with a general bound |
|
2285 |
on the finiteness of distinct regular expressions |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2286 |
up to a specific size, we obtain a bound on |
613 | 2287 |
the closed forms. |
2288 |
||
618 | 2289 |
\subsection{Finiteness of Distinct Regular Expressions} |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2290 |
We define the set of regular expressions whose size is no more than |
618 | 2291 |
a certain size $N$ as $\textit{sizeNregex} \; N$: |
2292 |
\[ |
|
2293 |
\textit{sizeNregex} \; N \dn \{r\; \mid \; \llbracket r \rrbracket_r \leq N \} |
|
2294 |
\] |
|
2295 |
We have that $\textit{sizeNregex} \; N$ is always a finite set: |
|
2296 |
\begin{lemma}\label{finiteSizeN} |
|
2297 |
$\textit{finite} \; (\textit{sizeNregex} \; N)$ holds. |
|
2298 |
\end{lemma} |
|
2299 |
\begin{proof} |
|
2300 |
By splitting the set $\textit{sizeNregex} \; (N + 1)$ into |
|
2301 |
subsets by their categories: |
|
639 | 2302 |
$\{\ZERO_r, \ONE_r, c\}$, $\{r^* \mid r \in \textit{sizeNregex} \; N\}$, |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2303 |
and so on. Each of these subsets is finitely bounded. |
618 | 2304 |
\end{proof} |
2305 |
\noindent |
|
2306 |
From this we get a corollary that |
|
2307 |
if forall $r \in rs$, $\rsize{r} \leq N$, then the output of |
|
2308 |
$\rdistinct{rs}{\varnothing}$ is a list of regular |
|
2309 |
expressions of finite size depending on $N$ only. |
|
2310 |
\begin{corollary}\label{finiteSizeNCorollary} |
|
2311 |
$\rsize{\rdistinct{rs}{\varnothing}} \leq c_N * N$ holds, |
|
2312 |
where the constant $c_N$ is equal to $\textit{card} \; (\textit{sizeNregex} \; |
|
2313 |
N)$. |
|
2314 |
\end{corollary} |
|
2315 |
\begin{proof} |
|
2316 |
For all $r$ in |
|
2317 |
$\textit{set} \; (\rdistinct{rs}{\varnothing})$, |
|
2318 |
it is always the case that $\rsize{r} \leq N$. |
|
2319 |
In addition, the list length is bounded by |
|
2320 |
$c_N$, yielding the desired bound. |
|
2321 |
\end{proof} |
|
2322 |
\noindent |
|
2323 |
This fact will be handy in estimating the closed form sizes. |
|
2324 |
%We have proven that the size of the |
|
2325 |
%output of $\textit{rdistinct} \; rs' \; \varnothing$ |
|
2326 |
%is bounded by a constant $N * c_N$ depending only on $N$, |
|
2327 |
%provided that each of $rs'$'s element |
|
2328 |
%is bounded by $N$. |
|
2329 |
||
639 | 2330 |
\subsection{$\textit{rsimp}$ Does Not Increase the Size} |
613 | 2331 |
Although it seems evident, we need a series |
2332 |
of non-trivial lemmas to establish that functions such as $\rflts$ |
|
2333 |
do not cause the regular expressions to grow. |
|
2334 |
\begin{lemma}\label{rsimpMonoLemmas} |
|
2335 |
\mbox{} |
|
2336 |
\begin{itemize} |
|
2337 |
\item |
|
2338 |
\[ |
|
2339 |
\llbracket \rsimpalts \; rs \rrbracket_r \leq |
|
2340 |
\llbracket \sum \; rs \rrbracket_r |
|
2341 |
\] |
|
2342 |
\item |
|
2343 |
\[ |
|
2344 |
\llbracket \rsimpseq \; r_1 \; r_2 \rrbracket_r \leq |
|
2345 |
\llbracket r_1 \cdot r_2 \rrbracket_r |
|
2346 |
\] |
|
2347 |
\item |
|
2348 |
\[ |
|
2349 |
\llbracket \rflts \; rs \rrbracket_r \leq |
|
2350 |
\llbracket rs \rrbracket_r |
|
2351 |
\] |
|
2352 |
\item |
|
2353 |
\[ |
|
2354 |
\llbracket \rDistinct \; rs \; ss \rrbracket_r \leq |
|
2355 |
\llbracket rs \rrbracket_r |
|
2356 |
\] |
|
2357 |
\item |
|
2358 |
If all elements $a$ in the set $as$ satisfy the property |
|
2359 |
that $\llbracket \textit{rsimp} \; a \rrbracket_r \leq |
|
2360 |
\llbracket a \rrbracket_r$, then we have |
|
2361 |
\[ |
|
2362 |
\llbracket \; \rsimpalts \; (\textit{rdistinct} \; |
|
2363 |
(\textit{rflts} \; (\textit{map}\;\textit{rsimp} as)) \{\}) |
|
2364 |
\rrbracket \leq |
|
2365 |
\llbracket \; \sum \; (\rDistinct \; (\rflts \;(\map \; |
|
2366 |
\textit{rsimp} \; x))\; \{ \} ) \rrbracket_r |
|
2367 |
\] |
|
2368 |
\end{itemize} |
|
2369 |
\end{lemma} |
|
2370 |
\begin{proof} |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2371 |
Points 1, 3, and 4 can be proven by an induction on $rs$. |
613 | 2372 |
Point 2 is by case analysis on $r_1$ and $r_2$. |
2373 |
The last part is a corollary of the previous ones. |
|
2374 |
\end{proof} |
|
2375 |
\noindent |
|
2376 |
With the lemmas for each inductive case in place, we are ready to get |
|
2377 |
the non-increasing property as a corollary: |
|
2378 |
\begin{corollary}\label{rsimpMono} |
|
2379 |
$\llbracket \textit{rsimp} \; r \rrbracket_r \leq \llbracket r \rrbracket_r$ |
|
2380 |
\end{corollary} |
|
2381 |
\begin{proof} |
|
2382 |
By \ref{rsimpMonoLemmas}. |
|
2383 |
\end{proof} |
|
2384 |
||
609 | 2385 |
\subsection{Estimating the Closed Forms' sizes} |
618 | 2386 |
We recap the closed forms we obtained |
639 | 2387 |
earlier: |
558 | 2388 |
\begin{itemize} |
2389 |
\item |
|
593 | 2390 |
$\rderssimp{(\sum rs)}{s} \sequal |
2391 |
\sum \; (\map \; (\rderssimp{\_}{s}) \; rs)$ |
|
558 | 2392 |
\item |
593 | 2393 |
$\rderssimp{(r_1 \cdot r_2)}{s} \sequal \sum ((r_1 \backslash s) \cdot r_2 ) |
2394 |
:: (\map \; (r_2 \backslash \_) (\vsuf{s}{r_1}))$ |
|
558 | 2395 |
\item |
2396 |
||
593 | 2397 |
$\rderssimp{r^*}{c::s} = |
2398 |
\rsimp{ |
|
2399 |
(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \; |
|
558 | 2400 |
(\starupdates \; s\; r \; [[c]]) |
593 | 2401 |
) |
2402 |
) |
|
2403 |
} |
|
2404 |
$ |
|
558 | 2405 |
\end{itemize} |
2406 |
\noindent |
|
2407 |
The closed forms on the left-hand-side |
|
2408 |
are all of the same shape: $\rsimp{ (\sum rs)} $. |
|
2409 |
Such regular expression will be bounded by the size of $\sum rs'$, |
|
2410 |
where every element in $rs'$ is distinct, and each element |
|
2411 |
can be described by some inductive sub-structures |
|
2412 |
(for example when $r = r_1 \cdot r_2$ then $rs'$ |
|
2413 |
will be solely comprised of $r_1 \backslash s'$ |
|
2414 |
and $r_2 \backslash s''$, $s'$ and $s''$ being |
|
2415 |
sub-strings of $s$). |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2416 |
which will each have a size upper bound |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2417 |
according to the inductive hypothesis, which controls $r \backslash s$. |
557 | 2418 |
|
558 | 2419 |
We elaborate the above reasoning by a series of lemmas |
2420 |
below, where straightforward proofs are omitted. |
|
639 | 2421 |
%We want to apply it to our setting $\rsize{\rsimp{\sum rs}}$. |
2422 |
We show that $\textit{rdistinct}$ and $\rflts$ |
|
609 | 2423 |
working together is at least as |
639 | 2424 |
good as $\textit{rdistinct}$ alone, which can be written as |
609 | 2425 |
\begin{center} |
2426 |
$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r |
|
2427 |
\leq |
|
2428 |
\llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. |
|
2429 |
\end{center} |
|
2430 |
We need this so that we know the outcome of our real |
|
2431 |
simplification is better than or equal to a rough estimate, |
|
2432 |
and therefore can be bounded by that estimate. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2433 |
This is a bit harder to establish compared to proving |
609 | 2434 |
$\textit{flts}$ does not make a list larger (which can |
2435 |
be proven using routine induction): |
|
2436 |
\begin{center} |
|
2437 |
$\llbracket \textit{rflts}\; rs \rrbracket_r \leq |
|
2438 |
\llbracket \textit{rs} \rrbracket_r$ |
|
2439 |
\end{center} |
|
2440 |
We cannot simply prove how each helper function |
|
2441 |
reduces the size and then put them together: |
|
2442 |
From |
|
2443 |
\begin{center} |
|
2444 |
$\llbracket \textit{rflts}\; rs \rrbracket_r \leq |
|
618 | 2445 |
\llbracket \textit{rs} \rrbracket_r$ |
609 | 2446 |
\end{center} |
2447 |
and |
|
2448 |
\begin{center} |
|
2449 |
$\llbracket \textit{rdistinct} \; rs \; \varnothing \leq |
|
2450 |
\llbracket rs \rrbracket_r$ |
|
2451 |
\end{center} |
|
618 | 2452 |
one cannot infer |
609 | 2453 |
\begin{center} |
558 | 2454 |
$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r |
2455 |
\leq |
|
2456 |
\llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. |
|
609 | 2457 |
\end{center} |
618 | 2458 |
What we can infer is that |
609 | 2459 |
\begin{center} |
2460 |
$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r |
|
2461 |
\leq |
|
2462 |
\llbracket rs \rrbracket_r$ |
|
2463 |
\end{center} |
|
2464 |
but this estimate is too rough and $\llbracket rs \rrbracket_r$ is unbounded. |
|
2465 |
The way we |
|
618 | 2466 |
get around this is by first proving a more general lemma |
609 | 2467 |
(so that the inductive case goes through): |
2468 |
\begin{lemma}\label{fltsSizeReductionAlts} |
|
2469 |
If we have three accumulator sets: |
|
2470 |
$noalts\_set$, $alts\_set$ and $corr\_set$, |
|
2471 |
satisfying: |
|
2472 |
\begin{itemize} |
|
2473 |
\item |
|
2474 |
$\forall r \in noalts\_set. \; \nexists xs.\; r = \sum xs$ |
|
2475 |
\item |
|
2476 |
$\forall r \in alts\_set. \; \exists xs. \; r = \sum xs |
|
2477 |
\; \textit{and} \; set \; xs \subseteq corr\_set$ |
|
2478 |
\end{itemize} |
|
2479 |
then we have that |
|
2480 |
\begin{center} |
|
2481 |
\begin{tabular}{lcl} |
|
2482 |
$\llbracket (\textit{rdistinct} \; (\textit{rflts} \; as) \; |
|
2483 |
(noalts\_set \cup corr\_set)) \rrbracket_r$ & $\leq$ &\\ |
|
2484 |
$\llbracket (\textit{rdistinct} \; as \; (noalts\_set \cup alts\_set \cup |
|
639 | 2485 |
\{ \ZERO_r \} )) \rrbracket_r$ & & \\ |
609 | 2486 |
\end{tabular} |
2487 |
\end{center} |
|
2488 |
holds. |
|
532 | 2489 |
\end{lemma} |
558 | 2490 |
\noindent |
618 | 2491 |
We split the accumulator into two parts: the part |
609 | 2492 |
which contains alternative regular expressions ($alts\_set$), and |
2493 |
the part without any of them($noalts\_set$). |
|
618 | 2494 |
This is because $\rflts$ opens up the alternatives in $as$, |
2495 |
causing the accumulators on both sides of the inequality |
|
2496 |
to diverge slightly. |
|
2497 |
If we want to compare the accumulators that are not |
|
2498 |
perfectly in sync, we need to consider the alternatives and non-alternatives |
|
2499 |
separately. |
|
609 | 2500 |
The set $corr\_set$ is the corresponding set |
618 | 2501 |
of $alts\_set$ with all elements under the alternative constructor |
609 | 2502 |
spilled out. |
2503 |
\begin{proof} |
|
2504 |
By induction on the list $as$. We make use of lemma \ref{rdistinctConcat}. |
|
2505 |
\end{proof} |
|
2506 |
By setting all three sets to the empty set, one gets the desired size estimate: |
|
2507 |
\begin{corollary}\label{interactionFltsDB} |
|
2508 |
$\llbracket \rdistinct{(\rflts \; \textit{rs})}{\varnothing} \rrbracket_r |
|
2509 |
\leq |
|
2510 |
\llbracket \rdistinct{rs}{\varnothing} \rrbracket_r $. |
|
2511 |
\end{corollary} |
|
2512 |
\begin{proof} |
|
2513 |
By using the lemma \ref{fltsSizeReductionAlts}. |
|
2514 |
\end{proof} |
|
2515 |
\noindent |
|
618 | 2516 |
The intuition for why this is true |
2517 |
is that if we remove duplicates from the $\textit{LHS}$, at least the same amount of |
|
558 | 2518 |
duplicates will be removed from the list $\textit{rs}$ in the $\textit{RHS}$. |
2519 |
||
2520 |
Now this $\rsimp{\sum rs}$ can be estimated using $\rdistinct{rs}{\varnothing}$: |
|
2521 |
\begin{lemma}\label{altsSimpControl} |
|
2522 |
$\rsize{\rsimp{\sum rs}} \leq \rsize{\rdistinct{rs}{\varnothing}}+ 1$ |
|
532 | 2523 |
\end{lemma} |
558 | 2524 |
\begin{proof} |
618 | 2525 |
By using corollary \ref{interactionFltsDB}. |
558 | 2526 |
\end{proof} |
2527 |
\noindent |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2528 |
This is a key lemma in establishing the bounds of all the |
609 | 2529 |
closed forms. |
2530 |
With this we are now ready to control the sizes of |
|
618 | 2531 |
$(r_1 \cdot r_2 )\backslash s$ and $r^* \backslash s$. |
2532 |
\begin{theorem}\label{rBound} |
|
593 | 2533 |
For any regex $r$, $\exists N_r. \forall s. \; \rsize{\rderssimp{r}{s}} \leq N_r$ |
558 | 2534 |
\end{theorem} |
2535 |
\noindent |
|
2536 |
\begin{proof} |
|
593 | 2537 |
We prove this by induction on $r$. The base cases for $\RZERO$, |
2538 |
$\RONE $ and $\RCHAR{c}$ are straightforward. |
|
2539 |
In the sequence $r_1 \cdot r_2$ case, |
|
2540 |
the inductive hypotheses state |
|
2541 |
$\exists N_1. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N_1$ and |
|
2542 |
$\exists N_2. \forall s. \; \llbracket \rderssimp{r_2}{s} \rrbracket \leq N_2$. |
|
562 | 2543 |
|
593 | 2544 |
When the string $s$ is not empty, we can reason as follows |
2545 |
% |
|
2546 |
\begin{center} |
|
2547 |
\begin{tabular}{lcll} |
|
558 | 2548 |
& & $ \llbracket \rderssimp{r_1\cdot r_2 }{s} \rrbracket_r $\\ |
620 | 2549 |
& $ = $ & $\llbracket \rsimp{(\sum(r_1 \backslash_{rsimps} s \cdot r_2 \; \; :: \; \; |
2550 |
\map \; (r_2\backslash_{rsimps} \_)\; (\vsuf{s}{r})))} \rrbracket_r $ & (1) \\ |
|
2551 |
& $\leq$ & $\llbracket \rdistinct{(r_1 \backslash_{rsimps} s \cdot r_2 \; \; :: \; \; |
|
2552 |
\map \; (r_2\backslash_{rsimps} \_)\; (\vsuf{s}{r}))}{\varnothing} \rrbracket_r + 1$ & (2) \\ |
|
593 | 2553 |
& $\leq$ & $2 + N_1 + \rsize{r_2} + (N_2 * (card\;(\sizeNregex \; N_2)))$ & (3)\\ |
558 | 2554 |
\end{tabular} |
2555 |
\end{center} |
|
561 | 2556 |
\noindent |
618 | 2557 |
(1) is by theorem \ref{seqClosedForm}. |
561 | 2558 |
(2) is by \ref{altsSimpControl}. |
2559 |
(3) is by \ref{finiteSizeNCorollary}. |
|
562 | 2560 |
|
2561 |
||
2562 |
Combining the cases when $s = []$ and $s \neq []$, we get (4): |
|
2563 |
\begin{center} |
|
2564 |
\begin{tabular}{lcll} |
|
2565 |
$\rsize{(r_1 \cdot r_2) \backslash_r s}$ & $\leq$ & |
|
2566 |
$max \; (2 + N_1 + |
|
2567 |
\llbracket r_2 \rrbracket_r + |
|
2568 |
N_2 * (card\; (\sizeNregex \; N_2))) \; \rsize{r_1\cdot r_2}$ & (4) |
|
2569 |
\end{tabular} |
|
2570 |
\end{center} |
|
558 | 2571 |
|
562 | 2572 |
We reason similarly for $\STAR$. |
2573 |
The inductive hypothesis is |
|
2574 |
$\exists N. \forall s. \; \llbracket \rderssimp{r}{s} \rrbracket \leq N$. |
|
564 | 2575 |
Let $n_r = \llbracket r^* \rrbracket_r$. |
562 | 2576 |
When $s = c :: cs$ is not empty, |
2577 |
\begin{center} |
|
593 | 2578 |
\begin{tabular}{lcll} |
562 | 2579 |
& & $ \llbracket \rderssimp{r^* }{c::cs} \rrbracket_r $\\ |
620 | 2580 |
& $ = $ & $\llbracket \rsimp{(\sum (\map \; (\lambda s. (r \backslash_{rsimps} s) \cdot r^*) \; (\starupdates\; |
593 | 2581 |
cs \; r \; [[c]] )) )} \rrbracket_r $ & (5) \\ |
2582 |
& $\leq$ & $\llbracket |
|
2583 |
\rdistinct{ |
|
2584 |
(\map \; |
|
620 | 2585 |
(\lambda s. (r \backslash_{rsimps} s) \cdot r^*) \; |
593 | 2586 |
(\starupdates\; cs \; r \; [[c]] ) |
2587 |
)} |
|
562 | 2588 |
{\varnothing} \rrbracket_r + 1$ & (6) \\ |
2589 |
& $\leq$ & $1 + (\textit{card} (\sizeNregex \; (N + n_r))) |
|
2590 |
* (1 + (N + n_r)) $ & (7)\\ |
|
2591 |
\end{tabular} |
|
2592 |
\end{center} |
|
2593 |
\noindent |
|
618 | 2594 |
(5) is by theorem \ref{starClosedForm}. |
562 | 2595 |
(6) is by \ref{altsSimpControl}. |
618 | 2596 |
(7) is by corollary \ref{finiteSizeNCorollary}. |
639 | 2597 |
Combining with the case when $s = []$, one obtains |
562 | 2598 |
\begin{center} |
593 | 2599 |
\begin{tabular}{lcll} |
2600 |
$\rsize{r^* \backslash_r s}$ & $\leq$ & $max \; n_r \; 1 + (\textit{card} (\sizeNregex \; (N + n_r))) |
|
2601 |
* (1 + (N + n_r)) $ & (8)\\ |
|
2602 |
\end{tabular} |
|
562 | 2603 |
\end{center} |
2604 |
\noindent |
|
2605 |
||
2606 |
The alternative case is slightly less involved. |
|
2607 |
The inductive hypothesis |
|
2608 |
is equivalent to $\exists N. \forall r \in (\map \; (\_ \backslash_r s) \; rs). \rsize{r} \leq N$. |
|
2609 |
In the case when $s = c::cs$, we have |
|
2610 |
\begin{center} |
|
593 | 2611 |
\begin{tabular}{lcll} |
562 | 2612 |
& & $ \llbracket \rderssimp{\sum rs }{c::cs} \rrbracket_r $\\ |
620 | 2613 |
& $ = $ & $\llbracket \rsimp{(\sum (\map \; (\_ \backslash_{rsimps} s) \; rs) )} \rrbracket_r $ & (9) \\ |
2614 |
& $\leq$ & $\llbracket (\sum (\map \; (\_ \backslash_{rsimps} s) \; rs) ) \rrbracket_r $ & (10) \\ |
|
562 | 2615 |
& $\leq$ & $1 + N * (length \; rs) $ & (11)\\ |
593 | 2616 |
\end{tabular} |
562 | 2617 |
\end{center} |
2618 |
\noindent |
|
618 | 2619 |
(9) is by theorem \ref{altsClosedForm}, (10) by lemma \ref{rsimpMono} and (11) by inductive hypothesis. |
562 | 2620 |
|
639 | 2621 |
Combining with the case when $s = []$, we obtain |
562 | 2622 |
\begin{center} |
593 | 2623 |
\begin{tabular}{lcll} |
2624 |
$\rsize{\sum rs \backslash_r s}$ & $\leq$ & $max \; \rsize{\sum rs} \; 1+N*(length \; rs)$ |
|
2625 |
& (12)\\ |
|
2626 |
\end{tabular} |
|
562 | 2627 |
\end{center} |
618 | 2628 |
We have all the inductive cases proven. |
558 | 2629 |
\end{proof} |
2630 |
||
618 | 2631 |
This leads to our main result on the size bound: |
564 | 2632 |
\begin{corollary} |
618 | 2633 |
For any annotated regular expression $a$, $\exists N_r. \forall s. \; \rsize{\bderssimp{a}{s}} \leq N_r$ |
564 | 2634 |
\end{corollary} |
2635 |
\begin{proof} |
|
618 | 2636 |
By lemma \ref{sizeRelations} and theorem \ref{rBound}. |
564 | 2637 |
\end{proof} |
558 | 2638 |
\noindent |
2639 |
||
609 | 2640 |
|
2641 |
||
2642 |
||
2643 |
||
558 | 2644 |
%----------------------------------- |
2645 |
% SECTION 2 |
|
2646 |
%----------------------------------- |
|
2647 |
||
625 | 2648 |
\section{Bounded Repetitions} |
2649 |
We have promised in chapter \ref{Introduction} |
|
2650 |
that our lexing algorithm can potentially be extended |
|
2651 |
to handle bounded repetitions |
|
2652 |
in natural and elegant ways. |
|
2653 |
Now we fulfill our promise by adding support for |
|
2654 |
the ``exactly-$n$-times'' bounded regular expression $r^{\{n\}}$. |
|
2655 |
We add clauses in our derivatives-based lexing algorithms (with simplifications) |
|
2656 |
introduced in chapter \ref{Bitcoded2}. |
|
2657 |
||
2658 |
\subsection{Augmented Definitions} |
|
2659 |
There are a number of definitions that need to be augmented. |
|
2660 |
The most notable one would be the POSIX rules for $r^{\{n\}}$: |
|
2661 |
\begin{center} |
|
2662 |
\begin{mathpar} |
|
2663 |
\inferrule{\forall v \in vs_1. \vdash v:r \land |
|
2664 |
|v| \neq []\\ \forall v \in vs_2. \vdash v:r \land |v| = []\\ |
|
2665 |
\textit{length} \; (vs_1 @ vs_2) = n}{\textit{Stars} \; |
|
2666 |
(vs_1 @ vs_2) : r^{\{n\}} } |
|
2667 |
\end{mathpar} |
|
2668 |
\end{center} |
|
2669 |
As Ausaf had pointed out \cite{Ausaf}, |
|
2670 |
sometimes empty iterations have to be taken to get |
|
2671 |
a match with exactly $n$ repetitions, |
|
2672 |
and hence the $vs_2$ part. |
|
2673 |
||
2674 |
Another important definition would be the size: |
|
2675 |
\begin{center} |
|
2676 |
\begin{tabular}{lcl} |
|
2677 |
$\llbracket r^{\{n\}} \rrbracket_r$ & $\dn$ & |
|
2678 |
$\llbracket r \rrbracket_r + n$\\ |
|
2679 |
\end{tabular} |
|
2680 |
\end{center} |
|
2681 |
\noindent |
|
2682 |
Arguably we should use $\log \; n$ for the size because |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2683 |
the number of digits increases logarithmically w.r.t $n$. |
625 | 2684 |
For simplicity we choose to add the counter directly to the size. |
2685 |
||
2686 |
The derivative w.r.t a bounded regular expression |
|
2687 |
is given as |
|
2688 |
\begin{center} |
|
2689 |
\begin{tabular}{lcl} |
|
2690 |
$r^{\{n\}} \backslash_r c$ & $\dn$ & |
|
2691 |
$r\backslash_r c \cdot r^{\{n-1\}} \;\; \textit{if} \; n \geq 1$\\ |
|
2692 |
& & $\RZERO \;\quad \quad\quad \quad |
|
2693 |
\textit{otherwise}$\\ |
|
2694 |
\end{tabular} |
|
2695 |
\end{center} |
|
2696 |
\noindent |
|
2697 |
For brevity, we sometimes use NTIMES to refer to bounded |
|
2698 |
regular expressions. |
|
2699 |
The $\mkeps$ function clause for NTIMES would be |
|
2700 |
\begin{center} |
|
2701 |
\begin{tabular}{lcl} |
|
2702 |
$\mkeps \; r^{\{n\}} $ & $\dn$ & $\Stars \; |
|
2703 |
(\textit{replicate} \; n\; (\mkeps \; r))$ |
|
2704 |
\end{tabular} |
|
2705 |
\end{center} |
|
2706 |
\noindent |
|
2707 |
The injection looks like |
|
2708 |
\begin{center} |
|
2709 |
\begin{tabular}{lcl} |
|
2710 |
$\inj \; r^{\{n\}} \; c\; (\Seq \;v \; (\Stars \; vs)) $ & |
|
2711 |
$\dn$ & $\Stars \; |
|
2712 |
((\inj \; r \;c \;v ) :: vs)$ |
|
2713 |
\end{tabular} |
|
2714 |
\end{center} |
|
2715 |
\noindent |
|
2716 |
||
2717 |
||
2718 |
\subsection{Proofs for the Augmented Lexing Algorithm} |
|
2719 |
We need to maintain two proofs with the additional $r^{\{n\}}$ |
|
2720 |
construct: the |
|
2721 |
correctness proof in chapter \ref{Bitcoded2}, |
|
2722 |
and the finiteness proof in chapter \ref{Finite}. |
|
2723 |
||
2724 |
\subsubsection{Correctness Proof Augmentation} |
|
2725 |
The correctness of $\textit{lexer}$ and $\textit{blexer}$ with bounded repetitions |
|
2726 |
have been proven by Ausaf and Urban\cite{AusafDyckhoffUrban2016}. |
|
2727 |
As they have commented, once the definitions are in place, |
|
2728 |
the proofs given for the basic regular expressions will extend to |
|
2729 |
bounded regular expressions, and there are no ``surprises''. |
|
2730 |
We confirm this point because the correctness theorem would also |
|
2731 |
extend without surprise to $\blexersimp$. |
|
2732 |
The rewrite rules such as $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$ and so on |
|
2733 |
do not need to be changed, |
|
2734 |
and only a few lemmas such as lemma \ref{fltsPreserves} need to be adjusted to |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2735 |
add one more line which can be solved by the Sledgehammer tool |
625 | 2736 |
to solve the $r^{\{n\}}$ inductive case. |
2737 |
||
2738 |
||
2739 |
\subsubsection{Finiteness Proof Augmentation} |
|
2740 |
The bounded repetitions are |
|
2741 |
very similar to stars, and therefore the treatment |
|
2742 |
is similar, with minor changes to handle some slight complications |
|
2743 |
when the counter reaches 0. |
|
2744 |
The exponential growth is similar: |
|
2745 |
\begin{center} |
|
2746 |
\begin{tabular}{ll} |
|
2747 |
$r^{\{n\}} $ & $\longrightarrow_{\backslash c}$\\ |
|
2748 |
$(r\backslash c) \cdot |
|
2749 |
r^{\{n - 1\}}*$ & $\longrightarrow_{\backslash c'}$\\ |
|
2750 |
\\ |
|
2751 |
$r \backslash cc' \cdot r^{\{n - 2\}}* + |
|
2752 |
r \backslash c' \cdot r^{\{n - 1\}}*$ & |
|
2753 |
$\longrightarrow_{\backslash c''}$\\ |
|
2754 |
\\ |
|
2755 |
$(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* + |
|
2756 |
r \backslash c''\cdot r^{\{n-1\}}) + |
|
2757 |
(r \backslash c'c'' \cdot r^{\{n-2\}}* + |
|
2758 |
r \backslash c'' \cdot r^{\{n-1\}}*)$ & |
|
2759 |
$\longrightarrow_{\backslash c'''}$ \\ |
|
2760 |
\\ |
|
2761 |
$\ldots$\\ |
|
2762 |
\end{tabular} |
|
2763 |
\end{center} |
|
2764 |
Again, we assume that $r\backslash c$, $r \backslash cc'$ and so on |
|
2765 |
are all nullable. |
|
2766 |
The flattened list of terms for $r^{\{n\}} \backslash_{rs} s$ |
|
2767 |
\begin{center} |
|
2768 |
$[r_1 \backslash cc'c'' \cdot r^{\{n-3\}}*,\; |
|
2769 |
r \backslash c''\cdot r^{\{n-1\}}, \; |
|
2770 |
r \backslash c'c'' \cdot r^{\{n-2\}}*, \; |
|
2771 |
r \backslash c'' \cdot r^{\{n-1\}}*,\; \ldots ]$ |
|
2772 |
\end{center} |
|
2773 |
that comes from |
|
2774 |
\begin{center} |
|
2775 |
$(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* + |
|
2776 |
r \backslash c''\cdot r^{\{n-1\}}) + |
|
2777 |
(r \backslash c'c'' \cdot r^{\{n-2\}}* + |
|
2778 |
r \backslash c'' \cdot r^{\{n-1\}}*)+ \ldots$ |
|
2779 |
\end{center} |
|
2780 |
are made of sequences with different tails, where the counters |
|
2781 |
might differ. |
|
2782 |
The observation for maintaining the bound is that |
|
2783 |
these counters never exceed $n$, the original |
|
2784 |
counter. With the number of counters staying finite, |
|
2785 |
$\rDistinct$ will deduplicate and keep the list finite. |
|
2786 |
We introduce this idea as a lemma once we describe all |
|
2787 |
the necessary helper functions. |
|
2788 |
||
2789 |
Similar to the star case, we want |
|
2790 |
\begin{center} |
|
2791 |
$\rderssimp{r^{\{n\}}}{s} = \rsimp{\sum rs}$. |
|
2792 |
\end{center} |
|
2793 |
where $rs$ |
|
2794 |
shall be in the form of |
|
2795 |
$\map \; f \; Ss$, where $f$ is a function and |
|
2796 |
$Ss$ a list of objects to act on. |
|
2797 |
For star, the object's datatype is string. |
|
2798 |
The list of strings $Ss$ |
|
2799 |
is generated using functions |
|
2800 |
$\starupdate$ and $\starupdates$. |
|
2801 |
The function that takes a string and returns a regular expression |
|
2802 |
is the anonymous function $ |
|
2803 |
(\lambda s'. \; r\backslash s' \cdot r^{\{m\}})$. |
|
2804 |
In the NTIMES setting, |
|
2805 |
the $\starupdate$ and $\starupdates$ functions are replaced by |
|
2806 |
$\textit{nupdate}$ and $\textit{nupdates}$: |
|
2807 |
\begin{center} |
|
2808 |
\begin{tabular}{lcl} |
|
2809 |
$\nupdate \; c \; r \; [] $ & $\dn$ & $[]$\\ |
|
2810 |
$\nupdate \; c \; r \; |
|
2811 |
(\Some \; (s, \; n + 1) \; :: \; Ss)$ & $\dn$ & %\\ |
|
2812 |
$\textit{if} \; |
|
2813 |
(\rnullable \; (r \backslash_{rs} s))$ \\ |
|
2814 |
& & $\;\;\textit{then} |
|
2815 |
\;\; \Some \; (s @ [c], n + 1) :: \Some \; ([c], n) :: ( |
|
2816 |
\nupdate \; c \; r \; Ss)$ \\ |
|
2817 |
& & $\textit{else} \;\; \Some \; (s @ [c], n+1) :: ( |
|
2818 |
\nupdate \; c \; r \; Ss)$\\ |
|
2819 |
$\nupdate \; c \; r \; (\textit{None} :: Ss)$ & $\dn$ & |
|
2820 |
$(\None :: \nupdate \; c \; r \; Ss)$\\ |
|
2821 |
& & \\ |
|
2822 |
%\end{tabular} |
|
2823 |
%\end{center} |
|
2824 |
%\begin{center} |
|
2825 |
%\begin{tabular}{lcl} |
|
2826 |
$\nupdates \; [] \; r \; Ss$ & $\dn$ & $Ss$\\ |
|
2827 |
$\nupdates \; (c :: cs) \; r \; Ss$ & $\dn$ & $\nupdates \; cs \; r \; ( |
|
2828 |
\nupdate \; c \; r \; Ss)$ |
|
2829 |
\end{tabular} |
|
2830 |
\end{center} |
|
2831 |
\noindent |
|
2832 |
which take into account when a subterm |
|
2833 |
\begin{center} |
|
2834 |
$r \backslash_s s \cdot r^{\{n\}}$ |
|
2835 |
\end{center} |
|
2836 |
counter $n$ |
|
2837 |
is 0, and therefore expands to |
|
2838 |
\begin{center} |
|
2839 |
$r \backslash_s (s@[c]) \cdot r^{\{n\}} \;+ |
|
2840 |
\; \ZERO$ |
|
2841 |
\end{center} |
|
2842 |
after taking a derivative. |
|
2843 |
The object now has type |
|
2844 |
\begin{center} |
|
2845 |
$\textit{option} \;(\textit{string}, \textit{nat})$ |
|
2846 |
\end{center} |
|
2847 |
and therefore the function for converting such an option into |
|
2848 |
a regular expression term is called $\opterm$: |
|
2849 |
||
2850 |
\begin{center} |
|
2851 |
\begin{tabular}{lcl} |
|
2852 |
$\opterm \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ |
|
2853 |
& & $\;\;\Some \; (s, n) \Rightarrow |
|
2854 |
(r\backslash_{rs} s)\cdot r^{\{n\}}$\\ |
|
2855 |
& & $\;\;\None \Rightarrow |
|
2856 |
\ZERO$\\ |
|
2857 |
\end{tabular} |
|
2858 |
\end{center} |
|
2859 |
\noindent |
|
2860 |
Put together, the list $\map \; f \; Ss$ is instantiated as |
|
2861 |
\begin{center} |
|
2862 |
$\map \; (\opterm \; r) \; (\nupdates \; s \; r \; |
|
2863 |
[\Some \; ([c], n)])$. |
|
2864 |
\end{center} |
|
2865 |
For the closed form to be bounded, we would like |
|
2866 |
simplification to be applied to each term in the list. |
|
2867 |
Therefore we introduce some variants of $\opterm$, |
|
2868 |
which help conveniently express the rewriting steps |
|
2869 |
needed in the closed form proof. |
|
639 | 2870 |
We have $\optermOsimp$, $\optermosimp$ and $\optermsimp$ |
2871 |
with slightly different spellings because they help the proof to go through: |
|
625 | 2872 |
\begin{center} |
2873 |
\begin{tabular}{lcl} |
|
2874 |
$\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ |
|
2875 |
& & $\;\;\Some \; (s, n) \Rightarrow |
|
2876 |
\textit{rsimp} \; ((r\backslash_{rs} s)\cdot r^{\{n\}})$\\ |
|
2877 |
& & $\;\;\None \Rightarrow |
|
2878 |
\ZERO$\\ |
|
2879 |
\\ |
|
2880 |
$\optermosimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ |
|
2881 |
& & $\;\;\Some \; (s, n) \Rightarrow |
|
2882 |
(\textit{rsimp} \; (r\backslash_{rs} s)) |
|
2883 |
\cdot r^{\{n\}}$\\ |
|
2884 |
& & $\;\;\None \Rightarrow |
|
2885 |
\ZERO$\\ |
|
2886 |
\\ |
|
2887 |
$\optermsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\ |
|
2888 |
& & $\;\;\Some \; (s, n) \Rightarrow |
|
2889 |
(r\backslash_{rsimps} s)\cdot r^{\{n\}}$\\ |
|
2890 |
& & $\;\;\None \Rightarrow |
|
2891 |
\ZERO$\\ |
|
2892 |
\end{tabular} |
|
2893 |
\end{center} |
|
2894 |
||
2895 |
||
2896 |
For a list of |
|
2897 |
$\textit{option} \;(\textit{string}, \textit{nat})$ elements, |
|
2898 |
we define the highest power for it recursively: |
|
2899 |
\begin{center} |
|
2900 |
\begin{tabular}{lcl} |
|
2901 |
$\hpa \; [] \; n $ & $\dn$ & $n$\\ |
|
2902 |
$\hpa \; (\None :: os) \; n $ & $\dn$ & $\hpa \; os \; n$\\ |
|
2903 |
$\hpa \; (\Some \; (s, n) :: os) \; m$ & $\dn$ & |
|
2904 |
$\hpa \;os \; (\textit{max} \; n\; m)$\\ |
|
2905 |
\\ |
|
2906 |
$\hpower \; rs $ & $\dn$ & $\hpa \; rs \; 0$\\ |
|
2907 |
\end{tabular} |
|
2908 |
\end{center} |
|
2909 |
||
2910 |
Now the intuition that an NTIMES regular expression's power |
|
2911 |
does not increase can be easily expressed as |
|
2912 |
\begin{lemma}\label{nupdatesMono2} |
|
2913 |
$\hpower \; (\nupdates \;s \; r \; [\Some \; ([c], n)]) \leq n$ |
|
2914 |
\end{lemma} |
|
2915 |
\begin{proof} |
|
2916 |
Note that the power is non-increasing after a $\nupdate$ application: |
|
2917 |
\begin{center} |
|
2918 |
$\hpa \;\; (\nupdate \; c \; r \; Ss)\;\; m \leq |
|
2919 |
\hpa\; \; Ss \; m$. |
|
2920 |
\end{center} |
|
2921 |
This is also the case for $\nupdates$: |
|
2922 |
\begin{center} |
|
2923 |
$\hpa \;\; (\nupdates \; s \; r \; Ss)\;\; m \leq |
|
2924 |
\hpa\; \; Ss \; m$. |
|
2925 |
\end{center} |
|
2926 |
Therefore we have that |
|
2927 |
\begin{center} |
|
2928 |
$\hpower \;\; (\nupdates \; s \; r \; Ss) \leq |
|
2929 |
\hpower \;\; Ss$ |
|
2930 |
\end{center} |
|
2931 |
which leads to the lemma being proven. |
|
2932 |
||
2933 |
\end{proof} |
|
2934 |
||
2935 |
||
2936 |
We also define the inductive rules for |
|
2937 |
the shape of derivatives of the NTIMES regular expressions:\\[-3em] |
|
2938 |
\begin{center} |
|
2939 |
\begin{mathpar} |
|
2940 |
\inferrule{\mbox{}}{\cbn \;\ZERO} |
|
2941 |
||
2942 |
\inferrule{\mbox{}}{\cbn \; \; r_a \cdot (r^{\{n\}})} |
|
2943 |
||
2944 |
\inferrule{\cbn \; r_1 \;\; \; \cbn \; r_2}{\cbn \; r_1 + r_2} |
|
2945 |
||
2946 |
\inferrule{\cbn \; r}{\cbn \; r + \ZERO} |
|
2947 |
\end{mathpar} |
|
2948 |
\end{center} |
|
2949 |
\noindent |
|
2950 |
A derivative of NTIMES fits into the shape described by $\cbn$: |
|
2951 |
\begin{lemma}\label{ntimesDersCbn} |
|
2952 |
$\cbn \; ((r' \cdot r^{\{n\}}) \backslash_{rs} s)$ holds. |
|
2953 |
\end{lemma} |
|
2954 |
\begin{proof} |
|
2955 |
By a reverse induction on $s$. |
|
2956 |
For the inductive case, note that if $\cbn \; r$ holds, |
|
2957 |
then $\cbn \; (r\backslash_r c)$ holds. |
|
2958 |
\end{proof} |
|
2959 |
\noindent |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
2960 |
In addition, for $\cbn$-shaped regular expressions, one can flatten |
625 | 2961 |
them: |
2962 |
\begin{lemma}\label{ntimesHfauPushin} |
|
2963 |
If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} = |
|
2964 |
\textit{concat} \; (\map \; \hflataux{\map \; (\_\backslash_r c) \; |
|
2965 |
(\hflataux{r})})$ |
|
2966 |
\end{lemma} |
|
2967 |
\begin{proof} |
|
2968 |
By an induction on the inductive cases of $\cbn$. |
|
2969 |
\end{proof} |
|
2970 |
\noindent |
|
2971 |
This time we do not need to define the flattening functions for NTIMES only, |
|
2972 |
because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already. |
|
2973 |
\begin{lemma}\label{ntimesHfauInduct} |
|
2974 |
$\hflataux{( (r\backslash_r c) \cdot r^{\{n\}}) \backslash_{rsimps} s} = |
|
2975 |
\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$ |
|
2976 |
\end{lemma} |
|
2977 |
\begin{proof} |
|
2978 |
By a reverse induction on $s$. |
|
2979 |
The lemmas \ref{ntimesHfauPushin} and \ref{ntimesDersCbn} are used. |
|
2980 |
\end{proof} |
|
2981 |
\noindent |
|
2982 |
We have a recursive property for NTIMES with $\nupdate$ |
|
2983 |
similar to that for STAR, |
|
2984 |
and one for $\nupdates $ as well: |
|
2985 |
\begin{lemma}\label{nupdateInduct1} |
|
2986 |
\mbox{} |
|
2987 |
\begin{itemize} |
|
2988 |
\item |
|
2989 |
\begin{center} |
|
2990 |
$\textit{concat} \; (\map \; (\hflataux{\_} \circ ( |
|
2991 |
\opterm \; r)) \; Ss) = \map \; (\opterm \; r) \; (\nupdate \; |
|
2992 |
c \; r \; Ss)$\\ |
|
2993 |
\end{center} |
|
2994 |
holds. |
|
2995 |
\item |
|
2996 |
\begin{center} |
|
2997 |
$\textit{concat} \; (\map \; \hflataux{\_}\; |
|
2998 |
\map \; (\_\backslash_r x) \; |
|
2999 |
(\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss)))$\\ |
|
3000 |
$=$\\ |
|
3001 |
$\map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$ |
|
3002 |
\end{center} |
|
3003 |
holds. |
|
3004 |
\end{itemize} |
|
3005 |
\end{lemma} |
|
3006 |
\begin{proof} |
|
3007 |
(i) is by an induction on $Ss$. |
|
3008 |
(ii) is by an induction on $xs$. |
|
3009 |
\end{proof} |
|
3010 |
\noindent |
|
3011 |
The $\nString$ predicate is defined for conveniently |
|
3012 |
expressing that there are no empty strings in the |
|
3013 |
$\Some \;(s, n)$ elements generated by $\nupdate$: |
|
3014 |
\begin{center} |
|
3015 |
\begin{tabular}{lcl} |
|
3016 |
$\nString \; \None$ & $\dn$ & $ \textit{true}$\\ |
|
3017 |
$\nString \; (\Some \; ([], n))$ & $\dn$ & $ \textit{false}$\\ |
|
3018 |
$\nString \; (\Some \; (c::s, n))$ & $\dn$ & $ \textit{true}$\\ |
|
3019 |
\end{tabular} |
|
3020 |
\end{center} |
|
3021 |
\begin{lemma}\label{nupdatesNonempty} |
|
3022 |
If for all elements $o \in \textit{set} \; Ss$, |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3023 |
$\nString \; o$ holds, then we have that |
625 | 3024 |
for all elements $o' \in \textit{set} \; (\nupdates \; s \; r \; Ss)$, |
3025 |
$\nString \; o'$ holds. |
|
3026 |
\end{lemma} |
|
3027 |
\begin{proof} |
|
3028 |
By an induction on $s$, where $Ss$ is set to vary over all possible values. |
|
3029 |
\end{proof} |
|
3030 |
||
3031 |
\noindent |
|
3032 |
||
3033 |
\begin{lemma}\label{ntimesClosedFormsSteps} |
|
3034 |
The following list of equalities or rewriting relations hold:\\ |
|
3035 |
(i) $r^{\{n+1\}} \backslash_{rsimps} (c::s) = |
|
3036 |
\textit{rsimp} \; (\sum (\map \; (\opterm \;r \;\_) \; (\nupdates \; |
|
3037 |
s \; r \; [\Some \; ([c], n)])))$\\ |
|
3038 |
(ii) |
|
3039 |
\begin{center} |
|
3040 |
$\sum (\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [ |
|
3041 |
\Some \; ([c], n)]))$ \\ $ \sequal$\\ |
|
3042 |
$\sum (\map \; (\textit{rsimp} \circ (\opterm \; r))\; (\nupdates \; |
|
3043 |
s\;r \; [\Some \; ([c], n)]))$\\ |
|
3044 |
\end{center} |
|
3045 |
(iii) |
|
3046 |
\begin{center} |
|
3047 |
$\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \; |
|
3048 |
([c], n)]))$\\ |
|
3049 |
$\sequal$\\ |
|
3050 |
$\sum \;(\map \; (\optermsimp r) \; (\nupdates \; s \; r \; [\Some \; |
|
3051 |
([c], n)])) $\\ |
|
3052 |
\end{center} |
|
3053 |
(iv) |
|
3054 |
\begin{center} |
|
3055 |
$\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \; |
|
3056 |
([c], n)])) $ \\ $\sequal$\\ |
|
3057 |
$\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \; |
|
3058 |
([c], n)])) $\\ |
|
3059 |
\end{center} |
|
3060 |
(v) |
|
3061 |
\begin{center} |
|
3062 |
$\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \; |
|
3063 |
([c], n)])) $ \\ $\sequal$\\ |
|
3064 |
$\sum \; (\map \; (\textit{rsimp} \circ (\opterm \; r)) \; |
|
3065 |
(\nupdates \; s \; r \; [\Some \; ([c], n)]))$ |
|
3066 |
\end{center} |
|
3067 |
\end{lemma} |
|
3068 |
\begin{proof} |
|
3069 |
Routine. |
|
3070 |
(iii) and (iv) make use of the fact that all the strings $s$ |
|
3071 |
inside $\Some \; (s, m)$ which are elements of the list |
|
3072 |
$\nupdates \; s\;r\;[\Some\; ([c], n)]$ are non-empty, |
|
3073 |
which is from lemma \ref{nupdatesNonempty}. |
|
3074 |
Once the string in $o = \Some \; (s, n)$ is |
|
3075 |
nonempty, $\optermsimp \; r \;o$, |
|
3076 |
$\optermosimp \; r \; o$ and $\optermosimp \; \; o$ are guaranteed |
|
3077 |
to be equal. |
|
3078 |
(v) uses \ref{nupdateInduct1}. |
|
3079 |
\end{proof} |
|
3080 |
\noindent |
|
3081 |
Now we are ready to present the closed form for NTIMES: |
|
3082 |
\begin{theorem}\label{ntimesClosedForm} |
|
3083 |
The derivative of $r^{\{n+1\}}$ can be described as an alternative |
|
3084 |
containing a list |
|
3085 |
of terms:\\ |
|
3086 |
$r^{\{n+1\}} \backslash_{rsimps} (c::s) = \textit{rsimp} \; ( |
|
3087 |
\sum (\map \; (\optermsimp \; r) \; (\nupdates \; s \; r \; |
|
3088 |
[\Some \; ([c], n)])))$ |
|
3089 |
\end{theorem} |
|
3090 |
\begin{proof} |
|
3091 |
By the rewriting steps described in lemma \ref{ntimesClosedFormsSteps}. |
|
3092 |
\end{proof} |
|
3093 |
\noindent |
|
3094 |
The key observation for bounding this closed form |
|
3095 |
is that the counter on $r^{\{n\}}$ will |
|
3096 |
only decrement during derivatives: |
|
3097 |
\begin{lemma}\label{nupdatesNLeqN} |
|
3098 |
For an element $o$ in $\textit{set} \; (\nupdates \; s \; r \; |
|
3099 |
[\Some \; ([c], n)])$, either $o = \None$, or $o = \Some |
|
3100 |
\; (s', m)$ for some string $s'$ and number $m \leq n$. |
|
3101 |
\end{lemma} |
|
3102 |
\noindent |
|
3103 |
The proof is routine and therefore omitted. |
|
3104 |
This allows us to say what kind of terms |
|
3105 |
are in the list $\textit{set} \; (\map \; (\optermsimp \; r) \; ( |
|
3106 |
\nupdates \; s \; r \; [\Some \; ([c], n)]))$: |
|
3107 |
only $\ZERO_r$s or a sequence with the tail an $r^{\{m\}}$ |
|
3108 |
with a small $m$: |
|
3109 |
\begin{lemma}\label{ntimesClosedFormListElemShape} |
|
3110 |
For any element $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; ( |
|
3111 |
\nupdates \; s \; r \; [\Some \; ([c], n)]))$, |
|
3112 |
we have that $r'$ is either $\ZERO$ or $r \backslash_{rsimps} s' \cdot |
|
3113 |
r^{\{m\}}$ for some string $s'$ and number $m \leq n$. |
|
3114 |
\end{lemma} |
|
3115 |
\begin{proof} |
|
3116 |
Using lemma \ref{nupdatesNLeqN}. |
|
3117 |
\end{proof} |
|
3118 |
||
3119 |
\begin{theorem}\label{ntimesClosedFormBounded} |
|
3120 |
Assuming that for any string $s$, $\llbracket r \backslash_{rsimps} s |
|
3121 |
\rrbracket_r \leq N$ holds, then we have that\\ |
|
3122 |
$\llbracket r^{\{n+1\}} \backslash_{rsimps} s \rrbracket_r \leq |
|
3123 |
\textit{max} \; (c_N+1)* (N + \llbracket r^{\{n\}} \rrbracket+1)$, |
|
3124 |
where $c_N = \textit{card} \; (\textit{sizeNregex} \; ( |
|
3125 |
N + \llbracket r^{\{n\}} \rrbracket_r+1))$. |
|
3126 |
\end{theorem} |
|
3127 |
\begin{proof} |
|
639 | 3128 |
We have that for all regular expressions $r'$ in |
3129 |
\begin{center} |
|
3130 |
$\textit{set} \; (\map \; (\optermsimp \; r) \; ( |
|
625 | 3131 |
\nupdates \; s \; r \; [\Some \; ([c], n)]))$, |
639 | 3132 |
\end{center} |
625 | 3133 |
$r'$'s size is less than or equal to $N + \llbracket r^{\{n\}} |
3134 |
\rrbracket_r + 1$ |
|
3135 |
because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdot |
|
3136 |
r^{\{m\}}$ for some string $s'$ and number |
|
3137 |
$m \leq n$ (lemma \ref{ntimesClosedFormListElemShape}). |
|
3138 |
In addition, we know that the list |
|
3139 |
$\map \; (\optermsimp \; r) \; ( |
|
3140 |
\nupdates \; s \; r \; [\Some \; ([c], n)])$'s size is at most |
|
3141 |
$c_N = \textit{card} \; |
|
3142 |
(\sizeNregex \; ((N + \llbracket r^{\{n\}} \rrbracket) + 1))$. |
|
3143 |
This gives us $\llbracket r \backslash_{rsimps} \;s \rrbracket_r |
|
3144 |
\leq N * c_N$. |
|
3145 |
\end{proof} |
|
3146 |
||
3147 |
We aim to formalise the correctness and size bound |
|
3148 |
for constructs like $r^{\{\ldots n\}}$, $r^{\{n \ldots\}}$ |
|
3149 |
and so on, which is still work in progress. |
|
3150 |
They should more or less follow the same recipe described in this section. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3151 |
Once we know how to deal with them recursively using suitable auxiliary |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3152 |
definitions, we can routinely establish the proofs. |
625 | 3153 |
|
532 | 3154 |
|
557 | 3155 |
%---------------------------------------------------------------------------------------- |
3156 |
% SECTION 3 |
|
3157 |
%---------------------------------------------------------------------------------------- |
|
3158 |
||
532 | 3159 |
|
618 | 3160 |
\section{Comments and Future Improvements} |
3161 |
\subsection{Some Experimental Results} |
|
532 | 3162 |
What guarantee does this bound give us? |
639 | 3163 |
It states that whatever the regex is, it will not grow indefinitely. |
532 | 3164 |
Take our previous example $(a + aa)^*$ as an example: |
3165 |
\begin{center} |
|
593 | 3166 |
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}} |
3167 |
\begin{tikzpicture} |
|
3168 |
\begin{axis}[ |
|
3169 |
xlabel={number of $a$'s}, |
|
3170 |
x label style={at={(1.05,-0.05)}}, |
|
3171 |
ylabel={regex size}, |
|
3172 |
enlargelimits=false, |
|
3173 |
xtick={0,5,...,30}, |
|
3174 |
xmax=33, |
|
3175 |
ymax= 40, |
|
3176 |
ytick={0,10,...,40}, |
|
3177 |
scaled ticks=false, |
|
3178 |
axis lines=left, |
|
3179 |
width=5cm, |
|
3180 |
height=4cm, |
|
3181 |
legend entries={$(a + aa)^*$}, |
|
618 | 3182 |
legend pos=south east, |
593 | 3183 |
legend cell align=left] |
3184 |
\addplot[red,mark=*, mark options={fill=white}] table {a_aa_star.data}; |
|
3185 |
\end{axis} |
|
3186 |
\end{tikzpicture} |
|
3187 |
\end{tabular} |
|
532 | 3188 |
\end{center} |
3189 |
We are able to limit the size of the regex $(a + aa)^*$'s derivatives |
|
593 | 3190 |
with our simplification |
532 | 3191 |
rules very effectively. |
3192 |
||
3193 |
||
3194 |
In our proof for the inductive case $r_1 \cdot r_2$, the dominant term in the bound |
|
3195 |
is $l_{N_2} * N_2$, where $N_2$ is the bound we have for $\llbracket \bderssimp{r_2}{s} \rrbracket$. |
|
3196 |
Given that $l_{N_2}$ is roughly the size $4^{N_2}$, the size bound $\llbracket \bderssimp{r_1 \cdot r_2}{s} \rrbracket$ |
|
3197 |
inflates the size bound of $\llbracket \bderssimp{r_2}{s} \rrbracket$ with the function |
|
3198 |
$f(x) = x * 2^x$. |
|
3199 |
This means the bound we have will surge up at least |
|
3200 |
tower-exponentially with a linear increase of the depth. |
|
3201 |
||
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3202 |
One might be pretty skepticafl about what this non-elementary |
618 | 3203 |
bound can bring us. |
3204 |
It turns out that the giant bounds are far from being hit. |
|
3205 |
Here we have some test data from randomly generated regular expressions: |
|
3206 |
\begin{figure}[H] |
|
3207 |
\begin{tabular}{@{}c@{\hspace{2mm}}c@{\hspace{0mm}}c@{}} |
|
593 | 3208 |
\begin{tikzpicture} |
3209 |
\begin{axis}[ |
|
618 | 3210 |
xlabel={$n$}, |
593 | 3211 |
x label style={at={(1.05,-0.05)}}, |
3212 |
ylabel={regex size}, |
|
3213 |
enlargelimits=false, |
|
3214 |
xtick={0,5,...,30}, |
|
3215 |
xmax=33, |
|
611 | 3216 |
%ymax=1000, |
3217 |
%ytick={0,100,...,1000}, |
|
593 | 3218 |
scaled ticks=false, |
3219 |
axis lines=left, |
|
618 | 3220 |
width=4.75cm, |
3221 |
height=3.8cm, |
|
593 | 3222 |
legend entries={regex1}, |
618 | 3223 |
legend pos=north east, |
593 | 3224 |
legend cell align=left] |
3225 |
\addplot[red,mark=*, mark options={fill=white}] table {regex1_size_change.data}; |
|
3226 |
\end{axis} |
|
3227 |
\end{tikzpicture} |
|
618 | 3228 |
& |
593 | 3229 |
\begin{tikzpicture} |
3230 |
\begin{axis}[ |
|
3231 |
xlabel={$n$}, |
|
3232 |
x label style={at={(1.05,-0.05)}}, |
|
3233 |
%ylabel={time in secs}, |
|
3234 |
enlargelimits=false, |
|
3235 |
xtick={0,5,...,30}, |
|
3236 |
xmax=33, |
|
611 | 3237 |
%ymax=1000, |
3238 |
%ytick={0,100,...,1000}, |
|
593 | 3239 |
scaled ticks=false, |
3240 |
axis lines=left, |
|
618 | 3241 |
width=4.75cm, |
3242 |
height=3.8cm, |
|
593 | 3243 |
legend entries={regex2}, |
618 | 3244 |
legend pos=south east, |
593 | 3245 |
legend cell align=left] |
3246 |
\addplot[blue,mark=*, mark options={fill=white}] table {regex2_size_change.data}; |
|
3247 |
\end{axis} |
|
3248 |
\end{tikzpicture} |
|
618 | 3249 |
& |
593 | 3250 |
\begin{tikzpicture} |
3251 |
\begin{axis}[ |
|
3252 |
xlabel={$n$}, |
|
3253 |
x label style={at={(1.05,-0.05)}}, |
|
3254 |
%ylabel={time in secs}, |
|
3255 |
enlargelimits=false, |
|
3256 |
xtick={0,5,...,30}, |
|
3257 |
xmax=33, |
|
611 | 3258 |
%ymax=1000, |
3259 |
%ytick={0,100,...,1000}, |
|
593 | 3260 |
scaled ticks=false, |
3261 |
axis lines=left, |
|
618 | 3262 |
width=4.75cm, |
3263 |
height=3.8cm, |
|
593 | 3264 |
legend entries={regex3}, |
618 | 3265 |
legend pos=south east, |
593 | 3266 |
legend cell align=left] |
3267 |
\addplot[cyan,mark=*, mark options={fill=white}] table {regex3_size_change.data}; |
|
3268 |
\end{axis} |
|
3269 |
\end{tikzpicture}\\ |
|
618 | 3270 |
\multicolumn{3}{c}{} |
593 | 3271 |
\end{tabular} |
618 | 3272 |
\caption{Graphs: size change of 3 randomly generated |
3273 |
regular expressions $w.r.t.$ input string length. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3274 |
The x-axis represents the length of the input.} |
611 | 3275 |
\end{figure} |
532 | 3276 |
\noindent |
3277 |
Most of the regex's sizes seem to stay within a polynomial bound $w.r.t$ the |
|
3278 |
original size. |
|
591 | 3279 |
We will discuss improvements to this bound in the next chapter. |
532 | 3280 |
|
3281 |
||
3282 |
||
618 | 3283 |
\subsection{Possible Further Improvements} |
639 | 3284 |
There are two problems with this finiteness result, though:\\ |
618 | 3285 |
(i) |
3286 |
First, it is not yet a direct formalisation of our lexer's complexity, |
|
593 | 3287 |
as a complexity proof would require looking into |
3288 |
the time it takes to execute {\bf all} the operations |
|
618 | 3289 |
involved in the lexer (simp, collect, decode), not just the derivative.\\ |
3290 |
(ii) |
|
593 | 3291 |
Second, the bound is not yet tight, and we seek to improve $N_a$ so that |
618 | 3292 |
it is polynomial on $\llbracket a \rrbracket$.\\ |
3293 |
Still, we believe this contribution is useful, |
|
590 | 3294 |
because |
3295 |
\begin{itemize} |
|
3296 |
\item |
|
3297 |
||
618 | 3298 |
The size proof can serve as a starting point for a complexity |
590 | 3299 |
formalisation. |
3300 |
Derivatives are the most important phases of our lexer algorithm. |
|
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3301 |
Size properties about derivatives cover the majority of the algorithm |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3302 |
and is therefore a good indication of the complexity of the entire program. |
590 | 3303 |
\item |
3304 |
The bound is already a strong indication that catastrophic |
|
3305 |
backtracking is much less likely to occur in our $\blexersimp$ |
|
3306 |
algorithm. |
|
3307 |
We refine $\blexersimp$ with $\blexerStrong$ in the next chapter |
|
639 | 3308 |
so that we conjecture the bound becomes polynomial. |
590 | 3309 |
\end{itemize} |
593 | 3310 |
|
532 | 3311 |
%---------------------------------------------------------------------------------------- |
3312 |
% SECTION 4 |
|
3313 |
%---------------------------------------------------------------------------------------- |
|
593 | 3314 |
|
3315 |
||
3316 |
||
3317 |
||
3318 |
||
3319 |
||
3320 |
||
3321 |
||
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3322 |
One might wonder about the actual bound rather than the loose bound we gave |
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3323 |
for the convenience of a more straightforward proof. |
532 | 3324 |
How much can the regex $r^* \backslash s$ grow? |
3325 |
As earlier graphs have shown, |
|
3326 |
%TODO: reference that graph where size grows quickly |
|
593 | 3327 |
they can grow at a maximum speed |
3328 |
exponential $w.r.t$ the number of characters, |
|
532 | 3329 |
but will eventually level off when the string $s$ is long enough. |
3330 |
If they grow to a size exponential $w.r.t$ the original regex, our algorithm |
|
3331 |
would still be slow. |
|
3332 |
And unfortunately, we have concrete examples |
|
576 | 3333 |
where such regular expressions grew exponentially large before levelling off: |
618 | 3334 |
\begin{center} |
532 | 3335 |
$(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
618 | 3336 |
(\underbrace{a \ldots a}_{\text{n a's}})^*)^*$ |
3337 |
\end{center} |
|
3338 |
will already have a maximum |
|
593 | 3339 |
size that is exponential on the number $n$ |
532 | 3340 |
under our current simplification rules: |
3341 |
%TODO: graph of a regex whose size increases exponentially. |
|
3342 |
\begin{center} |
|
593 | 3343 |
\begin{tikzpicture} |
3344 |
\begin{axis}[ |
|
3345 |
height=0.5\textwidth, |
|
3346 |
width=\textwidth, |
|
3347 |
xlabel=number of a's, |
|
3348 |
xtick={0,...,9}, |
|
3349 |
ylabel=maximum size, |
|
3350 |
ymode=log, |
|
3351 |
log basis y={2} |
|
3352 |
] |
|
3353 |
\addplot[mark=*,blue] table {re-chengsong.data}; |
|
3354 |
\end{axis} |
|
3355 |
\end{tikzpicture} |
|
532 | 3356 |
\end{center} |
3357 |
||
618 | 3358 |
For convenience we use $(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ |
532 | 3359 |
to express $(a ^ * + (aa) ^ * + (aaa) ^ * + \ldots + |
3360 |
(\underbrace{a \ldots a}_{\text{n a's}})^*$ in the below discussion. |
|
3361 |
The exponential size is triggered by that the regex |
|
618 | 3362 |
$\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$ |
532 | 3363 |
inside the $(\ldots) ^*$ having exponentially many |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3364 |
different derivatives, despite those differences being minor. |
618 | 3365 |
$(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*\backslash \underbrace{a \ldots a}_{\text{m a's}}$ |
532 | 3366 |
will therefore contain the following terms (after flattening out all nested |
3367 |
alternatives): |
|
3368 |
\begin{center} |
|
618 | 3369 |
$(\sum_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot (\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)$\\ |
3370 |
[1mm] |
|
593 | 3371 |
$(1 \leq m' \leq m )$ |
532 | 3372 |
\end{center} |
639 | 3373 |
There are at least exponentially |
618 | 3374 |
many such terms.\footnote{To be exact, these terms are |
3375 |
distinct for $m' \leq L.C.M.(1, \ldots, n)$, the details are omitted, |
|
3376 |
but the point is that the number is exponential. |
|
3377 |
} |
|
593 | 3378 |
With each new input character taking the derivative against the intermediate result, more and more such distinct |
618 | 3379 |
terms will accumulate. |
3380 |
The function $\textit{distinctBy}$ will not be able to de-duplicate any two of these terms |
|
3381 |
\begin{center} |
|
3382 |
$(\sum_{i = 1}^{n} |
|
3383 |
(\underbrace{a \ldots a}_{\text{((i - (m' \% i))\%i) a's}})\cdot |
|
3384 |
(\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot |
|
3385 |
(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$\\ |
|
3386 |
$(\sum_{i = 1}^{n} (\underbrace{a \ldots a}_{\text{((i - (m'' \% i))\%i) a's}})\cdot |
|
3387 |
(\underbrace{a \ldots a}_{\text{i a's}})^* )\cdot |
|
3388 |
(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$ |
|
3389 |
\end{center} |
|
3390 |
\noindent |
|
3391 |
where $m' \neq m''$ |
|
593 | 3392 |
as they are slightly different. |
3393 |
This means that with our current simplification methods, |
|
3394 |
we will not be able to control the derivative so that |
|
618 | 3395 |
$\llbracket \bderssimp{r}{s} \rrbracket$ stays polynomial. %\leq O((\llbracket r\rrbacket)^c)$ |
593 | 3396 |
These terms are similar in the sense that the head of those terms |
3397 |
are all consisted of sub-terms of the form: |
|
3398 |
$(\underbrace{a \ldots a}_{\text{j a's}})\cdot (\underbrace{a \ldots a}_{\text{i a's}})^* $. |
|
618 | 3399 |
For $\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*$, there will be at most |
593 | 3400 |
$n * (n + 1) / 2$ such terms. |
3401 |
For example, $(a^* + (aa)^* + (aaa)^*) ^*$'s derivatives |
|
3402 |
can be described by 6 terms: |
|
3403 |
$a^*$, $a\cdot (aa)^*$, $ (aa)^*$, |
|
3404 |
$aa \cdot (aaa)^*$, $a \cdot (aaa)^*$, and $(aaa)^*$. |
|
532 | 3405 |
The total number of different "head terms", $n * (n + 1) / 2$, |
593 | 3406 |
is proportional to the number of characters in the regex |
618 | 3407 |
$(\sum_{i=1}^{n} (\underbrace{a \ldots a}_{\text{i a's}})^*)^*$. |
3408 |
If we can improve our deduplication process so that it becomes smarter |
|
3409 |
and only keep track of these $n * (n+1) /2$ terms, then we can keep |
|
3410 |
the size growth polynomial again. |
|
3411 |
This example also suggests a slightly different notion of size, which we call the |
|
532 | 3412 |
alphabetic width: |
618 | 3413 |
\begin{center} |
3414 |
\begin{tabular}{lcl} |
|
3415 |
$\textit{awidth} \; \ZERO$ & $\dn$ & $0$\\ |
|
3416 |
$\textit{awidth} \; \ONE$ & $\dn$ & $0$\\ |
|
3417 |
$\textit{awidth} \; c$ & $\dn$ & $1$\\ |
|
3418 |
$\textit{awidth} \; r_1 + r_2$ & $\dn$ & $\textit{awidth} \; |
|
3419 |
r_1 + \textit{awidth} \; r_2$\\ |
|
3420 |
$\textit{awidth} \; r_1 \cdot r_2$ & $\dn$ & $\textit{awidth} \; |
|
3421 |
r_1 + \textit{awidth} \; r_2$\\ |
|
3422 |
$\textit{awidth} \; r^*$ & $\dn$ & $\textit{awidth} \; r$\\ |
|
3423 |
\end{tabular} |
|
3424 |
\end{center} |
|
3425 |
||
532 | 3426 |
|
593 | 3427 |
|
532 | 3428 |
Antimirov\parencite{Antimirov95} has proven that |
618 | 3429 |
$\textit{PDER}_{UNIV}(r) \leq \textit{awidth}(r)$, |
532 | 3430 |
where $\textit{PDER}_{UNIV}(r)$ is a set of all possible subterms |
3431 |
created by doing derivatives of $r$ against all possible strings. |
|
3432 |
If we can make sure that at any moment in our lexing algorithm our |
|
3433 |
intermediate result hold at most one copy of each of the |
|
3434 |
subterms then we can get the same bound as Antimirov's. |
|
3435 |
This leads to the algorithm in the next chapter. |
|
3436 |
||
3437 |
||
3438 |
||
3439 |
||
3440 |
||
3441 |
%---------------------------------------------------------------------------------------- |
|
3442 |
% SECTION 1 |
|
3443 |
%---------------------------------------------------------------------------------------- |
|
3444 |
||
3445 |
||
3446 |
%----------------------------------- |
|
3447 |
% SUBSECTION 1 |
|
3448 |
%----------------------------------- |
|
618 | 3449 |
%\subsection{Syntactic Equivalence Under $\simp$} |
640
bd1354127574
more proofreading done, last version before submission
Chengsong
parents:
639
diff
changeset
|
3450 |
%We prove that minor differences can be annihilated |
618 | 3451 |
%by $\simp$. |
3452 |
%For example, |
|
3453 |
%\begin{center} |
|
3454 |
% $\simp \;(\simpALTs\; (\map \;(\_\backslash \; x)\; (\distinct \; \mathit{rs}\; \phi))) = |
|
3455 |
% \simp \;(\simpALTs \;(\distinct \;(\map \;(\_ \backslash\; x) \; \mathit{rs}) \; \phi))$ |
|
3456 |
%\end{center} |
|
532 | 3457 |