468
|
1 |
% Chapter Template
|
|
2 |
|
|
3 |
\chapter{Chapter Title Here} % Main chapter title
|
|
4 |
|
|
5 |
\label{ChapterX} % Change X to a consecutive number; for referencing this chapter elsewhere, use \ref{ChapterX}
|
|
6 |
|
500
|
7 |
|
|
8 |
|
|
9 |
|
468
|
10 |
%----------------------------------------------------------------------------------------
|
|
11 |
% SECTION 1
|
|
12 |
%----------------------------------------------------------------------------------------
|
|
13 |
|
|
14 |
\section{Properties of $\backslash c$}
|
|
15 |
|
|
16 |
To have a clear idea of what we can and
|
|
17 |
need to prove about the algorithms involving
|
|
18 |
Brzozowski's derivatives, there are a few
|
|
19 |
properties we need to be clear about
|
|
20 |
it.
|
|
21 |
\subsection{function $\backslash c$ is not 1-to-1}
|
|
22 |
\begin{center}
|
|
23 |
The derivative $w.r.t$ character $c$ is not one-to-one.
|
|
24 |
Formally,
|
|
25 |
$\exists r_1 \;r_2. r_1 \neq r_2 \mathit{and} r_1 \backslash c = r_2 \backslash c$
|
|
26 |
\end{center}
|
|
27 |
This property is trivially true for the
|
|
28 |
character regex example:
|
|
29 |
\begin{center}
|
|
30 |
$r_1 = e; \; r_2 = d;\; r_1 \backslash c = \ZERO = r_2 \backslash c$
|
|
31 |
\end{center}
|
|
32 |
But apart from the cases where the derivative
|
|
33 |
output is $\ZERO$, are there non-trivial results
|
|
34 |
of derivatives which contain strings?
|
|
35 |
The answer is yes.
|
|
36 |
For example,
|
|
37 |
\begin{center}
|
|
38 |
Let $r_1 = a^*b\;\quad r_2 = (a\cdot a^*)\cdot b + b$.\\
|
|
39 |
where $a$ is not nullable.\\
|
|
40 |
$r_1 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$\\
|
|
41 |
$r_2 \backslash c = ((a \backslash c)\cdot a^*)\cdot c + b \backslash c$
|
|
42 |
\end{center}
|
|
43 |
We start with two syntactically different regexes,
|
|
44 |
and end up with the same derivative result, which is
|
|
45 |
a "meaningful" regex because it contains strings.
|
|
46 |
We have rediscovered Arden's lemma:\\
|
|
47 |
\begin{center}
|
|
48 |
$A^*B = A\cdot A^* \cdot B + B$
|
|
49 |
\end{center}
|
|
50 |
|
|
51 |
|
|
52 |
%-----------------------------------
|
|
53 |
% SUBSECTION 1
|
|
54 |
%-----------------------------------
|
|
55 |
\subsection{Subsection 1}
|
|
56 |
|
|
57 |
Nunc posuere quam at lectus tristique eu ultrices augue venenatis. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Aliquam erat volutpat. Vivamus sodales tortor eget quam adipiscing in vulputate ante ullamcorper. Sed eros ante, lacinia et sollicitudin et, aliquam sit amet augue. In hac habitasse platea dictumst.
|
|
58 |
|
500
|
59 |
|
|
60 |
|
|
61 |
|
|
62 |
|
|
63 |
|
|
64 |
|
|
65 |
|
|
66 |
|
468
|
67 |
%-----------------------------------
|
500
|
68 |
% SECTION 2
|
468
|
69 |
%-----------------------------------
|
|
70 |
|
500
|
71 |
\section{Finiteness Property}
|
|
72 |
In this section let us sketch our argument for why the size of the simplified
|
|
73 |
derivatives with the aggressive simplification function can be finitely bounded.
|
503
|
74 |
|
|
75 |
For convenience, we use a new datatype which we call $\rrexp$ to denote
|
500
|
76 |
the difference between it and annotated regular expressions.
|
503
|
77 |
\[ \rrexp ::= \RZERO \mid \RONE
|
|
78 |
\mid \RCHAR{c}
|
|
79 |
\mid \RSEQ{r_1}{r_2}
|
|
80 |
\mid \RALTS{rs}
|
|
81 |
\mid \RSTAR{r}
|
|
82 |
\]
|
|
83 |
For $\rrexp$ we throw away the bitcodes on the annotated regular expressions,
|
|
84 |
but keep everything else intact.
|
|
85 |
It is similar to annotated regular expressions being $\erase$-ed, but with all its structure preserved
|
500
|
86 |
(the $\erase$ function unfortunately does not preserve structure in the case of empty and singleton
|
503
|
87 |
$\ALTS$).
|
500
|
88 |
We denote the operation of erasing the bits and turning an annotated regular expression
|
503
|
89 |
into an $\rrexp{}$ as $\rerase{}$.
|
|
90 |
\begin{center}
|
|
91 |
\begin{tabular}{lcr}
|
|
92 |
$\rerase{\AZERO}$ & $=$ & $\RZERO$\\
|
|
93 |
$\rerase{\ASEQ{bs}{r_1}{r_2}}$ & $=$ & $\RSEQ{\rerase{r_1}}{\rerase{r_2}}$\\
|
|
94 |
$\rerase{\AALTS{bs}{rs}}$ & $ =$ & $ \RALTS{rs}$
|
|
95 |
\end{tabular}
|
|
96 |
\end{center}
|
|
97 |
%TODO: FINISH definition of rerase
|
|
98 |
Similarly we could define the derivative and simplification on
|
|
99 |
$\rrexp$, which would be identical to those we defined for plain $\rexp$s in chapter1,
|
|
100 |
except that now they can operate on alternatives taking multiple arguments.
|
|
101 |
%TODO: definition of rder rsimp (maybe only the alternative clause)
|
|
102 |
The reason why these definitions mirror precisely the corresponding operations
|
|
103 |
on annotated regualar expressions is that we can calculate sizes more easily:
|
|
104 |
|
|
105 |
\begin{lemma}
|
|
106 |
$\rsize{\rerase a} = \asize a$
|
|
107 |
\end{lemma}
|
|
108 |
A similar equation holds for annotated regular expressions' simplification
|
|
109 |
and the plain regular expressions' simplification:
|
|
110 |
\begin{lemma}
|
|
111 |
$\asize{\bsimp{a}} = \rsize{\rsimp{\rerase{a}}}$
|
|
112 |
\end{lemma}
|
|
113 |
Putting these two together we get a property that allows us to estimate the
|
|
114 |
size of an annotated regular expression derivative using its un-annotated counterpart:
|
|
115 |
\begin{lemma}
|
|
116 |
$\asize{\bderssimp{r}{s}} = \rsize{\rderssimp{\rerase{r}}{s}}$
|
|
117 |
\end{lemma}
|
|
118 |
Unless stated otherwise in this chapter all $\textit{rexp}$s without
|
|
119 |
bitcodes are seen as $\rrexp$s.
|
500
|
120 |
|
|
121 |
Suppose
|
|
122 |
we have a size function for bitcoded regular expressions, written
|
|
123 |
$\llbracket r\rrbracket$, which counts the number of nodes if we regard $r$ as a tree
|
|
124 |
(we omit the precise definition; ditto for lists $\llbracket r\!s\rrbracket$). For this we show that for every $r$
|
|
125 |
there exists a bound $N$
|
|
126 |
such that
|
|
127 |
|
|
128 |
\begin{center}
|
503
|
129 |
$\forall s. \; \llbracket{\bderssimp{r}{s}}\rrbracket \leq N$
|
500
|
130 |
\end{center}
|
|
131 |
|
|
132 |
\noindent
|
|
133 |
We prove this by induction on $r$. The base cases for $\AZERO$,
|
|
134 |
$\AONE \textit{bs}$ and $\ACHAR \textit{bs} c$ are straightforward. The interesting case is
|
503
|
135 |
for sequences of the form $\ASEQ{bs}{r_1}{r_2}$. In this case our induction
|
|
136 |
hypotheses state $\exists N_1. \forall s. \; \llbracket \bderssimp{r}{s} \rrbracket \leq N_1$ and
|
|
137 |
$\exists N_2. \forall s. \; \llbracket \bderssimp{r_2}{s} \rrbracket \leq N_2$. We can reason as follows
|
500
|
138 |
%
|
|
139 |
\begin{center}
|
|
140 |
\begin{tabular}{lcll}
|
503
|
141 |
& & $ \llbracket \bderssimp{\ASEQ{bs}{r_1}{r_2} }{s} \rrbracket$\\
|
|
142 |
& $ = $ & $\llbracket bsimp\,(\textit{ALTs}\;bs\;(\ASEQ{nil}{\bderssimp{ r_1}{s}}{ r_2} ::
|
|
143 |
[\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]))\rrbracket $ & (1) \\
|
500
|
144 |
& $\leq$ &
|
503
|
145 |
$\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
|
|
146 |
[$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ & (2) \\
|
|
147 |
& $\leq$ & $\llbracket\ASEQ{bs}{\bderssimp{ r_1}{s}}{r_2}\rrbracket +
|
|
148 |
\llbracket\textit{distinctWith}\,[\bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)]\,\approx\;{}\rrbracket + 1 $ & (3) \\
|
500
|
149 |
& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 +
|
503
|
150 |
\llbracket \distinctWith\,[ \bderssimp{r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)] \,\approx\;\rrbracket$ & (4)\\
|
500
|
151 |
& $\leq$ & $N_1 + \llbracket r_2\rrbracket + 2 + l_{N_{2}} * N_{2}$ & (5)
|
|
152 |
\end{tabular}
|
|
153 |
\end{center}
|
|
154 |
|
|
155 |
% tell Chengsong about Indian paper of closed forms of derivatives
|
|
156 |
|
|
157 |
\noindent
|
503
|
158 |
where in (1) the $\textit{Suffix}( r_1, s)$ are the all the suffixes of $s$ where $\bderssimp{ r_1}{s'}$ is nullable ($s'$ being a suffix of $s$).
|
500
|
159 |
The reason why we could write the derivatives of a sequence this way is described in section 2.
|
503
|
160 |
The term (2) is used to control (1).
|
|
161 |
That is because one can obtain an overall
|
500
|
162 |
smaller regex list
|
|
163 |
by flattening it and removing $\ZERO$s first before applying $\distinctWith$ on it.
|
|
164 |
Section 3 is dedicated to its proof.
|
503
|
165 |
In (3) we know that $\llbracket \ASEQ{bs}{(\bderssimp{ r_1}{s}}{r_2}\rrbracket$ is
|
500
|
166 |
bounded by $N_1 + \llbracket{}r_2\rrbracket + 1$. In (5) we know the list comprehension contains only regular expressions of size smaller
|
|
167 |
than $N_2$. The list length after $\distinctWith$ is bounded by a number, which we call $l_{N_2}$. It stands
|
|
168 |
for the number of distinct regular expressions smaller than $N_2$ (there can only be finitely many of them).
|
|
169 |
We reason similarly for $\STAR$.\medskip
|
|
170 |
|
|
171 |
\noindent
|
|
172 |
Clearly we give in this finiteness argument (Step (5)) a very loose bound that is
|
|
173 |
far from the actual bound we can expect. We can do better than this, but this does not improve
|
|
174 |
the finiteness property we are proving. If we are interested in a polynomial bound,
|
|
175 |
one would hope to obtain a similar tight bound as for partial
|
|
176 |
derivatives introduced by Antimirov \cite{Antimirov95}. After all the idea with
|
|
177 |
$\distinctWith$ is to maintain a ``set'' of alternatives (like the sets in
|
|
178 |
partial derivatives). Unfortunately to obtain the exact same bound would mean
|
|
179 |
we need to introduce simplifications, such as
|
|
180 |
$(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
|
|
181 |
which exist for partial derivatives. However, if we introduce them in our
|
|
182 |
setting we would lose the POSIX property of our calculated values. We leave better
|
|
183 |
bounds for future work.\\[-6.5mm]
|
|
184 |
|
468
|
185 |
|
|
186 |
%----------------------------------------------------------------------------------------
|
|
187 |
% SECTION 2
|
|
188 |
%----------------------------------------------------------------------------------------
|
|
189 |
|
500
|
190 |
\section{"Closed Forms" of regular expressions' derivatives w.r.t strings}
|
503
|
191 |
To embark on getting the "closed forms" of regexes, we first
|
|
192 |
define a few auxiliary definitions to make expressing them smoothly.
|
|
193 |
|
|
194 |
\begin{center}
|
|
195 |
\begin{tabular}{ccc}
|
|
196 |
$\sflataux{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
|
|
197 |
$\sflataux{\AALTS{ }{[]}}$ & $ = $ & $ []$\\
|
|
198 |
$\sflataux r$ & $=$ & $ [r]$
|
|
199 |
\end{tabular}
|
|
200 |
\end{center}
|
|
201 |
The intuition behind $\sflataux{\_}$ is to break up nested regexes
|
|
202 |
of the $(\ldots((r_1 + r_2) + r_3) + \ldots )$(left-associated) shape
|
|
203 |
into a more "balanced" list: $\AALTS{\_}{[r_1,\, r_2 ,\, r_3, \ldots]}$.
|
|
204 |
It will return the singleton list $[r]$ otherwise.
|
|
205 |
|
|
206 |
$\sflat{\_}$ works the same as $\sflataux{\_}$, except that it keeps
|
|
207 |
the output type a regular expression, not a list:
|
|
208 |
\begin{center}
|
|
209 |
\begin{tabular}{ccc}
|
|
210 |
$\sflat{\AALTS{ }{r :: rs}}$ & $=$ & $\sflataux{r} @ rs$\\
|
|
211 |
$\sflat{\AALTS{ }{[]}}$ & $ = $ & $ \AALTS{ }{[]}$\\
|
|
212 |
$\sflat r$ & $=$ & $ [r]$
|
|
213 |
\end{tabular}
|
|
214 |
\end{center}
|
|
215 |
$\sflataux{\_}$ and $\sflat{\_}$ is only recursive in terms of the
|
|
216 |
first element of the list of children of
|
|
217 |
an alternative regex ($\AALTS{ }{rs}$), and is purposefully built for dealing with the regular expression
|
|
218 |
$r_1 \cdot r_2 \backslash s$.
|
|
219 |
|
|
220 |
With $\sflat{\_}$ and $\sflataux{\_}$ we make
|
|
221 |
precise what "closed forms" we have for the sequence derivatives and their simplifications,
|
|
222 |
in other words, how can we construct $(r_1 \cdot r_2) \backslash s$
|
|
223 |
and $\bderssimp{(r_1\cdot r_2)}{s}$,
|
|
224 |
if we are only allowed to use a combination of $r_1 \backslash s'$ ($\bderssimp{r_1}{s'}$)
|
|
225 |
and $r_2 \backslash s'$ ($\bderssimp{r_2}{s'}$), where $s'$ ranges over
|
|
226 |
the substring of $s$?
|
|
227 |
First let's look at a series of derivatives steps on a sequence
|
|
228 |
regular expression, (assuming) that each time the first
|
|
229 |
component of the sequence is always nullable):
|
|
230 |
\begin{center}
|
|
231 |
|
|
232 |
$r_1 \cdot r_2 \quad \longrightarrow_{\backslash c} \quad r_1 \backslash c \cdot r_2 + r_2 \backslash c \quad \longrightarrow_{\backslash c'} \quad (r_1 \backslash cc' \cdot r_2 + r_2 \backslash c') + r_2 \backslash cc' \longrightarrow_{\backslash c''} \quad$\\
|
|
233 |
$((r_1 \backslash cc'c'' \cdot r_2 + r_2 \backslash c'') + r_2 \backslash c'c'') + r_2 \backslash cc'c'' \longrightarrow_{\backslash c''} \quad
|
|
234 |
\ldots$
|
|
235 |
|
|
236 |
\end{center}
|
|
237 |
%TODO: cite indian paper
|
|
238 |
Indianpaper have come up with a slightly more formal way of putting the above process:
|
|
239 |
\begin{center}
|
|
240 |
$r_1 \cdot r_2 \backslash (c_1 :: c_2 ::\ldots c_n) \myequiv r_1 \backslash (c_1 :: c_2:: \ldots c_n) +
|
|
241 |
\delta(\nullable(r_1 \backslash (c_1 :: c_2 \ldots c_{n-1}) ), r_2 \backslash (c_n)) + \ldots
|
|
242 |
+ \delta (\nullable(r_1), r_2\backslash (c_1 :: c_2 ::\ldots c_n))$
|
|
243 |
\end{center}
|
|
244 |
where $\delta(b, r)$ will produce $r$
|
|
245 |
if $b$ evaluates to true,
|
|
246 |
and $\ZERO$ otherwise.
|
|
247 |
|
|
248 |
But the $\myequiv$ sign in the above equation means language equivalence instead of syntactical
|
|
249 |
equivalence. To make this intuition useful
|
|
250 |
for a formal proof, we need something
|
|
251 |
stronger than language equivalence.
|
|
252 |
With the help of $\sflat{\_}$ we can state the equation in Indianpaper
|
|
253 |
more rigorously:
|
500
|
254 |
\begin{lemma}
|
503
|
255 |
$\sflat{(r_1 \cdot r_2) \backslash s} = \AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}$
|
500
|
256 |
\end{lemma}
|
503
|
257 |
With this property we can prove the finiteness of the size of the regex $(r_1 \cdot r_2) \backslash s$,
|
|
258 |
much like a recursive function's termination proof.
|
|
259 |
The function $\vsuf{\_}{\_}$ is defined this way:
|
|
260 |
%TODO: DEF of vsuf
|
|
261 |
\begin{center}
|
|
262 |
\begin{tabular}{lcl}
|
|
263 |
$\vsuf{[]}{\_} $ & $=$ & $[]$\\
|
|
264 |
$\vsuf{c::cs}{r_1}$ & $ =$ & $ \textit{if} (\rnullable{r_1}) \textit{then} (\vsuf{cs}{(rder c r1)}) @ [c :: cs]$\\
|
|
265 |
&& $\textit{else} (\vsuf{cs}{rder c r1}) $
|
|
266 |
\end{tabular}
|
|
267 |
\end{center}
|
|
268 |
It takes into account which prefixes $s'$ of $s$ would make $r \backslash s'$ nullable,
|
|
269 |
and keep a list of suffixes $s''$ such that $s' @ s'' = s$. The list is sorted in
|
|
270 |
the order $r_2\backslash s''$ appears in $(r_1\cdot r_2)\backslash s$.
|
|
271 |
In essence, $\vsuf{\_}{\_}$ is doing a "virtual derivative" of $r$, but instead of producing
|
|
272 |
the entire result of $(r_1 \cdot r_2) \backslash s$,
|
|
273 |
it only stores all the $r_2 \backslash s''$ terms.
|
500
|
274 |
|
503
|
275 |
With this we can also add simplifications to both sides and get
|
|
276 |
\begin{lemma}
|
|
277 |
$\rsimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
|
|
278 |
\end{lemma}
|
|
279 |
Together with the idempotency property of $\rsimp$,
|
|
280 |
%TODO: state the idempotency property of rsimp
|
|
281 |
it is possible to convert the above lemma to obtain a "closed form"
|
|
282 |
for our lexer's intermediate result without bitcodes:
|
|
283 |
\begin{lemma}
|
|
284 |
$\rderssimp{\sflat{(r_1 \cdot r_2) \backslash s} }= \rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}$
|
|
285 |
\end{lemma}
|
|
286 |
And now the reason we have (2) in section 1 is clear:
|
|
287 |
$\rsize{\rsimp{\AALTS{[[]}{ (r_1 \backslash s) \cdot r_2 :: (\map (r_2 \backslash \_) (\vsuf{s}{r_1}))}}}$
|
|
288 |
is bounded by $\llbracket\textit{\distinctWith}\,((\ASEQ{nil}{\bderssimp{r_1}{s}}{r_2}$) ::
|
|
289 |
[$\bderssimp{ r_2}{s'} \;|\; s' \in \textit{Suffix}( r_1, s)])\,\approx\;{}\rrbracket + 1 $ ,
|
|
290 |
as $\vsuf{s}{r_1}$ is a subset of s $\textit{Suffix}( r_1, s)])$.
|
500
|
291 |
|
|
292 |
%----------------------------------------------------------------------------------------
|
|
293 |
% SECTION 3
|
|
294 |
%----------------------------------------------------------------------------------------
|
|
295 |
|
|
296 |
\section{interaction between $\distinctWith$ and $\flts$}
|
|
297 |
Note that it is not immediately obvious that
|
|
298 |
$\llbracket \distinctWith (\flts \textit{rs}) = \phi \rrbracket \leq \llbracket \distinctWith \textit{rs} = \phi \rrbracket $
|
468
|
299 |
|
|
300 |
Sed ullamcorper quam eu nisl interdum at interdum enim egestas. Aliquam placerat justo sed lectus lobortis ut porta nisl porttitor. Vestibulum mi dolor, lacinia molestie gravida at, tempus vitae ligula. Donec eget quam sapien, in viverra eros. Donec pellentesque justo a massa fringilla non vestibulum metus vestibulum. Vestibulum in orci quis felis tempor lacinia. Vivamus ornare ultrices facilisis. Ut hendrerit volutpat vulputate. Morbi condimentum venenatis augue, id porta ipsum vulputate in. Curabitur luctus tempus justo. Vestibulum risus lectus, adipiscing nec condimentum quis, condimentum nec nisl. Aliquam dictum sagittis velit sed iaculis. Morbi tristique augue sit amet nulla pulvinar id facilisis ligula mollis. Nam elit libero, tincidunt ut aliquam at, molestie in quam. Aenean rhoncus vehicula hendrerit.
|