532
|
1 |
% Chapter Template
|
|
2 |
|
630
|
3 |
%We also present the idempotency property proof
|
|
4 |
%of $\bsimp$, which leverages the idempotency proof of $\rsimp$.
|
|
5 |
%This reinforces our claim that the fixpoint construction
|
|
6 |
%originally required by Sulzmann and Lu can be removed in $\blexersimp$.
|
|
7 |
%Last but not least, we present our efforts and challenges we met
|
|
8 |
%in further improving the algorithm by data
|
|
9 |
%structures such as zippers.
|
|
10 |
%----------------------------------------------------------------------------------------
|
|
11 |
% SECTION strongsimp
|
|
12 |
%----------------------------------------------------------------------------------------
|
|
13 |
%TODO: search for isabelle proofs of algorithms that check equivalence
|
|
14 |
|
|
15 |
\chapter{A Better Size Bound for Derivatives} % Main chapter title
|
532
|
16 |
|
|
17 |
\label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
|
|
18 |
%in Chapter 4 to a polynomial one, and demonstrate how one can extend the
|
|
19 |
%algorithm to include constructs such as bounded repetitions and negations.
|
590
|
20 |
\lstset{style=myScalastyle}
|
|
21 |
|
|
22 |
|
625
|
23 |
This chapter is a ``work-in-progress''
|
|
24 |
chapter which records
|
|
25 |
extensions to our $\blexersimp$.
|
630
|
26 |
We make a conjecture that the finite
|
|
27 |
size bound from the previous chapter can be
|
|
28 |
improved to a cubic bound.
|
|
29 |
We implemented our conjecture in Scala.
|
|
30 |
We intend to formalise this part in Isabelle/HOL at a
|
|
31 |
later stage.
|
|
32 |
%we have not been able to finish due to time constraints of the PhD.
|
625
|
33 |
Nevertheless, we outline the ideas we intend to use for the proof.
|
|
34 |
|
630
|
35 |
\section{A Stronger Version of Simplification}
|
|
36 |
|
625
|
37 |
We present further improvements
|
628
|
38 |
for our lexer algorithm $\blexersimp$.
|
590
|
39 |
We devise a stronger simplification algorithm,
|
|
40 |
called $\bsimpStrong$, which can prune away
|
|
41 |
similar components in two regular expressions at the same
|
|
42 |
alternative level,
|
|
43 |
even if these regular expressions are not exactly the same.
|
|
44 |
We call the lexer that uses this stronger simplification function
|
|
45 |
$\blexerStrong$.
|
630
|
46 |
%Unfortunately we did not have time to
|
|
47 |
%work out the proofs, like in
|
|
48 |
%the previous chapters.
|
590
|
49 |
We conjecture that both
|
|
50 |
\begin{center}
|
|
51 |
$\blexerStrong \;r \; s = \blexer\; r\;s$
|
|
52 |
\end{center}
|
630
|
53 |
and
|
590
|
54 |
\begin{center}
|
|
55 |
$\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$
|
|
56 |
\end{center}
|
630
|
57 |
hold.
|
|
58 |
%but a formalisation
|
|
59 |
%is still future work.
|
628
|
60 |
We give an informal justification
|
|
61 |
why the correctness and cubic size bound proofs
|
|
62 |
can be achieved
|
590
|
63 |
by exploring the connection between the internal
|
|
64 |
data structure of our $\blexerStrong$ and
|
630
|
65 |
Animirov's partial derivatives.
|
621
|
66 |
|
590
|
67 |
In our bitcoded lexing algorithm, (sub)terms represent (sub)matches.
|
591
|
68 |
For example, the regular expression
|
|
69 |
\[
|
|
70 |
aa \cdot a^*+ a \cdot a^* + aa\cdot a^*
|
|
71 |
\]
|
|
72 |
contains three terms,
|
628
|
73 |
expressing three possibilities for how it can match some input.
|
591
|
74 |
The first and the third terms are identical, which means we can eliminate
|
628
|
75 |
the latter as it will not contribute to a POSIX value.
|
|
76 |
In $\bsimps$, the $\distinctBy$ function takes care of
|
|
77 |
such instances.
|
591
|
78 |
The criteria $\distinctBy$ uses for removing a duplicate
|
|
79 |
$a_2$ in the list
|
|
80 |
\begin{center}
|
|
81 |
$rs_a@[a_1]@rs_b@[a_2]@rs_c$
|
|
82 |
\end{center}
|
|
83 |
is that
|
630
|
84 |
the two erased regular expressions are equal
|
533
|
85 |
\begin{center}
|
591
|
86 |
$\rerase{a_1} = \rerase{a_2}$.
|
|
87 |
\end{center}
|
630
|
88 |
This is characterised as the $LD$
|
|
89 |
rewrite rule in figure \ref{rrewriteRules}.
|
|
90 |
The problem, however, is that identical components
|
|
91 |
in two slightly different regular expressions cannot be removed
|
|
92 |
by the $LD$ rule.
|
|
93 |
Consider the stronger simplification
|
628
|
94 |
\begin{equation}
|
|
95 |
\label{eqn:partialDedup}
|
591
|
96 |
(a+b+d) \cdot r_1 + (a+c+e) \cdot r_1 \stackrel{?}{\rightsquigarrow} (a+b+d) \cdot r_1 + (c+e) \cdot r_1
|
628
|
97 |
\end{equation}
|
630
|
98 |
where the $(\underline{a}+c+e)\cdot r_1$ is deleted in the right alternative
|
|
99 |
$a+c+e$.
|
628
|
100 |
This is permissible because we have $(a+\ldots)\cdot r_1$ in the left
|
|
101 |
alternative.
|
|
102 |
The difficulty is that such ``buried''
|
|
103 |
alternatives-sequences are not easily recognised.
|
|
104 |
But simplification like this actually
|
630
|
105 |
cannot be omitted, if we want to have a better bound.
|
|
106 |
For example, the size of derivatives can still
|
628
|
107 |
blow up even with our $\textit{bsimp}$
|
|
108 |
function:
|
|
109 |
consider again the example
|
621
|
110 |
$\protect((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$,
|
630
|
111 |
and set $n$ to a relatively small number like $n=5$, then we get the following
|
|
112 |
exponential growth:
|
591
|
113 |
\begin{figure}[H]
|
|
114 |
\centering
|
|
115 |
\begin{tikzpicture}
|
|
116 |
\begin{axis}[
|
|
117 |
%xlabel={$n$},
|
|
118 |
myplotstyle,
|
|
119 |
xlabel={input length},
|
|
120 |
ylabel={size},
|
|
121 |
]
|
|
122 |
\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
|
|
123 |
\end{axis}
|
533
|
124 |
\end{tikzpicture}
|
630
|
125 |
\caption{Size of derivatives of $\blexersimp$ from chapter 5 for matching
|
591
|
126 |
$\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$
|
|
127 |
with strings
|
|
128 |
of the form $\protect\underbrace{aa..a}_{n}$.}\label{blexerExp}
|
|
129 |
\end{figure}
|
|
130 |
\noindent
|
628
|
131 |
One possible approach would be to apply the rewriting
|
|
132 |
rule
|
591
|
133 |
\[
|
|
134 |
(a+b+d) \cdot r_1 \longrightarrow a \cdot r_1 + b \cdot r_1 + d \cdot r_1
|
|
135 |
\]
|
533
|
136 |
\noindent
|
630
|
137 |
which pushes the sequence into the alternatives
|
|
138 |
in our $\simp$ function.
|
|
139 |
This would then make the simplification shown in \eqref{eqn:partialDedup} possible.
|
|
140 |
Translating this rule into our $\textit{bsimp}$ function would simply
|
|
141 |
involve adding a new clause to the $\textit{bsimp}_{ASEQ}$ function:
|
621
|
142 |
\begin{center}
|
|
143 |
\begin{tabular}{@{}lcl@{}}
|
|
144 |
$\textit{bsimp}_{ASEQ} \; bs\; a \; b$ & $\dn$ & $ (a,\; b) \textit{match}$\\
|
|
145 |
&& $\ldots$\\
|
|
146 |
&&$\quad\textit{case} \; (_{bs1}\sum as, a_2') \Rightarrow _{bs1}\sum (
|
|
147 |
\map \; (_{[]}\textit{ASEQ} \; \_ \; a_2') \; as)$\\
|
|
148 |
&&$\quad\textit{case} \; (a_1', a_2') \Rightarrow _{bs}a_1' \cdot a_2'$ \\
|
|
149 |
\end{tabular}
|
|
150 |
\end{center}
|
628
|
151 |
\noindent
|
591
|
152 |
Unfortunately,
|
630
|
153 |
if we introduce this clause in our
|
591
|
154 |
setting we would lose the POSIX property of our calculated values.
|
|
155 |
For example given the regular expression
|
|
156 |
\begin{center}
|
|
157 |
$(a + ab)(bc + c)$
|
|
158 |
\end{center}
|
628
|
159 |
and the string $ab$,
|
591
|
160 |
then our algorithm generates the following
|
|
161 |
correct POSIX value
|
|
162 |
\begin{center}
|
|
163 |
$\Seq \; (\Right \; ab) \; (\Right \; c)$.
|
533
|
164 |
\end{center}
|
591
|
165 |
Essentially it matches the string with the longer Right-alternative
|
|
166 |
in the first sequence (and
|
|
167 |
then the 'rest' with the character regular expression $c$ from the second sequence).
|
630
|
168 |
If we add the simplification above, however,
|
|
169 |
then we would obtain the following value
|
591
|
170 |
\begin{center}
|
|
171 |
$\Left \; (\Seq \; a \; (\Left \; bc))$
|
|
172 |
\end{center}
|
|
173 |
where the $\Left$-alternatives get priority.
|
630
|
174 |
This violates the POSIX rules.
|
591
|
175 |
The reason for getting this undesired value
|
|
176 |
is that the new rule splits this regular expression up into
|
630
|
177 |
a topmost alternative
|
591
|
178 |
\begin{center}
|
|
179 |
$a\cdot(b c + c) + ab \cdot (bc + c)$,
|
533
|
180 |
\end{center}
|
630
|
181 |
which is a regular expression with a
|
|
182 |
quite different meaning: the original regular expression
|
|
183 |
is a sequence, but the simplified regular expression is an alternative.
|
|
184 |
With an alternative the maximal munch rule no longer works.
|
|
185 |
|
|
186 |
|
|
187 |
A method to reconcile this problem is to do the
|
628
|
188 |
transformation in \eqref{eqn:partialDedup} ``non-invasively'',
|
591
|
189 |
meaning that we traverse the list of regular expressions
|
630
|
190 |
%\begin{center}
|
|
191 |
% $rs_a@[a]@rs_c$
|
|
192 |
%\end{center}
|
|
193 |
inside alternatives
|
591
|
194 |
\begin{center}
|
|
195 |
$\sum ( rs_a@[a]@rs_c)$
|
533
|
196 |
\end{center}
|
591
|
197 |
using a function similar to $\distinctBy$,
|
|
198 |
but this time
|
630
|
199 |
we allow the following more general rewrite rule:
|
|
200 |
\begin{equation}
|
|
201 |
\label{eqn:cubicRule}
|
|
202 |
%\mbox{
|
|
203 |
%\begin{mathpar}
|
621
|
204 |
\inferrule * [Right = cubicRule]{\vspace{0mm} }{rs_a@[a]@rs_c
|
591
|
205 |
\stackrel{s}{\rightsquigarrow }
|
|
206 |
rs_a@[\textit{prune} \; a \; rs_a]@rs_c }
|
630
|
207 |
%\end{mathpar}
|
|
208 |
%\caption{The rule capturing the pruning simplification needed to achieve
|
|
209 |
%a cubic bound}
|
|
210 |
\end{equation}
|
|
211 |
\noindent
|
591
|
212 |
%L \; a_1' = L \; a_1 \setminus (\cup_{a \in rs_a} L \; a)
|
|
213 |
where $\textit{prune} \;a \; acc$ traverses $a$
|
630
|
214 |
without altering the structure of $a$, but removing components in $a$
|
591
|
215 |
that have appeared in the accumulator $acc$.
|
|
216 |
For example
|
|
217 |
\begin{center}
|
|
218 |
$\textit{prune} \;\;\; (r_a+r_f+r_g+r_h)r_d \;\; \; [(r_a+r_b+r_c)r_d, (r_e+r_f)r_d] $
|
|
219 |
\end{center}
|
|
220 |
should be equal to
|
|
221 |
\begin{center}
|
|
222 |
$(r_g+r_h)r_d$
|
|
223 |
\end{center}
|
|
224 |
because $r_gr_d$ and
|
|
225 |
$r_hr_d$ are the only terms
|
630
|
226 |
that do not appeared in the accumulator list
|
591
|
227 |
\begin{center}
|
|
228 |
$[(r_a+r_b+r_c)r_d, (r_e+r_f)r_d]$.
|
|
229 |
\end{center}
|
630
|
230 |
We implemented the
|
|
231 |
function $\textit{prune}$ in Scala (see figure \ref{fig:pruneFunc})
|
|
232 |
The function $\textit{prune}$
|
|
233 |
is a stronger version of $\textit{distinctBy}$.
|
|
234 |
It does not just walk through a list looking for exact duplicates,
|
|
235 |
but prunes sub-expressions recursively.
|
|
236 |
It manages proper contexts by the helper functions
|
|
237 |
$\textit{removeSeqTail}$, $\textit{isOne}$ and $\textit{atMostEmpty}$.
|
|
238 |
\begin{figure}%[H]
|
|
239 |
\begin{lstlisting}[numbers=left]
|
591
|
240 |
def prune(r: ARexp, acc: Set[Rexp]) : ARexp = r match{
|
628
|
241 |
case AALTS(bs, rs) => rs.map(r => prune(r, acc)).filter(_ != AZERO) match
|
591
|
242 |
{
|
|
243 |
//all components have been removed, meaning this is effectively a duplicate
|
|
244 |
//flats will take care of removing this AZERO
|
|
245 |
case Nil => AZERO
|
|
246 |
case r::Nil => fuse(bs, r)
|
|
247 |
case rs1 => AALTS(bs, rs1)
|
|
248 |
}
|
|
249 |
case ASEQ(bs, r1, r2) =>
|
|
250 |
//remove the r2 in (ra + rb)r2 to identify the duplicate contents of r1
|
|
251 |
prune(r1, acc.map(r => removeSeqTail(r, erase(r2)))) match {
|
|
252 |
//after pruning, returns 0
|
|
253 |
case AZERO => AZERO
|
|
254 |
//after pruning, got r1'.r2, where r1' is equal to 1
|
|
255 |
case r1p if(isOne(erase(r1p))) => fuse(bs ++ mkepsBC(r1p), r2)
|
|
256 |
//assemble the pruned head r1p with r2
|
|
257 |
case r1p => ASEQ(bs, r1p, r2)
|
|
258 |
}
|
|
259 |
//this does the duplicate component removal task
|
|
260 |
case r => if(acc(erase(r))) AZERO else r
|
|
261 |
}
|
|
262 |
\end{lstlisting}
|
630
|
263 |
\caption{The function $\textit{prune}$ is called recursively in the
|
|
264 |
alternative case (line 2) and in the sequence case (line 12).
|
|
265 |
In the alternative case we keep all the accumulators the same, but
|
|
266 |
in the sequence case we are making necessary changes to the accumulators
|
|
267 |
to allow correct de-duplication.}\label{fig:pruneFunc}
|
591
|
268 |
\end{figure}
|
|
269 |
\noindent
|
630
|
270 |
|
|
271 |
\begin{figure}
|
|
272 |
\begin{lstlisting}[numbers=left]
|
628
|
273 |
def atMostEmpty(r: Rexp) : Boolean = r match {
|
|
274 |
case ZERO => true
|
|
275 |
case ONE => true
|
|
276 |
case STAR(r) => atMostEmpty(r)
|
|
277 |
case SEQ(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2)
|
|
278 |
case ALTS(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2)
|
|
279 |
case CHAR(_) => false
|
|
280 |
}
|
|
281 |
|
|
282 |
def isOne(r: Rexp) : Boolean = r match {
|
|
283 |
case ONE => true
|
|
284 |
case SEQ(r1, r2) => isOne(r1) && isOne(r2)
|
|
285 |
case ALTS(r1, r2) => (isOne(r1) || isOne(r2)) && (atMostEmpty(r1) && atMostEmpty(r2))
|
|
286 |
case STAR(r0) => atMostEmpty(r0)
|
|
287 |
case CHAR(c) => false
|
|
288 |
case ZERO => false
|
|
289 |
}
|
|
290 |
|
|
291 |
def removeSeqTail(r: Rexp, tail: Rexp) : Rexp =
|
|
292 |
if (r == tail)
|
|
293 |
ONE
|
|
294 |
else {
|
|
295 |
r match {
|
|
296 |
case SEQ(r1, r2) =>
|
630
|
297 |
if(r2 == tail) r1 else ZERO
|
628
|
298 |
case r => ZERO
|
|
299 |
}
|
591
|
300 |
}
|
628
|
301 |
|
|
302 |
|
|
303 |
\end{lstlisting}
|
638
|
304 |
\caption{The helper functions of $\textit{prune}$:
|
|
305 |
$\textit{atMostEmpty}$, $\textit{isOne}$ and $\textit{removeSeqTail}$}
|
628
|
306 |
\end{figure}
|
|
307 |
\noindent
|
|
308 |
Suppose we feed
|
|
309 |
\begin{center}
|
|
310 |
$r= (\underline{\ONE}+(\underline{f}+b)\cdot g)\cdot (a\cdot(d\cdot e))$
|
|
311 |
\end{center}
|
|
312 |
and
|
|
313 |
\begin{center}
|
|
314 |
$acc = \{a\cdot(d\cdot e),f\cdot (g \cdot (a \cdot (d \cdot e))) \}$
|
|
315 |
\end{center}
|
630
|
316 |
as the input into $\textit{prune}$.
|
628
|
317 |
The end result will be
|
|
318 |
\[
|
|
319 |
b\cdot(g\cdot(a\cdot(d\cdot e)))
|
|
320 |
\]
|
|
321 |
where the underlined components in $r$ are eliminated.
|
|
322 |
Looking more closely, at the topmost call
|
|
323 |
\[
|
|
324 |
\textit{prune} \quad (\ONE+
|
|
325 |
(f+b)\cdot g)\cdot (a\cdot(d\cdot e)) \quad
|
|
326 |
\{a\cdot(d\cdot e),f\cdot (g \cdot (a \cdot (d \cdot e))) \}
|
|
327 |
\]
|
|
328 |
The sequence clause will be called,
|
|
329 |
where a sub-call
|
|
330 |
\[
|
|
331 |
\textit{prune} \;\; (\ONE+(f+b)\cdot g)\;\; \{\ONE, f\cdot g \}
|
|
332 |
\]
|
|
333 |
is made. The terms in the new accumulator $\{\ONE,\; f\cdot g \}$ come from
|
|
334 |
the two calls to $\textit{removeSeqTail}$:
|
|
335 |
\[
|
|
336 |
\textit{removeSeqTail} \quad\;\; a \cdot(d\cdot e) \quad\;\; a \cdot(d\cdot e)
|
|
337 |
\]
|
|
338 |
and
|
|
339 |
\[
|
|
340 |
\textit{removeSeqTail} \quad \;\;
|
|
341 |
f\cdot(g\cdot (a \cdot(d\cdot e)))\quad \;\; a \cdot(d\cdot e).
|
|
342 |
\]
|
630
|
343 |
|
628
|
344 |
The idea behind $\textit{removeSeqTail}$ is that
|
|
345 |
when pruning recursively, we need to ``zoom in''
|
|
346 |
to sub-expressions, and this ``zoom in'' needs to be performed
|
|
347 |
on the
|
630
|
348 |
accumulators as well, otherwise the deletion will not work.
|
628
|
349 |
The sub-call
|
|
350 |
$\textit{prune} \;\; (\ONE+(f+b)\cdot g)\;\; \{\ONE, f\cdot g \}$
|
|
351 |
is simpler, which will trigger the alternative clause, causing
|
|
352 |
a pruning on each element in $(\ONE+(f+b)\cdot g)$,
|
630
|
353 |
leaving us with $b\cdot g$ only.
|
628
|
354 |
|
|
355 |
Our new lexer with stronger simplification
|
|
356 |
uses $\textit{prune}$ by making it
|
|
357 |
the core component of the deduplicating function
|
|
358 |
called $\textit{distinctWith}$.
|
|
359 |
$\textit{DistinctWith}$ ensures that all verbose
|
|
360 |
parts of a regular expression are pruned away.
|
|
361 |
|
|
362 |
\begin{figure}[H]
|
630
|
363 |
\begin{lstlisting}[numbers=left]
|
628
|
364 |
def turnIntoTerms(r: Rexp): List[Rexp] = r match {
|
|
365 |
case SEQ(r1, r2) =>
|
|
366 |
turnIntoTerms(r1).flatMap(r11 => furtherSEQ(r11, r2))
|
|
367 |
case ALTS(r1, r2) => turnIntoTerms(r1) ::: turnIntoTerms(r2)
|
|
368 |
case ZERO => Nil
|
|
369 |
case _ => r :: Nil
|
|
370 |
}
|
|
371 |
|
|
372 |
def distinctWith(rs: List[ARexp],
|
|
373 |
pruneFunction: (ARexp, Set[Rexp]) => ARexp,
|
|
374 |
acc: Set[Rexp] = Set()) : List[ARexp] =
|
|
375 |
rs match{
|
|
376 |
case Nil => Nil
|
|
377 |
case r :: rs =>
|
|
378 |
if(acc(erase(r)))
|
|
379 |
distinctWith(rs, pruneFunction, acc)
|
|
380 |
else {
|
|
381 |
val pruned_r = pruneFunction(r, acc)
|
|
382 |
pruned_r ::
|
|
383 |
distinctWith(rs,
|
|
384 |
pruneFunction,
|
|
385 |
turnIntoTerms(erase(pruned_r)) ++: acc
|
|
386 |
)
|
|
387 |
}
|
|
388 |
}
|
|
389 |
|
591
|
390 |
\end{lstlisting}
|
630
|
391 |
\caption{A Stronger Version of $\textit{distinctBy}$ XXX}
|
591
|
392 |
\end{figure}
|
|
393 |
\noindent
|
628
|
394 |
Once a regular expression has been pruned,
|
|
395 |
all its components will be added to the accumulator
|
|
396 |
to remove any future regular expressions' duplicate components.
|
|
397 |
|
|
398 |
The function $\textit{bsimpStrong}$
|
|
399 |
is very much the same as $\textit{bsimp}$, just with
|
|
400 |
$\textit{distinctBy}$ replaced
|
|
401 |
by $\textit{distinctWith}$.
|
591
|
402 |
\begin{figure}[H]
|
|
403 |
\begin{lstlisting}
|
628
|
404 |
|
591
|
405 |
def bsimpStrong(r: ARexp): ARexp =
|
|
406 |
{
|
|
407 |
r match {
|
|
408 |
case ASEQ(bs1, r1, r2) => (bsimpStrong(r1), bsimpStrong(r2)) match {
|
|
409 |
case (AZERO, _) => AZERO
|
|
410 |
case (_, AZERO) => AZERO
|
|
411 |
case (AONE(bs2), r2s) => fuse(bs1 ++ bs2, r2s)
|
|
412 |
case (r1s, AONE(bs2)) => fuse(bs1, r1s) //assert bs2 == Nil
|
|
413 |
case (r1s, r2s) => ASEQ(bs1, r1s, r2s)
|
|
414 |
}
|
|
415 |
case AALTS(bs1, rs) => {
|
|
416 |
distinctWith(flats(rs.map(bsimpStrong(_))), prune) match {
|
|
417 |
case Nil => AZERO
|
|
418 |
case s :: Nil => fuse(bs1, s)
|
|
419 |
case rs => AALTS(bs1, rs)
|
|
420 |
}
|
|
421 |
}
|
|
422 |
case ASTAR(bs, r0) if(atMostEmpty(erase(r0))) => AONE(bs)
|
|
423 |
case r => r
|
|
424 |
}
|
|
425 |
}
|
628
|
426 |
def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match {
|
591
|
427 |
case Nil => r
|
|
428 |
case c::s => bdersStrong(s, bsimpStrong(bder(c, r)))
|
|
429 |
}
|
628
|
430 |
|
591
|
431 |
\end{lstlisting}
|
628
|
432 |
\caption{The function
|
|
433 |
$\textit{bsimpStrong}$: a stronger version of $\textit{bsimp}$}
|
591
|
434 |
\end{figure}
|
|
435 |
\noindent
|
628
|
436 |
The benefits of using
|
|
437 |
$\textit{prune}$ refining the finiteness bound
|
|
438 |
to a cubic bound has not been formalised yet.
|
|
439 |
Therefore we choose to use Scala code rather than an Isabelle-style formal
|
|
440 |
definition like we did for $\simp$, as the definitions might change
|
|
441 |
to suit our proof needs.
|
|
442 |
In the rest of the chapter we will use this convention consistently.
|
|
443 |
|
|
444 |
%The function $\textit{prune}$ is used in $\distinctWith$.
|
|
445 |
%$\distinctWith$ is a stronger version of $\distinctBy$
|
|
446 |
%which not only removes duplicates as $\distinctBy$ would
|
|
447 |
%do, but also uses the $\textit{pruneFunction}$
|
|
448 |
%argument to prune away verbose components in a regular expression.\\
|
|
449 |
%\begin{figure}[H]
|
|
450 |
%\begin{lstlisting}
|
|
451 |
% //a stronger version of simp
|
|
452 |
% def bsimpStrong(r: ARexp): ARexp =
|
|
453 |
% {
|
|
454 |
% r match {
|
|
455 |
% case ASEQ(bs1, r1, r2) => (bsimpStrong(r1), bsimpStrong(r2)) match {
|
|
456 |
% //normal clauses same as simp
|
|
457 |
% case (AZERO, _) => AZERO
|
|
458 |
% case (_, AZERO) => AZERO
|
|
459 |
% case (AONE(bs2), r2s) => fuse(bs1 ++ bs2, r2s)
|
|
460 |
% //bs2 can be discarded
|
|
461 |
% case (r1s, AONE(bs2)) => fuse(bs1, r1s) //assert bs2 == Nil
|
|
462 |
% case (r1s, r2s) => ASEQ(bs1, r1s, r2s)
|
|
463 |
% }
|
|
464 |
% case AALTS(bs1, rs) => {
|
|
465 |
% //distinctBy(flat_res, erase)
|
|
466 |
% distinctWith(flats(rs.map(bsimpStrong(_))), prune) match {
|
|
467 |
% case Nil => AZERO
|
|
468 |
% case s :: Nil => fuse(bs1, s)
|
|
469 |
% case rs => AALTS(bs1, rs)
|
|
470 |
% }
|
|
471 |
% }
|
|
472 |
% //stars that can be treated as 1
|
|
473 |
% case ASTAR(bs, r0) if(atMostEmpty(erase(r0))) => AONE(bs)
|
|
474 |
% case r => r
|
|
475 |
% }
|
|
476 |
% }
|
|
477 |
% def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match {
|
|
478 |
% case Nil => r
|
|
479 |
% case c::s => bdersStrong(s, bsimpStrong(bder(c, r)))
|
|
480 |
% }
|
|
481 |
%\end{lstlisting}
|
|
482 |
%\caption{The function $\bsimpStrong$ and $\bdersStrongs$}
|
|
483 |
%\end{figure}
|
|
484 |
%\noindent
|
|
485 |
%$\distinctWith$, is in turn used in $\bsimpStrong$:
|
|
486 |
%\begin{figure}[H]
|
|
487 |
%\begin{lstlisting}
|
|
488 |
% //Conjecture: [| bdersStrong(s, r) |] = O([| r |]^3)
|
|
489 |
% def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match {
|
|
490 |
% case Nil => r
|
|
491 |
% case c::s => bdersStrong(s, bsimpStrong(bder(c, r)))
|
|
492 |
% }
|
|
493 |
%\end{lstlisting}
|
|
494 |
%\caption{The function $\bsimpStrong$ and $\bdersStrongs$}
|
|
495 |
%\end{figure}
|
|
496 |
%\noindent
|
591
|
497 |
We conjecture that the above Scala function $\bdersStrongs$,
|
|
498 |
written $\bdersStrong{\_}{\_}$ as an infix notation,
|
|
499 |
satisfies the following property:
|
|
500 |
\begin{conjecture}
|
|
501 |
$\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$
|
|
502 |
\end{conjecture}
|
628
|
503 |
\noindent
|
591
|
504 |
The stronger version of $\blexersimp$'s
|
|
505 |
code in Scala looks like:
|
|
506 |
\begin{figure}[H]
|
|
507 |
\begin{lstlisting}
|
|
508 |
def strongBlexer(r: Rexp, s: String) : Option[Val] = {
|
|
509 |
Try(Some(decode(r, strong_blex_simp(internalise(r), s.toList)))).getOrElse(None)
|
|
510 |
}
|
|
511 |
def strong_blex_simp(r: ARexp, s: List[Char]) : Bits = s match {
|
|
512 |
case Nil => {
|
|
513 |
if (bnullable(r)) {
|
|
514 |
mkepsBC(r)
|
|
515 |
}
|
|
516 |
else
|
|
517 |
throw new Exception("Not matched")
|
|
518 |
}
|
|
519 |
case c::cs => {
|
|
520 |
strong_blex_simp(strongBsimp(bder(c, r)), cs)
|
|
521 |
}
|
|
522 |
}
|
|
523 |
\end{lstlisting}
|
|
524 |
\end{figure}
|
|
525 |
\noindent
|
621
|
526 |
We call this lexer $\blexerStrong$.
|
630
|
527 |
This version is able to reduce the
|
|
528 |
size of the derivatives which
|
|
529 |
otherwise
|
|
530 |
triggered exponential behaviour in
|
621
|
531 |
$\blexersimp$.
|
630
|
532 |
Consider again the runtime for matching
|
|
533 |
$\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings
|
|
534 |
of the form $\protect\underbrace{aa..a}_{n}$.
|
|
535 |
They produce the folloiwng graphs ($\blexerStrong$ on the left-hand-side and
|
|
536 |
$\blexersimp$ on the right-hand-side).
|
621
|
537 |
\begin{figure}[H]
|
|
538 |
\centering
|
|
539 |
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
|
|
540 |
\begin{tikzpicture}
|
|
541 |
\begin{axis}[
|
|
542 |
%xlabel={$n$},
|
|
543 |
myplotstyle,
|
|
544 |
xlabel={input length},
|
|
545 |
ylabel={size},
|
|
546 |
width = 7cm,
|
|
547 |
height = 5cm,
|
|
548 |
]
|
|
549 |
\addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data};
|
|
550 |
\end{axis}
|
|
551 |
\end{tikzpicture}
|
|
552 |
&
|
|
553 |
\begin{tikzpicture}
|
|
554 |
\begin{axis}[
|
|
555 |
%xlabel={$n$},
|
|
556 |
myplotstyle,
|
|
557 |
xlabel={input length},
|
|
558 |
ylabel={size},
|
|
559 |
width = 7cm,
|
|
560 |
height = 5cm,
|
|
561 |
]
|
|
562 |
\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
|
|
563 |
\end{axis}
|
|
564 |
\end{tikzpicture}\\
|
|
565 |
\multicolumn{2}{l}{}
|
|
566 |
\end{tabular}
|
630
|
567 |
\caption{}\label{fig:aaaaaStarStar}
|
621
|
568 |
\end{figure}
|
|
569 |
\noindent
|
630
|
570 |
We hope the correctness is preserved.
|
|
571 |
%We would like to preserve the correctness like the one
|
|
572 |
%we had for $\blexersimp$:
|
|
573 |
The proof idea is to preserve the key lemma in chapter \ref{Bitcoded2}
|
|
574 |
such as in equation \eqref{eqn:cubicRule}.
|
591
|
575 |
\begin{conjecture}\label{cubicConjecture}
|
|
576 |
$\blexerStrong \;r \; s = \blexer\; r\;s$
|
|
577 |
\end{conjecture}
|
592
|
578 |
\noindent
|
621
|
579 |
The idea is to maintain key lemmas in
|
|
580 |
chapter \ref{Bitcoded2} like
|
|
581 |
$r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$
|
628
|
582 |
with the new rewriting rule
|
630
|
583 |
shown in figure \eqref{eqn:cubicRule} .
|
591
|
584 |
|
621
|
585 |
In the next sub-section,
|
592
|
586 |
we will describe why we
|
628
|
587 |
believe a cubic size bound can be achieved with
|
|
588 |
the stronger simplification.
|
|
589 |
For this we give a short introduction to the
|
592
|
590 |
partial derivatives,
|
628
|
591 |
which were invented by Antimirov \cite{Antimirov95},
|
|
592 |
and then link them with the result of the function
|
592
|
593 |
$\bdersStrongs$.
|
|
594 |
|
621
|
595 |
\subsection{Antimirov's partial derivatives}
|
|
596 |
Partial derivatives were first introduced by
|
|
597 |
Antimirov \cite{Antimirov95}.
|
630
|
598 |
They are very similar
|
628
|
599 |
to Brzozowski derivatives,
|
630
|
600 |
but split children of alternative regular expressions into
|
628
|
601 |
multiple independent terms. This means the output of
|
630
|
602 |
partial derivatives is a
|
|
603 |
set of regular expressions, defined as follows
|
621
|
604 |
\begin{center}
|
630
|
605 |
\begin{tabular}{lcl@{\hspace{-5mm}}l}
|
628
|
606 |
$\partial_x \; (r_1 \cdot r_2)$ &
|
|
607 |
$\dn$ & $(\partial_x \; r_1) \cdot r_2 \cup
|
630
|
608 |
\partial_x \; r_2 \;$ & $ \textit{if} \; \; \nullable\; r_1$\\
|
|
609 |
& & $(\partial_x \; r_1)\cdot r_2 \quad\quad$ & $
|
621
|
610 |
\textit{otherwise}$\\
|
628
|
611 |
$\partial_x \; r^*$ & $\dn$ & $(\partial_x \; r) \cdot r^*$\\
|
621
|
612 |
$\partial_x \; c $ & $\dn$ & $\textit{if} \; x = c \;
|
|
613 |
\textit{then} \;
|
|
614 |
\{ \ONE\} \;\;\textit{else} \; \varnothing$\\
|
628
|
615 |
$\partial_x(r_1+r_2)$ & $=$ & $\partial_x(r_1) \cup \partial_x(r_2)$\\
|
621
|
616 |
$\partial_x(\ONE)$ & $=$ & $\varnothing$\\
|
|
617 |
$\partial_x(\ZERO)$ & $\dn$ & $\varnothing$\\
|
|
618 |
\end{tabular}
|
591
|
619 |
\end{center}
|
621
|
620 |
\noindent
|
630
|
621 |
The $\cdot$ in the example
|
628
|
622 |
$(\partial_x \; r_1) \cdot r_2 $
|
621
|
623 |
is a shorthand notation for the cartesian product
|
628
|
624 |
$(\partial_x \; r_1) \times \{ r_2\}$.
|
621
|
625 |
%Each element in the set generated by a partial derivative
|
|
626 |
%corresponds to a (potentially partial) match
|
|
627 |
%TODO: define derivatives w.r.t string s
|
628
|
628 |
Rather than joining the calculated derivatives $\partial_x r_1$ and $\partial_x r_2$ together
|
621
|
629 |
using the $\sum$ constructor, Antimirov put them into
|
628
|
630 |
a set. This means many subterms will be de-duplicated
|
630
|
631 |
because they are sets.
|
|
632 |
For example, to compute what the derivative of the regular expression
|
|
633 |
$x^*(xx + y)^*$
|
|
634 |
w.r.t. $x$ is, one can compute a partial derivative
|
628
|
635 |
and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$
|
591
|
636 |
from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$.
|
621
|
637 |
|
628
|
638 |
The partial derivative w.r.t. a string is defined recursively:
|
|
639 |
\[
|
|
640 |
\partial_{c::cs} r \dn \bigcup_{r'\in (\partial_c r)}
|
|
641 |
\partial_{cs} r'
|
|
642 |
\]
|
630
|
643 |
Suppose an alphabet $\Sigma$, we use $\Sigma^*$ for the set of all possible strings
|
|
644 |
from the alphabet.
|
|
645 |
The set of all possible partial derivatives is then defined
|
|
646 |
as the union of derivatives w.r.t all the strings:
|
621
|
647 |
\begin{center}
|
|
648 |
\begin{tabular}{lcl}
|
628
|
649 |
$\textit{PDER}_{\Sigma^*} \; r $ & $\dn $ & $\bigcup_{w \in \Sigma^*}\partial_w \; r$
|
621
|
650 |
\end{tabular}
|
591
|
651 |
\end{center}
|
621
|
652 |
\noindent
|
630
|
653 |
Consider now again our pathological case where we apply the more
|
|
654 |
aggressive
|
|
655 |
simplification
|
621
|
656 |
\begin{center}
|
|
657 |
$((a^* + (aa)^* + \ldots + (\underbrace{a\ldots a}_{n a's})^* )^*)^*$
|
|
658 |
\end{center}
|
630
|
659 |
let use abbreviate theis regular expression with $r$,
|
|
660 |
then we have that
|
621
|
661 |
\begin{center}
|
628
|
662 |
$\textit{PDER}_{\Sigma^*} \; r =
|
621
|
663 |
\bigcup_{i=1}^{n}\bigcup_{j=0}^{i-1} \{
|
|
664 |
(\underbrace{a \ldots a}_{\text{j a's}}\cdot
|
628
|
665 |
(\underbrace{a \ldots a}_{\text{i a's}})^*)\cdot r \}$,
|
591
|
666 |
\end{center}
|
630
|
667 |
The union on the right-hand-side has $n * (n + 1) / 2$ terms.
|
|
668 |
This leads us to believe that the maximum number of terms needed
|
|
669 |
in our derivative is also only $n * (n + 1) / 2$.
|
|
670 |
Therefore
|
|
671 |
we conjecture that $\bsimpStrong$ is also able to achieve this
|
621
|
672 |
upper limit in general
|
|
673 |
\begin{conjecture}\label{bsimpStrongInclusionPder}
|
630
|
674 |
Using a suitable transformation $f$, we have that
|
621
|
675 |
\begin{center}
|
|
676 |
$\forall s.\; f \; (r \bdersStrong \; s) \subseteq
|
628
|
677 |
\textit{PDER}_{\Sigma^*} \; r$
|
621
|
678 |
\end{center}
|
630
|
679 |
holds.
|
621
|
680 |
\end{conjecture}
|
|
681 |
\noindent
|
630
|
682 |
The reason is that our \eqref{eqn:cubicRule} will keep only one copy of each term,
|
621
|
683 |
where the function $\textit{prune}$ takes care of maintaining
|
|
684 |
a set like structure similar to partial derivatives.
|
630
|
685 |
%We might need to adjust $\textit{prune}$
|
|
686 |
%slightly to make sure all duplicate terms are eliminated,
|
|
687 |
%which should be doable.
|
591
|
688 |
|
621
|
689 |
Antimirov had proven that the sum of all the partial derivative
|
|
690 |
terms' sizes is bounded by the cubic of the size of that regular
|
|
691 |
expression:
|
|
692 |
\begin{property}\label{pderBound}
|
628
|
693 |
$\llbracket \textit{PDER}_{\Sigma^*} \; r \rrbracket \leq O(\llbracket r \rrbracket^3)$
|
621
|
694 |
\end{property}
|
630
|
695 |
\noindent
|
628
|
696 |
This property was formalised by Wu et al. \cite{Wu2014}, and the
|
630
|
697 |
details can be found in the Archive of Formal Froofs\footnote{https://www.isa-afp.org/entries/Myhill-Nerode.html}.
|
621
|
698 |
Once conjecture \ref{bsimpStrongInclusionPder} is proven, then property \ref{pderBound}
|
630
|
699 |
would provide us with a cubic bound for our $\blexerStrong$ algorithm:
|
621
|
700 |
\begin{conjecture}\label{strongCubic}
|
|
701 |
$\llbracket r \bdersStrong\; s \rrbracket \leq \llbracket r \rrbracket^3$
|
|
702 |
\end{conjecture}
|
628
|
703 |
\noindent
|
|
704 |
We leave this as future work.
|
591
|
705 |
|
|
706 |
|
621
|
707 |
%To get all the "atomic" components of a regular expression's possible derivatives,
|
|
708 |
%there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes
|
|
709 |
%whatever character is available at the head of the string inside the language of a
|
|
710 |
%regular expression, and gives back the character and the derivative regular expression
|
|
711 |
%as a pair (which he called "monomial"):
|
|
712 |
% \begin{center}
|
|
713 |
% \begin{tabular}{ccc}
|
|
714 |
% $\lf(\ONE)$ & $=$ & $\phi$\\
|
|
715 |
%$\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\
|
|
716 |
% $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
|
|
717 |
% $\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\
|
|
718 |
%\end{tabular}
|
|
719 |
%\end{center}
|
|
720 |
%%TODO: completion
|
|
721 |
%
|
|
722 |
%There is a slight difference in the last three clauses compared
|
|
723 |
%with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular
|
|
724 |
%expression $r$ with every element inside $\textit{rset}$ to create a set of
|
|
725 |
%sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates
|
|
726 |
%on a set of monomials (which Antimirov called "linear form") and a regular
|
|
727 |
%expression, and returns a linear form:
|
|
728 |
% \begin{center}
|
|
729 |
% \begin{tabular}{ccc}
|
|
730 |
% $l \bigodot (\ZERO)$ & $=$ & $\phi$\\
|
|
731 |
% $l \bigodot (\ONE)$ & $=$ & $l$\\
|
|
732 |
% $\phi \bigodot t$ & $=$ & $\phi$\\
|
|
733 |
% $\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\
|
|
734 |
% $\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\
|
|
735 |
% $\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\
|
|
736 |
% $\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
|
|
737 |
% $\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\
|
|
738 |
%\end{tabular}
|
|
739 |
%\end{center}
|
|
740 |
%%TODO: completion
|
|
741 |
%
|
|
742 |
% Some degree of simplification is applied when doing $\bigodot$, for example,
|
|
743 |
% $l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$,
|
|
744 |
% and $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and
|
|
745 |
% $\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$,
|
|
746 |
% and so on.
|
|
747 |
%
|
|
748 |
% With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regular expression $r$ with
|
|
749 |
% an iterative procedure:
|
|
750 |
% \begin{center}
|
|
751 |
% \begin{tabular}{llll}
|
|
752 |
%$\textit{while}$ & $(\Delta_i \neq \phi)$ & & \\
|
|
753 |
% & $\Delta_{i+1}$ & $ =$ & $\lf(\Delta_i) - \PD_i$ \\
|
|
754 |
% & $\PD_{i+1}$ & $ =$ & $\Delta_{i+1} \cup \PD_i$ \\
|
|
755 |
%$\partial_{UNIV}(r)$ & $=$ & $\PD$ &
|
|
756 |
%\end{tabular}
|
|
757 |
%\end{center}
|
|
758 |
%
|
|
759 |
%
|
|
760 |
% $(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
|
591
|
761 |
|
|
762 |
|
532
|
763 |
|
|
764 |
|
|
765 |
%----------------------------------------------------------------------------------------
|
|
766 |
% SECTION 2
|
|
767 |
%----------------------------------------------------------------------------------------
|
|
768 |
|
620
|
769 |
|
621
|
770 |
%The closed form for them looks like:
|
|
771 |
%%\begin{center}
|
|
772 |
%% \begin{tabular}{llrclll}
|
|
773 |
%% $r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & \\
|
|
774 |
%% $\textit{rsimp}$ & $($ & $
|
|
775 |
%% \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\
|
|
776 |
%% & & & & $\textit{nupdates} \;$ &
|
|
777 |
%% $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\
|
|
778 |
%% & & & & $)$ &\\
|
|
779 |
%% & & $)$ & & &\\
|
|
780 |
%% & $)$ & & & &\\
|
|
781 |
%% \end{tabular}
|
|
782 |
%%\end{center}
|
620
|
783 |
%\begin{center}
|
621
|
784 |
% \begin{tabular}{llrcllrllll}
|
|
785 |
% $r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & &&&&\\
|
|
786 |
% &&&&$\textit{rsimp}$ & $($ & $
|
|
787 |
% \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\
|
|
788 |
% &&&& & & & & $\;\; \textit{nupdates} \;$ &
|
|
789 |
% $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\
|
|
790 |
% &&&& & & & & $)$ &\\
|
|
791 |
% &&&& & & $)$ & & &\\
|
|
792 |
% &&&& & $)$ & & & &\\
|
|
793 |
% \end{tabular}
|
620
|
794 |
%\end{center}
|
621
|
795 |
%The $\textit{optermsimp}$ function with the argument $r$
|
|
796 |
%chooses from two options: $\ZERO$ or
|
|
797 |
%We define for the $r^{\{n\}}$ constructor something similar to $\starupdate$
|
|
798 |
%and $\starupdates$:
|
620
|
799 |
%\begin{center}
|
621
|
800 |
% \begin{tabular}{lcl}
|
|
801 |
% $\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
|
|
802 |
% $\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\
|
|
803 |
% & & $\textit{if} \;
|
|
804 |
% (\rnullable \; (\rders \; r \; s))$ \\
|
|
805 |
% & & $\textit{then} \;\; (s @ [c]) :: [c] :: (
|
|
806 |
% \starupdate \; c \; r \; Ss)$ \\
|
|
807 |
% & & $\textit{else} \;\; (s @ [c]) :: (
|
|
808 |
% \starupdate \; c \; r \; Ss)$
|
|
809 |
% \end{tabular}
|
620
|
810 |
%\end{center}
|
621
|
811 |
%\noindent
|
|
812 |
%As a generalisation from characters to strings,
|
|
813 |
%$\starupdates$ takes a string instead of a character
|
|
814 |
%as the first input argument, and is otherwise the same
|
|
815 |
%as $\starupdate$.
|
|
816 |
%\begin{center}
|
|
817 |
% \begin{tabular}{lcl}
|
|
818 |
% $\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\
|
|
819 |
% $\starupdates \; (c :: cs) \; r \; Ss$ & $=$ & $\starupdates \; cs \; r \; (
|
|
820 |
% \starupdate \; c \; r \; Ss)$
|
|
821 |
% \end{tabular}
|
|
822 |
%\end{center}
|
|
823 |
%\noindent
|
620
|
824 |
|
|
825 |
|
|
826 |
|
621
|
827 |
%\section{Zippers}
|
|
828 |
%Zipper is a data structure designed to operate on
|
|
829 |
%and navigate between local parts of a tree.
|
|
830 |
%It was first formally described by Huet \cite{HuetZipper}.
|
|
831 |
%Typical applications of zippers involve text editor buffers
|
|
832 |
%and proof system databases.
|
|
833 |
%In our setting, the idea is to compactify the representation
|
|
834 |
%of derivatives with zippers, thereby making our algorithm faster.
|
|
835 |
%Some initial results
|
|
836 |
%We first give a brief introduction to what zippers are,
|
|
837 |
%and other works
|
|
838 |
%that apply zippers to derivatives
|
|
839 |
%When dealing with large trees, it would be a waste to
|
|
840 |
%traverse the entire tree if
|
|
841 |
%the operation only
|
|
842 |
%involves a small fraction of it.
|
|
843 |
%The idea is to put the focus on that subtree, turning other parts
|
|
844 |
%of the tree into a context
|
|
845 |
%
|
|
846 |
%
|
|
847 |
%One observation about our derivative-based lexing algorithm is that
|
|
848 |
%the derivative operation sometimes traverses the entire regular expression
|
|
849 |
%unnecessarily:
|
620
|
850 |
|
|
851 |
|
612
|
852 |
%----------------------------------------------------------------------------------------
|
|
853 |
% SECTION 1
|
|
854 |
%----------------------------------------------------------------------------------------
|
532
|
855 |
|
612
|
856 |
%\section{Adding Support for the Negation Construct, and its Correctness Proof}
|
|
857 |
%We now add support for the negation regular expression:
|
|
858 |
%\[ r ::= \ZERO \mid \ONE
|
|
859 |
% \mid c
|
|
860 |
% \mid r_1 \cdot r_2
|
|
861 |
% \mid r_1 + r_2
|
|
862 |
% \mid r^*
|
|
863 |
% \mid \sim r
|
|
864 |
%\]
|
|
865 |
%The $\textit{nullable}$ function's clause for it would be
|
|
866 |
%\[
|
|
867 |
%\textit{nullable}(~r) = \neg \nullable(r)
|
|
868 |
%\]
|
|
869 |
%The derivative would be
|
|
870 |
%\[
|
|
871 |
%~r \backslash c = ~ (r \backslash c)
|
|
872 |
%\]
|
|
873 |
%
|
|
874 |
%The most tricky part of lexing for the $~r$ regular expression
|
|
875 |
% is creating a value for it.
|
|
876 |
% For other regular expressions, the value aligns with the
|
|
877 |
% structure of the regular expression:
|
|
878 |
% \[
|
|
879 |
% \vdash \Seq(\Char(a), \Char(b)) : a \cdot b
|
|
880 |
% \]
|
|
881 |
%But for the $~r$ regular expression, $s$ is a member of it if and only if
|
|
882 |
%$s$ does not belong to $L(r)$.
|
|
883 |
%That means when there
|
|
884 |
%is a match for the not regular expression, it is not possible to generate how the string $s$ matched
|
|
885 |
%with $r$.
|
|
886 |
%What we can do is preserve the information of how $s$ was not matched by $r$,
|
|
887 |
%and there are a number of options to do this.
|
|
888 |
%
|
|
889 |
%We could give a partial value when there is a partial match for the regular expression inside
|
|
890 |
%the $\mathbf{not}$ construct.
|
|
891 |
%For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,
|
|
892 |
%A value for it could be
|
|
893 |
% \[
|
|
894 |
% \vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)
|
|
895 |
% \]
|
|
896 |
% The above example demonstrates what value to construct
|
|
897 |
% when the string $s$ is at most a real prefix
|
|
898 |
% of the strings in $L(r)$. When $s$ instead is not a prefix of any strings
|
|
899 |
% in $L(r)$, it becomes unclear what to return as a value inside the $\textit{Not}$
|
|
900 |
% constructor.
|
|
901 |
%
|
|
902 |
% Another option would be to either store the string $s$ that resulted in
|
|
903 |
% a mis-match for $r$ or a dummy value as a placeholder:
|
|
904 |
% \[
|
|
905 |
% \vdash \textit{Not}(abcd) : ~( r_1 )
|
|
906 |
% \]
|
|
907 |
%or
|
|
908 |
% \[
|
|
909 |
% \vdash \textit{Not}(\textit{Dummy}) : ~( r_1 )
|
|
910 |
% \]
|
|
911 |
% We choose to implement this as it is most straightforward:
|
|
912 |
% \[
|
|
913 |
% \mkeps(~(r)) = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})
|
|
914 |
% \]
|
|
915 |
%
|
|
916 |
%
|
620
|
917 |
%\begin{center}
|
|
918 |
% \begin{tabular}{lcl}
|
|
919 |
% $\ntset \; r \; (n+1) \; c::cs $ & $\dn$ & $\nupdates \;
|
|
920 |
% cs \; r \; [\Some \; ([c], n)]$\\
|
|
921 |
% $\ntset \; r\; 0 \; \_$ & $\dn$ & $\None$\\
|
|
922 |
% $\ntset \; r \; \_ \; [] $ & $ \dn$ & $[]$\\
|
|
923 |
% \end{tabular}
|
|
924 |
%\end{center}
|
628
|
925 |
|