532
|
1 |
% Chapter Template
|
|
2 |
|
|
3 |
\chapter{A Better Bound and Other Extensions} % Main chapter title
|
|
4 |
|
|
5 |
\label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
|
|
6 |
%in Chapter 4 to a polynomial one, and demonstrate how one can extend the
|
|
7 |
%algorithm to include constructs such as bounded repetitions and negations.
|
590
|
8 |
\lstset{style=myScalastyle}
|
|
9 |
|
|
10 |
|
|
11 |
This chapter is a ``miscellaneous''
|
|
12 |
chapter which records various
|
|
13 |
extensions to our $\blexersimp$'s formalisations.\\
|
|
14 |
Firstly we present further improvements
|
|
15 |
made to our lexer algorithm $\blexersimp$.
|
|
16 |
We devise a stronger simplification algorithm,
|
|
17 |
called $\bsimpStrong$, which can prune away
|
|
18 |
similar components in two regular expressions at the same
|
|
19 |
alternative level,
|
|
20 |
even if these regular expressions are not exactly the same.
|
|
21 |
We call the lexer that uses this stronger simplification function
|
|
22 |
$\blexerStrong$.
|
|
23 |
We conjecture that both
|
|
24 |
\begin{center}
|
|
25 |
$\blexerStrong \;r \; s = \blexer\; r\;s$
|
|
26 |
\end{center}
|
|
27 |
and
|
|
28 |
\begin{center}
|
|
29 |
$\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$
|
|
30 |
\end{center}
|
|
31 |
hold, but formalising
|
|
32 |
them is still work in progress.
|
|
33 |
We give reasons why the correctness and cubic size bound proofs
|
|
34 |
can be achieved,
|
|
35 |
by exploring the connection between the internal
|
|
36 |
data structure of our $\blexerStrong$ and
|
|
37 |
Animirov's partial derivatives.\\
|
|
38 |
Secondly, we extend our $\blexersimp$
|
|
39 |
to support bounded repetitions ($r^{\{n\}}$).
|
|
40 |
We update our formalisation of
|
|
41 |
the correctness and finiteness properties to
|
|
42 |
include this new construct. With bounded repetitions
|
|
43 |
we are able to out-compete other verified lexers such as
|
|
44 |
Verbatim++ on regular expressions which involve a lot of
|
620
|
45 |
repetitions. %We also present the idempotency property proof
|
|
46 |
%of $\bsimp$, which leverages the idempotency proof of $\rsimp$.
|
|
47 |
%This reinforces our claim that the fixpoint construction
|
|
48 |
%originally required by Sulzmann and Lu can be removed in $\blexersimp$.
|
590
|
49 |
\\
|
|
50 |
Last but not least, we present our efforts and challenges we met
|
|
51 |
in further improving the algorithm by data
|
|
52 |
structures such as zippers.
|
|
53 |
|
|
54 |
|
|
55 |
|
532
|
56 |
%----------------------------------------------------------------------------------------
|
|
57 |
% SECTION strongsimp
|
|
58 |
%----------------------------------------------------------------------------------------
|
|
59 |
\section{A Stronger Version of Simplification}
|
|
60 |
%TODO: search for isabelle proofs of algorithms that check equivalence
|
590
|
61 |
In our bitcoded lexing algorithm, (sub)terms represent (sub)matches.
|
591
|
62 |
For example, the regular expression
|
|
63 |
\[
|
|
64 |
aa \cdot a^*+ a \cdot a^* + aa\cdot a^*
|
|
65 |
\]
|
|
66 |
contains three terms,
|
|
67 |
expressing three possibilities it will match future input.
|
|
68 |
The first and the third terms are identical, which means we can eliminate
|
|
69 |
the latter as we know it will not be picked up by $\bmkeps$.
|
|
70 |
In $\bsimps$, the $\distinctBy$ function takes care of this.
|
|
71 |
The criteria $\distinctBy$ uses for removing a duplicate
|
|
72 |
$a_2$ in the list
|
|
73 |
\begin{center}
|
|
74 |
$rs_a@[a_1]@rs_b@[a_2]@rs_c$
|
|
75 |
\end{center}
|
|
76 |
is that
|
533
|
77 |
\begin{center}
|
591
|
78 |
$\rerase{a_1} = \rerase{a_2}$.
|
|
79 |
\end{center}
|
|
80 |
It can be characterised as the $LD$
|
|
81 |
rewrite rule in \ref{rrewriteRules}.\\
|
|
82 |
The problem , however, is that identical components
|
|
83 |
in two slightly different regular expressions cannot be removed:
|
|
84 |
\begin{figure}[H]
|
|
85 |
\[
|
|
86 |
(a+b+d) \cdot r_1 + (a+c+e) \cdot r_1 \stackrel{?}{\rightsquigarrow} (a+b+d) \cdot r_1 + (c+e) \cdot r_1
|
|
87 |
\]
|
|
88 |
\caption{Desired simplification, but not done in $\blexersimp$}\label{partialDedup}
|
|
89 |
\end{figure}
|
|
90 |
\noindent
|
|
91 |
A simplification like this actually
|
|
92 |
cannot be omitted,
|
|
93 |
as without it the size could blow up even with our $\simp$ function:
|
|
94 |
\begin{figure}[H]
|
|
95 |
\centering
|
|
96 |
\begin{tikzpicture}
|
|
97 |
\begin{axis}[
|
|
98 |
%xlabel={$n$},
|
|
99 |
myplotstyle,
|
|
100 |
xlabel={input length},
|
|
101 |
ylabel={size},
|
|
102 |
]
|
|
103 |
\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
|
|
104 |
\end{axis}
|
533
|
105 |
\end{tikzpicture}
|
591
|
106 |
\caption{Runtime of $\blexersimp$ for matching
|
|
107 |
$\protect((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$
|
|
108 |
with strings
|
|
109 |
of the form $\protect\underbrace{aa..a}_{n}$.}\label{blexerExp}
|
|
110 |
\end{figure}
|
|
111 |
\noindent
|
|
112 |
We would like to apply the rewriting at some stage
|
|
113 |
\begin{figure}[H]
|
|
114 |
\[
|
|
115 |
(a+b+d) \cdot r_1 \longrightarrow a \cdot r_1 + b \cdot r_1 + d \cdot r_1
|
|
116 |
\]
|
|
117 |
\caption{Desired simplification, but not done in $\blexersimp$}\label{desiredSimp}
|
533
|
118 |
\end{figure}
|
|
119 |
\noindent
|
591
|
120 |
in our $\simp$ function,
|
|
121 |
so that it makes the simplification in \ref{partialDedup} possible.
|
|
122 |
For example,
|
|
123 |
can a function like the below one be used?
|
|
124 |
%TODO: simp' that spills things
|
533
|
125 |
|
591
|
126 |
Unfortunately,
|
|
127 |
if we introduce them in our
|
|
128 |
setting we would lose the POSIX property of our calculated values.
|
|
129 |
For example given the regular expression
|
|
130 |
\begin{center}
|
|
131 |
$(a + ab)(bc + c)$
|
|
132 |
\end{center}
|
|
133 |
and the string
|
|
134 |
\begin{center}
|
|
135 |
$ab$,
|
533
|
136 |
\end{center}
|
591
|
137 |
then our algorithm generates the following
|
|
138 |
correct POSIX value
|
|
139 |
\begin{center}
|
|
140 |
$\Seq \; (\Right \; ab) \; (\Right \; c)$.
|
533
|
141 |
\end{center}
|
591
|
142 |
Essentially it matches the string with the longer Right-alternative
|
|
143 |
in the first sequence (and
|
|
144 |
then the 'rest' with the character regular expression $c$ from the second sequence).
|
|
145 |
If we add the simplification above, then we obtain the following value
|
|
146 |
\begin{center}
|
|
147 |
$\Left \; (\Seq \; a \; (\Left \; bc))$
|
|
148 |
\end{center}
|
|
149 |
where the $\Left$-alternatives get priority.
|
|
150 |
However this violates the POSIX rules.
|
|
151 |
The reason for getting this undesired value
|
|
152 |
is that the new rule splits this regular expression up into
|
|
153 |
\begin{center}
|
|
154 |
$a\cdot(b c + c) + ab \cdot (bc + c)$,
|
533
|
155 |
\end{center}
|
591
|
156 |
which becomes a regular expression with a
|
|
157 |
totally different structure--the original
|
|
158 |
was a sequence, and now it becomes an alternative.
|
|
159 |
With an alternative the maximum munch rule no longer works.\\
|
|
160 |
A method to reconcile this is to do the
|
|
161 |
transformation in \ref{desiredSimp} ``non-invasively'',
|
|
162 |
meaning that we traverse the list of regular expressions
|
|
163 |
\begin{center}
|
|
164 |
$rs_a@[a]@rs_c$
|
|
165 |
\end{center}
|
|
166 |
in the alternative
|
|
167 |
\begin{center}
|
|
168 |
$\sum ( rs_a@[a]@rs_c)$
|
533
|
169 |
\end{center}
|
591
|
170 |
using a function similar to $\distinctBy$,
|
|
171 |
but this time
|
|
172 |
we allow a more general list rewrite:
|
592
|
173 |
\begin{mathpar}\label{cubicRule}
|
591
|
174 |
\inferrule{\vspace{0mm} }{rs_a@[a]@rs_c
|
|
175 |
\stackrel{s}{\rightsquigarrow }
|
|
176 |
rs_a@[\textit{prune} \; a \; rs_a]@rs_c }
|
592
|
177 |
\end{mathpar}
|
591
|
178 |
%L \; a_1' = L \; a_1 \setminus (\cup_{a \in rs_a} L \; a)
|
|
179 |
where $\textit{prune} \;a \; acc$ traverses $a$
|
|
180 |
without altering the structure of $a$, removing components in $a$
|
|
181 |
that have appeared in the accumulator $acc$.
|
|
182 |
For example
|
|
183 |
\begin{center}
|
|
184 |
$\textit{prune} \;\;\; (r_a+r_f+r_g+r_h)r_d \;\; \; [(r_a+r_b+r_c)r_d, (r_e+r_f)r_d] $
|
|
185 |
\end{center}
|
|
186 |
should be equal to
|
|
187 |
\begin{center}
|
|
188 |
$(r_g+r_h)r_d$
|
|
189 |
\end{center}
|
|
190 |
because $r_gr_d$ and
|
|
191 |
$r_hr_d$ are the only terms
|
|
192 |
that have not appeared in the accumulator list
|
|
193 |
\begin{center}
|
|
194 |
$[(r_a+r_b+r_c)r_d, (r_e+r_f)r_d]$.
|
|
195 |
\end{center}
|
|
196 |
We implemented
|
|
197 |
function $\textit{prune}$ in Scala,
|
|
198 |
and incorporated into our lexer,
|
|
199 |
by replacing the $\simp$ function
|
|
200 |
with a stronger version called $\bsimpStrong$
|
|
201 |
that prunes regular expressions.
|
|
202 |
We call this lexer $\blexerStrong$.
|
|
203 |
$\blexerStrong$ is able to drastically reduce the
|
|
204 |
internal data structure size which could
|
|
205 |
trigger exponential behaviours in
|
|
206 |
$\blexersimp$.
|
590
|
207 |
\begin{figure}[H]
|
533
|
208 |
\centering
|
|
209 |
\begin{tabular}{@{}c@{\hspace{0mm}}c@{\hspace{0mm}}c@{}}
|
|
210 |
\begin{tikzpicture}
|
|
211 |
\begin{axis}[
|
535
|
212 |
%xlabel={$n$},
|
|
213 |
myplotstyle,
|
|
214 |
xlabel={input length},
|
533
|
215 |
ylabel={size},
|
535
|
216 |
]
|
533
|
217 |
\addplot[red,mark=*, mark options={fill=white}] table {strongSimpCurve.data};
|
|
218 |
\end{axis}
|
|
219 |
\end{tikzpicture}
|
|
220 |
&
|
|
221 |
\begin{tikzpicture}
|
|
222 |
\begin{axis}[
|
535
|
223 |
%xlabel={$n$},
|
|
224 |
myplotstyle,
|
|
225 |
xlabel={input length},
|
533
|
226 |
ylabel={size},
|
535
|
227 |
]
|
533
|
228 |
\addplot[blue,mark=*, mark options={fill=white}] table {bsimpExponential.data};
|
|
229 |
\end{axis}
|
|
230 |
\end{tikzpicture}\\
|
535
|
231 |
\multicolumn{2}{l}{Graphs: Runtime for matching $((a^* + (aa)^* + \ldots + (aaaaa)^* )^*)^*$ with strings
|
591
|
232 |
of the form $\underbrace{aa..a}_{n}$.}
|
533
|
233 |
\end{tabular}
|
|
234 |
\caption{aaaaaStarStar} \label{fig:aaaaaStarStar}
|
|
235 |
\end{figure}
|
591
|
236 |
\begin{figure}[H]
|
|
237 |
\begin{lstlisting}
|
|
238 |
def atMostEmpty(r: Rexp) : Boolean = r match {
|
|
239 |
case ZERO => true
|
|
240 |
case ONE => true
|
|
241 |
case STAR(r) => atMostEmpty(r)
|
|
242 |
case SEQ(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2)
|
|
243 |
case ALTS(r1, r2) => atMostEmpty(r1) && atMostEmpty(r2)
|
|
244 |
case CHAR(_) => false
|
|
245 |
}
|
|
246 |
|
|
247 |
|
|
248 |
def isOne(r: Rexp) : Boolean = r match {
|
|
249 |
case ONE => true
|
|
250 |
case SEQ(r1, r2) => isOne(r1) && isOne(r2)
|
|
251 |
case ALTS(r1, r2) => (isOne(r1) || isOne(r2)) && (atMostEmpty(r1) && atMostEmpty(r2))//rs.forall(atMostEmpty) && rs.exists(isOne)
|
|
252 |
case STAR(r0) => atMostEmpty(r0)
|
|
253 |
case CHAR(c) => false
|
|
254 |
case ZERO => false
|
|
255 |
}
|
|
256 |
|
|
257 |
//r = r' ~ tail' : If tail' matches tail => returns r'
|
|
258 |
def removeSeqTail(r: Rexp, tail: Rexp) : Rexp = r match {
|
|
259 |
case SEQ(r1, r2) =>
|
|
260 |
if(r2 == tail)
|
|
261 |
r1
|
|
262 |
else
|
|
263 |
ZERO
|
|
264 |
case r => ZERO
|
|
265 |
}
|
|
266 |
|
|
267 |
def prune(r: ARexp, acc: Set[Rexp]) : ARexp = r match{
|
|
268 |
case AALTS(bs, rs) => rs.map(r => prune(r, acc)).filter(_ != ZERO) match
|
|
269 |
{
|
|
270 |
//all components have been removed, meaning this is effectively a duplicate
|
|
271 |
//flats will take care of removing this AZERO
|
|
272 |
case Nil => AZERO
|
|
273 |
case r::Nil => fuse(bs, r)
|
|
274 |
case rs1 => AALTS(bs, rs1)
|
|
275 |
}
|
|
276 |
case ASEQ(bs, r1, r2) =>
|
|
277 |
//remove the r2 in (ra + rb)r2 to identify the duplicate contents of r1
|
|
278 |
prune(r1, acc.map(r => removeSeqTail(r, erase(r2)))) match {
|
|
279 |
//after pruning, returns 0
|
|
280 |
case AZERO => AZERO
|
|
281 |
//after pruning, got r1'.r2, where r1' is equal to 1
|
|
282 |
case r1p if(isOne(erase(r1p))) => fuse(bs ++ mkepsBC(r1p), r2)
|
|
283 |
//assemble the pruned head r1p with r2
|
|
284 |
case r1p => ASEQ(bs, r1p, r2)
|
|
285 |
}
|
|
286 |
//this does the duplicate component removal task
|
|
287 |
case r => if(acc(erase(r))) AZERO else r
|
|
288 |
}
|
|
289 |
\end{lstlisting}
|
|
290 |
\caption{pruning function together with its helper functions}
|
|
291 |
\end{figure}
|
|
292 |
\noindent
|
|
293 |
The benefits of using
|
|
294 |
$\textit{prune}$ such as refining the finiteness bound
|
|
295 |
to a cubic bound has not been formalised yet.
|
|
296 |
Therefore we choose to use Scala code rather than an Isabelle-style formal
|
|
297 |
definition like we did for $\simp$, as the definitions might change
|
|
298 |
to suit proof needs.
|
|
299 |
In the rest of the chapter we will use this convention consistently.
|
|
300 |
\begin{figure}[H]
|
|
301 |
\begin{lstlisting}
|
|
302 |
def distinctWith(rs: List[ARexp],
|
|
303 |
pruneFunction: (ARexp, Set[Rexp]) => ARexp,
|
|
304 |
acc: Set[Rexp] = Set()) : List[ARexp] =
|
|
305 |
rs match{
|
|
306 |
case Nil => Nil
|
|
307 |
case r :: rs =>
|
|
308 |
if(acc(erase(r)))
|
|
309 |
distinctWith(rs, pruneFunction, acc)
|
|
310 |
else {
|
|
311 |
val pruned_r = pruneFunction(r, acc)
|
|
312 |
pruned_r ::
|
|
313 |
distinctWith(rs,
|
|
314 |
pruneFunction,
|
|
315 |
turnIntoTerms(erase(pruned_r)) ++: acc
|
|
316 |
)
|
|
317 |
}
|
|
318 |
}
|
|
319 |
\end{lstlisting}
|
|
320 |
\caption{A Stronger Version of $\textit{distinctBy}$}
|
|
321 |
\end{figure}
|
|
322 |
\noindent
|
|
323 |
The function $\textit{prune}$ is used in $\distinctWith$.
|
|
324 |
$\distinctWith$ is a stronger version of $\distinctBy$
|
|
325 |
which not only removes duplicates as $\distinctBy$ would
|
|
326 |
do, but also uses the $\textit{pruneFunction}$
|
|
327 |
argument to prune away verbose components in a regular expression.\\
|
|
328 |
\begin{figure}[H]
|
|
329 |
\begin{lstlisting}
|
|
330 |
//a stronger version of simp
|
|
331 |
def bsimpStrong(r: ARexp): ARexp =
|
|
332 |
{
|
|
333 |
r match {
|
|
334 |
case ASEQ(bs1, r1, r2) => (bsimpStrong(r1), bsimpStrong(r2)) match {
|
|
335 |
//normal clauses same as simp
|
|
336 |
case (AZERO, _) => AZERO
|
|
337 |
case (_, AZERO) => AZERO
|
|
338 |
case (AONE(bs2), r2s) => fuse(bs1 ++ bs2, r2s)
|
|
339 |
//bs2 can be discarded
|
|
340 |
case (r1s, AONE(bs2)) => fuse(bs1, r1s) //assert bs2 == Nil
|
|
341 |
case (r1s, r2s) => ASEQ(bs1, r1s, r2s)
|
|
342 |
}
|
|
343 |
case AALTS(bs1, rs) => {
|
|
344 |
//distinctBy(flat_res, erase)
|
|
345 |
distinctWith(flats(rs.map(bsimpStrong(_))), prune) match {
|
|
346 |
case Nil => AZERO
|
|
347 |
case s :: Nil => fuse(bs1, s)
|
|
348 |
case rs => AALTS(bs1, rs)
|
|
349 |
}
|
|
350 |
}
|
|
351 |
//stars that can be treated as 1
|
|
352 |
case ASTAR(bs, r0) if(atMostEmpty(erase(r0))) => AONE(bs)
|
|
353 |
case r => r
|
|
354 |
}
|
|
355 |
}
|
|
356 |
\end{lstlisting}
|
|
357 |
\caption{The function $\bsimpStrong$ and $\bdersStrongs$}
|
|
358 |
\end{figure}
|
|
359 |
\noindent
|
|
360 |
$\distinctWith$, is in turn used in $\bsimpStrong$:
|
|
361 |
\begin{figure}[H]
|
|
362 |
\begin{lstlisting}
|
|
363 |
//Conjecture: [| bdersStrong(s, r) |] = O([| r |]^3)
|
|
364 |
def bdersStrong(s: List[Char], r: ARexp) : ARexp = s match {
|
|
365 |
case Nil => r
|
|
366 |
case c::s => bdersStrong(s, bsimpStrong(bder(c, r)))
|
|
367 |
}
|
|
368 |
\end{lstlisting}
|
|
369 |
\caption{The function $\bsimpStrong$ and $\bdersStrongs$}
|
|
370 |
\end{figure}
|
|
371 |
\noindent
|
|
372 |
We conjecture that the above Scala function $\bdersStrongs$,
|
|
373 |
written $\bdersStrong{\_}{\_}$ as an infix notation,
|
|
374 |
satisfies the following property:
|
|
375 |
\begin{conjecture}
|
|
376 |
$\llbracket \bdersStrong{a}{s} \rrbracket = O(\llbracket a \rrbracket^3)$
|
|
377 |
\end{conjecture}
|
|
378 |
The stronger version of $\blexersimp$'s
|
|
379 |
code in Scala looks like:
|
|
380 |
\begin{figure}[H]
|
|
381 |
\begin{lstlisting}
|
|
382 |
def strongBlexer(r: Rexp, s: String) : Option[Val] = {
|
|
383 |
Try(Some(decode(r, strong_blex_simp(internalise(r), s.toList)))).getOrElse(None)
|
|
384 |
}
|
|
385 |
def strong_blex_simp(r: ARexp, s: List[Char]) : Bits = s match {
|
|
386 |
case Nil => {
|
|
387 |
if (bnullable(r)) {
|
|
388 |
mkepsBC(r)
|
|
389 |
}
|
|
390 |
else
|
|
391 |
throw new Exception("Not matched")
|
|
392 |
}
|
|
393 |
case c::cs => {
|
|
394 |
strong_blex_simp(strongBsimp(bder(c, r)), cs)
|
|
395 |
}
|
|
396 |
}
|
|
397 |
\end{lstlisting}
|
|
398 |
\end{figure}
|
|
399 |
\noindent
|
|
400 |
We would like to preserve the correctness like the one
|
|
401 |
we had for $\blexersimp$:
|
|
402 |
\begin{conjecture}\label{cubicConjecture}
|
|
403 |
$\blexerStrong \;r \; s = \blexer\; r\;s$
|
|
404 |
\end{conjecture}
|
592
|
405 |
\noindent
|
|
406 |
We introduce the new rule \ref{cubicRule}
|
|
407 |
and augment our proofs in chapter \ref{Bitcoded2}.
|
|
408 |
The idea is to maintain the properties like
|
|
409 |
$r \stackrel{*}{\rightsquigarrow} \textit{bsimp} \; r$ etc.
|
591
|
410 |
|
592
|
411 |
In the next section,
|
|
412 |
we will describe why we
|
|
413 |
believe a cubic bound can be achieved.
|
|
414 |
We give an introduction to the
|
|
415 |
partial derivatives,
|
|
416 |
which was invented by Antimirov \cite{Antimirov95},
|
|
417 |
and then link it with the result of the function
|
|
418 |
$\bdersStrongs$.
|
|
419 |
|
|
420 |
\section{Antimirov's partial derivatives}
|
591
|
421 |
This suggests a link towrads "partial derivatives"
|
592
|
422 |
introduced The idea behind Antimirov's partial derivatives
|
591
|
423 |
is to do derivatives in a similar way as suggested by Brzozowski,
|
|
424 |
but maintain a set of regular expressions instead of a single one:
|
|
425 |
|
|
426 |
%TODO: antimirov proposition 3.1, needs completion
|
|
427 |
\begin{center}
|
|
428 |
\begin{tabular}{ccc}
|
|
429 |
$\partial_x(a+b)$ & $=$ & $\partial_x(a) \cup \partial_x(b)$\\
|
|
430 |
$\partial_x(\ONE)$ & $=$ & $\phi$
|
|
431 |
\end{tabular}
|
|
432 |
\end{center}
|
|
433 |
|
|
434 |
Rather than joining the calculated derivatives $\partial_x a$ and $\partial_x b$ together
|
|
435 |
using the alternatives constructor, Antimirov cleverly chose to put them into
|
|
436 |
a set instead. This breaks the terms in a derivative regular expression up,
|
|
437 |
allowing us to understand what are the "atomic" components of it.
|
|
438 |
For example, To compute what regular expression $x^*(xx + y)^*$'s
|
|
439 |
derivative against $x$ is made of, one can do a partial derivative
|
|
440 |
of it and get two singleton sets $\{x^* \cdot (xx + y)^*\}$ and $\{x \cdot (xx + y) ^* \}$
|
|
441 |
from $\partial_x(x^*) \cdot (xx + y) ^*$ and $\partial_x((xx + y)^*)$.
|
|
442 |
To get all the "atomic" components of a regular expression's possible derivatives,
|
|
443 |
there is a procedure Antimirov called $\textit{lf}$, short for "linear forms", that takes
|
|
444 |
whatever character is available at the head of the string inside the language of a
|
|
445 |
regular expression, and gives back the character and the derivative regular expression
|
|
446 |
as a pair (which he called "monomial"):
|
|
447 |
\begin{center}
|
|
448 |
\begin{tabular}{ccc}
|
|
449 |
$\lf(\ONE)$ & $=$ & $\phi$\\
|
|
450 |
$\lf(c)$ & $=$ & $\{(c, \ONE) \}$\\
|
|
451 |
$\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
|
|
452 |
$\lf(r^*)$ & $=$ & $\lf(r) \bigodot \lf(r^*)$\\
|
|
453 |
\end{tabular}
|
|
454 |
\end{center}
|
|
455 |
%TODO: completion
|
|
456 |
|
|
457 |
There is a slight difference in the last three clauses compared
|
|
458 |
with $\partial$: instead of a dot operator $ \textit{rset} \cdot r$ that attaches the regular
|
|
459 |
expression $r$ with every element inside $\textit{rset}$ to create a set of
|
|
460 |
sequence derivatives, it uses the "circle dot" operator $\bigodot$ which operates
|
|
461 |
on a set of monomials (which Antimirov called "linear form") and a regular
|
|
462 |
expression, and returns a linear form:
|
|
463 |
\begin{center}
|
|
464 |
\begin{tabular}{ccc}
|
|
465 |
$l \bigodot (\ZERO)$ & $=$ & $\phi$\\
|
|
466 |
$l \bigodot (\ONE)$ & $=$ & $l$\\
|
|
467 |
$\phi \bigodot t$ & $=$ & $\phi$\\
|
|
468 |
$\{ (x, \ZERO) \} \bigodot t$ & $=$ & $\{(x,\ZERO) \}$\\
|
|
469 |
$\{ (x, \ONE) \} \bigodot t$ & $=$ & $\{(x,t) \}$\\
|
|
470 |
$\{ (x, p) \} \bigodot t$ & $=$ & $\{(x,p\cdot t) \}$\\
|
|
471 |
$\lf(a+b)$ & $=$ & $\lf(a) \cup \lf(b)$\\
|
|
472 |
$\lf(r^*)$ & $=$ & $\lf(r) \cdot \lf(r^*)$\\
|
|
473 |
\end{tabular}
|
|
474 |
\end{center}
|
|
475 |
%TODO: completion
|
|
476 |
|
|
477 |
Some degree of simplification is applied when doing $\bigodot$, for example,
|
|
478 |
$l \bigodot (\ZERO) = \phi$ corresponds to $r \cdot \ZERO \rightsquigarrow \ZERO$,
|
|
479 |
and $l \bigodot (\ONE) = l$ to $l \cdot \ONE \rightsquigarrow l$, and
|
|
480 |
$\{ (x, \ZERO) \} \bigodot t = \{(x,\ZERO) \}$ to $\ZERO \cdot x \rightsquigarrow \ZERO$,
|
|
481 |
and so on.
|
|
482 |
|
|
483 |
With the function $\lf$ one can compute all possible partial derivatives $\partial_{UNIV}(r)$ of a regular expression $r$ with
|
|
484 |
an iterative procedure:
|
|
485 |
\begin{center}
|
|
486 |
\begin{tabular}{llll}
|
|
487 |
$\textit{while}$ & $(\Delta_i \neq \phi)$ & & \\
|
|
488 |
& $\Delta_{i+1}$ & $ =$ & $\lf(\Delta_i) - \PD_i$ \\
|
|
489 |
& $\PD_{i+1}$ & $ =$ & $\Delta_{i+1} \cup \PD_i$ \\
|
|
490 |
$\partial_{UNIV}(r)$ & $=$ & $\PD$ &
|
|
491 |
\end{tabular}
|
|
492 |
\end{center}
|
|
493 |
|
|
494 |
|
|
495 |
$(r_1 + r_2) \cdot r_3 \longrightarrow (r_1 \cdot r_3) + (r_2 \cdot r_3)$,
|
|
496 |
|
|
497 |
|
532
|
498 |
|
|
499 |
|
|
500 |
%----------------------------------------------------------------------------------------
|
|
501 |
% SECTION 2
|
|
502 |
%----------------------------------------------------------------------------------------
|
|
503 |
|
|
504 |
\section{Bounded Repetitions}
|
612
|
505 |
We have promised in chapter \ref{Introduction}
|
|
506 |
that our lexing algorithm can potentially be extended
|
|
507 |
to handle bounded repetitions
|
|
508 |
in natural and elegant ways.
|
|
509 |
Now we fulfill our promise by adding support for
|
|
510 |
the ``exactly-$n$-times'' bounded regular expression $r^{\{n\}}$.
|
|
511 |
We add clauses in our derivatives-based lexing algorithms (with simplifications)
|
|
512 |
introduced in chapter \ref{Bitcoded2}.
|
|
513 |
|
|
514 |
\subsection{Augmented Definitions}
|
|
515 |
There are a number of definitions that needs to be augmented.
|
|
516 |
The most notable one would be the POSIX rules for $r^{\{n\}}$:
|
|
517 |
\begin{mathpar}
|
|
518 |
\inferrule{v \in vs}{\textit{Stars} place holder}
|
|
519 |
\end{mathpar}
|
|
520 |
|
|
521 |
|
|
522 |
\subsection{Proofs for the Augmented Lexing Algorithm}
|
|
523 |
We need to maintain two proofs with the additional $r^{\{n\}}$
|
|
524 |
construct: the
|
|
525 |
correctness proof in chapter \ref{Bitcoded2},
|
|
526 |
and the finiteness proof in chapter \ref{Finite}.
|
|
527 |
|
|
528 |
\subsubsection{Correctness Proof Augmentation}
|
|
529 |
The correctness of $\textit{lexer}$ and $\textit{blexer}$ with bounded repetitions
|
|
530 |
have been proven by Ausaf and Urban\cite{AusafDyckhoffUrban2016}.
|
|
531 |
As they have commented, once the definitions are in place,
|
|
532 |
the proofs given for the basic regular expressions will extend to
|
|
533 |
bounded regular expressions, and there are no ``surprises''.
|
|
534 |
We confirm this point because the correctness theorem would also
|
|
535 |
extend without surprise to $\blexersimp$.
|
|
536 |
The rewrite rules such as $\rightsquigarrow$, $\stackrel{s}{\rightsquigarrow}$ and so on
|
|
537 |
do not need to be changed,
|
|
538 |
and only a few lemmas such as lemma \ref{fltsPreserves} need to be adjusted to
|
|
539 |
add one more line which can be solved by sledgehammer
|
|
540 |
to solve the $r^{\{n\}}$ inductive case.
|
|
541 |
|
|
542 |
|
|
543 |
\subsubsection{Finiteness Proof Augmentation}
|
620
|
544 |
The bounded repetitions are
|
|
545 |
very similar to stars, and therefore the treatment
|
|
546 |
is similar, with minor changes to handle some slight complications
|
|
547 |
when the counter reaches 0.
|
|
548 |
The exponential growth is similar:
|
|
549 |
\begin{center}
|
|
550 |
\begin{tabular}{ll}
|
|
551 |
$r^{\{n\}} $ & $\longrightarrow_{\backslash c}$\\
|
|
552 |
$(r\backslash c) \cdot
|
|
553 |
r^{\{n - 1\}}*$ & $\longrightarrow_{\backslash c'}$\\
|
|
554 |
\\
|
|
555 |
$r \backslash cc' \cdot r^{\{n - 2\}}* +
|
|
556 |
r \backslash c' \cdot r^{\{n - 1\}}*$ &
|
|
557 |
$\longrightarrow_{\backslash c''}$\\
|
|
558 |
\\
|
|
559 |
$(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* +
|
|
560 |
r \backslash c''\cdot r^{\{n-1\}}) +
|
|
561 |
(r \backslash c'c'' \cdot r^{\{n-2\}}* +
|
|
562 |
r \backslash c'' \cdot r^{\{n-1\}}*)$ &
|
|
563 |
$\longrightarrow_{\backslash c'''}$ \\
|
|
564 |
\\
|
|
565 |
$\ldots$\\
|
|
566 |
\end{tabular}
|
|
567 |
\end{center}
|
|
568 |
Again, we assume that $r\backslash c$, $r \backslash cc'$ and so on
|
|
569 |
are all nullable.
|
|
570 |
The flattened list of terms for $r^{\{n\}} \backslash_{rs} s$
|
|
571 |
\begin{center}
|
|
572 |
$[r_1 \backslash cc'c'' \cdot r^{\{n-3\}}*,\;
|
|
573 |
r \backslash c''\cdot r^{\{n-1\}}, \;
|
|
574 |
r \backslash c'c'' \cdot r^{\{n-2\}}*, \;
|
|
575 |
r \backslash c'' \cdot r^{\{n-1\}}*,\; \ldots ]$
|
|
576 |
\end{center}
|
|
577 |
that comes from
|
|
578 |
\begin{center}
|
|
579 |
$(r_1 \backslash cc'c'' \cdot r^{\{n-3\}}* +
|
|
580 |
r \backslash c''\cdot r^{\{n-1\}}) +
|
|
581 |
(r \backslash c'c'' \cdot r^{\{n-2\}}* +
|
|
582 |
r \backslash c'' \cdot r^{\{n-1\}}*)+ \ldots$
|
|
583 |
\end{center}
|
|
584 |
are made of sequences with different tails, where the counters
|
|
585 |
might differ.
|
|
586 |
The observation for maintaining the bound is that
|
|
587 |
these counters never exceed $n$, the original
|
|
588 |
counter. With the number of counters staying finite,
|
|
589 |
$\rDistinct$ will deduplicate and keep the list finite.
|
|
590 |
We introduce this idea as a lemma once we describe all
|
|
591 |
the necessary helper functions.
|
|
592 |
|
|
593 |
Similar to the star case, we want
|
|
594 |
\begin{center}
|
|
595 |
$\rderssimp{r^{\{n\}}}{s} = \rsimp{\sum rs}$.
|
|
596 |
\end{center}
|
|
597 |
where $rs$
|
|
598 |
shall be in the form of
|
|
599 |
$\map \; f \; Ss$, where $f$ is a function and
|
|
600 |
$Ss$ a list of objects to act on.
|
|
601 |
For star, the object's datatype is string.
|
|
602 |
The list of strings $Ss$
|
|
603 |
is generated using functions
|
|
604 |
$\starupdate$ and $\starupdates$.
|
|
605 |
The function that takes a string and returns a regular expression
|
|
606 |
is the anonymous function $
|
|
607 |
(\lambda s'. \; r\backslash s' \cdot r^{\{m\}})$.
|
|
608 |
In the NTIMES setting,
|
|
609 |
the $\starupdate$ and $\starupdates$ functions are replaced by
|
|
610 |
$\textit{nupdate}$ and $\textit{nupdates}$:
|
|
611 |
\begin{center}
|
|
612 |
\begin{tabular}{lcl}
|
|
613 |
$\nupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
|
|
614 |
$\nupdate \; c \; r \;
|
|
615 |
(\Some \; (s, \; n + 1) \; :: \; Ss)$ & $\dn$ & %\\
|
|
616 |
$\textit{if} \;
|
|
617 |
(\rnullable \; (r \backslash_{rs} s))$ \\
|
|
618 |
& & $\;\;\textit{then}
|
|
619 |
\;\; \Some \; (s @ [c], n + 1) :: \Some \; ([c], n) :: (
|
|
620 |
\nupdate \; c \; r \; Ss)$ \\
|
|
621 |
& & $\textit{else} \;\; \Some \; (s @ [c], n+1) :: (
|
|
622 |
\nupdate \; c \; r \; Ss)$\\
|
|
623 |
$\nupdate \; c \; r \; (\textit{None} :: Ss)$ & $\dn$ &
|
|
624 |
$(\None :: \nupdate \; c \; r \; Ss)$\\
|
|
625 |
& & \\
|
|
626 |
%\end{tabular}
|
|
627 |
%\end{center}
|
|
628 |
%\begin{center}
|
|
629 |
%\begin{tabular}{lcl}
|
|
630 |
$\nupdates \; [] \; r \; Ss$ & $\dn$ & $Ss$\\
|
|
631 |
$\nupdates \; (c :: cs) \; r \; Ss$ & $\dn$ & $\nupdates \; cs \; r \; (
|
|
632 |
\nupdate \; c \; r \; Ss)$
|
|
633 |
\end{tabular}
|
|
634 |
\end{center}
|
|
635 |
\noindent
|
|
636 |
which take into account when a subterm
|
|
637 |
\begin{center}
|
|
638 |
$r \backslash_s s \cdot r^{\{n\}}$
|
|
639 |
\end{center}
|
|
640 |
counter $n$
|
|
641 |
is 0, and therefore expands to
|
|
642 |
\begin{center}
|
|
643 |
$r \backslash_s (s@[c]) \cdot r^{\{n\}} \;+
|
|
644 |
\; \ZERO$
|
|
645 |
\end{center}
|
|
646 |
after taking a derivative.
|
|
647 |
The object now has type
|
|
648 |
\begin{center}
|
|
649 |
$\textit{option} \;(\textit{string}, \textit{nat})$
|
|
650 |
\end{center}
|
|
651 |
and therefore the function for converting such an option into
|
|
652 |
a regular expression term is called $\opterm$:
|
|
653 |
|
|
654 |
\begin{center}
|
|
655 |
\begin{tabular}{lcl}
|
|
656 |
$\opterm \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
|
|
657 |
& & $\;\;\Some \; (s, n) \Rightarrow
|
|
658 |
(r\backslash_{rs} s)\cdot r^{\{n\}}$\\
|
|
659 |
& & $\;\;\None \Rightarrow
|
|
660 |
\ZERO$\\
|
|
661 |
\end{tabular}
|
|
662 |
\end{center}
|
|
663 |
\noindent
|
|
664 |
Put together, the list $\map \; f \; Ss$ is instantiated as
|
|
665 |
\begin{center}
|
|
666 |
$\map \; (\opterm \; r) \; (\nupdates \; s \; r \;
|
|
667 |
[\Some \; ([c], n)])$.
|
|
668 |
\end{center}
|
|
669 |
For the closed form to be bounded, we would like
|
|
670 |
simplification to be applied to each term in the list.
|
|
671 |
Therefore we introduce some variants of $\opterm$,
|
|
672 |
which help conveniently express the rewriting steps
|
|
673 |
needed in the closed form proof.
|
|
674 |
\begin{center}
|
|
675 |
\begin{tabular}{lcl}
|
|
676 |
$\optermOsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
|
|
677 |
& & $\;\;\Some \; (s, n) \Rightarrow
|
|
678 |
\textit{rsimp} \; ((r\backslash_{rs} s)\cdot r^{\{n\}})$\\
|
|
679 |
& & $\;\;\None \Rightarrow
|
|
680 |
\ZERO$\\
|
|
681 |
\\
|
|
682 |
$\optermosimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
|
|
683 |
& & $\;\;\Some \; (s, n) \Rightarrow
|
|
684 |
(\textit{rsimp} \; (r\backslash_{rs} s))
|
|
685 |
\cdot r^{\{n\}}$\\
|
|
686 |
& & $\;\;\None \Rightarrow
|
|
687 |
\ZERO$\\
|
|
688 |
\\
|
|
689 |
$\optermsimp \; r \; SN$ & $\dn$ & $\textit{case} \; SN\; of$\\
|
|
690 |
& & $\;\;\Some \; (s, n) \Rightarrow
|
|
691 |
(r\backslash_{rsimps} s)\cdot r^{\{n\}}$\\
|
|
692 |
& & $\;\;\None \Rightarrow
|
|
693 |
\ZERO$\\
|
|
694 |
\end{tabular}
|
|
695 |
\end{center}
|
|
696 |
|
|
697 |
|
|
698 |
For a list of
|
|
699 |
$\textit{option} \;(\textit{string}, \textit{nat})$ elements,
|
|
700 |
we define the highest power for it recursively:
|
|
701 |
\begin{center}
|
|
702 |
\begin{tabular}{lcl}
|
|
703 |
$\hpa \; [] \; n $ & $\dn$ & $n$\\
|
|
704 |
$\hpa \; (\None :: os) \; n $ & $\dn$ & $\hpa \; os \; n$\\
|
|
705 |
$\hpa \; (\Some \; (s, n) :: os) \; m$ & $\dn$ &
|
|
706 |
$\hpa \;os \; (\textit{max} \; n\; m)$\\
|
|
707 |
\\
|
|
708 |
$\hpower \; rs $ & $\dn$ & $\hpa \; rs \; 0$\\
|
|
709 |
\end{tabular}
|
|
710 |
\end{center}
|
|
711 |
|
|
712 |
Now the intuition that an NTIMES regular expression's power
|
|
713 |
does not increase can be easily expressed as
|
|
714 |
\begin{lemma}\label{nupdatesMono2}
|
|
715 |
$\hpower \; (\nupdates \;s \; r \; [\Some \; ([c], n)]) \leq n$
|
|
716 |
\end{lemma}
|
|
717 |
\begin{proof}
|
|
718 |
Note that the power is non-increasing after a $\nupdate$ application:
|
|
719 |
\begin{center}
|
|
720 |
$\hpa \;\; (\nupdate \; c \; r \; Ss)\;\; m \leq
|
|
721 |
\hpa\; \; Ss \; m$.
|
|
722 |
\end{center}
|
|
723 |
This is also the case for $\nupdates$:
|
|
724 |
\begin{center}
|
|
725 |
$\hpa \;\; (\nupdates \; s \; r \; Ss)\;\; m \leq
|
|
726 |
\hpa\; \; Ss \; m$.
|
|
727 |
\end{center}
|
|
728 |
Therefore we have that
|
|
729 |
\begin{center}
|
|
730 |
$\hpower \;\; (\nupdates \; s \; r \; Ss) \leq
|
|
731 |
\hpower \;\; Ss$
|
|
732 |
\end{center}
|
|
733 |
which leads to the lemma being proven.
|
|
734 |
|
|
735 |
\end{proof}
|
|
736 |
|
|
737 |
|
|
738 |
We also define the inductive rules for
|
|
739 |
the shape of derivatives of the NTIMES regular expressions:\\[-3em]
|
|
740 |
\begin{center}
|
|
741 |
\begin{mathpar}
|
|
742 |
\inferrule{\mbox{}}{\cbn \;\ZERO}
|
|
743 |
|
|
744 |
\inferrule{\mbox{}}{\cbn \; \; r_a \cdot (r^{\{n\}})}
|
|
745 |
|
|
746 |
\inferrule{\cbn \; r_1 \;\; \; \cbn \; r_2}{\cbn \; r_1 + r_2}
|
|
747 |
|
|
748 |
\inferrule{\cbn \; r}{\cbn \; r + \ZERO}
|
|
749 |
\end{mathpar}
|
|
750 |
\end{center}
|
|
751 |
\noindent
|
|
752 |
A derivative of NTIMES fits into the shape described by $\cbn$:
|
|
753 |
\begin{lemma}\label{ntimesDersCbn}
|
|
754 |
$\cbn \; ((r' \cdot r^{\{n\}}) \backslash_{rs} s)$ holds.
|
|
755 |
\end{lemma}
|
|
756 |
\begin{proof}
|
|
757 |
By a reverse induction on $s$.
|
|
758 |
For the inductive case, note that if $\cbn \; r$ holds,
|
|
759 |
then $\cbn \; (r\backslash_r c)$ holds.
|
|
760 |
\end{proof}
|
|
761 |
In addition, for $\cbn$-shaped regular expressioins, one can flatten
|
|
762 |
them:
|
|
763 |
\begin{lemma}\label{ntimesHfauPushin}
|
|
764 |
If $\cbn \; r$ holds, then $\hflataux{r \backslash_r c} =
|
|
765 |
\textit{concat} \; (\map \; \hflataux{\map \; (\_\backslash_r c) \;
|
|
766 |
(\hflataux{r})})$
|
|
767 |
\end{lemma}
|
|
768 |
\begin{proof}
|
|
769 |
By an induction on the inductive cases of $\cbn$.
|
|
770 |
\end{proof}
|
|
771 |
|
|
772 |
|
|
773 |
|
|
774 |
This time we do not need to define the flattening functions for NTIMES only,
|
|
775 |
because $\hflat{\_}$ and $\hflataux{\_}$ work on NTIMES already:
|
|
776 |
\begin{lemma}\label{ntimesHfauInduct}
|
|
777 |
$\hflataux{( (r\backslash_r c) \cdot r^{\{n\}}) \backslash_{rsimps} s} =
|
|
778 |
\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [\Some \; ([c], n)])$
|
|
779 |
\end{lemma}
|
|
780 |
\begin{proof}
|
|
781 |
By a reverse induction on $s$.
|
|
782 |
The lemma \ref{ntimesDersCbn} is used.
|
|
783 |
\end{proof}
|
|
784 |
\noindent
|
|
785 |
We have a recursive property for NTIMES with $\nupdate$
|
|
786 |
similar to that for STAR,
|
|
787 |
and one for $\nupdates $ as well:
|
|
788 |
\begin{lemma}\label{nupdateInduct1}
|
|
789 |
(i) $\textit{concat} \; (\map \; (\hflataux{\_} \circ (
|
|
790 |
\opterm \; r)) \; Ss) = \map \; (\opterm \; r) \; (\nupdate \;
|
|
791 |
c \; r \; Ss)$\\
|
|
792 |
(ii) $\textit{concat} \; (\map \; \hflataux{\_}\;
|
|
793 |
\map \; (\_\backslash_r x) \;
|
|
794 |
(\map \; (\opterm \; r) \; (\nupdates \; xs \; r \; Ss))) =
|
|
795 |
\map \; (\opterm \; r) \; (\nupdates \;(xs@[x]) \; r\;Ss)$ holds.
|
|
796 |
\end{lemma}
|
|
797 |
\begin{proof}
|
|
798 |
(i) is by an induction on $Ss$.
|
|
799 |
(ii) is by an induction on $xs$.
|
|
800 |
\end{proof}
|
|
801 |
|
|
802 |
The $\nString$ predicate is defined for conveniently
|
|
803 |
expressing that there are no empty strings in the
|
|
804 |
$\Some \;(s, n)$ elements generated by $\nupdate$:
|
|
805 |
\begin{center}
|
|
806 |
\begin{tabular}{lcl}
|
|
807 |
$\nString \; \None$ & $\dn$ & $ \textit{true}$\\
|
|
808 |
$\nString \; (\Some \; ([], n))$ & $\dn$ & $ \textit{false}$\\
|
|
809 |
$\nString \; (\Some \; (c::s, n))$ & $\dn$ & $ \textit{true}$\\
|
|
810 |
\end{tabular}
|
|
811 |
\end{center}
|
|
812 |
\begin{lemma}\label{nupdatesNonempty}
|
|
813 |
If for all elements $opt$ in $\textit{set} \; Ss$,
|
|
814 |
$\nString \; opt$ holds, the we have that
|
|
815 |
for all elements $opt$ in $\textit{set} \; (\nupdates \; s \; r \; Ss)$,
|
|
816 |
$\nString \; opt$ holds.
|
|
817 |
\end{lemma}
|
|
818 |
\begin{proof}
|
|
819 |
By an induction on $s$, where $Ss$ is set to vary over all possible values.
|
|
820 |
\end{proof}
|
|
821 |
|
|
822 |
\noindent
|
|
823 |
|
|
824 |
\begin{lemma}\label{ntimesClosedFormsSteps}
|
|
825 |
The following list of equalities or rewriting relations hold:\\
|
|
826 |
(i) $r^{\{n+1\}} \backslash_{rsimps} (c::s) =
|
|
827 |
\textit{rsimp} \; (\sum (\map \; (\opterm \;r \;\_) \; (\nupdates \;
|
|
828 |
s \; r \; [\Some \; ([c], n)])))$\\
|
|
829 |
(ii)
|
|
830 |
\begin{center}
|
|
831 |
$\sum (\map \; (\opterm \; r) \; (\nupdates \; s \; r \; [
|
|
832 |
\Some \; ([c], n)]))$ \\ $ \sequal$\\
|
|
833 |
$\sum (\map \; (\textit{rsimp} \circ (\opterm \; r))\; (\nupdates \;
|
|
834 |
s\;r \; [\Some \; ([c], n)]))$\\
|
|
835 |
\end{center}
|
|
836 |
(iii)
|
|
837 |
\begin{center}
|
|
838 |
$\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \;
|
|
839 |
([c], n)]))$\\
|
|
840 |
$\sequal$\\
|
|
841 |
$\sum \;(\map \; (\optermsimp r) \; (\nupdates \; s \; r \; [\Some \;
|
|
842 |
([c], n)])) $\\
|
|
843 |
\end{center}
|
|
844 |
(iv)
|
|
845 |
\begin{center}
|
|
846 |
$\sum \;(\map \; (\optermosimp \; r) \; (\nupdates \; s \; r\; [\Some \;
|
|
847 |
([c], n)])) $ \\ $\sequal$\\
|
|
848 |
$\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \;
|
|
849 |
([c], n)])) $\\
|
|
850 |
\end{center}
|
|
851 |
(v)
|
|
852 |
\begin{center}
|
|
853 |
$\sum \;(\map \; (\optermOsimp r) \; (\nupdates \; s \; r \; [\Some \;
|
|
854 |
([c], n)])) $ \\ $\sequal$\\
|
|
855 |
$\sum \; (\map \; (\textit{rsimp} \circ (\opterm \; r)) \;
|
|
856 |
(\nupdates \; s \; r \; [\Some \; ([c], n)]))$
|
|
857 |
\end{center}
|
|
858 |
\end{lemma}
|
|
859 |
\begin{proof}
|
|
860 |
Routine.
|
|
861 |
(iii) and (iv) make use of the fact that all the strings $s$
|
|
862 |
inside $\Some \; (s, m)$ which are elements of the list
|
|
863 |
$\nupdates \; s\;r\;[\Some\; ([c], n)]$ are non-empty,
|
|
864 |
which is from lemma \ref{nupdatesNonempty}.
|
|
865 |
Once the string in $o = \Some \; (s, n)$ is
|
|
866 |
nonempty, $\optermsimp \; r \;o$,
|
|
867 |
$\optermosimp \; r \; o$ and $\optermosimp \; \; o$ are guaranteed
|
|
868 |
to be equal.
|
|
869 |
\end{proof}
|
|
870 |
|
|
871 |
\begin{theorem}\label{ntimesClosedForm}
|
|
872 |
The derivative of $r^{\{n+1\}}$ can be described as an alternative
|
|
873 |
containing a list
|
|
874 |
of terms:\\
|
|
875 |
$r^{\{n+1\}} \backslash_{rsimps} (c::s) = \textit{rsimp} \; (
|
|
876 |
\sum (\map \; (\optermsimp \; r) \; (\nupdates \; s \; r \;
|
|
877 |
[\Some \; ([c], n)])))$
|
|
878 |
\end{theorem}
|
|
879 |
\begin{proof}
|
|
880 |
By the rewriting steps described in lemma \ref{ntimesClosedFormsSteps}.
|
|
881 |
\end{proof}
|
|
882 |
|
|
883 |
\begin{lemma}\label{nupdatesNLeqN}
|
|
884 |
For an element $o$ in $\textit{set} \; (\nupdates \; s \; r \;
|
|
885 |
[\Some \; ([c], n)])$, either $o = \None$, or $o = \Some
|
|
886 |
\; (s', m)$ for some string $s'$ and number $m \leq n$.
|
|
887 |
\end{lemma}
|
|
888 |
|
|
889 |
|
|
890 |
\begin{lemma}\label{ntimesClosedFormListElemShape}
|
|
891 |
For any element $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; (
|
|
892 |
\nupdates \; s \; r \; [\Some \; ([c], n)]))$,
|
|
893 |
we have that $r'$ is either $\ZERO$ or $r \backslash_{rsimps} s' \cdot
|
|
894 |
r^{\{m\}}$ for some string $s'$ and number $m \leq n$.
|
|
895 |
\end{lemma}
|
|
896 |
\begin{proof}
|
|
897 |
Using lemma \ref{nupdatesNLeqN}.
|
|
898 |
\end{proof}
|
|
899 |
|
|
900 |
\begin{theorem}\label{ntimesClosedFormBounded}
|
|
901 |
Assuming that for any string $s$, $\llbracket r \backslash_{rsimps} s
|
|
902 |
\rrbracket_r \leq N$ holds, then we have that\\
|
|
903 |
$\llbracket r^{\{n+1\}} \backslash_{rsimps} s \rrbracket_r \leq
|
|
904 |
\textit{max} \; (c_N+1)* (N + \llbracket r^{\{n\}} \rrbracket+1)$,
|
|
905 |
where $c_N = \textit{card} \; (\textit{sizeNregex} \; (
|
|
906 |
N + \llbracket r^{\{n\}} \rrbracket_r+1))$.
|
|
907 |
\end{theorem}
|
|
908 |
\begin{proof}
|
|
909 |
We have that for all regular expressions $r'$ in $\textit{set} \; (\map \; (\optermsimp \; r) \; (
|
|
910 |
\nupdates \; s \; r \; [\Some \; ([c], n)]))$,
|
|
911 |
$r'$'s size is less than or equal to $N + \llbracket r^{\{n\}}
|
|
912 |
\rrbracket_r + 1$
|
|
913 |
because $r'$ can only be either a $\ZERO$ or $r \backslash_{rsimps} s' \cdot
|
|
914 |
r^{\{m\}}$ for some string $s'$ and number
|
|
915 |
$m \leq n$ (lemma \ref{ntimesClosedFormListElemShape}).
|
|
916 |
In addition, we know that the list
|
|
917 |
$\map \; (\optermsimp \; r) \; (
|
|
918 |
\nupdates \; s \; r \; [\Some \; ([c], n)])$'s size is at most
|
|
919 |
$c_N = \textit{card} \;
|
|
920 |
(\sizeNregex \; ((N + \llbracket r^{\{n\}} \rrbracket) + 1))$.
|
|
921 |
This gives us $\llbracket \llbracket r \backslash_{rsimps} \;s \llbracket_r
|
|
922 |
\leq N * c_N$.
|
|
923 |
\end{proof}
|
|
924 |
|
|
925 |
|
|
926 |
Assuming we have that
|
|
927 |
\begin{center}
|
|
928 |
$\rderssimp{r^*}{s} = \rsimp{(\sum \map \; (\lambda s'. r\backslash s' \cdot r^*) \; Ss)}$.
|
|
929 |
\end{center}
|
|
930 |
holds.
|
|
931 |
The idea of $\starupdate$ and $\starupdates$
|
|
932 |
is to update $Ss$ when another
|
|
933 |
derivative is taken on $\rderssimp{r^*}{s}$
|
|
934 |
w.r.t a character $c$ and a string $s'$
|
|
935 |
respectively.
|
|
936 |
Both $\starupdate$ and $\starupdates$ take three arguments as input:
|
|
937 |
the new character $c$ or string $s$ to take derivative with,
|
|
938 |
the regular expression
|
|
939 |
$r$ under the star $r^*$, and the
|
|
940 |
list of strings $Ss$ for the derivative $r^* \backslash s$
|
|
941 |
up until this point
|
|
942 |
such that
|
|
943 |
\begin{center}
|
|
944 |
$(r^*) \backslash s = \sum_{s' \in sSet} (r\backslash s') \cdot r^*$
|
|
945 |
\end{center}
|
|
946 |
is satisfied.
|
|
947 |
|
|
948 |
Functions $\starupdate$ and $\starupdates$ characterise what the
|
|
949 |
star derivatives will look like once ``straightened out'' into lists.
|
|
950 |
The helper functions for such operations will be similar to
|
|
951 |
$\sflat{\_}$, $\sflataux{\_}$ and $\sflataux{\_}$, which we defined for sequence.
|
|
952 |
We use similar symbols to denote them, with a $*$ subscript to mark the difference.
|
|
953 |
\begin{center}
|
|
954 |
\begin{tabular}{lcl}
|
|
955 |
$\hflataux{r_1 + r_2}$ & $\dn$ & $\hflataux{r_1} @ \hflataux{r_2}$\\
|
|
956 |
$\hflataux{r}$ & $\dn$ & $[r]$
|
|
957 |
\end{tabular}
|
|
958 |
\end{center}
|
|
959 |
|
|
960 |
\begin{center}
|
|
961 |
\begin{tabular}{lcl}
|
|
962 |
$\hflat{r_1 + r_2}$ & $\dn$ & $\sum (\hflataux {r_1} @ \hflataux {r_2}) $\\
|
|
963 |
$\hflat{r}$ & $\dn$ & $r$
|
|
964 |
\end{tabular}
|
|
965 |
\end{center}
|
|
966 |
\noindent
|
|
967 |
These definitions are tailor-made for dealing with alternatives that have
|
|
968 |
originated from a star's derivatives.
|
|
969 |
A typical star derivative always has the structure of a balanced binary tree:
|
|
970 |
\begin{center}
|
|
971 |
$(r_1 \backslash cc'c'' \cdot r^* + r \backslash c'') +
|
|
972 |
(r \backslash c'c'' \cdot r^* + r \backslash c'' \cdot r^*) $
|
|
973 |
\end{center}
|
|
974 |
All of the nested structures of alternatives
|
|
975 |
generated from derivatives are binary, and therefore
|
|
976 |
$\hflat{\_}$ and $\hflataux{\_}$ only deal with binary alternatives.
|
|
977 |
$\hflat{\_}$ ``untangles'' like the following:
|
|
978 |
\[
|
|
979 |
\sum ((r_1 + r_2) + (r_3 + r_4)) + \ldots \;
|
|
980 |
\stackrel{\hflat{\_}}{\longrightarrow} \;
|
|
981 |
\RALTS{[r_1, r_2, \ldots, r_n]}
|
|
982 |
\]
|
|
983 |
Here is a lemma stating the recursive property of $\starupdate$ and $\starupdates$,
|
|
984 |
with the helpers $\hflat{\_}$ and $\hflataux{\_}$\footnote{The function $\textit{concat}$ takes a list of lists
|
|
985 |
and merges each of the element lists to form a flattened list.}:
|
|
986 |
\begin{lemma}\label{stupdateInduct1}
|
|
987 |
\mbox
|
|
988 |
For a list of strings $Ss$, the following hold.
|
|
989 |
\begin{itemize}
|
|
990 |
\item
|
|
991 |
If we do a derivative on the terms
|
|
992 |
$r\backslash_r s \cdot r^*$ (where $s$ is taken from the list $Ss$),
|
|
993 |
the result will be the same as if we apply $\starupdate$ to $Ss$.
|
|
994 |
\begin{center}
|
|
995 |
\begin{tabular}{c}
|
|
996 |
$\textit{concat} \; (\map \; (\hflataux{\_} \circ ( (\_\backslash_r x)
|
|
997 |
\circ (\lambda s.\;\; (r \backslash_r s) \cdot r^*)))\; Ss )\;
|
|
998 |
$\\
|
|
999 |
\\
|
|
1000 |
$=$ \\
|
|
1001 |
\\
|
|
1002 |
$\map \; (\lambda s. (r \backslash_r s) \cdot (r^*)) \;
|
|
1003 |
(\starupdate \; x \; r \; Ss)$
|
|
1004 |
\end{tabular}
|
|
1005 |
\end{center}
|
|
1006 |
\item
|
|
1007 |
$\starupdates$ is ``composable'' w.r.t a derivative.
|
|
1008 |
It piggybacks the character $x$ to the tail of the string
|
|
1009 |
$xs$.
|
|
1010 |
\begin{center}
|
|
1011 |
\begin{tabular}{c}
|
|
1012 |
$\textit{concat} \; (\map \; \hflataux{\_} \;
|
|
1013 |
(\map \; (\_\backslash_r x) \;
|
|
1014 |
(\map \; (\lambda s.\;\; (r \backslash_r s) \cdot
|
|
1015 |
(r^*) ) \; (\starupdates \; xs \; r \; Ss))))$\\
|
|
1016 |
\\
|
|
1017 |
$=$\\
|
|
1018 |
\\
|
|
1019 |
$\map \; (\lambda s.\;\; (r\backslash_r s) \cdot (r^*)) \;
|
|
1020 |
(\starupdates \; (xs @ [x]) \; r \; Ss)$
|
|
1021 |
\end{tabular}
|
|
1022 |
\end{center}
|
|
1023 |
\end{itemize}
|
|
1024 |
\end{lemma}
|
|
1025 |
|
|
1026 |
\begin{proof}
|
|
1027 |
Part 1 is by induction on $Ss$.
|
|
1028 |
Part 2 is by induction on $xs$, where $Ss$ is left to take arbitrary values.
|
|
1029 |
\end{proof}
|
|
1030 |
|
|
1031 |
|
|
1032 |
Like $\textit{createdBySequence}$, we need
|
|
1033 |
a predicate for ``star-created'' regular expressions:
|
|
1034 |
\begin{center}
|
|
1035 |
\begin{mathpar}
|
|
1036 |
\inferrule{\mbox{}}{ \textit{createdByStar}\; \RSEQ{ra}{\RSTAR{rb}} }
|
|
1037 |
|
|
1038 |
\inferrule{ \textit{createdByStar} \; r_1\; \land \; \textit{createdByStar} \; r_2 }{\textit{createdByStar} \; (r_1 + r_2) }
|
|
1039 |
\end{mathpar}
|
|
1040 |
\end{center}
|
|
1041 |
\noindent
|
|
1042 |
All regular expressions created by taking derivatives of
|
|
1043 |
$r_1 \cdot (r_2)^*$ satisfy the $\textit{createdByStar}$ predicate:
|
|
1044 |
\begin{lemma}\label{starDersCbs}
|
|
1045 |
$\textit{createdByStar} \; ((r_1 \cdot r_2^*) \backslash_r s) $ holds.
|
|
1046 |
\end{lemma}
|
|
1047 |
\begin{proof}
|
|
1048 |
By a reverse induction on $s$.
|
|
1049 |
\end{proof}
|
|
1050 |
If a regular expression conforms to the shape of a star's derivative,
|
|
1051 |
then we can push an application of $\hflataux{\_}$ inside a derivative of it:
|
|
1052 |
\begin{lemma}\label{hfauPushin}
|
|
1053 |
If $\textit{createdByStar} \; r$ holds, then
|
|
1054 |
\begin{center}
|
|
1055 |
$\hflataux{r \backslash_r c} = \textit{concat} \; (
|
|
1056 |
\map \; \hflataux{\_} (\map \; (\_\backslash_r c) \;(\hflataux{r})))$
|
|
1057 |
\end{center}
|
|
1058 |
holds.
|
|
1059 |
\end{lemma}
|
|
1060 |
\begin{proof}
|
|
1061 |
By an induction on the inductivev cases of $\textit{createdByStar}$.
|
|
1062 |
\end{proof}
|
|
1063 |
%This is not entirely true for annotated regular expressions:
|
|
1064 |
%%TODO: bsimp bders \neq bderssimp
|
|
1065 |
%\begin{center}
|
|
1066 |
% $(1+ (c\cdot \ASEQ{bs}{c^*}{c} ))$
|
|
1067 |
%\end{center}
|
|
1068 |
%For bit-codes, the order in which simplification is applied
|
|
1069 |
%might cause a difference in the location they are placed.
|
|
1070 |
%If we want something like
|
|
1071 |
%\begin{center}
|
|
1072 |
% $\bderssimp{r}{s} \myequiv \bsimp{\bders{r}{s}}$
|
|
1073 |
%\end{center}
|
|
1074 |
%Some "canonicalization" procedure is required,
|
|
1075 |
%which either pushes all the common bitcodes to nodes
|
|
1076 |
%as senior as possible:
|
|
1077 |
%\begin{center}
|
|
1078 |
% $_{bs}(_{bs_1 @ bs'}r_1 + _{bs_1 @ bs''}r_2) \rightarrow _{bs @ bs_1}(_{bs'}r_1 + _{bs''}r_2) $
|
|
1079 |
%\end{center}
|
|
1080 |
%or does the reverse. However bitcodes are not of interest if we are talking about
|
|
1081 |
%the $\llbracket r \rrbracket$ size of a regex.
|
|
1082 |
%Therefore for the ease and simplicity of producing a
|
|
1083 |
%proof for a size bound, we are happy to restrict ourselves to
|
|
1084 |
%unannotated regular expressions, and obtain such equalities as
|
|
1085 |
%TODO: rsimp sflat
|
|
1086 |
% The simplification of a flattened out regular expression, provided it comes
|
|
1087 |
%from the derivative of a star, is the same as the one nested.
|
|
1088 |
|
|
1089 |
|
|
1090 |
|
|
1091 |
Now we introduce an inductive property
|
|
1092 |
for $\starupdate$ and $\hflataux{\_}$.
|
|
1093 |
\begin{lemma}\label{starHfauInduct}
|
|
1094 |
If we do derivatives of $r^*$
|
|
1095 |
with a string that starts with $c$,
|
|
1096 |
then flatten it out,
|
|
1097 |
we obtain a list
|
|
1098 |
of the shape $\sum_{s' \in sS} (r\backslash_r s') \cdot r^*$,
|
|
1099 |
where $sS = \starupdates \; s \; r \; [[c]]$. Namely,
|
|
1100 |
\begin{center}
|
|
1101 |
$\hflataux{(\rders{( (\rder{c}{r_0})\cdot(r_0^*))}{s})} =
|
|
1102 |
\map \; (\lambda s_1. (r_0 \backslash_r s_1) \cdot (r_0^*)) \;
|
|
1103 |
(\starupdates \; s \; r_0 \; [[c]])$
|
|
1104 |
\end{center}
|
|
1105 |
holds.
|
|
1106 |
\end{lemma}
|
|
1107 |
\begin{proof}
|
|
1108 |
By an induction on $s$, the inductive cases
|
|
1109 |
being $[]$ and $s@[c]$. The lemmas \ref{hfauPushin} and \ref{starDersCbs} are used.
|
|
1110 |
\end{proof}
|
|
1111 |
\noindent
|
|
1112 |
|
|
1113 |
$\hflataux{\_}$ has a similar effect as $\textit{flatten}$:
|
|
1114 |
\begin{lemma}\label{hflatauxGrewrites}
|
|
1115 |
$a :: rs \grewrites \hflataux{a} @ rs$
|
|
1116 |
\end{lemma}
|
|
1117 |
\begin{proof}
|
|
1118 |
By induction on $a$. $rs$ is set to take arbitrary values.
|
|
1119 |
\end{proof}
|
|
1120 |
It is also not surprising that $\textit{rsimp}$ will cover
|
|
1121 |
the effect of $\hflataux{\_}$:
|
|
1122 |
\begin{lemma}\label{cbsHfauRsimpeq1}
|
|
1123 |
$\rsimp{(r_1 + r_2)} = \rsimp{(\RALTS{\hflataux{r_1} @ \hflataux{r_2}})}$
|
|
1124 |
\end{lemma}
|
|
1125 |
|
|
1126 |
\begin{proof}
|
|
1127 |
By using the rewriting relation $\rightsquigarrow$
|
|
1128 |
\end{proof}
|
|
1129 |
And from this we obtain a proof that a star's derivative will be the same
|
|
1130 |
as if it had all its nested alternatives created during deriving being flattened out:
|
|
1131 |
For example,
|
|
1132 |
\begin{lemma}\label{hfauRsimpeq2}
|
|
1133 |
$\createdByStar{r} \implies \rsimp{r} = \rsimp{\RALTS{\hflataux{r}}}$
|
|
1134 |
\end{lemma}
|
|
1135 |
\begin{proof}
|
|
1136 |
By structural induction on $r$, where the induction rules
|
|
1137 |
are these of $\createdByStar{\_}$.
|
|
1138 |
Lemma \ref{cbsHfauRsimpeq1} is used in the inductive case.
|
|
1139 |
\end{proof}
|
|
1140 |
|
|
1141 |
|
|
1142 |
%Here is a corollary that states the lemma in
|
|
1143 |
%a more intuitive way:
|
|
1144 |
%\begin{corollary}
|
|
1145 |
% $\hflataux{r^* \backslash_r (c::xs)} = \map \; (\lambda s. (r \backslash_r s) \cdot
|
|
1146 |
% (r^*))\; (\starupdates \; c\; r\; [[c]])$
|
|
1147 |
%\end{corollary}
|
|
1148 |
%\noindent
|
|
1149 |
%Note that this is also agnostic of the simplification
|
|
1150 |
%function we defined, and is therefore of more general interest.
|
|
1151 |
|
|
1152 |
Together with the rewriting relation
|
|
1153 |
\begin{lemma}\label{starClosedForm6Hrewrites}
|
|
1154 |
We have the following set of rewriting relations or equalities:
|
|
1155 |
\begin{itemize}
|
|
1156 |
\item
|
|
1157 |
$\textit{rsimp} \; (r^* \backslash_r (c::s))
|
|
1158 |
\sequal
|
|
1159 |
\sum \; ( ( \sum (\lambda s. (r\backslash_r s) \cdot r^*) \; (
|
|
1160 |
\starupdates \; s \; r \; [ c::[]] ) ) )$
|
|
1161 |
\item
|
|
1162 |
$r \backslash_{rsimp} (c::s) = \textit{rsimp} \; ( (
|
|
1163 |
\sum ( (\map \; (\lambda s_1. (r\backslash s_1) \; r^*) \;
|
|
1164 |
(\starupdates \;s \; r \; [ c::[] ])))))$
|
|
1165 |
\item
|
|
1166 |
$\sum ( (\map \; (\lambda s. (r\backslash s) \; r^*) \; Ss))
|
|
1167 |
\sequal
|
|
1168 |
\sum ( (\map \; (\lambda s. \textit{rsimp} \; (r\backslash s) \;
|
|
1169 |
r^*) \; Ss) )$
|
|
1170 |
\item
|
|
1171 |
$\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss
|
|
1172 |
\scfrewrites
|
|
1173 |
\map \; (\lambda s. (\rsimp{r \backslash_r s}) \cdot (r^*)) \; Ss$
|
|
1174 |
\item
|
|
1175 |
$( ( \sum ( ( \map \ (\lambda s. \;\;
|
|
1176 |
(\textit{rsimp} \; (r \backslash_r s)) \cdot r^*) \; (\starupdates \;
|
|
1177 |
s \; r \; [ c::[] ])))))$\\
|
|
1178 |
$\sequal$\\
|
|
1179 |
$( ( \sum ( ( \map \ (\lambda s. \;\;
|
|
1180 |
( r \backslash_{rsimp} s)) \cdot r^*) \; (\starupdates \;
|
|
1181 |
s \; r \; [ c::[] ]))))$\\
|
|
1182 |
\end{itemize}
|
|
1183 |
\end{lemma}
|
|
1184 |
\begin{proof}
|
|
1185 |
Part 1 leads to part 2.
|
|
1186 |
The rest of them are routine.
|
|
1187 |
\end{proof}
|
|
1188 |
\noindent
|
|
1189 |
Next the closed form for star regular expressions can be derived:
|
|
1190 |
\begin{theorem}\label{starClosedForm}
|
|
1191 |
$\rderssimp{r^*}{c::s} =
|
|
1192 |
\rsimp{
|
|
1193 |
(\sum (\map \; (\lambda s. (\rderssimp{r}{s})\cdot r^*) \;
|
|
1194 |
(\starupdates \; s\; r \; [[c]])
|
|
1195 |
)
|
|
1196 |
)
|
|
1197 |
}
|
|
1198 |
$
|
|
1199 |
\end{theorem}
|
|
1200 |
\begin{proof}
|
|
1201 |
By an induction on $s$.
|
|
1202 |
The lemmas \ref{rsimpIdem}, \ref{starHfauInduct}, \ref{starClosedForm6Hrewrites}
|
|
1203 |
and \ref{hfauRsimpeq2}
|
|
1204 |
are used.
|
|
1205 |
In \ref{starClosedForm6Hrewrites}, the equalities are
|
|
1206 |
used to link the LHS and RHS.
|
|
1207 |
\end{proof}
|
|
1208 |
|
|
1209 |
|
|
1210 |
|
|
1211 |
|
612
|
1212 |
The closed form for them looks like:
|
|
1213 |
%\begin{center}
|
|
1214 |
% \begin{tabular}{llrclll}
|
|
1215 |
% $r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & \\
|
|
1216 |
% $\textit{rsimp}$ & $($ & $
|
|
1217 |
% \sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\
|
|
1218 |
% & & & & $\textit{nupdates} \;$ &
|
|
1219 |
% $ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\
|
|
1220 |
% & & & & $)$ &\\
|
|
1221 |
% & & $)$ & & &\\
|
|
1222 |
% & $)$ & & & &\\
|
|
1223 |
% \end{tabular}
|
|
1224 |
%\end{center}
|
|
1225 |
\begin{center}
|
|
1226 |
\begin{tabular}{llrcllrllll}
|
|
1227 |
$r^{\{n+1\}}$ & $ \backslash_{rsimps}$ & $(c::s)$ & $=$ & & &&&&\\
|
|
1228 |
&&&&$\textit{rsimp}$ & $($ & $
|
|
1229 |
\sum \; ( $ & $\map$ & $(\textit{optermsimp}\;r)$ & $($\\
|
|
1230 |
&&&& & & & & $\;\; \textit{nupdates} \;$ &
|
|
1231 |
$ s \; r_0 \; [ \textit{Some} \; ([c], n)]$\\
|
|
1232 |
&&&& & & & & $)$ &\\
|
|
1233 |
&&&& & & $)$ & & &\\
|
|
1234 |
&&&& & $)$ & & & &\\
|
|
1235 |
\end{tabular}
|
|
1236 |
\end{center}
|
613
|
1237 |
The $\textit{optermsimp}$ function with the argument $r$
|
|
1238 |
chooses from two options: $\ZERO$ or
|
612
|
1239 |
We define for the $r^{\{n\}}$ constructor something similar to $\starupdate$
|
596
|
1240 |
and $\starupdates$:
|
|
1241 |
\begin{center}
|
|
1242 |
\begin{tabular}{lcl}
|
|
1243 |
$\starupdate \; c \; r \; [] $ & $\dn$ & $[]$\\
|
|
1244 |
$\starupdate \; c \; r \; (s :: Ss)$ & $\dn$ & \\
|
|
1245 |
& & $\textit{if} \;
|
|
1246 |
(\rnullable \; (\rders \; r \; s))$ \\
|
|
1247 |
& & $\textit{then} \;\; (s @ [c]) :: [c] :: (
|
|
1248 |
\starupdate \; c \; r \; Ss)$ \\
|
|
1249 |
& & $\textit{else} \;\; (s @ [c]) :: (
|
|
1250 |
\starupdate \; c \; r \; Ss)$
|
|
1251 |
\end{tabular}
|
|
1252 |
\end{center}
|
|
1253 |
\noindent
|
|
1254 |
As a generalisation from characters to strings,
|
|
1255 |
$\starupdates$ takes a string instead of a character
|
|
1256 |
as the first input argument, and is otherwise the same
|
|
1257 |
as $\starupdate$.
|
|
1258 |
\begin{center}
|
|
1259 |
\begin{tabular}{lcl}
|
|
1260 |
$\starupdates \; [] \; r \; Ss$ & $=$ & $Ss$\\
|
|
1261 |
$\starupdates \; (c :: cs) \; r \; Ss$ & $=$ & $\starupdates \; cs \; r \; (
|
|
1262 |
\starupdate \; c \; r \; Ss)$
|
|
1263 |
\end{tabular}
|
|
1264 |
\end{center}
|
|
1265 |
\noindent
|
532
|
1266 |
|
620
|
1267 |
|
|
1268 |
|
|
1269 |
|
612
|
1270 |
%----------------------------------------------------------------------------------------
|
|
1271 |
% SECTION 1
|
|
1272 |
%----------------------------------------------------------------------------------------
|
532
|
1273 |
|
612
|
1274 |
%\section{Adding Support for the Negation Construct, and its Correctness Proof}
|
|
1275 |
%We now add support for the negation regular expression:
|
|
1276 |
%\[ r ::= \ZERO \mid \ONE
|
|
1277 |
% \mid c
|
|
1278 |
% \mid r_1 \cdot r_2
|
|
1279 |
% \mid r_1 + r_2
|
|
1280 |
% \mid r^*
|
|
1281 |
% \mid \sim r
|
|
1282 |
%\]
|
|
1283 |
%The $\textit{nullable}$ function's clause for it would be
|
|
1284 |
%\[
|
|
1285 |
%\textit{nullable}(~r) = \neg \nullable(r)
|
|
1286 |
%\]
|
|
1287 |
%The derivative would be
|
|
1288 |
%\[
|
|
1289 |
%~r \backslash c = ~ (r \backslash c)
|
|
1290 |
%\]
|
|
1291 |
%
|
|
1292 |
%The most tricky part of lexing for the $~r$ regular expression
|
|
1293 |
% is creating a value for it.
|
|
1294 |
% For other regular expressions, the value aligns with the
|
|
1295 |
% structure of the regular expression:
|
|
1296 |
% \[
|
|
1297 |
% \vdash \Seq(\Char(a), \Char(b)) : a \cdot b
|
|
1298 |
% \]
|
|
1299 |
%But for the $~r$ regular expression, $s$ is a member of it if and only if
|
|
1300 |
%$s$ does not belong to $L(r)$.
|
|
1301 |
%That means when there
|
|
1302 |
%is a match for the not regular expression, it is not possible to generate how the string $s$ matched
|
|
1303 |
%with $r$.
|
|
1304 |
%What we can do is preserve the information of how $s$ was not matched by $r$,
|
|
1305 |
%and there are a number of options to do this.
|
|
1306 |
%
|
|
1307 |
%We could give a partial value when there is a partial match for the regular expression inside
|
|
1308 |
%the $\mathbf{not}$ construct.
|
|
1309 |
%For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,
|
|
1310 |
%A value for it could be
|
|
1311 |
% \[
|
|
1312 |
% \vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)
|
|
1313 |
% \]
|
|
1314 |
% The above example demonstrates what value to construct
|
|
1315 |
% when the string $s$ is at most a real prefix
|
|
1316 |
% of the strings in $L(r)$. When $s$ instead is not a prefix of any strings
|
|
1317 |
% in $L(r)$, it becomes unclear what to return as a value inside the $\textit{Not}$
|
|
1318 |
% constructor.
|
|
1319 |
%
|
|
1320 |
% Another option would be to either store the string $s$ that resulted in
|
|
1321 |
% a mis-match for $r$ or a dummy value as a placeholder:
|
|
1322 |
% \[
|
|
1323 |
% \vdash \textit{Not}(abcd) : ~( r_1 )
|
|
1324 |
% \]
|
|
1325 |
%or
|
|
1326 |
% \[
|
|
1327 |
% \vdash \textit{Not}(\textit{Dummy}) : ~( r_1 )
|
|
1328 |
% \]
|
|
1329 |
% We choose to implement this as it is most straightforward:
|
|
1330 |
% \[
|
|
1331 |
% \mkeps(~(r)) = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})
|
|
1332 |
% \]
|
|
1333 |
%
|
|
1334 |
%
|
620
|
1335 |
%\begin{center}
|
|
1336 |
% \begin{tabular}{lcl}
|
|
1337 |
% $\ntset \; r \; (n+1) \; c::cs $ & $\dn$ & $\nupdates \;
|
|
1338 |
% cs \; r \; [\Some \; ([c], n)]$\\
|
|
1339 |
% $\ntset \; r\; 0 \; \_$ & $\dn$ & $\None$\\
|
|
1340 |
% $\ntset \; r \; \_ \; [] $ & $ \dn$ & $[]$\\
|
|
1341 |
% \end{tabular}
|
|
1342 |
%\end{center}
|