532
|
1 |
% Chapter Template
|
|
2 |
|
|
3 |
\chapter{A Better Bound and Other Extensions} % Main chapter title
|
|
4 |
|
|
5 |
\label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound
|
|
6 |
%in Chapter 4 to a polynomial one, and demonstrate how one can extend the
|
|
7 |
%algorithm to include constructs such as bounded repetitions and negations.
|
|
8 |
|
|
9 |
%----------------------------------------------------------------------------------------
|
|
10 |
% SECTION strongsimp
|
|
11 |
%----------------------------------------------------------------------------------------
|
|
12 |
\section{A Stronger Version of Simplification}
|
|
13 |
%TODO: search for isabelle proofs of algorithms that check equivalence
|
|
14 |
Two alternative (distinct) sub-matches for the (sub-)string and (sub-)regex pair
|
|
15 |
are always expressed in the "derivative regular expression" as two
|
|
16 |
disjoint alternative terms at the current (sub-)regex level. The fact that they
|
|
17 |
are parallel tells us when we retrieve the information from this (sub-)regex
|
|
18 |
there will always be a choice of which alternative term to take.
|
|
19 |
|
|
20 |
Here is an example for this.
|
|
21 |
|
|
22 |
Assume we have the derivative regex
|
|
23 |
\[(a^* + (aa)^* + (aaa)^*)\cdot(a^* + (aa)^* + (aaa)^*)^* +
|
|
24 |
(a^* + a\cdot(aa)^* + (aa)\cdot (aaa)^*)\cdot(a^* + (aa)^* + (aaa)^*)^*
|
|
25 |
\]
|
|
26 |
|
|
27 |
occurring in an intermediate step in successive
|
|
28 |
derivatives of an input string $\underbrace{aa\ldots a}_{\text{n \; a's}}$.
|
|
29 |
In this expression, there will be 6 "parallel" terms if we break up regexes
|
|
30 |
of shape $(a+b)\cdot c$ using the distributivity law (into $ac + bc$).
|
|
31 |
\begin{align}
|
|
32 |
(\nonumber \\
|
|
33 |
a^* + & \label{term:1} \\
|
|
34 |
(aa)^* + & \label{term:2} \\
|
|
35 |
(aaa)^* & \label{term:3} \\
|
|
36 |
& )\cdot(a^* + (aa)^* + (aaa)^*)^* \; + \; (\nonumber \\
|
|
37 |
a^* + & \label{term:4} \\
|
|
38 |
a \cdot (aa)^* + & \label{term:5} \\
|
|
39 |
aa \cdot (aaa)^* \label{term:6}\\
|
|
40 |
& )\cdot(a^* + (aa)^* + (aaa)^*)^*\nonumber
|
|
41 |
\end{align}
|
|
42 |
|
|
43 |
|
|
44 |
They have been labelled, and each label denotes
|
|
45 |
one term, for example, \ref{term:1} denotes
|
|
46 |
\[
|
|
47 |
a^*\cdot(a^* + (aa)^* + (aaa)^*)^*
|
|
48 |
\]
|
|
49 |
\noindent
|
|
50 |
and \ref{term:3} denotes
|
|
51 |
\[
|
|
52 |
(aaa)^*\cdot(a^* + (aa)^* + (aaa)^*)^*.
|
|
53 |
\]
|
|
54 |
These terms can be interpreted in terms of
|
|
55 |
their current input position's most recent
|
|
56 |
partial match.
|
|
57 |
Term \ref{term:1}, \ref{term:2}, and \ref{term:3}
|
|
58 |
denote the partial-match ending at the current position:
|
|
59 |
\mybox{previous input $\ldots$}\mybox{$aaa$ }\mybox{rest of input $\ldots$}
|
|
60 |
|
|
61 |
|
|
62 |
|
|
63 |
%----------------------------------------------------------------------------------------
|
|
64 |
% SECTION 1
|
|
65 |
%----------------------------------------------------------------------------------------
|
|
66 |
|
|
67 |
\section{Adding Support for the Negation Construct, and its Correctness Proof}
|
|
68 |
We now add support for the negation regular expression:
|
|
69 |
\[ r ::= \ZERO \mid \ONE
|
|
70 |
\mid c
|
|
71 |
\mid r_1 \cdot r_2
|
|
72 |
\mid r_1 + r_2
|
|
73 |
\mid r^*
|
|
74 |
\mid \sim r
|
|
75 |
\]
|
|
76 |
The $\textit{nullable}$ function's clause for it would be
|
|
77 |
\[
|
|
78 |
\textit{nullable}(~r) = \neg \nullable(r)
|
|
79 |
\]
|
|
80 |
The derivative would be
|
|
81 |
\[
|
|
82 |
~r \backslash c = ~ (r \backslash c)
|
|
83 |
\]
|
|
84 |
|
|
85 |
The most tricky part of lexing for the $~r$ regex
|
|
86 |
is creating a value for it.
|
|
87 |
For other regular expressions, the value aligns with the
|
|
88 |
structure of the regex:
|
|
89 |
\[
|
|
90 |
\vdash \Seq(\Char(a), \Char(b)) : a \cdot b
|
|
91 |
\]
|
|
92 |
But for the $~r$ regex, $s$ is a member of it if and only if
|
|
93 |
$s$ does not belong to $L(r)$.
|
|
94 |
That means when there
|
|
95 |
is a match for the not regex, it is not possible to generate how the string $s$ matched
|
|
96 |
with $r$.
|
|
97 |
What we can do is preserve the information of how $s$ was not matched by $r$,
|
|
98 |
and there are a number of options to do this.
|
|
99 |
|
|
100 |
We could give a partial value when there is a partial match for the regex inside
|
|
101 |
the $\mathbf{not}$ construct.
|
|
102 |
For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$,
|
|
103 |
A value for it could be
|
|
104 |
\[
|
|
105 |
\vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c)
|
|
106 |
\]
|
|
107 |
The above example demonstrates what value to construct
|
|
108 |
when the string $s$ is at most a real prefix
|
|
109 |
of the strings in $L(r)$. When $s$ instead is not a prefix of any strings
|
|
110 |
in $L(r)$, it becomes unclear what to return as a value inside the $\textit{Not}$
|
|
111 |
constructor.
|
|
112 |
|
|
113 |
Another option would be to either store the string $s$ that resulted in
|
|
114 |
a mis-match for $r$ or a dummy value as a placeholder:
|
|
115 |
\[
|
|
116 |
\vdash \textit{Not}(abcd) : ~(a^*)
|
|
117 |
\]
|
|
118 |
or
|
|
119 |
\[
|
|
120 |
\vdash \textit{Not}(\textit{Dummy}) : ~(a^*)
|
|
121 |
\]
|
|
122 |
We choose to implement this as it is most straightforward:
|
|
123 |
\[
|
|
124 |
\mkeps(~(r)) = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy})
|
|
125 |
\]
|
|
126 |
|
|
127 |
%----------------------------------------------------------------------------------------
|
|
128 |
% SECTION 2
|
|
129 |
%----------------------------------------------------------------------------------------
|
|
130 |
|
|
131 |
\section{Bounded Repetitions}
|
|
132 |
|
|
133 |
|