|
1 % Chapter Template |
|
2 |
|
3 \chapter{A Better Bound and Other Extensions} % Main chapter title |
|
4 |
|
5 \label{Cubic} %In Chapter 5\ref{Chapter5} we discuss stronger simplifications to improve the finite bound |
|
6 %in Chapter 4 to a polynomial one, and demonstrate how one can extend the |
|
7 %algorithm to include constructs such as bounded repetitions and negations. |
|
8 |
|
9 %---------------------------------------------------------------------------------------- |
|
10 % SECTION strongsimp |
|
11 %---------------------------------------------------------------------------------------- |
|
12 \section{A Stronger Version of Simplification} |
|
13 %TODO: search for isabelle proofs of algorithms that check equivalence |
|
14 Two alternative (distinct) sub-matches for the (sub-)string and (sub-)regex pair |
|
15 are always expressed in the "derivative regular expression" as two |
|
16 disjoint alternative terms at the current (sub-)regex level. The fact that they |
|
17 are parallel tells us when we retrieve the information from this (sub-)regex |
|
18 there will always be a choice of which alternative term to take. |
|
19 |
|
20 Here is an example for this. |
|
21 |
|
22 Assume we have the derivative regex |
|
23 \[(a^* + (aa)^* + (aaa)^*)\cdot(a^* + (aa)^* + (aaa)^*)^* + |
|
24 (a^* + a\cdot(aa)^* + (aa)\cdot (aaa)^*)\cdot(a^* + (aa)^* + (aaa)^*)^* |
|
25 \] |
|
26 |
|
27 occurring in an intermediate step in successive |
|
28 derivatives of an input string $\underbrace{aa\ldots a}_{\text{n \; a's}}$. |
|
29 In this expression, there will be 6 "parallel" terms if we break up regexes |
|
30 of shape $(a+b)\cdot c$ using the distributivity law (into $ac + bc$). |
|
31 \begin{align} |
|
32 (\nonumber \\ |
|
33 a^* + & \label{term:1} \\ |
|
34 (aa)^* + & \label{term:2} \\ |
|
35 (aaa)^* & \label{term:3} \\ |
|
36 & )\cdot(a^* + (aa)^* + (aaa)^*)^* \; + \; (\nonumber \\ |
|
37 a^* + & \label{term:4} \\ |
|
38 a \cdot (aa)^* + & \label{term:5} \\ |
|
39 aa \cdot (aaa)^* \label{term:6}\\ |
|
40 & )\cdot(a^* + (aa)^* + (aaa)^*)^*\nonumber |
|
41 \end{align} |
|
42 |
|
43 |
|
44 They have been labelled, and each label denotes |
|
45 one term, for example, \ref{term:1} denotes |
|
46 \[ |
|
47 a^*\cdot(a^* + (aa)^* + (aaa)^*)^* |
|
48 \] |
|
49 \noindent |
|
50 and \ref{term:3} denotes |
|
51 \[ |
|
52 (aaa)^*\cdot(a^* + (aa)^* + (aaa)^*)^*. |
|
53 \] |
|
54 These terms can be interpreted in terms of |
|
55 their current input position's most recent |
|
56 partial match. |
|
57 Term \ref{term:1}, \ref{term:2}, and \ref{term:3} |
|
58 denote the partial-match ending at the current position: |
|
59 \mybox{previous input $\ldots$}\mybox{$aaa$ }\mybox{rest of input $\ldots$} |
|
60 |
|
61 |
|
62 |
|
63 %---------------------------------------------------------------------------------------- |
|
64 % SECTION 1 |
|
65 %---------------------------------------------------------------------------------------- |
|
66 |
|
67 \section{Adding Support for the Negation Construct, and its Correctness Proof} |
|
68 We now add support for the negation regular expression: |
|
69 \[ r ::= \ZERO \mid \ONE |
|
70 \mid c |
|
71 \mid r_1 \cdot r_2 |
|
72 \mid r_1 + r_2 |
|
73 \mid r^* |
|
74 \mid \sim r |
|
75 \] |
|
76 The $\textit{nullable}$ function's clause for it would be |
|
77 \[ |
|
78 \textit{nullable}(~r) = \neg \nullable(r) |
|
79 \] |
|
80 The derivative would be |
|
81 \[ |
|
82 ~r \backslash c = ~ (r \backslash c) |
|
83 \] |
|
84 |
|
85 The most tricky part of lexing for the $~r$ regex |
|
86 is creating a value for it. |
|
87 For other regular expressions, the value aligns with the |
|
88 structure of the regex: |
|
89 \[ |
|
90 \vdash \Seq(\Char(a), \Char(b)) : a \cdot b |
|
91 \] |
|
92 But for the $~r$ regex, $s$ is a member of it if and only if |
|
93 $s$ does not belong to $L(r)$. |
|
94 That means when there |
|
95 is a match for the not regex, it is not possible to generate how the string $s$ matched |
|
96 with $r$. |
|
97 What we can do is preserve the information of how $s$ was not matched by $r$, |
|
98 and there are a number of options to do this. |
|
99 |
|
100 We could give a partial value when there is a partial match for the regex inside |
|
101 the $\mathbf{not}$ construct. |
|
102 For example, the string $ab$ is not in the language of $(a\cdot b) \cdot c$, |
|
103 A value for it could be |
|
104 \[ |
|
105 \vdash \textit{Not}(\Seq(\Char(a), \Char(b))) : ~((a \cdot b ) \cdot c) |
|
106 \] |
|
107 The above example demonstrates what value to construct |
|
108 when the string $s$ is at most a real prefix |
|
109 of the strings in $L(r)$. When $s$ instead is not a prefix of any strings |
|
110 in $L(r)$, it becomes unclear what to return as a value inside the $\textit{Not}$ |
|
111 constructor. |
|
112 |
|
113 Another option would be to either store the string $s$ that resulted in |
|
114 a mis-match for $r$ or a dummy value as a placeholder: |
|
115 \[ |
|
116 \vdash \textit{Not}(abcd) : ~(a^*) |
|
117 \] |
|
118 or |
|
119 \[ |
|
120 \vdash \textit{Not}(\textit{Dummy}) : ~(a^*) |
|
121 \] |
|
122 We choose to implement this as it is most straightforward: |
|
123 \[ |
|
124 \mkeps(~(r)) = \textit{if}(\nullable(r)) \; \textit{Error} \; \textit{else} \; \textit{Not}(\textit{Dummy}) |
|
125 \] |
|
126 |
|
127 %---------------------------------------------------------------------------------------- |
|
128 % SECTION 2 |
|
129 %---------------------------------------------------------------------------------------- |
|
130 |
|
131 \section{Bounded Repetitions} |
|
132 |
|
133 |