123
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
1 |
\documentclass{article}
|
251
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
2 |
\usepackage{../style}
|
217
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
3 |
\usepackage{../langs}
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
4 |
\usepackage{../graphics}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
5 |
\usepackage{../data}
|
123
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
6 |
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
7 |
|
123
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
8 |
\begin{document}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
9 |
\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
10 |
|
123
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
11 |
|
272
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
12 |
\section*{Handout 2 (Regular Expression Matching)}
|
123
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
13 |
|
412
|
14 |
This lecture is about implementing a more efficient regular expression
|
|
15 |
matcher (the plots on the right)---more efficient than the matchers
|
|
16 |
from regular expression libraries in Ruby, Python and Java (the plots
|
|
17 |
on the left). The first pair of plots show the running time for the
|
|
18 |
regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings composed
|
|
19 |
of $n$ \pcode{a}s. The second pair of plots show the running time
|
|
20 |
for the regular expression $(a^*)^*\cdot b$ and also strings composed
|
|
21 |
of $n$ \pcode{a}s (meaning this regular expression actually does not
|
|
22 |
match the strings). To see the substantial differences in the left
|
|
23 |
and right plots below, note the different scales of the $x$-axes.
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
24 |
|
263
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
25 |
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
26 |
\begin{center}
|
415
|
27 |
Graphs: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$\\
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
28 |
\begin{tabular}{@{}cc@{}}
|
268
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
29 |
\begin{tikzpicture}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
30 |
\begin{axis}[
|
414
|
31 |
xlabel={$n$},
|
|
32 |
x label style={at={(1.05,0.0)}},
|
412
|
33 |
ylabel={\small time in secs},
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
34 |
enlargelimits=false,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
35 |
xtick={0,5,...,30},
|
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
36 |
xmax=33,
|
268
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
37 |
ymax=35,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
38 |
ytick={0,5,...,30},
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
39 |
scaled ticks=false,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
40 |
axis lines=left,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
41 |
width=5cm,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
42 |
height=5cm,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
43 |
legend entries={Python,Ruby},
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
44 |
legend pos=north west,
|
268
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
45 |
legend cell align=left]
|
434
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
46 |
\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
47 |
\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data};
|
268
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
48 |
\end{axis}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
49 |
\end{tikzpicture}
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
50 |
&
|
268
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
51 |
\begin{tikzpicture}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
52 |
\begin{axis}[
|
414
|
53 |
xlabel={$n$},
|
|
54 |
x label style={at={(1.1,0.05)}},
|
412
|
55 |
ylabel={\small time in secs},
|
|
56 |
enlargelimits=false,
|
443
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
57 |
xtick={0,3000,...,9000},
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
58 |
xmax=10000,
|
412
|
59 |
ymax=35,
|
|
60 |
ytick={0,5,...,30},
|
|
61 |
scaled ticks=false,
|
|
62 |
axis lines=left,
|
|
63 |
width=6.5cm,
|
|
64 |
height=5cm]
|
434
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
65 |
\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
|
412
|
66 |
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
|
|
67 |
\end{axis}
|
|
68 |
\end{tikzpicture}
|
|
69 |
\end{tabular}
|
|
70 |
\end{center}
|
|
71 |
|
|
72 |
\begin{center}
|
415
|
73 |
Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$
|
412
|
74 |
\begin{tabular}{@{}cc@{}}
|
|
75 |
\begin{tikzpicture}
|
|
76 |
\begin{axis}[
|
414
|
77 |
xlabel={$n$},
|
|
78 |
x label style={at={(1.05,0.0)}},
|
412
|
79 |
ylabel={time in secs},
|
|
80 |
enlargelimits=false,
|
|
81 |
xtick={0,5,...,30},
|
|
82 |
xmax=33,
|
|
83 |
ymax=35,
|
|
84 |
ytick={0,5,...,30},
|
|
85 |
scaled ticks=false,
|
|
86 |
axis lines=left,
|
|
87 |
width=5cm,
|
|
88 |
height=5cm,
|
|
89 |
legend entries={Java},
|
|
90 |
legend pos=north west,
|
|
91 |
legend cell align=left]
|
415
|
92 |
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
|
412
|
93 |
\end{axis}
|
|
94 |
\end{tikzpicture}
|
|
95 |
&
|
|
96 |
\begin{tikzpicture}
|
|
97 |
\begin{axis}[
|
414
|
98 |
xlabel={$n$},
|
416
|
99 |
x label style={at={(1.05,0.0)}},
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
100 |
ylabel={time in secs},
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
101 |
enlargelimits=false,
|
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
102 |
ymax=35,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
103 |
ytick={0,5,...,30},
|
416
|
104 |
axis lines=left,
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
105 |
scaled ticks=false,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
106 |
width=6.5cm,
|
268
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
107 |
height=5cm]
|
434
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
108 |
\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
|
416
|
109 |
\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
|
268
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
110 |
\end{axis}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
111 |
\end{tikzpicture}
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
112 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
113 |
\end{center}\medskip
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
114 |
|
412
|
115 |
\noindent
|
|
116 |
We will use these regular expressions and strings
|
|
117 |
as running examples.
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
118 |
|
412
|
119 |
Having specified in the previous lecture what
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
120 |
problem our regular expression matcher is supposed to solve,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
121 |
namely for any given regular expression $r$ and string $s$
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
122 |
answer \textit{true} if and only if
|
123
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
123 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
124 |
\[
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
125 |
s \in L(r)
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
126 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
127 |
|
412
|
128 |
\noindent we can look at an algorithm to solve this problem. Clearly
|
|
129 |
we cannot use the function $L$ directly for this, because in general
|
|
130 |
the set of strings $L$ returns is infinite (recall what $L(a^*)$ is).
|
|
131 |
In such cases there is no way we can implement an exhaustive test for
|
|
132 |
whether a string is member of this set or not. In contrast our
|
|
133 |
matching algorithm will operate on the regular expression $r$ and
|
414
|
134 |
string $s$, only, which are both finite objects. Before we explain
|
412
|
135 |
the matching algorithm, however, let us have a closer look at what it
|
|
136 |
means when two regular expressions are equivalent.
|
258
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
137 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
138 |
\subsection*{Regular Expression Equivalences}
|
123
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
139 |
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
140 |
We already defined in Handout 1 what it means for two regular
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
141 |
expressions to be equivalent, namely if their meaning is the
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
142 |
same language:
|
258
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
143 |
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
144 |
\[
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
145 |
r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
146 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
147 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
148 |
\noindent
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
149 |
It is relatively easy to verify that some concrete equivalences
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
150 |
hold, for example
|
124
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
151 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
152 |
\begin{center}
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
153 |
\begin{tabular}{rcl}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
154 |
$(a + b) + c$ & $\equiv$ & $a + (b + c)$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
155 |
$a + a$ & $\equiv$ & $a$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
156 |
$a + b$ & $\equiv$ & $b + a$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
157 |
$(a \cdot b) \cdot c$ & $\equiv$ & $a \cdot (b \cdot c)$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
158 |
$c \cdot (a + b)$ & $\equiv$ & $(c \cdot a) + (c \cdot b)$\\
|
124
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
159 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
160 |
\end{center}
|
123
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
161 |
|
124
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
162 |
\noindent
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
163 |
but also easy to verify that the following regular expressions
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
164 |
are \emph{not} equivalent
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
165 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
166 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
167 |
\begin{tabular}{rcl}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
168 |
$a \cdot a$ & $\not\equiv$ & $a$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
169 |
$a + (b \cdot c)$ & $\not\equiv$ & $(a + b) \cdot (a + c)$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
170 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
171 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
172 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
173 |
\noindent I leave it to you to verify these equivalences and
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
174 |
non-equivalences. It is also interesting to look at some
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
175 |
corner cases involving $\ONE$ and $\ZERO$:
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
176 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
177 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
178 |
\begin{tabular}{rcl}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
179 |
$a \cdot \ZERO$ & $\not\equiv$ & $a$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
180 |
$a + \ONE$ & $\not\equiv$ & $a$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
181 |
$\ONE$ & $\equiv$ & $\ZERO^*$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
182 |
$\ONE^*$ & $\equiv$ & $\ONE$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
183 |
$\ZERO^*$ & $\not\equiv$ & $\ZERO$\\
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
184 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
185 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
186 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
187 |
\noindent Again I leave it to you to make sure you agree
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
188 |
with these equivalences and non-equivalences.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
189 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
190 |
|
318
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
191 |
For our matching algorithm however the following seven
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
192 |
equivalences will play an important role:
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
193 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
194 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
195 |
\begin{tabular}{rcl}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
196 |
$r + \ZERO$ & $\equiv$ & $r$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
197 |
$\ZERO + r$ & $\equiv$ & $r$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
198 |
$r \cdot \ONE$ & $\equiv$ & $r$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
199 |
$\ONE \cdot r$ & $\equiv$ & $r$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
200 |
$r \cdot \ZERO$ & $\equiv$ & $\ZERO$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
201 |
$\ZERO \cdot r$ & $\equiv$ & $\ZERO$\\
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
202 |
$r + r$ & $\equiv$ & $r$
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
203 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
204 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
205 |
|
412
|
206 |
\noindent which always hold no matter what the regular expression $r$
|
|
207 |
looks like. The first two are easy to verify since $L(\ZERO)$ is the
|
|
208 |
empty set. The next two are also easy to verify since $L(\ONE) =
|
|
209 |
\{[]\}$ and appending the empty string to every string of another set,
|
|
210 |
leaves the set unchanged. Be careful to fully comprehend the fifth and
|
|
211 |
sixth equivalence: if you concatenate two sets of strings and one is
|
|
212 |
the empty set, then the concatenation will also be the empty set. To
|
|
213 |
see this, check the definition of $\_ @ \_$ for sets. The last
|
|
214 |
equivalence is again trivial.
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
215 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
216 |
What will be important later on is that we can orient these
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
217 |
equivalences and read them from left to right. In this way we
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
218 |
can view them as \emph{simplification rules}. Consider for
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
219 |
example the regular expression
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
220 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
221 |
\begin{equation}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
222 |
(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
223 |
\label{big}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
224 |
\end{equation}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
225 |
|
412
|
226 |
\noindent If we can find an equivalent regular expression that is
|
|
227 |
simpler (smaller for example), then this might potentially make our
|
|
228 |
matching algorithm run faster. We can look for such a simpler regular
|
|
229 |
expression $r'$ because whether a string $s$ is in $L(r)$ or in
|
|
230 |
$L(r')$ with $r\equiv r'$ will always give the same answer. In the
|
|
231 |
example above you will see that the regular expression is equivalent
|
|
232 |
to just $r_1$. You can verify this by iteratively applying the
|
|
233 |
simplification rules from above:
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
234 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
235 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
236 |
\begin{tabular}{ll}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
237 |
& $(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
238 |
(\underline{r_4 \cdot \ZERO})$\smallskip\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
239 |
$\equiv$ & $(r_1 + \ZERO) \cdot \ONE + \underline{((\ONE + r_2) + r_3) \cdot
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
240 |
\ZERO}$\smallskip\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
241 |
$\equiv$ & $\underline{(r_1 + \ZERO) \cdot \ONE} + \ZERO$\smallskip\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
242 |
$\equiv$ & $(\underline{r_1 + \ZERO}) + \ZERO$\smallskip\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
243 |
$\equiv$ & $\underline{r_1 + \ZERO}$\smallskip\\
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
244 |
$\equiv$ & $r_1$\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
245 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
246 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
247 |
|
296
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
248 |
\noindent In each step, I underlined where a simplification
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
249 |
rule is applied. Our matching algorithm in the next section
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
250 |
will often generate such ``useless'' $\ONE$s and
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
251 |
$\ZERO$s, therefore simplifying them away will make the
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
252 |
algorithm quite a bit faster.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
253 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
254 |
\subsection*{The Matching Algorithm}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
255 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
256 |
The algorithm we will define below consists of two parts. One
|
412
|
257 |
is the function $\textit{nullable}$ which takes a regular expression as
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
258 |
argument and decides whether it can match the empty string
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
259 |
(this means it returns a boolean in Scala). This can be easily
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
260 |
defined recursively as follows:
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
261 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
262 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
263 |
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
|
412
|
264 |
$\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\
|
|
265 |
$\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\
|
|
266 |
$\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\
|
|
267 |
$\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\
|
|
268 |
$\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\
|
|
269 |
$\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$ \\
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
270 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
271 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
272 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
273 |
\noindent The idea behind this function is that the following
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
274 |
property holds:
|
124
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
275 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
276 |
\[
|
412
|
277 |
\textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r)
|
124
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
278 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
279 |
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
280 |
\noindent Note on the left-hand side of the if-and-only-if we
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
281 |
have a function we can implement; on the right we have its
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
282 |
specification (which we cannot implement in a programming
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
283 |
language).
|
124
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
284 |
|
258
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
285 |
The other function of our matching algorithm calculates a
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
286 |
\emph{derivative} of a regular expression. This is a function
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
287 |
which will take a regular expression, say $r$, and a
|
412
|
288 |
character, say $c$, as arguments and returns a new regular
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
289 |
expression. Be careful that the intuition behind this function
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
290 |
is not so easy to grasp on first reading. Essentially this
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
291 |
function solves the following problem: if $r$ can match a
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
292 |
string of the form $c\!::\!s$, what does the regular
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
293 |
expression look like that can match just $s$? The definition
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
294 |
of this function is as follows:
|
125
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
295 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
296 |
\begin{center}
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
297 |
\begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l}
|
414
|
298 |
$\textit{der}\, c\, (\ZERO)$ & $\dn$ & $\ZERO$\\
|
|
299 |
$\textit{der}\, c\, (\ONE)$ & $\dn$ & $\ZERO$ \\
|
|
300 |
$\textit{der}\, c\, (d)$ & $\dn$ & if $c = d$ then $\ONE$ else $\ZERO$\\
|
|
301 |
$\textit{der}\, c\, (r_1 + r_2)$ & $\dn$ & $\textit{der}\, c\, r_1 + \textit{der}\, c\, r_2$\\
|
|
302 |
$\textit{der}\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $\textit{nullable} (r_1)$\\
|
|
303 |
& & then $(\textit{der}\,c\,r_1) \cdot r_2 + \textit{der}\, c\, r_2$\\
|
|
304 |
& & else $(\textit{der}\, c\, r_1) \cdot r_2$\\
|
|
305 |
$\textit{der}\, c\, (r^*)$ & $\dn$ & $(\textit{der}\,c\,r) \cdot (r^*)$
|
125
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
306 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
307 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
308 |
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
309 |
\noindent The first two clauses can be rationalised as
|
414
|
310 |
follows: recall that $\textit{der}$ should calculate a regular
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
311 |
expression so that given the ``input'' regular expression can
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
312 |
match a string of the form $c\!::\!s$, we want a regular
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
313 |
expression for $s$. Since neither $\ZERO$ nor $\ONE$
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
314 |
can match a string of the form $c\!::\!s$, we return
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
315 |
$\ZERO$. In the third case we have to make a
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
316 |
case-distinction: In case the regular expression is $c$, then
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
317 |
clearly it can recognise a string of the form $c\!::\!s$, just
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
318 |
that $s$ is the empty string. Therefore we return the
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
319 |
$\ONE$-regular expression. In the other case we again
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
320 |
return $\ZERO$ since no string of the $c\!::\!s$ can be
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
321 |
matched. Next come the recursive cases, which are a bit more
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
322 |
involved. Fortunately, the $+$-case is still relatively
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
323 |
straightforward: all strings of the form $c\!::\!s$ are either
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
324 |
matched by the regular expression $r_1$ or $r_2$. So we just
|
414
|
325 |
have to recursively call $\textit{der}$ with these two regular
|
332
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
326 |
expressions and compose the results again with $+$. Makes
|
412
|
327 |
sense?
|
|
328 |
|
|
329 |
The $\cdot$-case is more complicated: if $r_1\cdot r_2$
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
330 |
matches a string of the form $c\!::\!s$, then the first part
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
331 |
must be matched by $r_1$. Consequently, it makes sense to
|
414
|
332 |
construct the regular expression for $s$ by calling $\textit{der}$ with
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
333 |
$r_1$ and ``appending'' $r_2$. There is however one exception
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
334 |
to this simple rule: if $r_1$ can match the empty string, then
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
335 |
all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
336 |
nullable (that is can match the empty string) we have to allow
|
414
|
337 |
the choice $\textit{der}\,c\,r_2$ for calculating the regular
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
338 |
expression that can match $s$. Therefore we have to add the
|
414
|
339 |
regular expression $\textit{der}\,c\,r_2$ in the result. The $*$-case
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
340 |
is again simple: if $r^*$ matches a string of the form
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
341 |
$c\!::\!s$, then the first part must be ``matched'' by a
|
414
|
342 |
single copy of $r$. Therefore we call recursively $\textit{der}\,c\,r$
|
|
343 |
and ``append'' $r^*$ in order to match the rest of $s$. Still
|
|
344 |
makes sense?
|
125
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
345 |
|
414
|
346 |
If all this did not make sense yet, here is another way to rationalise
|
|
347 |
the definition of $\textit{der}$ by considering the following operation
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
348 |
on sets:
|
125
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
349 |
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
350 |
\begin{equation}\label{Der}
|
414
|
351 |
\textit{Der}\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
352 |
\end{equation}
|
125
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
353 |
|
291
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
354 |
\noindent This operation essentially transforms a set of
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
355 |
strings $A$ by filtering out all strings that do not start
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
356 |
with $c$ and then strips off the $c$ from all the remaining
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
357 |
strings. For example suppose $A = \{f\!oo, bar, f\!rak\}$ then
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
358 |
|
414
|
359 |
\[ \textit{Der}\,f\,A = \{oo, rak\}\quad,\quad
|
|
360 |
\textit{Der}\,b\,A = \{ar\} \quad \text{and} \quad
|
|
361 |
\textit{Der}\,a\,A = \{\}
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
362 |
\]
|
125
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
363 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
364 |
\noindent
|
414
|
365 |
Note that in the last case $\textit{Der}$ is empty, because no string in $A$
|
258
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
366 |
starts with $a$. With this operation we can state the following
|
414
|
367 |
property about $\textit{der}$:
|
125
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
368 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
369 |
\[
|
414
|
370 |
L(\textit{der}\,c\,r) = \textit{Der}\,c\,(L(r))
|
125
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
371 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
372 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
373 |
\noindent
|
414
|
374 |
This property clarifies what regular expression $\textit{der}$ calculates,
|
258
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
375 |
namely take the set of strings that $r$ can match (that is $L(r)$),
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
376 |
filter out all strings not starting with $c$ and strip off the $c$
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
377 |
from the remaining strings---this is exactly the language that
|
414
|
378 |
$\textit{der}\,c\,r$ can match.
|
125
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
379 |
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
380 |
If we want to find out whether the string $abc$ is matched by
|
414
|
381 |
the regular expression $r_1$ then we can iteratively apply $\textit{der}$
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
382 |
as follows
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
383 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
384 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
385 |
\begin{tabular}{rll}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
386 |
Input: $r_1$, $abc$\medskip\\
|
414
|
387 |
Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = \textit{der}\,a\,r_1)$\smallskip\\
|
|
388 |
Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = \textit{der}\,b\,r_2)$\smallskip\\
|
433
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
389 |
Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = \textit{der}\,c\,r_3)$\smallskip\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
390 |
Step 4: & the string is exhausted: & $(\textit{nullable}(r_4))$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
391 |
& test whether $r_4$ can recognise the\\
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
392 |
& empty string\smallskip\\
|
412
|
393 |
Output: & result of this test $\Rightarrow \textit{true} \,\text{or}\, \textit{false}$\\
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
394 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
395 |
\end{center}
|
140
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
396 |
|
414
|
397 |
\noindent Again the operation $\textit{Der}$ might help to rationalise
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
398 |
this algorithm. We want to know whether $abc \in L(r_1)$. We
|
414
|
399 |
do not know yet---but let us assume it is. Then $\textit{Der}\,a\,L(r_1)$
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
400 |
builds the set where all the strings not starting with $a$ are
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
401 |
filtered out. Of the remaining strings, the $a$ is stripped
|
412
|
402 |
off. So we should still have $bc$ in the set.
|
|
403 |
Then we continue with filtering out all strings not
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
404 |
starting with $b$ and stripping off the $b$ from the remaining
|
414
|
405 |
strings, that means we build $\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1)))$.
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
406 |
Finally we filter out all strings not starting with $c$ and
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
407 |
strip off $c$ from the remaining string. This is
|
414
|
408 |
$\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$. Now if $abc$ was in the
|
|
409 |
original set ($L(r_1)$), then $\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$
|
412
|
410 |
must contain the empty string. If not, then $abc$ was not in the
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
411 |
language we started with.
|
140
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
412 |
|
414
|
413 |
Our matching algorithm using $\textit{der}$ and $\textit{nullable}$ works
|
|
414 |
similarly, just using regular expression instead of sets. In order to
|
|
415 |
define our algorithm we need to extend the notion of derivatives from single
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
416 |
characters to strings. This can be done using the following
|
414
|
417 |
function, taking a string and a regular expression as input and
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
418 |
a regular expression as output.
|
125
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
419 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
420 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
421 |
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}}
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
422 |
$\textit{ders}\, []\, r$ & $\dn$ & $r$ & \\
|
414
|
423 |
$\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(\textit{der}\,c\,r)$ & \\
|
125
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
424 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
425 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
426 |
|
414
|
427 |
\noindent This function iterates $\textit{der}$ taking one character at
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
428 |
the time from the original string until it is exhausted.
|
414
|
429 |
Having $\textit{der}s$ in place, we can finally define our matching
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
430 |
algorithm:
|
125
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
431 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
432 |
\[
|
414
|
433 |
\textit{matches}\,s\,r \dn \textit{nullable}(\textit{ders}\,s\,r)
|
125
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
434 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
435 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
436 |
\noindent
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
437 |
and we can claim that
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
438 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
439 |
\[
|
414
|
440 |
\textit{matches}\,s\,r\quad\text{if and only if}\quad s\in L(r)
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
441 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
442 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
443 |
\noindent holds, which means our algorithm satisfies the
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
444 |
specification. Of course we can claim many things\ldots
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
445 |
whether the claim holds any water is a different question,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
446 |
which for example is the point of the Strand-2 Coursework.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
447 |
|
414
|
448 |
This algorithm was introduced by Janus Brzozowski in 1964, but
|
|
449 |
is more widely known only in the last 10 or so years. Its
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
450 |
main attractions are simplicity and being fast, as well as
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
451 |
being easily extendable for other regular expressions such as
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
452 |
$r^{\{n\}}$, $r^?$, $\sim{}r$ and so on (this is subject of
|
414
|
453 |
Strand-1 Coursework 1).
|
258
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
454 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
455 |
\subsection*{The Matching Algorithm in Scala}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
456 |
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
457 |
Another attraction of the algorithm is that it can be easily
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
458 |
implemented in a functional programming language, like Scala.
|
296
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
459 |
Given the implementation of regular expressions in Scala shown
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
460 |
in the first lecture and handout, the functions and subfunctions
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
461 |
for \pcode{matches} are shown in Figure~\ref{scala1}.
|
126
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
462 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
463 |
\begin{figure}[p]
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
464 |
\lstinputlisting{../progs/app5.scala}
|
412
|
465 |
\caption{Scala implementation of the \textit{nullable} and
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
466 |
derivative functions. These functions are easy to
|
412
|
467 |
implement in functional languages, because their built-in pattern
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
468 |
matching and recursion allow us to mimic the mathematical
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
469 |
definitions very closely.\label{scala1}}
|
126
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
470 |
\end{figure}
|
123
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
471 |
|
414
|
472 |
|
443
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
473 |
%Remember our second example involving the regular expression
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
474 |
%$(a^*)^* \cdot b$ which could not match strings of $n$ \texttt{a}s.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
475 |
%Java needed around 30 seconds to find this out a string with $n=28$.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
476 |
%It seems our algorithm is doing rather well in comparison:
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
477 |
%
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
478 |
%\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
479 |
%\begin{tikzpicture}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
480 |
%\begin{axis}[
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
481 |
% title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
482 |
% xlabel={$n$},
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
483 |
% x label style={at={(1.05,0.0)}},
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
484 |
% ylabel={time in secs},
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
485 |
% enlargelimits=false,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
486 |
% xtick={0,1000,...,6500},
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
487 |
% xmax=6800,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
488 |
% ytick={0,5,...,30},
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
489 |
% ymax=34,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
490 |
% scaled ticks=false,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
491 |
% axis lines=left,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
492 |
% width=8cm,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
493 |
% height=4.5cm,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
494 |
% legend entries={Java,Scala V1},
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
495 |
% legend pos=north east,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
496 |
% legend cell align=left]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
497 |
%\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
498 |
%\addplot[red,mark=triangle*,mark options={fill=white}] table {re1a.data};
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
499 |
%\end{axis}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
500 |
%\end{tikzpicture}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
501 |
%\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
502 |
%
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
503 |
%\noindent
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
504 |
%This is not an error: it hardly takes more than half a second for
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
505 |
%strings up to the length of 6500. After that we receive a
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
506 |
%StackOverflow exception, but still\ldots
|
414
|
507 |
|
|
508 |
For running the algorithm with our first example, the evil
|
394
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
509 |
regular expression $a^?{}^{\{n\}}a^{\{n\}}$, we need to implement
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
510 |
the optional regular expression and the exactly $n$-times
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
511 |
regular expression. This can be done with the translations
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
512 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
513 |
\lstinputlisting[numbers=none]{../progs/app51.scala}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
514 |
|
414
|
515 |
\noindent Running the matcher with this example, we find it is
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
516 |
slightly worse then the matcher in Ruby and Python.
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
517 |
Ooops\ldots
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
518 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
519 |
\begin{center}
|
414
|
520 |
\begin{tikzpicture}
|
|
521 |
\begin{axis}[
|
415
|
522 |
title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
|
414
|
523 |
xlabel={$n$},
|
|
524 |
x label style={at={(1.05,0.0)}},
|
|
525 |
ylabel={time in secs},
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
526 |
enlargelimits=false,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
527 |
xtick={0,5,...,30},
|
415
|
528 |
xmax=32,
|
414
|
529 |
ytick={0,5,...,30},
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
530 |
scaled ticks=false,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
531 |
axis lines=left,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
532 |
width=6cm,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
533 |
height=5cm,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
534 |
legend entries={Python,Ruby,Scala V1},
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
535 |
legend pos=outer north east,
|
415
|
536 |
legend cell align=left]
|
434
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
537 |
\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
538 |
\addplot[brown,mark=pentagon*, mark options={fill=white}] table {re-ruby.data};
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
539 |
\addplot[red,mark=triangle*,mark options={fill=white}] table {re1.data};
|
414
|
540 |
\end{axis}
|
|
541 |
\end{tikzpicture}
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
542 |
\end{center}
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
543 |
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
544 |
\noindent Analysing this failure we notice that for
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
545 |
$a^{\{n\}}$ we generate quite big regular expressions:
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
546 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
547 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
548 |
\begin{tabular}{rl}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
549 |
1: & $a$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
550 |
2: & $a\cdot a$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
551 |
3: & $a\cdot a\cdot a$\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
552 |
& \ldots\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
553 |
13: & $a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a$\\
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
554 |
& \ldots
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
555 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
556 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
557 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
558 |
\noindent Our algorithm traverses such regular expressions at
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
559 |
least once every time a derivative is calculated. So having
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
560 |
large regular expressions will cause problems. This problem
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
561 |
is aggravated by $a^?$ being represented as $a + \ONE$.
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
562 |
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
563 |
We can however fix this by having an explicit constructor for
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
564 |
$r^{\{n\}}$. In Scala we would introduce a constructor like
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
565 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
566 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
567 |
\code{case class NTIMES(r: Rexp, n: Int) extends Rexp}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
568 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
569 |
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
570 |
\noindent With this fix we have a constant ``size'' regular
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
571 |
expression for our running example no matter how large $n$ is.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
572 |
This means we have to also add cases for \pcode{NTIMES} in the
|
414
|
573 |
functions $\textit{nullable}$ and $\textit{der}$. Does the change have any
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
574 |
effect?
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
575 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
576 |
\begin{center}
|
414
|
577 |
\begin{tikzpicture}
|
|
578 |
\begin{axis}[
|
415
|
579 |
title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
|
414
|
580 |
xlabel={$n$},
|
|
581 |
x label style={at={(1.01,0.0)}},
|
|
582 |
ylabel={time in secs},
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
583 |
enlargelimits=false,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
584 |
xtick={0,100,...,1000},
|
414
|
585 |
xmax=1100,
|
|
586 |
ytick={0,5,...,30},
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
587 |
scaled ticks=false,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
588 |
axis lines=left,
|
414
|
589 |
width=10cm,
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
590 |
height=5cm,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
591 |
legend entries={Python,Ruby,Scala V1,Scala V2},
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
592 |
legend pos=outer north east,
|
414
|
593 |
legend cell align=left]
|
434
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
594 |
\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data};
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
595 |
\addplot[brown,mark=pentagon*, mark options={fill=white}] table {re-ruby.data};
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
596 |
\addplot[red,mark=triangle*,mark options={fill=white}] table {re1.data};
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
597 |
\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
|
414
|
598 |
\end{axis}
|
|
599 |
\end{tikzpicture}
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
600 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
601 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
602 |
\noindent Now we are talking business! The modified matcher
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
603 |
can within 30 seconds handle regular expressions up to
|
414
|
604 |
$n = 950$ before a StackOverflow is raised. Recall that Python and Ruby
|
|
605 |
(and our first version, Scala V1) could only handle $n = 27$ or so in 30
|
|
606 |
seconds. There is no change for our second example
|
|
607 |
$(a^*)^* \cdot b$---so this is still good.
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
608 |
|
412
|
609 |
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
610 |
The moral is that our algorithm is rather sensitive to the
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
611 |
size of regular expressions it needs to handle. This is of
|
414
|
612 |
course obvious because both $\textit{nullable}$ and $\textit{der}$ frequently
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
613 |
need to traverse the whole regular expression. There seems,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
614 |
however, one more issue for making the algorithm run faster.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
615 |
The derivative function often produces ``useless''
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
616 |
$\ZERO$s and $\ONE$s. To see this, consider $r = ((a
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
617 |
\cdot b) + b)^*$ and the following two derivatives
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
618 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
619 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
620 |
\begin{tabular}{l}
|
414
|
621 |
$\textit{der}\,a\,r = ((\ONE \cdot b) + \ZERO) \cdot r$\\
|
|
622 |
$\textit{der}\,b\,r = ((\ZERO \cdot b) + \ONE)\cdot r$\\
|
|
623 |
$\textit{der}\,c\,r = ((\ZERO \cdot b) + \ZERO)\cdot r$
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
624 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
625 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
626 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
627 |
\noindent
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
628 |
If we simplify them according to the simple rules from the
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
629 |
beginning, we can replace the right-hand sides by the
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
630 |
smaller equivalent regular expressions
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
631 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
632 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
633 |
\begin{tabular}{l}
|
414
|
634 |
$\textit{der}\,a\,r \equiv b \cdot r$\\
|
|
635 |
$\textit{der}\,b\,r \equiv r$\\
|
|
636 |
$\textit{der}\,c\,r \equiv \ZERO$
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
637 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
638 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
639 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
640 |
\noindent I leave it to you to contemplate whether such a
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
641 |
simplification can have any impact on the correctness of our
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
642 |
algorithm (will it change any answers?). Figure~\ref{scala2}
|
296
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
643 |
gives a simplification function that recursively traverses a
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
644 |
regular expression and simplifies it according to the rules
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
645 |
given at the beginning. There are only rules for $+$, $\cdot$
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
646 |
and $n$-times (the latter because we added it in the second
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
647 |
version of our matcher). There is no rule for a star, because
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
648 |
empirical data and also a little thought showed that
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
649 |
simplifying under a star is a waste of computation time. The
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
650 |
simplification function will be called after every derivation.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
651 |
This additional step removes all the ``junk'' the derivative
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
652 |
function introduced. Does this improve the speed? You bet!!
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
653 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
654 |
\begin{figure}[p]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
655 |
\lstinputlisting{../progs/app6.scala}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
656 |
\caption{The simplification function and modified
|
325
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
657 |
\texttt{ders}-function; this function now
|
333
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
658 |
calls \texttt{der} first, but then simplifies
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
659 |
the resulting derivative regular expressions before
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
660 |
building the next derivative, see
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
661 |
Line~\ref{simpline}.\label{scala2}}
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
662 |
\end{figure}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
663 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
664 |
\begin{center}
|
268
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
665 |
\begin{tikzpicture}
|
414
|
666 |
\begin{axis}[
|
415
|
667 |
title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
|
414
|
668 |
xlabel={$n$},
|
|
669 |
x label style={at={(1.04,0.0)}},
|
|
670 |
ylabel={time in secs},
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
671 |
enlargelimits=false,
|
443
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
672 |
xtick={0,3000,...,9000},
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
673 |
xmax=10000,
|
268
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
674 |
ytick={0,5,...,30},
|
443
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
675 |
ymax=32,
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
676 |
scaled ticks=false,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
677 |
axis lines=left,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
678 |
width=9cm,
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
679 |
height=5cm,
|
415
|
680 |
legend entries={Scala V2,Scala V3},
|
|
681 |
legend pos=outer north east,
|
|
682 |
legend cell align=left]
|
|
683 |
\addplot[green,mark=square*,mark options={fill=white}] table {re2.data};
|
268
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
684 |
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
685 |
\end{axis}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
686 |
\end{tikzpicture}
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
687 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
688 |
|
415
|
689 |
\noindent
|
|
690 |
To reacap, Python and Ruby needed approximately 30 seconds to match
|
|
691 |
a string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot a^{\{n\}}$.
|
|
692 |
We need a third of this time to do the same with strings up to 12,000 \texttt{a}s.
|
|
693 |
Similarly, Java needed 30 seconds to find out the regular expression
|
|
694 |
$(a^*)^* \cdot b$ does not match the string of 28 \texttt{a}s. We can do
|
|
695 |
the same in approximately 5 seconds for strings of 6000000 \texttt{a}s:
|
|
696 |
|
|
697 |
|
414
|
698 |
\begin{center}
|
|
699 |
\begin{tikzpicture}
|
|
700 |
\begin{axis}[
|
415
|
701 |
title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
|
414
|
702 |
xlabel={$n$},
|
|
703 |
x label style={at={(1.09,0.0)}},
|
|
704 |
ylabel={time in secs},
|
|
705 |
enlargelimits=false,
|
415
|
706 |
xmax=7700000,
|
414
|
707 |
ytick={0,5,...,30},
|
443
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
708 |
ymax=32,
|
415
|
709 |
%scaled ticks=false,
|
414
|
710 |
axis lines=left,
|
|
711 |
width=9cm,
|
|
712 |
height=5cm,
|
415
|
713 |
legend entries={Scala V2, Scala V3},
|
|
714 |
legend pos=outer north east,
|
|
715 |
legend cell align=left]
|
|
716 |
\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data};
|
414
|
717 |
\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
|
|
718 |
\end{axis}
|
|
719 |
\end{tikzpicture}
|
|
720 |
\end{center}
|
|
721 |
|
415
|
722 |
\subsection*{Epilogue}
|
|
723 |
|
|
724 |
(23/Aug/2016) I recently found another place where this algorithm can be
|
|
725 |
sped (this idea is not integrated with what is coming next,
|
|
726 |
but I present it nonetheless). The idea is to define \texttt{ders}
|
|
727 |
not such that it iterates the derivative character-by-character, but
|
|
728 |
in bigger chunks. The resulting code for \texttt{ders2} looks as
|
|
729 |
follows:
|
|
730 |
|
|
731 |
\lstinputlisting[numbers=none]{../progs/app52.scala}
|
|
732 |
|
|
733 |
\noindent
|
|
734 |
I have not fully understood why this version is much faster,
|
|
735 |
but it seems it is a combination of the clauses for \texttt{ALT}
|
|
736 |
and \texttt{SEQ}. In the latter case we call \texttt{der} with
|
|
737 |
a single character and this potentially produces an alternative.
|
|
738 |
The derivative of such an alternative can then be more effeciently
|
|
739 |
calculated by \texttt{ders2} since it pushes a whole string
|
|
740 |
under an \texttt{ALT}. The numbers are that in the second case
|
|
741 |
$(a^*)^* \cdot b$ both versions are pretty much the same, but in the
|
|
742 |
first case $a^{?\{n\}} \cdot a^{\{n\}}$ the improvement gives
|
|
743 |
another factor of 100 speedup. Nice!
|
414
|
744 |
|
415
|
745 |
\begin{center}
|
|
746 |
\begin{tabular}{cc}
|
|
747 |
\begin{tikzpicture}
|
|
748 |
\begin{axis}[
|
|
749 |
title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$},
|
|
750 |
xlabel={$n$},
|
|
751 |
x label style={at={(1.04,0.0)}},
|
|
752 |
ylabel={time in secs},
|
|
753 |
enlargelimits=false,
|
|
754 |
xmax=7100000,
|
|
755 |
ytick={0,5,...,30},
|
|
756 |
ymax=33,
|
|
757 |
%scaled ticks=false,
|
|
758 |
axis lines=left,
|
443
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
759 |
width=5.5cm,
|
415
|
760 |
height=5cm,
|
|
761 |
legend entries={Scala V3, Scala V4},
|
443
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
762 |
legend style={at={(0.1,-0.2)},anchor=north}]
|
415
|
763 |
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data};
|
|
764 |
\addplot[purple,mark=square*,mark options={fill=white}] table {re4.data};
|
|
765 |
\end{axis}
|
|
766 |
\end{tikzpicture}
|
|
767 |
&
|
|
768 |
\begin{tikzpicture}
|
|
769 |
\begin{axis}[
|
|
770 |
title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$},
|
|
771 |
xlabel={$n$},
|
|
772 |
x label style={at={(1.09,0.0)}},
|
|
773 |
ylabel={time in secs},
|
|
774 |
enlargelimits=false,
|
|
775 |
xmax=8100000,
|
|
776 |
ytick={0,5,...,30},
|
|
777 |
ymax=33,
|
|
778 |
%scaled ticks=false,
|
|
779 |
axis lines=left,
|
443
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
780 |
width=5.5cm,
|
415
|
781 |
height=5cm,
|
|
782 |
legend entries={Scala V3, Scala V4},
|
443
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
783 |
legend style={at={(0.1,-0.2)},anchor=north}]
|
415
|
784 |
\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
|
|
785 |
\addplot[purple,mark=square*,mark options={fill=white}] table {re4a.data};
|
|
786 |
\end{axis}
|
|
787 |
\end{tikzpicture}
|
|
788 |
\end{tabular}
|
|
789 |
\end{center}
|
414
|
790 |
|
412
|
791 |
|
334
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
792 |
\section*{Proofs}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
793 |
|
339
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
794 |
You might not like doing proofs. But they serve a very
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
795 |
important purpose in Computer Science: How can we be sure that
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
796 |
our algorithm matches its specification. We can try to test
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
797 |
the algorithm, but that often overlooks corner cases and an
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
798 |
exhaustive testing is impossible (since there are infinitely
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
799 |
many inputs). Proofs allow us to ensure that an algorithm
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
800 |
really meets its specification.
|
338
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
801 |
|
339
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
802 |
For the programs we look at in this module, the proofs will
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
803 |
mostly by some form of induction. Remember that regular
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
804 |
expressions are defined as
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
805 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
806 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
807 |
\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
808 |
$r$ & $::=$ & $\ZERO$ & null language\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
809 |
& $\mid$ & $\ONE$ & empty string / \texttt{""} / []\\
|
339
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
810 |
& $\mid$ & $c$ & single character\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
811 |
& $\mid$ & $r_1 + r_2$ & alternative / choice\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
812 |
& $\mid$ & $r_1 \cdot r_2$ & sequence\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
813 |
& $\mid$ & $r^*$ & star (zero or more)\\
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
814 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
815 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
816 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
817 |
\noindent If you want to show a property $P(r)$ for all
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
818 |
regular expressions $r$, then you have to follow essentially
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
819 |
the recipe:
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
820 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
821 |
\begin{itemize}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
822 |
\item $P$ has to hold for $\ZERO$, $\ONE$ and $c$
|
339
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
823 |
(these are the base cases).
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
824 |
\item $P$ has to hold for $r_1 + r_2$ under the assumption
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
825 |
that $P$ already holds for $r_1$ and $r_2$.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
826 |
\item $P$ has to hold for $r_1 \cdot r_2$ under the
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
827 |
assumption that $P$ already holds for $r_1$ and $r_2$.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
828 |
\item $P$ has to hold for $r^*$ under the assumption
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
829 |
that $P$ already holds for $r$.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
830 |
\end{itemize}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
831 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
832 |
\noindent
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
833 |
A simple proof is for example showing the following
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
834 |
property:
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
835 |
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
836 |
\begin{equation}
|
412
|
837 |
\textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r)
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
838 |
\label{nullableprop}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
839 |
\end{equation}
|
339
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
840 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
841 |
\noindent
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
842 |
Let us say that this property is $P(r)$, then the first case
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
843 |
we need to check is whether $P(\ZERO)$ (see recipe
|
339
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
844 |
above). So we have to show that
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
845 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
846 |
\[
|
412
|
847 |
\textit{nullable}(\ZERO) \;\;\text{if and only if}\;\;
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
848 |
[]\in L(\ZERO)
|
339
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
849 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
850 |
|
412
|
851 |
\noindent whereby $\textit{nullable}(\ZERO)$ is by definition of
|
|
852 |
the function $\textit{nullable}$ always $\textit{false}$. We also have
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
853 |
that $L(\ZERO)$ is by definition $\{\}$. It is
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
854 |
impossible that the empty string $[]$ is in the empty set.
|
339
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
855 |
Therefore also the right-hand side is false. Consequently we
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
856 |
verified this case: both sides are false. We would still need
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
857 |
to do this for $P(\ONE)$ and $P(c)$. I leave this to
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
858 |
you to verify.
|
340
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
859 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
860 |
Next we need to check the inductive cases, for example
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
861 |
$P(r_1 + r_2)$, which is
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
862 |
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
863 |
\begin{equation}
|
412
|
864 |
\textit{nullable}(r_1 + r_2) \;\;\text{if and only if}\;\;
|
340
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
865 |
[]\in L(r_1 + r_2)
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
866 |
\label{propalt}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
867 |
\end{equation}
|
340
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
868 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
869 |
\noindent The difference to the base cases is that in this
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
870 |
case we can already assume we proved
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
871 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
872 |
\begin{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
873 |
\begin{tabular}{l}
|
412
|
874 |
$\textit{nullable}(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$ and\\
|
|
875 |
$\textit{nullable}(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\
|
340
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
876 |
\end{tabular}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
877 |
\end{center}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
878 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
879 |
\noindent These are the induction hypotheses. To check this
|
412
|
880 |
case, we can start from $\textit{nullable}(r_1 + r_2)$, which by
|
340
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
881 |
definition is
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
882 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
883 |
\[
|
412
|
884 |
\textit{nullable}(r_1) \vee \textit{nullable}(r_2)
|
340
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
885 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
886 |
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
887 |
\noindent Using the two induction hypotheses from above,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
888 |
we can transform this into
|
340
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
889 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
890 |
\[
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
891 |
[] \in L(r_1) \vee []\in(r_2)
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
892 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
893 |
|
412
|
894 |
\noindent We just replaced the $\textit{nullable}(\ldots)$ parts by
|
340
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
895 |
the equivalent $[] \in L(\ldots)$ from the induction
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
896 |
hypotheses. A bit of thinking convinces you that if
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
897 |
$[] \in L(r_1) \vee []\in L(r_2)$ then the empty string
|
340
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
898 |
must be in the union $L(r_1)\cup L(r_2)$, that is
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
899 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
900 |
\[
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
901 |
[] \in L(r_1)\cup L(r_2)
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
902 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
903 |
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
904 |
\noindent but this is by definition of $L$ exactly $[] \in
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
905 |
L(r_1 + r_2)$, which we needed to establish according to
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
906 |
\eqref{propalt}. What we have shown is that starting from
|
412
|
907 |
$\textit{nullable}(r_1 + r_2)$ we have done equivalent transformations
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
908 |
to end up with $[] \in L(r_1 + r_2)$. Consequently we have
|
340
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
909 |
established that $P(r_1 + r_2)$ holds.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
910 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
911 |
In order to complete the proof we would now need to look
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
912 |
at the cases \mbox{$P(r_1\cdot r_2)$} and $P(r^*)$. Again I let you
|
340
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
913 |
check the details.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
914 |
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
915 |
You might have to do induction proofs over strings.
|
340
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
916 |
That means you want to establish a property $P(s)$ for all
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
917 |
strings $s$. For this remember strings are lists of
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
918 |
characters. These lists can be either the empty list or a
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
919 |
list of the form $c::s$. If you want to perform an induction
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
920 |
proof for strings you need to consider the cases
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
921 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
922 |
\begin{itemize}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
923 |
\item $P$ has to hold for $[]$ (this is the base case).
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
924 |
\item $P$ has to hold for $c::s$ under the assumption
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
925 |
that $P$ already holds for $s$.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
926 |
\end{itemize}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
927 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
928 |
\noindent
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
929 |
Given this recipe, I let you show
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
930 |
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
931 |
\begin{equation}
|
414
|
932 |
\textit{Ders}\,s\,(L(r)) = L(\textit{ders}\,s\,r)
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
933 |
\label{dersprop}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
934 |
\end{equation}
|
340
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
935 |
|
414
|
936 |
\noindent by induction on $s$. Recall $\textit{Der}$ is defined for
|
|
937 |
character---see \eqref{Der}; $\textit{Ders}$ is similar, but for strings:
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
938 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
939 |
\[
|
414
|
940 |
\textit{Ders}\,s\,A\;\dn\;\{s'\,|\,s @ s' \in A\}
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
941 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
942 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
943 |
\noindent In this proof you can assume the following property
|
414
|
944 |
for $der$ and $\textit{Der}$ has already been proved, that is you can
|
399
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
945 |
assume
|
340
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
946 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
947 |
\[
|
414
|
948 |
L(\textit{der}\,c\,r) = \textit{Der}\,c\,(L(r))
|
340
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
949 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
950 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
951 |
\noindent holds (this would be of course a property that
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
952 |
needs to be proved in a side-lemma by induction on $r$).
|
338
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
953 |
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
954 |
To sum up, using reasoning like the one shown above allows us
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
955 |
to show the correctness of our algorithm. To see this,
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
956 |
start from the specification
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
957 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
958 |
\[
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
959 |
s \in L(r)
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
960 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
961 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
962 |
\noindent That is the problem we want to solve. Thinking a
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
963 |
little, you will see that this problem is equivalent to the
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
964 |
following problem
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
965 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
966 |
\begin{equation}
|
414
|
967 |
[] \in \textit{Ders}\,s\,(L(r))
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
968 |
\label{dersstep}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
969 |
\end{equation}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
970 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
971 |
\noindent But we have shown above in \eqref{dersprop}, that
|
414
|
972 |
the $\textit{Ders}$ can be replaced by $L(\textit{ders}\ldots)$. That means
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
973 |
\eqref{dersstep} is equivalent to
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
974 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
975 |
\begin{equation}
|
414
|
976 |
[] \in L(\textit{ders}\,s\,r)
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
977 |
\label{prefinalstep}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
978 |
\end{equation}
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
979 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
980 |
\noindent We have also shown that testing whether the empty
|
412
|
981 |
string is in a language is equivalent to the $\textit{nullable}$
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
982 |
function; see \eqref{nullableprop}. That means
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
983 |
\eqref{prefinalstep} is equivalent with
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
984 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
985 |
\[
|
414
|
986 |
\textit{nullable}(\textit{ders}\,s\,r)
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
987 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
988 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
989 |
\noindent But this is just the definition of $matches$
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
990 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
991 |
\[
|
414
|
992 |
matches\,s\,r \dn nullable(\textit{ders}\,s\,r)
|
343
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
993 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
994 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
995 |
\noindent In effect we have shown
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
996 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
997 |
\[
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
998 |
matches\,s\,r\;\;\text{if and only if}\;\;
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
999 |
s\in L(r)
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
1000 |
\]
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
1001 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
1002 |
\noindent which is the property we set out to prove:
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
1003 |
our algorithm meets its specification. To have done
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
1004 |
so, requires a few induction proofs about strings and
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
1005 |
regular expressions. Following the recipes is already a big
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
1006 |
step in performing these proofs.
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
1007 |
|
262
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
1008 |
\end{document}
|
261
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
1009 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
1010 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
|
1011 |
|
123
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
1012 |
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
1013 |
%%% Local Variables:
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
1014 |
%%% mode: latex
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
1015 |
%%% TeX-master: t
|
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
1016 |
%%% End:
|