author | Christian Urban <christian.urban@kcl.ac.uk> |
Sat, 23 Sep 2023 22:26:52 +0100 | |
changeset 926 | 42ecc3186944 |
parent 874 | ffe02fd574a5 |
child 932 | 5678414a3898 |
permissions | -rw-r--r-- |
646 | 1 |
% !TEX program = xelatex |
123
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
2 |
\documentclass{article} |
251
5b5a68df6d16
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
217
diff
changeset
|
3 |
\usepackage{../style} |
217
cd6066f1056a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
140
diff
changeset
|
4 |
\usepackage{../langs} |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
5 |
\usepackage{../graphics} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
6 |
\usepackage{../data} |
480 | 7 |
|
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
8 |
|
123
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
9 |
\begin{document} |
727 | 10 |
\fnote{\copyright{} Christian Urban, King's College London, |
926 | 11 |
2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023} |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
12 |
|
123
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
13 |
|
272
1446bc47a294
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
268
diff
changeset
|
14 |
\section*{Handout 2 (Regular Expression Matching)} |
123
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
15 |
|
757 | 16 |
%\noindent |
17 |
%{\bf Checklist} |
|
18 |
% |
|
19 |
%\begin{itemize} |
|
20 |
% \item You have understood the derivative-based matching algorithm. |
|
21 |
% \item You know how the derivative is related to the meaning of regular |
|
22 |
% expressions. |
|
23 |
% \item You can extend the algorithm to non-basic regular expressions. |
|
24 |
%\end{itemize}\bigskip\bigskip\bigskip |
|
727 | 25 |
|
26 |
\noindent |
|
412 | 27 |
This lecture is about implementing a more efficient regular expression |
478 | 28 |
matcher (the plots on the right below)---more efficient than the |
926 | 29 |
matchers from regular expression libraries in Ruby, Python, JavaScript, Swift |
831 | 30 |
and Java (the plots on the left).\footnote{Have a look at KEATS: students |
926 | 31 |
last year contributed also data for the Dart language.}\medskip |
831 | 32 |
|
33 |
\noindent |
|
34 |
To start with let us look more closely at the experimental data: The |
|
35 |
first pair of plots shows the running time for the regular expression |
|
36 |
$(a^*)^*\cdot b$ and strings composed of $n$ \pcode{a}s, like |
|
727 | 37 |
\[ |
764 | 38 |
\underbrace{\pcode{a}\ldots\pcode{a}}_{n} |
727 | 39 |
\] |
40 |
||
41 |
\noindent |
|
42 |
This means the regular expression actually does not match the strings. |
|
43 |
The second pair of plots shows the running time for the regular |
|
44 |
expressions of the form $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and corresponding |
|
45 |
strings composed of $n$ \pcode{a}s (this time the regular expressions |
|
46 |
match the strings). To see the substantial differences in the left and |
|
618 | 47 |
right plots below, note the different scales of the $x$-axes. |
478 | 48 |
|
510 | 49 |
|
478 | 50 |
\begin{center} |
51 |
Graphs: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$ |
|
52 |
\begin{tabular}{@{}cc@{}} |
|
550 | 53 |
\begin{tikzpicture}[baseline=(current bounding box.north)] |
54 |
\begin{axis}[ |
|
478 | 55 |
xlabel={$n$}, |
56 |
x label style={at={(1.05,0.0)}}, |
|
57 |
ylabel={time in secs}, |
|
58 |
enlargelimits=false, |
|
59 |
xtick={0,5,...,30}, |
|
60 |
xmax=33, |
|
61 |
ymax=35, |
|
62 |
ytick={0,5,...,30}, |
|
63 |
scaled ticks=false, |
|
64 |
axis lines=left, |
|
65 |
width=5cm, |
|
926 | 66 |
height=4.5cm, |
67 |
legend entries={Java 8, Python, JavaScript, Swift}, |
|
478 | 68 |
legend pos=north west, |
69 |
legend cell align=left] |
|
70 |
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data}; |
|
71 |
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
|
618 | 72 |
\addplot[red,mark=*, mark options={fill=white}] table {re-js.data}; |
926 | 73 |
\addplot[magenta,mark=*, mark options={fill=white}] table {re-swift.data}; |
478 | 74 |
\end{axis} |
75 |
\end{tikzpicture} |
|
76 |
& |
|
550 | 77 |
\begin{tikzpicture}[baseline=(current bounding box.north)] |
478 | 78 |
\begin{axis}[ |
79 |
xlabel={$n$}, |
|
488 | 80 |
x label style={at={(1.1,0.0)}}, |
81 |
%%xtick={0,1000000,...,5000000}, |
|
478 | 82 |
ylabel={time in secs}, |
83 |
enlargelimits=false, |
|
84 |
ymax=35, |
|
85 |
ytick={0,5,...,30}, |
|
86 |
axis lines=left, |
|
488 | 87 |
%scaled ticks=false, |
478 | 88 |
width=6.5cm, |
926 | 89 |
height=4.5cm, |
488 | 90 |
legend entries={Our matcher}, |
478 | 91 |
legend pos=north east, |
92 |
legend cell align=left] |
|
93 |
%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data}; |
|
94 |
\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data}; |
|
95 |
\end{axis} |
|
96 |
\end{tikzpicture} |
|
97 |
\end{tabular} |
|
488 | 98 |
\end{center}\bigskip |
263
92e6985018ae
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
262
diff
changeset
|
99 |
|
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
100 |
\begin{center} |
415 | 101 |
Graphs: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$\\ |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
102 |
\begin{tabular}{@{}cc@{}} |
268
18bef085a7ca
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
263
diff
changeset
|
103 |
\begin{tikzpicture} |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
104 |
\begin{axis}[ |
414 | 105 |
xlabel={$n$}, |
106 |
x label style={at={(1.05,0.0)}}, |
|
412 | 107 |
ylabel={\small time in secs}, |
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
108 |
enlargelimits=false, |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
109 |
xtick={0,5,...,30}, |
291
201c2c6d8696
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
272
diff
changeset
|
110 |
xmax=33, |
268
18bef085a7ca
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
263
diff
changeset
|
111 |
ymax=35, |
18bef085a7ca
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
263
diff
changeset
|
112 |
ytick={0,5,...,30}, |
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
113 |
scaled ticks=false, |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
114 |
axis lines=left, |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
115 |
width=5cm, |
926 | 116 |
height=4.55cm, |
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
117 |
legend entries={Python,Ruby}, |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
118 |
legend pos=north west, |
268
18bef085a7ca
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
263
diff
changeset
|
119 |
legend cell align=left] |
434
8664ff87cd77
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
433
diff
changeset
|
120 |
\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data}; |
8664ff87cd77
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
433
diff
changeset
|
121 |
\addplot[brown,mark=triangle*, mark options={fill=white}] table {re-ruby.data}; |
268
18bef085a7ca
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
263
diff
changeset
|
122 |
\end{axis} |
18bef085a7ca
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
263
diff
changeset
|
123 |
\end{tikzpicture} |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
124 |
& |
268
18bef085a7ca
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
263
diff
changeset
|
125 |
\begin{tikzpicture} |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
126 |
\begin{axis}[ |
414 | 127 |
xlabel={$n$}, |
128 |
x label style={at={(1.1,0.05)}}, |
|
412 | 129 |
ylabel={\small time in secs}, |
130 |
enlargelimits=false, |
|
477 | 131 |
xtick={0,2500,...,11000}, |
132 |
xmax=12000, |
|
412 | 133 |
ymax=35, |
134 |
ytick={0,5,...,30}, |
|
135 |
scaled ticks=false, |
|
136 |
axis lines=left, |
|
137 |
width=6.5cm, |
|
926 | 138 |
height=4.5cm, |
488 | 139 |
legend entries={Our matcher}, |
478 | 140 |
legend pos=north east, |
141 |
legend cell align=left] |
|
142 |
%\addplot[green,mark=square*,mark options={fill=white}] table {re2.data}; |
|
412 | 143 |
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; |
144 |
\end{axis} |
|
145 |
\end{tikzpicture} |
|
146 |
\end{tabular} |
|
147 |
\end{center} |
|
488 | 148 |
\bigskip |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
149 |
|
412 | 150 |
\noindent |
488 | 151 |
In what follows we will use these regular expressions and strings as |
152 |
running examples. There will be several versions (V1, V2, V3,\ldots) |
|
153 |
of our matcher.\footnote{The corresponding files are |
|
831 | 154 |
\texttt{re1.sc}, \texttt{re2.sc} and so on. As usual, you can |
727 | 155 |
find the code on KEATS.} |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
156 |
|
412 | 157 |
Having specified in the previous lecture what |
325
794c599cee53
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
318
diff
changeset
|
158 |
problem our regular expression matcher is supposed to solve, |
794c599cee53
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
318
diff
changeset
|
159 |
namely for any given regular expression $r$ and string $s$ |
794c599cee53
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
318
diff
changeset
|
160 |
answer \textit{true} if and only if |
123
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
161 |
|
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
162 |
\[ |
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
163 |
s \in L(r) |
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
164 |
\] |
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
165 |
|
488 | 166 |
\noindent we can look for an algorithm to solve this problem. Clearly |
412 | 167 |
we cannot use the function $L$ directly for this, because in general |
168 |
the set of strings $L$ returns is infinite (recall what $L(a^*)$ is). |
|
169 |
In such cases there is no way we can implement an exhaustive test for |
|
170 |
whether a string is member of this set or not. In contrast our |
|
171 |
matching algorithm will operate on the regular expression $r$ and |
|
414 | 172 |
string $s$, only, which are both finite objects. Before we explain |
646 | 173 |
the matching algorithm, let us have a closer look at what it |
412 | 174 |
means when two regular expressions are equivalent. |
258
1e4da6d2490c
updated programs
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
251
diff
changeset
|
175 |
|
1e4da6d2490c
updated programs
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
251
diff
changeset
|
176 |
\subsection*{Regular Expression Equivalences} |
123
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
177 |
|
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
178 |
We already defined in Handout 1 what it means for two regular |
727 | 179 |
expressions to be equivalent, namely whether their |
180 |
\emph{meaning} is the same language: |
|
258
1e4da6d2490c
updated programs
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
251
diff
changeset
|
181 |
|
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
182 |
\[ |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
183 |
r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2) |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
184 |
\] |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
185 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
186 |
\noindent |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
187 |
It is relatively easy to verify that some concrete equivalences |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
188 |
hold, for example |
124
dd8b5a3dac0a
adde
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
123
diff
changeset
|
189 |
|
dd8b5a3dac0a
adde
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
123
diff
changeset
|
190 |
\begin{center} |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
191 |
\begin{tabular}{rcl} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
192 |
$(a + b) + c$ & $\equiv$ & $a + (b + c)$\\ |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
193 |
$a + a$ & $\equiv$ & $a$\\ |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
194 |
$a + b$ & $\equiv$ & $b + a$\\ |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
195 |
$(a \cdot b) \cdot c$ & $\equiv$ & $a \cdot (b \cdot c)$\\ |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
196 |
$c \cdot (a + b)$ & $\equiv$ & $(c \cdot a) + (c \cdot b)$\\ |
124
dd8b5a3dac0a
adde
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
123
diff
changeset
|
197 |
\end{tabular} |
dd8b5a3dac0a
adde
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
123
diff
changeset
|
198 |
\end{center} |
123
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
199 |
|
124
dd8b5a3dac0a
adde
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
123
diff
changeset
|
200 |
\noindent |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
201 |
but also easy to verify that the following regular expressions |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
202 |
are \emph{not} equivalent |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
203 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
204 |
\begin{center} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
205 |
\begin{tabular}{rcl} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
206 |
$a \cdot a$ & $\not\equiv$ & $a$\\ |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
207 |
$a + (b \cdot c)$ & $\not\equiv$ & $(a + b) \cdot (a + c)$\\ |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
208 |
\end{tabular} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
209 |
\end{center} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
210 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
211 |
\noindent I leave it to you to verify these equivalences and |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
212 |
non-equivalences. It is also interesting to look at some |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
213 |
corner cases involving $\ONE$ and $\ZERO$: |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
214 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
215 |
\begin{center} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
216 |
\begin{tabular}{rcl} |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
217 |
$a \cdot \ZERO$ & $\not\equiv$ & $a$\\ |
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
218 |
$a + \ONE$ & $\not\equiv$ & $a$\\ |
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
219 |
$\ONE$ & $\equiv$ & $\ZERO^*$\\ |
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
220 |
$\ONE^*$ & $\equiv$ & $\ONE$\\ |
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
221 |
$\ZERO^*$ & $\not\equiv$ & $\ZERO$\\ |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
222 |
\end{tabular} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
223 |
\end{center} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
224 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
225 |
\noindent Again I leave it to you to make sure you agree |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
226 |
with these equivalences and non-equivalences. |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
227 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
228 |
|
318
7975e4f0d4de
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
296
diff
changeset
|
229 |
For our matching algorithm however the following seven |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
230 |
equivalences will play an important role: |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
231 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
232 |
\begin{center} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
233 |
\begin{tabular}{rcl} |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
234 |
$r + \ZERO$ & $\equiv$ & $r$\\ |
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
235 |
$\ZERO + r$ & $\equiv$ & $r$\\ |
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
236 |
$r \cdot \ONE$ & $\equiv$ & $r$\\ |
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
237 |
$\ONE \cdot r$ & $\equiv$ & $r$\\ |
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
238 |
$r \cdot \ZERO$ & $\equiv$ & $\ZERO$\\ |
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
239 |
$\ZERO \cdot r$ & $\equiv$ & $\ZERO$\\ |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
240 |
$r + r$ & $\equiv$ & $r$ |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
241 |
\end{tabular} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
242 |
\end{center} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
243 |
|
727 | 244 |
\noindent They always hold no matter what the regular expression $r$ |
412 | 245 |
looks like. The first two are easy to verify since $L(\ZERO)$ is the |
246 |
empty set. The next two are also easy to verify since $L(\ONE) = |
|
247 |
\{[]\}$ and appending the empty string to every string of another set, |
|
248 |
leaves the set unchanged. Be careful to fully comprehend the fifth and |
|
249 |
sixth equivalence: if you concatenate two sets of strings and one is |
|
250 |
the empty set, then the concatenation will also be the empty set. To |
|
251 |
see this, check the definition of $\_ @ \_$ for sets. The last |
|
252 |
equivalence is again trivial. |
|
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
253 |
|
727 | 254 |
What will be critical later on is that we can orient these |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
255 |
equivalences and read them from left to right. In this way we |
325
794c599cee53
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
318
diff
changeset
|
256 |
can view them as \emph{simplification rules}. Consider for |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
257 |
example the regular expression |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
258 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
259 |
\begin{equation} |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
260 |
(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO) |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
261 |
\label{big} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
262 |
\end{equation} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
263 |
|
412 | 264 |
\noindent If we can find an equivalent regular expression that is |
488 | 265 |
simpler (that usually means smaller), then this might potentially make |
727 | 266 |
our matching algorithm run faster. We can look for such a simpler, but |
267 |
equivalent, regular expression $r'$ because whether a string $s$ is in |
|
926 | 268 |
$L(r)$ or in $L(r')$ does not matter as long as $r\equiv r'$. Yes? \footnote{You have checked this for yourself? Your friendly lecturer might talk rubbish\ldots{}one never knows.} |
488 | 269 |
|
727 | 270 |
In the example above you will see that the regular expression in |
271 |
\eqref{big} is equivalent to just $r_1$. You can verify this by |
|
272 |
iteratively applying the simplification rules from above: |
|
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
273 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
274 |
\begin{center} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
275 |
\begin{tabular}{ll} |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
276 |
& $(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot |
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
277 |
(\underline{r_4 \cdot \ZERO})$\smallskip\\ |
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
278 |
$\equiv$ & $(r_1 + \ZERO) \cdot \ONE + \underline{((\ONE + r_2) + r_3) \cdot |
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
279 |
\ZERO}$\smallskip\\ |
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
280 |
$\equiv$ & $\underline{(r_1 + \ZERO) \cdot \ONE} + \ZERO$\smallskip\\ |
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
281 |
$\equiv$ & $(\underline{r_1 + \ZERO}) + \ZERO$\smallskip\\ |
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
282 |
$\equiv$ & $\underline{r_1 + \ZERO}$\smallskip\\ |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
283 |
$\equiv$ & $r_1$\ |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
284 |
\end{tabular} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
285 |
\end{center} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
286 |
|
296
796b9b81ac8d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
291
diff
changeset
|
287 |
\noindent In each step, I underlined where a simplification |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
288 |
rule is applied. Our matching algorithm in the next section |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
289 |
will often generate such ``useless'' $\ONE$s and |
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
290 |
$\ZERO$s, therefore simplifying them away will make the |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
291 |
algorithm quite a bit faster. |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
292 |
|
488 | 293 |
Finally here are three equivalences between regular expressions which are |
479 | 294 |
not so obvious: |
295 |
||
296 |
\begin{center} |
|
297 |
\begin{tabular}{rcl} |
|
727 | 298 |
$r^*$ & $\equiv$ & $\ONE + r\cdot r^*$\\ |
479 | 299 |
$(r_1 + r_2)^*$ & $\equiv$ & $r_1^* \cdot (r_2\cdot r_1^*)^*$\\ |
727 | 300 |
$(r_1 \cdot r_2)^*$ & $\equiv$ & $\ONE + r_1\cdot (r_2 \cdot r_1)^* \cdot r_2$\\ |
479 | 301 |
\end{tabular} |
302 |
\end{center} |
|
303 |
||
304 |
\noindent |
|
727 | 305 |
We will not use them in our algorithm, but feel free to convince |
306 |
yourself that they actually hold. As an aside, there has been a lot of |
|
926 | 307 |
research on questions like: Can one always decide whether two regular |
727 | 308 |
expressions are equivalent or not? What does an algorithm look like to |
831 | 309 |
decide this efficiently? Surprisingly, many of such questions |
310 |
turn out to be non-trivial problems. |
|
311 |
||
479 | 312 |
|
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
313 |
\subsection*{The Matching Algorithm} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
314 |
|
926 | 315 |
The regular expression matching algorithm we will introduce below |
316 |
consists of two parts: One is the function $\textit{nullable}$ which |
|
317 |
takes a regular expression as an argument and decides whether it can |
|
318 |
match the empty string (this means it returns a boolean in |
|
319 |
Scala). This can be easily defined recursively as follows: |
|
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
320 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
321 |
\begin{center} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
322 |
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} |
412 | 323 |
$\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\ |
324 |
$\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\ |
|
325 |
$\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\ |
|
326 |
$\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\ |
|
327 |
$\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\ |
|
328 |
$\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$ \\ |
|
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
329 |
\end{tabular} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
330 |
\end{center} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
331 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
332 |
\noindent The idea behind this function is that the following |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
333 |
property holds: |
124
dd8b5a3dac0a
adde
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
123
diff
changeset
|
334 |
|
dd8b5a3dac0a
adde
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
123
diff
changeset
|
335 |
\[ |
412 | 336 |
\textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r) |
124
dd8b5a3dac0a
adde
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
123
diff
changeset
|
337 |
\] |
dd8b5a3dac0a
adde
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
123
diff
changeset
|
338 |
|
727 | 339 |
\noindent Note on the left-hand side of the if-and-only-if we have a |
874 | 340 |
function we can implement, for example in Scala; on the right we have |
727 | 341 |
its specification (which we cannot implement in a programming language). |
124
dd8b5a3dac0a
adde
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
123
diff
changeset
|
342 |
|
258
1e4da6d2490c
updated programs
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
251
diff
changeset
|
343 |
The other function of our matching algorithm calculates a |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
344 |
\emph{derivative} of a regular expression. This is a function |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
345 |
which will take a regular expression, say $r$, and a |
412 | 346 |
character, say $c$, as arguments and returns a new regular |
488 | 347 |
expression. Be mindful that the intuition behind this function |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
348 |
is not so easy to grasp on first reading. Essentially this |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
349 |
function solves the following problem: if $r$ can match a |
488 | 350 |
string of the form $c\!::\!s$, what does a regular |
325
794c599cee53
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
318
diff
changeset
|
351 |
expression look like that can match just $s$? The definition |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
352 |
of this function is as follows: |
125
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
353 |
|
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
354 |
\begin{center} |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
355 |
\begin{tabular}{l@ {\hspace{2mm}}c@ {\hspace{2mm}}l} |
414 | 356 |
$\textit{der}\, c\, (\ZERO)$ & $\dn$ & $\ZERO$\\ |
357 |
$\textit{der}\, c\, (\ONE)$ & $\dn$ & $\ZERO$ \\ |
|
358 |
$\textit{der}\, c\, (d)$ & $\dn$ & if $c = d$ then $\ONE$ else $\ZERO$\\ |
|
359 |
$\textit{der}\, c\, (r_1 + r_2)$ & $\dn$ & $\textit{der}\, c\, r_1 + \textit{der}\, c\, r_2$\\ |
|
360 |
$\textit{der}\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $\textit{nullable} (r_1)$\\ |
|
361 |
& & then $(\textit{der}\,c\,r_1) \cdot r_2 + \textit{der}\, c\, r_2$\\ |
|
362 |
& & else $(\textit{der}\, c\, r_1) \cdot r_2$\\ |
|
363 |
$\textit{der}\, c\, (r^*)$ & $\dn$ & $(\textit{der}\,c\,r) \cdot (r^*)$ |
|
125
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
364 |
\end{tabular} |
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
365 |
\end{center} |
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
366 |
|
727 | 367 |
\noindent The first two clauses can be rationalised as follows: recall |
368 |
that $\textit{der}$ should calculate a regular expression so that |
|
926 | 369 |
provided the ``input'' regular expression can match a string of the |
727 | 370 |
form $c\!::\!s$, we want a regular expression for $s$. Since neither |
926 | 371 |
$\ZERO$ nor $\ONE$ can match a string of the form $c\!::\!s$, we |
372 |
return $\ZERO$. In the third case we have to make a case-distinction: |
|
373 |
In case the regular expression is $c$, then clearly it can recognise a |
|
374 |
string of the form $c\!::\!s$, just that $s$ is the empty |
|
375 |
string. Therefore we return the $\ONE$-regular expression in this |
|
376 |
case, as it can match the empty string. In the other case we again |
|
377 |
return $\ZERO$ since no string of the $c\!::\!s$ can be matched. Next |
|
378 |
come the recursive cases, which are a bit more involved. Fortunately, |
|
379 |
the $+$-case is still relatively straightforward: all strings of the |
|
380 |
form $c\!::\!s$ are either matched by the regular expression $r_1$ or |
|
381 |
$r_2$. So we just have to recursively call $\textit{der}$ with these |
|
382 |
two regular expressions and compose the results again with $+$. Makes |
|
383 |
sense? |
|
727 | 384 |
|
412 | 385 |
|
386 |
The $\cdot$-case is more complicated: if $r_1\cdot r_2$ |
|
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
387 |
matches a string of the form $c\!::\!s$, then the first part |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
388 |
must be matched by $r_1$. Consequently, it makes sense to |
414 | 389 |
construct the regular expression for $s$ by calling $\textit{der}$ with |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
390 |
$r_1$ and ``appending'' $r_2$. There is however one exception |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
391 |
to this simple rule: if $r_1$ can match the empty string, then |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
392 |
all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
393 |
nullable (that is can match the empty string) we have to allow |
414 | 394 |
the choice $\textit{der}\,c\,r_2$ for calculating the regular |
325
794c599cee53
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
318
diff
changeset
|
395 |
expression that can match $s$. Therefore we have to add the |
414 | 396 |
regular expression $\textit{der}\,c\,r_2$ in the result. The $*$-case |
325
794c599cee53
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
318
diff
changeset
|
397 |
is again simple: if $r^*$ matches a string of the form |
794c599cee53
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
318
diff
changeset
|
398 |
$c\!::\!s$, then the first part must be ``matched'' by a |
414 | 399 |
single copy of $r$. Therefore we call recursively $\textit{der}\,c\,r$ |
400 |
and ``append'' $r^*$ in order to match the rest of $s$. Still |
|
401 |
makes sense? |
|
125
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
402 |
|
488 | 403 |
If all this did not make sense yet, here is another way to explain the |
404 |
definition of $\textit{der}$ by considering the following operation on |
|
405 |
sets: |
|
125
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
406 |
|
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
407 |
\begin{equation}\label{Der} |
414 | 408 |
\textit{Der}\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\} |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
409 |
\end{equation} |
125
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
410 |
|
291
201c2c6d8696
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
272
diff
changeset
|
411 |
\noindent This operation essentially transforms a set of |
201c2c6d8696
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
272
diff
changeset
|
412 |
strings $A$ by filtering out all strings that do not start |
201c2c6d8696
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
272
diff
changeset
|
413 |
with $c$ and then strips off the $c$ from all the remaining |
201c2c6d8696
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
272
diff
changeset
|
414 |
strings. For example suppose $A = \{f\!oo, bar, f\!rak\}$ then |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
415 |
|
414 | 416 |
\[ \textit{Der}\,f\,A = \{oo, rak\}\quad,\quad |
417 |
\textit{Der}\,b\,A = \{ar\} \quad \text{and} \quad |
|
418 |
\textit{Der}\,a\,A = \{\} |
|
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
419 |
\] |
125
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
420 |
|
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
421 |
\noindent |
414 | 422 |
Note that in the last case $\textit{Der}$ is empty, because no string in $A$ |
258
1e4da6d2490c
updated programs
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
251
diff
changeset
|
423 |
starts with $a$. With this operation we can state the following |
414 | 424 |
property about $\textit{der}$: |
125
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
425 |
|
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
426 |
\[ |
414 | 427 |
L(\textit{der}\,c\,r) = \textit{Der}\,c\,(L(r)) |
125
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
428 |
\] |
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
429 |
|
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
430 |
\noindent |
414 | 431 |
This property clarifies what regular expression $\textit{der}$ calculates, |
258
1e4da6d2490c
updated programs
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
251
diff
changeset
|
432 |
namely take the set of strings that $r$ can match (that is $L(r)$), |
1e4da6d2490c
updated programs
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
251
diff
changeset
|
433 |
filter out all strings not starting with $c$ and strip off the $c$ |
1e4da6d2490c
updated programs
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
251
diff
changeset
|
434 |
from the remaining strings---this is exactly the language that |
414 | 435 |
$\textit{der}\,c\,r$ can match. |
125
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
436 |
|
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
437 |
If we want to find out whether the string $abc$ is matched by |
414 | 438 |
the regular expression $r_1$ then we can iteratively apply $\textit{der}$ |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
439 |
as follows |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
440 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
441 |
\begin{center} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
442 |
\begin{tabular}{rll} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
443 |
Input: $r_1$, $abc$\medskip\\ |
414 | 444 |
Step 1: & build derivative of $a$ and $r_1$ & $(r_2 = \textit{der}\,a\,r_1)$\smallskip\\ |
445 |
Step 2: & build derivative of $b$ and $r_2$ & $(r_3 = \textit{der}\,b\,r_2)$\smallskip\\ |
|
433
c08290ee4f1f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
416
diff
changeset
|
446 |
Step 3: & build derivative of $c$ and $r_3$ & $(r_4 = \textit{der}\,c\,r_3)$\smallskip\\ |
c08290ee4f1f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
416
diff
changeset
|
447 |
Step 4: & the string is exhausted: & $(\textit{nullable}(r_4))$\\ |
c08290ee4f1f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
416
diff
changeset
|
448 |
& test whether $r_4$ can recognise the\\ |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
449 |
& empty string\smallskip\\ |
412 | 450 |
Output: & result of this test $\Rightarrow \textit{true} \,\text{or}\, \textit{false}$\\ |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
451 |
\end{tabular} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
452 |
\end{center} |
140
1be892087df2
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
133
diff
changeset
|
453 |
|
414 | 454 |
\noindent Again the operation $\textit{Der}$ might help to rationalise |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
455 |
this algorithm. We want to know whether $abc \in L(r_1)$. We |
414 | 456 |
do not know yet---but let us assume it is. Then $\textit{Der}\,a\,L(r_1)$ |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
457 |
builds the set where all the strings not starting with $a$ are |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
458 |
filtered out. Of the remaining strings, the $a$ is stripped |
412 | 459 |
off. So we should still have $bc$ in the set. |
460 |
Then we continue with filtering out all strings not |
|
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
461 |
starting with $b$ and stripping off the $b$ from the remaining |
414 | 462 |
strings, that means we build $\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1)))$. |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
463 |
Finally we filter out all strings not starting with $c$ and |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
464 |
strip off $c$ from the remaining string. This is |
414 | 465 |
$\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$. Now if $abc$ was in the |
466 |
original set ($L(r_1)$), then $\textit{Der}\,c\,(\textit{Der}\,b\,(\textit{Der}\,a\,(L(r_1))))$ |
|
412 | 467 |
must contain the empty string. If not, then $abc$ was not in the |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
468 |
language we started with. |
140
1be892087df2
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
133
diff
changeset
|
469 |
|
414 | 470 |
Our matching algorithm using $\textit{der}$ and $\textit{nullable}$ works |
571 | 471 |
similarly, just using regular expressions instead of sets. In order to |
414 | 472 |
define our algorithm we need to extend the notion of derivatives from single |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
473 |
characters to strings. This can be done using the following |
414 | 474 |
function, taking a string and a regular expression as input and |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
475 |
a regular expression as output. |
125
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
476 |
|
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
477 |
\begin{center} |
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
478 |
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}} |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
479 |
$\textit{ders}\, []\, r$ & $\dn$ & $r$ & \\ |
414 | 480 |
$\textit{ders}\, (c\!::\!s)\, r$ & $\dn$ & $\textit{ders}\,s\,(\textit{der}\,c\,r)$ & \\ |
125
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
481 |
\end{tabular} |
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
482 |
\end{center} |
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
483 |
|
414 | 484 |
\noindent This function iterates $\textit{der}$ taking one character at |
488 | 485 |
the time from the original string until the string is exhausted. |
414 | 486 |
Having $\textit{der}s$ in place, we can finally define our matching |
325
794c599cee53
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
318
diff
changeset
|
487 |
algorithm: |
125
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
488 |
|
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
489 |
\[ |
764 | 490 |
\textit{matcher}\,r\,s \dn \textit{nullable}(\textit{ders}\,s\,r) |
125
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
491 |
\] |
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
492 |
|
39c75cf4e079
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
124
diff
changeset
|
493 |
\noindent |
325
794c599cee53
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
318
diff
changeset
|
494 |
and we can claim that |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
495 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
496 |
\[ |
764 | 497 |
\textit{matcher}\,r\,s\quad\text{if and only if}\quad s\in L(r) |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
498 |
\] |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
499 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
500 |
\noindent holds, which means our algorithm satisfies the |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
501 |
specification. Of course we can claim many things\ldots |
831 | 502 |
whether the claim holds any water is a different question. |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
503 |
|
566 | 504 |
This algorithm was introduced by Janusz Brzozowski in 1964, but |
414 | 505 |
is more widely known only in the last 10 or so years. Its |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
506 |
main attractions are simplicity and being fast, as well as |
566 | 507 |
being easily extendible for other regular expressions such as |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
508 |
$r^{\{n\}}$, $r^?$, $\sim{}r$ and so on (this is subject of |
831 | 509 |
Coursework 1). |
258
1e4da6d2490c
updated programs
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
251
diff
changeset
|
510 |
|
1e4da6d2490c
updated programs
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
251
diff
changeset
|
511 |
\subsection*{The Matching Algorithm in Scala} |
1e4da6d2490c
updated programs
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
251
diff
changeset
|
512 |
|
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
513 |
Another attraction of the algorithm is that it can be easily |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
514 |
implemented in a functional programming language, like Scala. |
296
796b9b81ac8d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
291
diff
changeset
|
515 |
Given the implementation of regular expressions in Scala shown |
796b9b81ac8d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
291
diff
changeset
|
516 |
in the first lecture and handout, the functions and subfunctions |
764 | 517 |
for \pcode{matcher} are shown in Figure~\ref{scala1}. |
126
7c7185cb4f2b
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
125
diff
changeset
|
518 |
|
7c7185cb4f2b
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
125
diff
changeset
|
519 |
\begin{figure}[p] |
477 | 520 |
\lstinputlisting[numbers=left,linebackgroundcolor= |
521 |
{\ifodd\value{lstnumber}\color{capri!3}\fi}] |
|
522 |
{../progs/app5.scala} |
|
874 | 523 |
\caption{A Scala implementation of the \textit{nullable} and |
524 |
the derivative function. These functions are easy to |
|
512 | 525 |
implement in functional programming languages. This is because pattern |
325
794c599cee53
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
318
diff
changeset
|
526 |
matching and recursion allow us to mimic the mathematical |
488 | 527 |
definitions very closely. Nearly all functional |
528 |
programming languages support pattern matching and |
|
529 |
recursion out of the box.\label{scala1}} |
|
126
7c7185cb4f2b
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
125
diff
changeset
|
530 |
\end{figure} |
123
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
531 |
|
414 | 532 |
|
443
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
533 |
%Remember our second example involving the regular expression |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
534 |
%$(a^*)^* \cdot b$ which could not match strings of $n$ \texttt{a}s. |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
535 |
%Java needed around 30 seconds to find this out a string with $n=28$. |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
536 |
%It seems our algorithm is doing rather well in comparison: |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
537 |
% |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
538 |
%\begin{center} |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
539 |
%\begin{tikzpicture} |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
540 |
%\begin{axis}[ |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
541 |
% title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$}, |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
542 |
% xlabel={$n$}, |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
543 |
% x label style={at={(1.05,0.0)}}, |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
544 |
% ylabel={time in secs}, |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
545 |
% enlargelimits=false, |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
546 |
% xtick={0,1000,...,6500}, |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
547 |
% xmax=6800, |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
548 |
% ytick={0,5,...,30}, |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
549 |
% ymax=34, |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
550 |
% scaled ticks=false, |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
551 |
% axis lines=left, |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
552 |
% width=8cm, |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
553 |
% height=4.5cm, |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
554 |
% legend entries={Java,Scala V1}, |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
555 |
% legend pos=north east, |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
556 |
% legend cell align=left] |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
557 |
%\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data}; |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
558 |
%\addplot[red,mark=triangle*,mark options={fill=white}] table {re1a.data}; |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
559 |
%\end{axis} |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
560 |
%\end{tikzpicture} |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
561 |
%\end{center} |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
562 |
% |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
563 |
%\noindent |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
564 |
%This is not an error: it hardly takes more than half a second for |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
565 |
%strings up to the length of 6500. After that we receive a |
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
566 |
%StackOverflow exception, but still\ldots |
414 | 567 |
|
568 |
For running the algorithm with our first example, the evil |
|
566 | 569 |
regular expression $a^?{}^{\{n\}}\cdot a^{\{n\}}$, we need to implement |
488 | 570 |
the optional regular expression and the `exactly $n$-times |
571 |
regular expression'. This can be done with the translations |
|
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
572 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
573 |
\lstinputlisting[numbers=none]{../progs/app51.scala} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
574 |
|
414 | 575 |
\noindent Running the matcher with this example, we find it is |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
576 |
slightly worse then the matcher in Ruby and Python. |
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
577 |
Ooops\ldots |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
578 |
|
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
579 |
\begin{center} |
414 | 580 |
\begin{tikzpicture} |
581 |
\begin{axis}[ |
|
415 | 582 |
title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$}, |
414 | 583 |
xlabel={$n$}, |
584 |
x label style={at={(1.05,0.0)}}, |
|
585 |
ylabel={time in secs}, |
|
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
586 |
enlargelimits=false, |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
587 |
xtick={0,5,...,30}, |
415 | 588 |
xmax=32, |
414 | 589 |
ytick={0,5,...,30}, |
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
590 |
scaled ticks=false, |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
591 |
axis lines=left, |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
592 |
width=6cm, |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
593 |
height=5cm, |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
594 |
legend entries={Python,Ruby,Scala V1}, |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
595 |
legend pos=outer north east, |
415 | 596 |
legend cell align=left] |
434
8664ff87cd77
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
433
diff
changeset
|
597 |
\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data}; |
8664ff87cd77
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
433
diff
changeset
|
598 |
\addplot[brown,mark=pentagon*, mark options={fill=white}] table {re-ruby.data}; |
8664ff87cd77
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
433
diff
changeset
|
599 |
\addplot[red,mark=triangle*,mark options={fill=white}] table {re1.data}; |
414 | 600 |
\end{axis} |
601 |
\end{tikzpicture} |
|
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
602 |
\end{center} |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
603 |
|
488 | 604 |
\noindent Analysing this failure we notice that for $a^{\{n\}}$, for |
605 |
example, we generate quite big regular expressions: |
|
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
606 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
607 |
\begin{center} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
608 |
\begin{tabular}{rl} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
609 |
1: & $a$\\ |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
610 |
2: & $a\cdot a$\\ |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
611 |
3: & $a\cdot a\cdot a$\\ |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
612 |
& \ldots\\ |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
613 |
13: & $a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a\cdot a$\\ |
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
614 |
& \ldots |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
615 |
\end{tabular} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
616 |
\end{center} |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
617 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
618 |
\noindent Our algorithm traverses such regular expressions at |
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
619 |
least once every time a derivative is calculated. So having |
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
620 |
large regular expressions will cause problems. This problem |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
621 |
is aggravated by $a^?$ being represented as $a + \ONE$. |
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
622 |
|
488 | 623 |
We can however fix this easily by having an explicit constructor for |
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
624 |
$r^{\{n\}}$. In Scala we would introduce a constructor like |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
625 |
|
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
626 |
\begin{center} |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
627 |
\code{case class NTIMES(r: Rexp, n: Int) extends Rexp} |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
628 |
\end{center} |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
629 |
|
478 | 630 |
\noindent With this fix we have a constant ``size'' regular expression |
631 |
for our running example no matter how large $n$ is (see the |
|
632 |
\texttt{size} section in the implementations). This means we have to |
|
633 |
also add cases for \pcode{NTIMES} in the functions $\textit{nullable}$ |
|
634 |
and $\textit{der}$. Does the change have any effect? |
|
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
635 |
|
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
636 |
\begin{center} |
414 | 637 |
\begin{tikzpicture} |
638 |
\begin{axis}[ |
|
415 | 639 |
title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$}, |
414 | 640 |
xlabel={$n$}, |
641 |
x label style={at={(1.01,0.0)}}, |
|
642 |
ylabel={time in secs}, |
|
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
643 |
enlargelimits=false, |
477 | 644 |
xtick={0,200,...,1100}, |
645 |
xmax=1200, |
|
414 | 646 |
ytick={0,5,...,30}, |
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
647 |
scaled ticks=false, |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
648 |
axis lines=left, |
414 | 649 |
width=10cm, |
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
650 |
height=5cm, |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
651 |
legend entries={Python,Ruby,Scala V1,Scala V2}, |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
652 |
legend pos=outer north east, |
414 | 653 |
legend cell align=left] |
434
8664ff87cd77
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
433
diff
changeset
|
654 |
\addplot[blue,mark=*, mark options={fill=white}] table {re-python.data}; |
8664ff87cd77
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
433
diff
changeset
|
655 |
\addplot[brown,mark=pentagon*, mark options={fill=white}] table {re-ruby.data}; |
8664ff87cd77
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
433
diff
changeset
|
656 |
\addplot[red,mark=triangle*,mark options={fill=white}] table {re1.data}; |
8664ff87cd77
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
433
diff
changeset
|
657 |
\addplot[green,mark=square*,mark options={fill=white}] table {re2.data}; |
414 | 658 |
\end{axis} |
659 |
\end{tikzpicture} |
|
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
660 |
\end{center} |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
661 |
|
478 | 662 |
\noindent Now we are talking business! The modified matcher can within |
663 |
25 seconds handle regular expressions up to $n = 1,100$ before a |
|
664 |
StackOverflow is raised. Recall that Python and Ruby (and our first |
|
665 |
version, Scala V1) could only handle $n = 27$ or so in 30 |
|
488 | 666 |
seconds. We have not tried our algorithm on the second example $(a^*)^* \cdot |
511 | 667 |
b$---I leave this to you. |
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
668 |
|
412 | 669 |
|
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
670 |
The moral is that our algorithm is rather sensitive to the |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
671 |
size of regular expressions it needs to handle. This is of |
414 | 672 |
course obvious because both $\textit{nullable}$ and $\textit{der}$ frequently |
325
794c599cee53
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
318
diff
changeset
|
673 |
need to traverse the whole regular expression. There seems, |
794c599cee53
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
318
diff
changeset
|
674 |
however, one more issue for making the algorithm run faster. |
794c599cee53
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
318
diff
changeset
|
675 |
The derivative function often produces ``useless'' |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
676 |
$\ZERO$s and $\ONE$s. To see this, consider $r = ((a |
478 | 677 |
\cdot b) + b)^*$ and the following three derivatives |
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
678 |
|
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
679 |
\begin{center} |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
680 |
\begin{tabular}{l} |
414 | 681 |
$\textit{der}\,a\,r = ((\ONE \cdot b) + \ZERO) \cdot r$\\ |
682 |
$\textit{der}\,b\,r = ((\ZERO \cdot b) + \ONE)\cdot r$\\ |
|
683 |
$\textit{der}\,c\,r = ((\ZERO \cdot b) + \ZERO)\cdot r$ |
|
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
684 |
\end{tabular} |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
685 |
\end{center} |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
686 |
|
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
687 |
\noindent |
488 | 688 |
If we simplify them according to the simplification rules from the |
689 |
beginning, we can replace the right-hand sides by the smaller |
|
690 |
equivalent regular expressions |
|
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
691 |
|
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
692 |
\begin{center} |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
693 |
\begin{tabular}{l} |
414 | 694 |
$\textit{der}\,a\,r \equiv b \cdot r$\\ |
695 |
$\textit{der}\,b\,r \equiv r$\\ |
|
696 |
$\textit{der}\,c\,r \equiv \ZERO$ |
|
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
697 |
\end{tabular} |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
698 |
\end{center} |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
699 |
|
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
700 |
\noindent I leave it to you to contemplate whether such a |
478 | 701 |
simplification can have any impact on the correctness of our algorithm |
702 |
(will it change any answers?). Figure~\ref{scala2} gives a |
|
703 |
simplification function that recursively traverses a regular |
|
704 |
expression and simplifies it according to the rules given at the |
|
571 | 705 |
beginning. There are only rules for $+$ and $\cdot$. There is |
706 |
no simplification rule for a star, because |
|
478 | 707 |
empirical data and also a little thought showed that simplifying under |
708 |
a star is a waste of computation time. The simplification function |
|
709 |
will be called after every derivation. This additional step removes |
|
710 |
all the ``junk'' the derivative function introduced. Does this improve |
|
711 |
the speed? You bet!! |
|
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
712 |
|
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
713 |
\begin{figure}[p] |
477 | 714 |
\lstinputlisting[numbers=left,linebackgroundcolor= |
715 |
{\ifodd\value{lstnumber}\color{capri!3}\fi}] |
|
716 |
{../progs/app6.scala} |
|
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
717 |
\caption{The simplification function and modified |
325
794c599cee53
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
318
diff
changeset
|
718 |
\texttt{ders}-function; this function now |
333
8890852e18b7
updated coursework
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
332
diff
changeset
|
719 |
calls \texttt{der} first, but then simplifies |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
720 |
the resulting derivative regular expressions before |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
721 |
building the next derivative, see |
566 | 722 |
Line~24.\label{scala2}} |
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
723 |
\end{figure} |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
724 |
|
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
725 |
\begin{center} |
268
18bef085a7ca
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
263
diff
changeset
|
726 |
\begin{tikzpicture} |
414 | 727 |
\begin{axis}[ |
415 | 728 |
title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$}, |
414 | 729 |
xlabel={$n$}, |
730 |
x label style={at={(1.04,0.0)}}, |
|
731 |
ylabel={time in secs}, |
|
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
732 |
enlargelimits=false, |
478 | 733 |
xtick={0,2500,...,10000}, |
734 |
xmax=12000, |
|
268
18bef085a7ca
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
263
diff
changeset
|
735 |
ytick={0,5,...,30}, |
443
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
736 |
ymax=32, |
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
737 |
scaled ticks=false, |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
738 |
axis lines=left, |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
739 |
width=9cm, |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
740 |
height=5cm, |
415 | 741 |
legend entries={Scala V2,Scala V3}, |
742 |
legend pos=outer north east, |
|
743 |
legend cell align=left] |
|
744 |
\addplot[green,mark=square*,mark options={fill=white}] table {re2.data}; |
|
268
18bef085a7ca
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
263
diff
changeset
|
745 |
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; |
18bef085a7ca
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
263
diff
changeset
|
746 |
\end{axis} |
18bef085a7ca
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
263
diff
changeset
|
747 |
\end{tikzpicture} |
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
748 |
\end{center} |
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
749 |
|
415 | 750 |
\noindent |
510 | 751 |
To recap, Python and Ruby needed approximately 30 seconds to match a |
478 | 752 |
string of 28 \texttt{a}s and the regular expression $a^{?\{n\}} \cdot |
753 |
a^{\{n\}}$. We need a third of this time to do the same with strings |
|
566 | 754 |
up to 11,000 \texttt{a}s. Similarly, Java 8 and Python needed 30 |
478 | 755 |
seconds to find out the regular expression $(a^*)^* \cdot b$ does not |
566 | 756 |
match the string of 28 \texttt{a}s. In Java 9 and later this has been |
757 |
cranked up to 39,000 \texttt{a}s, but we can do the same in the same |
|
571 | 758 |
amount of time for strings composed of nearly 6,000,000 \texttt{a}s. |
759 |
This is shown in the following plot. |
|
415 | 760 |
|
761 |
||
414 | 762 |
\begin{center} |
763 |
\begin{tikzpicture} |
|
764 |
\begin{axis}[ |
|
415 | 765 |
title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$}, |
414 | 766 |
xlabel={$n$}, |
767 |
ylabel={time in secs}, |
|
768 |
enlargelimits=false, |
|
478 | 769 |
ymax=35, |
414 | 770 |
ytick={0,5,...,30}, |
771 |
axis lines=left, |
|
550 | 772 |
%%scaled ticks=false, |
478 | 773 |
x label style={at={(1.09,0.0)}}, |
550 | 774 |
%%xmax=7700000, |
414 | 775 |
width=9cm, |
776 |
height=5cm, |
|
478 | 777 |
legend entries={Scala V3}, |
415 | 778 |
legend pos=outer north east, |
779 |
legend cell align=left] |
|
478 | 780 |
%\addplot[green,mark=square*,mark options={fill=white}] table {re2a.data}; |
414 | 781 |
\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data}; |
782 |
\end{axis} |
|
783 |
\end{tikzpicture} |
|
784 |
\end{center} |
|
785 |
||
415 | 786 |
\subsection*{Epilogue} |
787 |
||
550 | 788 |
(23/Aug/2016) I found another place where this algorithm can |
488 | 789 |
be sped up (this idea is not integrated with what is coming next, but |
790 |
I present it nonetheless). The idea is to not define \texttt{ders} |
|
791 |
that it iterates the derivative character-by-character, but in bigger |
|
792 |
chunks. The resulting code for \texttt{ders2} looks as follows: |
|
415 | 793 |
|
794 |
\lstinputlisting[numbers=none]{../progs/app52.scala} |
|
795 |
||
796 |
\noindent |
|
797 |
I have not fully understood why this version is much faster, |
|
798 |
but it seems it is a combination of the clauses for \texttt{ALT} |
|
799 |
and \texttt{SEQ}. In the latter case we call \texttt{der} with |
|
800 |
a single character and this potentially produces an alternative. |
|
510 | 801 |
The derivative of such an alternative can then be more efficiently |
415 | 802 |
calculated by \texttt{ders2} since it pushes a whole string |
803 |
under an \texttt{ALT}. The numbers are that in the second case |
|
804 |
$(a^*)^* \cdot b$ both versions are pretty much the same, but in the |
|
805 |
first case $a^{?\{n\}} \cdot a^{\{n\}}$ the improvement gives |
|
806 |
another factor of 100 speedup. Nice! |
|
414 | 807 |
|
415 | 808 |
\begin{center} |
809 |
\begin{tabular}{cc} |
|
810 |
\begin{tikzpicture} |
|
811 |
\begin{axis}[ |
|
812 |
title={Graph: $a^{?\{n\}} \cdot a^{\{n\}}$ and strings $\underbrace{a\ldots a}_{n}$}, |
|
813 |
xlabel={$n$}, |
|
814 |
x label style={at={(1.04,0.0)}}, |
|
815 |
ylabel={time in secs}, |
|
816 |
enlargelimits=false, |
|
817 |
xmax=7100000, |
|
818 |
ytick={0,5,...,30}, |
|
819 |
ymax=33, |
|
820 |
%scaled ticks=false, |
|
821 |
axis lines=left, |
|
488 | 822 |
width=5.3cm, |
415 | 823 |
height=5cm, |
824 |
legend entries={Scala V3, Scala V4}, |
|
443
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
825 |
legend style={at={(0.1,-0.2)},anchor=north}] |
415 | 826 |
\addplot[black,mark=square*,mark options={fill=white}] table {re3.data}; |
827 |
\addplot[purple,mark=square*,mark options={fill=white}] table {re4.data}; |
|
828 |
\end{axis} |
|
829 |
\end{tikzpicture} |
|
830 |
& |
|
831 |
\begin{tikzpicture} |
|
832 |
\begin{axis}[ |
|
833 |
title={Graph: $(a^*)^* \cdot b$ and strings $\underbrace{a\ldots a}_{n}$}, |
|
834 |
xlabel={$n$}, |
|
835 |
x label style={at={(1.09,0.0)}}, |
|
836 |
ylabel={time in secs}, |
|
837 |
enlargelimits=false, |
|
488 | 838 |
xmax=8200000, |
415 | 839 |
ytick={0,5,...,30}, |
840 |
ymax=33, |
|
841 |
%scaled ticks=false, |
|
842 |
axis lines=left, |
|
488 | 843 |
width=5.3cm, |
415 | 844 |
height=5cm, |
845 |
legend entries={Scala V3, Scala V4}, |
|
443
cd43d8c6eb84
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
434
diff
changeset
|
846 |
legend style={at={(0.1,-0.2)},anchor=north}] |
415 | 847 |
\addplot[black,mark=square*,mark options={fill=white}] table {re3a.data}; |
848 |
\addplot[purple,mark=square*,mark options={fill=white}] table {re4a.data}; |
|
849 |
\end{axis} |
|
850 |
\end{tikzpicture} |
|
851 |
\end{tabular} |
|
852 |
\end{center} |
|
414 | 853 |
|
412 | 854 |
|
334
fd89a63e9db3
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
333
diff
changeset
|
855 |
\section*{Proofs} |
fd89a63e9db3
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
333
diff
changeset
|
856 |
|
339
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
857 |
You might not like doing proofs. But they serve a very |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
858 |
important purpose in Computer Science: How can we be sure that |
488 | 859 |
our algorithm matches its specification? We can try to test |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
860 |
the algorithm, but that often overlooks corner cases and an |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
861 |
exhaustive testing is impossible (since there are infinitely |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
862 |
many inputs). Proofs allow us to ensure that an algorithm |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
863 |
really meets its specification. |
338
f16120cb4e19
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
334
diff
changeset
|
864 |
|
339
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
865 |
For the programs we look at in this module, the proofs will |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
866 |
mostly by some form of induction. Remember that regular |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
867 |
expressions are defined as |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
868 |
|
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
869 |
\begin{center} |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
870 |
\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} |
512 | 871 |
$r$ & $::=$ & $\ZERO$ & nothing\\ |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
872 |
& $\mid$ & $\ONE$ & empty string / \texttt{""} / []\\ |
339
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
873 |
& $\mid$ & $c$ & single character\\ |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
874 |
& $\mid$ & $r_1 + r_2$ & alternative / choice\\ |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
875 |
& $\mid$ & $r_1 \cdot r_2$ & sequence\\ |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
876 |
& $\mid$ & $r^*$ & star (zero or more)\\ |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
877 |
\end{tabular} |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
878 |
\end{center} |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
879 |
|
488 | 880 |
\noindent If you want to show a property $P(r)$ for \emph{all} |
339
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
881 |
regular expressions $r$, then you have to follow essentially |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
882 |
the recipe: |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
883 |
|
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
884 |
\begin{itemize} |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
885 |
\item $P$ has to hold for $\ZERO$, $\ONE$ and $c$ |
339
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
886 |
(these are the base cases). |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
887 |
\item $P$ has to hold for $r_1 + r_2$ under the assumption |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
888 |
that $P$ already holds for $r_1$ and $r_2$. |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
889 |
\item $P$ has to hold for $r_1 \cdot r_2$ under the |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
890 |
assumption that $P$ already holds for $r_1$ and $r_2$. |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
891 |
\item $P$ has to hold for $r^*$ under the assumption |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
892 |
that $P$ already holds for $r$. |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
893 |
\end{itemize} |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
894 |
|
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
895 |
\noindent |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
896 |
A simple proof is for example showing the following |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
897 |
property: |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
898 |
|
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
899 |
\begin{equation} |
412 | 900 |
\textit{nullable}(r) \;\;\text{if and only if}\;\; []\in L(r) |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
901 |
\label{nullableprop} |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
902 |
\end{equation} |
339
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
903 |
|
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
904 |
\noindent |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
905 |
Let us say that this property is $P(r)$, then the first case |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
906 |
we need to check is whether $P(\ZERO)$ (see recipe |
339
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
907 |
above). So we have to show that |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
908 |
|
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
909 |
\[ |
412 | 910 |
\textit{nullable}(\ZERO) \;\;\text{if and only if}\;\; |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
911 |
[]\in L(\ZERO) |
339
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
912 |
\] |
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
913 |
|
412 | 914 |
\noindent whereby $\textit{nullable}(\ZERO)$ is by definition of |
915 |
the function $\textit{nullable}$ always $\textit{false}$. We also have |
|
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
916 |
that $L(\ZERO)$ is by definition $\{\}$. It is |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
917 |
impossible that the empty string $[]$ is in the empty set. |
339
bc395ccfba7f
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
338
diff
changeset
|
918 |
Therefore also the right-hand side is false. Consequently we |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
919 |
verified this case: both sides are false. We would still need |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
920 |
to do this for $P(\ONE)$ and $P(c)$. I leave this to |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
921 |
you to verify. |
340
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
922 |
|
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
923 |
Next we need to check the inductive cases, for example |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
924 |
$P(r_1 + r_2)$, which is |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
925 |
|
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
926 |
\begin{equation} |
412 | 927 |
\textit{nullable}(r_1 + r_2) \;\;\text{if and only if}\;\; |
340
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
928 |
[]\in L(r_1 + r_2) |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
929 |
\label{propalt} |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
930 |
\end{equation} |
340
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
931 |
|
488 | 932 |
\noindent The difference to the base cases is that in the inductive |
933 |
cases we can already assume we proved $P$ for the components, that is |
|
934 |
we can assume. |
|
340
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
935 |
|
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
936 |
\begin{center} |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
937 |
\begin{tabular}{l} |
412 | 938 |
$\textit{nullable}(r_1) \;\;\text{if and only if}\;\; []\in L(r_1)$ and\\ |
939 |
$\textit{nullable}(r_2) \;\;\text{if and only if}\;\; []\in L(r_2)$\\ |
|
340
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
940 |
\end{tabular} |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
941 |
\end{center} |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
942 |
|
488 | 943 |
\noindent These are called the induction hypotheses. To check this |
412 | 944 |
case, we can start from $\textit{nullable}(r_1 + r_2)$, which by |
488 | 945 |
definition of $\textit{nullable}$ is |
340
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
946 |
|
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
947 |
\[ |
412 | 948 |
\textit{nullable}(r_1) \vee \textit{nullable}(r_2) |
340
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
949 |
\] |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
950 |
|
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
951 |
\noindent Using the two induction hypotheses from above, |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
952 |
we can transform this into |
340
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
953 |
|
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
954 |
\[ |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
955 |
[] \in L(r_1) \vee []\in(r_2) |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
956 |
\] |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
957 |
|
412 | 958 |
\noindent We just replaced the $\textit{nullable}(\ldots)$ parts by |
340
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
959 |
the equivalent $[] \in L(\ldots)$ from the induction |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
960 |
hypotheses. A bit of thinking convinces you that if |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
961 |
$[] \in L(r_1) \vee []\in L(r_2)$ then the empty string |
340
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
962 |
must be in the union $L(r_1)\cup L(r_2)$, that is |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
963 |
|
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
964 |
\[ |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
965 |
[] \in L(r_1)\cup L(r_2) |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
966 |
\] |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
967 |
|
488 | 968 |
\noindent but this is by definition of $L$ exactly $[] \in L(r_1 + |
969 |
r_2)$, which we needed to establish according to statement in |
|
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
970 |
\eqref{propalt}. What we have shown is that starting from |
412 | 971 |
$\textit{nullable}(r_1 + r_2)$ we have done equivalent transformations |
488 | 972 |
to end up with $[] \in L(r_1 + r_2)$. Consequently we have established |
973 |
that $P(r_1 + r_2)$ holds. |
|
340
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
974 |
|
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
975 |
In order to complete the proof we would now need to look |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
976 |
at the cases \mbox{$P(r_1\cdot r_2)$} and $P(r^*)$. Again I let you |
340
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
977 |
check the details. |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
978 |
|
488 | 979 |
You might also have to do induction proofs over strings. |
340
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
980 |
That means you want to establish a property $P(s)$ for all |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
981 |
strings $s$. For this remember strings are lists of |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
982 |
characters. These lists can be either the empty list or a |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
983 |
list of the form $c::s$. If you want to perform an induction |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
984 |
proof for strings you need to consider the cases |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
985 |
|
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
986 |
\begin{itemize} |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
987 |
\item $P$ has to hold for $[]$ (this is the base case). |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
988 |
\item $P$ has to hold for $c::s$ under the assumption |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
989 |
that $P$ already holds for $s$. |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
990 |
\end{itemize} |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
991 |
|
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
992 |
\noindent |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
993 |
Given this recipe, I let you show |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
994 |
|
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
995 |
\begin{equation} |
414 | 996 |
\textit{Ders}\,s\,(L(r)) = L(\textit{ders}\,s\,r) |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
997 |
\label{dersprop} |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
998 |
\end{equation} |
340
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
999 |
|
414 | 1000 |
\noindent by induction on $s$. Recall $\textit{Der}$ is defined for |
1001 |
character---see \eqref{Der}; $\textit{Ders}$ is similar, but for strings: |
|
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
1002 |
|
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
1003 |
\[ |
414 | 1004 |
\textit{Ders}\,s\,A\;\dn\;\{s'\,|\,s @ s' \in A\} |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
1005 |
\] |
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
1006 |
|
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
1007 |
\noindent In this proof you can assume the following property |
414 | 1008 |
for $der$ and $\textit{Der}$ has already been proved, that is you can |
399
5c1fbb39c93e
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
394
diff
changeset
|
1009 |
assume |
340
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
1010 |
|
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
1011 |
\[ |
414 | 1012 |
L(\textit{der}\,c\,r) = \textit{Der}\,c\,(L(r)) |
340
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
1013 |
\] |
c49122dbcdd1
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
339
diff
changeset
|
1014 |
|
488 | 1015 |
\noindent holds (this would be of course another property that needs |
1016 |
to be proved in a side-lemma by induction on $r$). This is a bit |
|
1017 |
more challenging, but not impossible. |
|
338
f16120cb4e19
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
334
diff
changeset
|
1018 |
|
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1019 |
To sum up, using reasoning like the one shown above allows us |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1020 |
to show the correctness of our algorithm. To see this, |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1021 |
start from the specification |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1022 |
|
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1023 |
\[ |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1024 |
s \in L(r) |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1025 |
\] |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1026 |
|
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1027 |
\noindent That is the problem we want to solve. Thinking a |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1028 |
little, you will see that this problem is equivalent to the |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1029 |
following problem |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1030 |
|
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1031 |
\begin{equation} |
414 | 1032 |
[] \in \textit{Ders}\,s\,(L(r)) |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1033 |
\label{dersstep} |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1034 |
\end{equation} |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1035 |
|
488 | 1036 |
\noindent You agree? But we have shown above in \eqref{dersprop}, |
1037 |
that the $\textit{Ders}$ can be replaced by |
|
1038 |
$L(\textit{ders}\ldots)$. That means \eqref{dersstep} is equivalent to |
|
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1039 |
|
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1040 |
\begin{equation} |
414 | 1041 |
[] \in L(\textit{ders}\,s\,r) |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1042 |
\label{prefinalstep} |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1043 |
\end{equation} |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1044 |
|
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1045 |
\noindent We have also shown that testing whether the empty |
412 | 1046 |
string is in a language is equivalent to the $\textit{nullable}$ |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1047 |
function; see \eqref{nullableprop}. That means |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1048 |
\eqref{prefinalstep} is equivalent with |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1049 |
|
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1050 |
\[ |
414 | 1051 |
\textit{nullable}(\textit{ders}\,s\,r) |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1052 |
\] |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1053 |
|
764 | 1054 |
\noindent But this is just the definition of $matcher$ |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1055 |
|
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1056 |
\[ |
764 | 1057 |
matcher\,s\,r \dn nullable(\textit{ders}\,s\,r) |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1058 |
\] |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1059 |
|
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1060 |
\noindent In effect we have shown |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1061 |
|
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1062 |
\[ |
764 | 1063 |
matcher\,s\,r\;\;\text{if and only if}\;\; |
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1064 |
s\in L(r) |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1065 |
\] |
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1066 |
|
488 | 1067 |
\noindent which is the property we set out to prove: our algorithm |
1068 |
meets its specification. To have done so, requires a few induction |
|
926 | 1069 |
proofs about strings and regular expressions. Following the |
1070 |
\emph{induction recipes} is already a big step in actually performing |
|
1071 |
these proofs. If you do not believe it, proofs have helped me to make |
|
1072 |
sure my code is correct and in several instances prevented me of |
|
1073 |
letting slip embarrassing mistakes into the `wild'. In fact I have |
|
1074 |
found a number of mistakes in the brilliant work by Sulzmann and Lu, |
|
1075 |
which we are going to have a look at in Lecture 4. But in Lecture 3 we |
|
1076 |
should first find out why on earth are existing regular expression |
|
1077 |
matchers so abysmally slow. Are the people in Python, Ruby, Swift, |
|
1078 |
JavaScript, Java and also in Rust\footnote{Interestingly the regex |
|
1079 |
engine in Rust says it guarantees a linear behaviour when deciding |
|
1080 |
when a regular expression matches a |
|
1081 |
string. \here{https://youtu.be/3N_ywhx6_K0?t=31}} just idiots? For |
|
1082 |
example could it possibly be that what we have implemented here in |
|
1083 |
Scala is faster than the regex engine that has been implemented in |
|
1084 |
Rust? See you next week\ldots |
|
343
539b2e88f5b9
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
340
diff
changeset
|
1085 |
|
262
ee4304bc6350
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
261
diff
changeset
|
1086 |
\end{document} |
261
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
1087 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
1088 |
|
24531cfaa36a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
259
diff
changeset
|
1089 |
|
566 | 1090 |
% !TeX program = latexmk -xelatex |
123
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
1091 |
%%% Local Variables: |
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
1092 |
%%% mode: latex |
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
1093 |
%%% TeX-master: t |
a75f9c9d8f94
added
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
1094 |
%%% End: |