1
00:00:05,880 --> 00:00:09,700
Welcome back.
Remember this slide.
2
00:00:09,700 --> 00:00:11,500
This slide said what is
3
00:00:11,500 --> 00:00:14,500
our regular expression matcher
actually supposed to do?
4
00:00:14,500 --> 00:00:16,570
It will take two arguments,
5
00:00:16,570 --> 00:00:18,670
a regular expression r and a string s.
6
00:00:18,670 --> 00:00:21,580
And it's supposed to say yes,
7
00:00:21,580 --> 00:00:23,440
the regular expression matches
8
00:00:23,440 --> 00:00:26,920
the string if and only
if the string is in
9
00:00:26,920 --> 00:00:28,855
the language of r.
10
00:00:28,855 --> 00:00:32,410
And if the string is not
in the language of r,
11
00:00:32,410 --> 00:00:35,515
then our algorithm has to say no.
12
00:00:35,515 --> 00:00:37,210
And we can't use
13
00:00:37,210 --> 00:00:39,565
this specification
directly because remember,
14
00:00:39,565 --> 00:00:43,305
this L sometimes
produces infinite sets.
15
00:00:43,305 --> 00:00:47,585
And we can't test whether a
string is in infinite set,
16
00:00:47,585 --> 00:00:50,090
at least not easily.
17
00:00:50,090 --> 00:00:52,340
And so what we have to do
18
00:00:52,340 --> 00:00:54,260
instead is we have
to be a bit more
19
00:00:54,260 --> 00:00:57,050
clever and implement
some operations
20
00:00:57,050 --> 00:00:59,284
on regular expressions instead,
21
00:00:59,284 --> 00:01:03,275
because regular expressions
are always finite trees.
22
00:01:03,275 --> 00:01:05,870
I should say the
person who has been
23
00:01:05,870 --> 00:01:08,150
clever is called Brzozowski.
24
00:01:08,150 --> 00:01:11,435
It's his algorithm
I'm introducing here.
25
00:01:11,435 --> 00:01:15,515
And his algorithm consists
of two functions.
26
00:01:15,515 --> 00:01:17,840
One is called
nullable and it takes
27
00:01:17,840 --> 00:01:20,104
a regular expression as argument.
28
00:01:20,104 --> 00:01:24,155
And the idea is that
this function nullable is
29
00:01:24,155 --> 00:01:26,450
testing whether
the regular expression
30
00:01:26,450 --> 00:01:27,950
can match the empty string.
31
00:01:27,950 --> 00:01:30,305
So 0 cannot match any string.
32
00:01:30,305 --> 00:01:33,275
So it cannot match the
empty string either.
33
00:01:33,275 --> 00:01:35,465
So that's defined as false.
34
00:01:35,465 --> 00:01:37,775
This regular expression,
the whole purpose
35
00:01:37,775 --> 00:01:39,680
is that it can match
the empty string.
36
00:01:39,680 --> 00:01:41,645
So it's defined as true.
37
00:01:41,645 --> 00:01:44,599
If a regular expression
can match a character,
38
00:01:44,599 --> 00:01:47,045
then it cannot match
the empty string.
39
00:01:47,045 --> 00:01:49,445
So that is defined as false.
40
00:01:49,445 --> 00:01:53,180
If an alternative can
match the empty string,
41
00:01:53,180 --> 00:01:56,960
then either r1 can
match the empty string,
42
00:01:56,960 --> 00:01:59,720
or r2 can match the empty string.
43
00:01:59,720 --> 00:02:03,110
So either nullable
of r1 has to hold,
44
00:02:03,110 --> 00:02:06,945
or nullable of r2 has to hold.
45
00:02:06,945 --> 00:02:09,925
In this sequence, it's
the other way around.
46
00:02:09,925 --> 00:02:12,790
If this regular expression can
match the empty string,
47
00:02:12,790 --> 00:02:16,615
then r1 has to be able to
match the empty string,
48
00:02:16,615 --> 00:02:20,290
and r2 has to be able to
match the empty string.
49
00:02:20,290 --> 00:02:22,555
So here it's just the opposite.
50
00:02:22,555 --> 00:02:25,660
In one case it is or and
the other case it is and.
51
00:02:25,660 --> 00:02:27,970
The star regular
expression can match
52
00:02:27,970 --> 00:02:30,445
always the empty string.
So that is true.
53
00:02:30,445 --> 00:02:33,340
So this is a simple
recursive function
54
00:02:33,340 --> 00:02:37,179
and it should not be too
difficult to implement.
55
00:02:37,179 --> 00:02:42,025
Okay, this nullable function
that is easy-peasy.
56
00:02:42,025 --> 00:02:44,604
The second function, however,
57
00:02:44,604 --> 00:02:49,155
is a bit more involved and
that's just to be expected.
58
00:02:49,155 --> 00:02:53,075
Remember people working in
this area already for decades.
59
00:02:53,075 --> 00:02:57,305
If they have some problems
with runtime, for example,
60
00:02:57,305 --> 00:02:58,940
we can't expect that
61
00:02:58,940 --> 00:03:03,260
a simple fix will solve all
the problems in the world.
62
00:03:03,260 --> 00:03:06,530
So I admit the second
function is a bit more
63
00:03:06,530 --> 00:03:10,085
complicated and make sure
that you understand it.
64
00:03:10,085 --> 00:03:12,140
And it also just
shows this Brzozowski
65
00:03:12,140 --> 00:03:14,345
is a very clever guy.
66
00:03:14,345 --> 00:03:15,800
Actually, I have to say he was
67
00:03:15,800 --> 00:03:17,720
a clever guy because
I think he either
68
00:03:17,720 --> 00:03:21,650
died last year or
the year before.
69
00:03:21,650 --> 00:03:25,505
And he came up with this
algorithm already in 1964.
70
00:03:25,505 --> 00:03:27,260
It somehow got lost,
71
00:03:27,260 --> 00:03:30,305
but has been rediscovered
in the last ten years.
72
00:03:30,305 --> 00:03:34,685
So the idea of the second
function is the following.
73
00:03:34,685 --> 00:03:38,120
Imagine you have a
regular expression and it can
74
00:03:38,120 --> 00:03:41,930
match a string of the
form c followed by s.
75
00:03:41,930 --> 00:03:44,810
So the c is the first
character of that string.
76
00:03:44,810 --> 00:03:48,605
So I imagine that r can
match this kind of string.
77
00:03:48,605 --> 00:03:50,330
The question is now,
78
00:03:50,330 --> 00:03:52,400
what would a regular
expression look
79
00:03:52,400 --> 00:03:54,695
like that can match just
80
00:03:54,695 --> 00:03:57,140
s. You probably remember
81
00:03:57,140 --> 00:03:59,300
that from the semantic
derivative,
82
00:03:59,300 --> 00:04:00,860
there was also
something of chopping
83
00:04:00,860 --> 00:04:02,210
off the first character.
84
00:04:02,210 --> 00:04:04,940
Here it's the same.
If a regular expression
85
00:04:04,940 --> 00:04:07,835
can match a string
starting with a c,
86
00:04:07,835 --> 00:04:11,090
we're looking for a regular
expression which can match
87
00:04:11,090 --> 00:04:15,215
the rest of the string where
the c has been chopped off.
88
00:04:15,215 --> 00:04:17,690
And this operation will be called
89
00:04:17,690 --> 00:04:21,049
the derivative of a
regular expression.
90
00:04:21,049 --> 00:04:22,205
And it will also take
91
00:04:22,205 --> 00:04:25,460
a character as argument
and a regular expression.
92
00:04:25,460 --> 00:04:28,730
And in contrast to
the semantic
93
00:04:28,730 --> 00:04:31,310
derivative, which works
94
00:04:31,310 --> 00:04:34,430
over languages or
sets of strings,
95
00:04:34,430 --> 00:04:39,620
this derivative works
over regular expressions.
96
00:04:39,620 --> 00:04:41,330
Okay, here's this function.
97
00:04:41,330 --> 00:04:43,970
It's defined recursively over
98
00:04:43,970 --> 00:04:46,370
the structure of regular expressions.
99
00:04:46,370 --> 00:04:48,814
The first argument
is the character,
100
00:04:48,814 --> 00:04:52,700
and the second one is
a regular expression.
101
00:04:52,700 --> 00:04:56,510
And remember the idea
is we're looking for
102
00:04:56,510 --> 00:05:00,680
a regular expression that
can match everything
103
00:05:00,680 --> 00:05:03,125
this regular expression
as argument can match
104
00:05:03,125 --> 00:05:07,040
except for the c. So now let's
look at this first case.
105
00:05:07,040 --> 00:05:10,550
So this reg expression
cannot match any string.
106
00:05:10,550 --> 00:05:14,270
So it certainly cannot
match any string starting
107
00:05:14,270 --> 00:05:16,910
a c. So we have to look
108
00:05:16,910 --> 00:05:20,090
for a regular expression which
can not match anything.
109
00:05:20,090 --> 00:05:22,700
Well, that's the purpose
of this regular expression,
110
00:05:22,700 --> 00:05:24,815
so we define it as 0.
111
00:05:24,815 --> 00:05:28,130
In the next case,
this regular expression
112
00:05:28,130 --> 00:05:30,440
can match the empty string,
113
00:05:30,440 --> 00:05:33,440
but it cannot match
any string that starts
114
00:05:33,440 --> 00:05:36,350
with c. So also in this case,
115
00:05:36,350 --> 00:05:39,560
we just define it as
the regular expression,
116
00:05:39,560 --> 00:05:41,615
which cannot match anything.
117
00:05:41,615 --> 00:05:45,170
In the next case, this
c is the argument to
118
00:05:45,170 --> 00:05:48,335
the derivative and this d
is the regular expression.
119
00:05:48,335 --> 00:05:50,225
So now there are two cases.
120
00:05:50,225 --> 00:05:53,525
If this c and this d
is actually equal,
121
00:05:53,525 --> 00:05:55,970
Thaat means this regular
expression can match
122
00:05:55,970 --> 00:05:59,240
a string which just contains c.
123
00:05:59,240 --> 00:06:01,505
So if we strip that off,
124
00:06:01,505 --> 00:06:04,790
what remains is
the empty string.
125
00:06:04,790 --> 00:06:06,890
What is a regular expression
126
00:06:06,890 --> 00:06:09,170
look like that can
match the empty string.
127
00:06:09,170 --> 00:06:11,915
Well, that's the
purpose of the 1.
128
00:06:11,915 --> 00:06:15,440
And if c is not equal to d,
129
00:06:15,440 --> 00:06:17,630
then this regular expression
130
00:06:17,630 --> 00:06:19,220
cannot match anything
which starts with
131
00:06:19,220 --> 00:06:23,780
a c. So again it will
be defined as just 0.
132
00:06:23,780 --> 00:06:29,390
Now, the alternative case,
133
00:06:29,390 --> 00:06:31,970
if this regular expression can
134
00:06:31,970 --> 00:06:36,050
match strings
starting with c,
135
00:06:36,050 --> 00:06:40,820
then it can either be
matched by the r1
136
00:06:40,820 --> 00:06:44,495
or it can be matched by the r2.
137
00:06:44,495 --> 00:06:50,090
If the r1 can match c
and then following a string,
138
00:06:50,090 --> 00:06:53,975
then we just have to recursively
call the derivative for
139
00:06:53,975 --> 00:06:56,570
r1 to get a regular expression
140
00:06:56,570 --> 00:06:59,135
that can match the
rest of the string.
141
00:06:59,135 --> 00:07:02,690
And the same we can do with r2.
142
00:07:02,690 --> 00:07:06,110
You can find a regular expression
which can match everything
143
00:07:06,110 --> 00:07:07,850
this r2 can match,
144
00:07:07,850 --> 00:07:09,710
starting with c, but
145
00:07:09,710 --> 00:07:12,590
we are chopping off this c.
146
00:07:12,590 --> 00:07:16,370
So now if you have to find
the regular expression,
147
00:07:16,370 --> 00:07:20,030
which can match all the strings
where c is chopped off,
148
00:07:20,030 --> 00:07:22,295
then we just have to
take the alternative
149
00:07:22,295 --> 00:07:24,965
of these two derivatives.
150
00:07:24,965 --> 00:07:29,390
Now to sequence case.
151
00:07:29,390 --> 00:07:33,005
This sequence case is the
most complicated one.
152
00:07:33,005 --> 00:07:35,180
And let's look first at
153
00:07:35,180 --> 00:07:39,335
the second case, in the
else branch.
154
00:07:39,335 --> 00:07:42,830
So if this
regular expression
155
00:07:42,830 --> 00:07:46,145
can match a string
starting with c,
156
00:07:46,145 --> 00:07:48,155
then the following
must have happened.
157
00:07:48,155 --> 00:07:51,905
First, the r1 must have matched
a string starting with c
158
00:07:51,905 --> 00:07:56,420
and then anything coming
afterwards with r2.
159
00:07:56,420 --> 00:07:59,660
So in this case I only
160
00:07:59,660 --> 00:08:03,260
have to call this
derivative for this r1,
161
00:08:03,260 --> 00:08:06,830
and find the regular expression
which can match everything
162
00:08:06,830 --> 00:08:11,555
this r1 can match except
for this c chopped off.
163
00:08:11,555 --> 00:08:15,830
And I have to build the
sequence derivative of that.
164
00:08:15,830 --> 00:08:18,260
So that's what the else branch says:
165
00:08:18,260 --> 00:08:21,860
I take the
derivative of r1 and I
166
00:08:21,860 --> 00:08:23,480
put the r2 on
167
00:08:23,480 --> 00:08:25,865
the back because that's
the rest of the string.
168
00:08:25,865 --> 00:08:29,240
So that's the only
case we have to consider
169
00:08:29,240 --> 00:08:32,750
this sequence case
except if not the
170
00:08:32,750 --> 00:08:37,895
r1 can match the
c and something else.
171
00:08:37,895 --> 00:08:42,965
But if r1 is matching
actually the empty string.
172
00:08:42,965 --> 00:08:48,890
In this case actually the r2
is in charge of matching
173
00:08:48,890 --> 00:08:51,590
the string starting with c. So in
174
00:08:51,590 --> 00:08:55,490
this case we have to call
the derivative for r2.
175
00:08:55,490 --> 00:08:57,875
So that's why we have
these two cases.
176
00:08:57,875 --> 00:09:00,455
So if r1 is nullable,
177
00:09:00,455 --> 00:09:03,245
then it can match
the empty string.
178
00:09:03,245 --> 00:09:05,330
And we have to
consider the case that
179
00:09:05,330 --> 00:09:08,045
this r2 is matching a string
180
00:09:08,045 --> 00:09:10,700
starting with c. And so we have
181
00:09:10,700 --> 00:09:14,210
to call the derivative
for this r2 in this case.
182
00:09:14,210 --> 00:09:18,680
Otherwise, the r1 will
be in charge of matching
183
00:09:18,680 --> 00:09:20,840
a part of that
string starting with
184
00:09:20,840 --> 00:09:24,695
c. And I have to call the
derivative on this r1.
185
00:09:24,695 --> 00:09:30,670
I hope this makes sense.
186
00:09:30,670 --> 00:09:34,150
Go over that and
also the handouts
187
00:09:34,150 --> 00:09:37,465
again. That case
is really subtle.
188
00:09:37,465 --> 00:09:40,945
And how do I remember this case?
189
00:09:40,945 --> 00:09:42,430
Well, I know it's
190
00:09:42,430 --> 00:09:45,310
an if condition and the
condition is nullable,
191
00:09:45,310 --> 00:09:48,160
then I will always remember
192
00:09:48,160 --> 00:09:53,275
the else branch where it pushes
the derivative over the r1.
193
00:09:53,275 --> 00:09:55,780
So I usually write
down the else branch first,
194
00:09:55,780 --> 00:09:59,650
and then construct
the then branch by saying,
195
00:09:59,650 --> 00:10:01,525
well, I just repeat this.
196
00:10:01,525 --> 00:10:03,760
And I have to remember
in that case I
197
00:10:03,760 --> 00:10:06,580
have to build also
derivative of r2.
198
00:10:06,580 --> 00:10:12,695
Finally, the star case.
199
00:10:12,695 --> 00:10:15,665
So here again
we're looking for
200
00:10:15,665 --> 00:10:17,300
a regular expression which
201
00:10:17,300 --> 00:10:19,745
can match the string
starting with c,
202
00:10:19,745 --> 00:10:22,355
except that the c has
been chopped off.
203
00:10:22,355 --> 00:10:28,640
So if r* has to match
a string starting with c,
204
00:10:28,640 --> 00:10:32,735
then at least we need one
copy of this r.
205
00:10:32,735 --> 00:10:34,310
So at least one copy of
206
00:10:34,310 --> 00:10:37,010
this r has matched
something which starts with
207
00:10:37,010 --> 00:10:38,870
a c and then afterwards
208
00:10:38,870 --> 00:10:41,570
come 0 more copies
of this r*.
209
00:10:41,570 --> 00:10:45,530
What we do there
is we are trying to find
210
00:10:45,530 --> 00:10:47,960
the regular expression
which can match
211
00:10:47,960 --> 00:10:50,915
this first part of the string.
212
00:10:50,915 --> 00:10:53,255
However, where the
c is chopped off.
213
00:10:53,255 --> 00:10:55,130
And then we just say, well,
214
00:10:55,130 --> 00:10:57,050
the rest has to be
matched again with
215
00:10:57,050 --> 00:10:59,030
0 or more copies of
216
00:10:59,030 --> 00:11:02,600
this r. So that's why
it's defined like this.
217
00:11:02,600 --> 00:11:09,050
Okay? ...As said, please take care
with this definition.
218
00:11:09,050 --> 00:11:11,435
That's not so simple.
219
00:11:11,435 --> 00:11:13,250
Once you get a hang of it,
220
00:11:13,250 --> 00:11:15,170
however, it makes perfect sense.
221
00:11:15,170 --> 00:11:17,210
So let me explain it in
222
00:11:17,210 --> 00:11:20,825
different ways in
the next slides.
223
00:11:20,825 --> 00:11:24,695
Let's look
first at some examples.
224
00:11:24,695 --> 00:11:27,140
So here is a regular expression,
225
00:11:27,140 --> 00:11:29,390
r. And let's have
226
00:11:29,390 --> 00:11:32,450
a look at these three
derivatives according to a,
227
00:11:32,450 --> 00:11:38,405
b, and c. And we shall do
the one for a.
228
00:11:38,405 --> 00:11:42,379
So here is our regular expression
229
00:11:42,379 --> 00:11:45,334
and I was very generous
with parentheses.
230
00:11:45,334 --> 00:11:48,140
And the outermost is a star.
231
00:11:48,140 --> 00:11:52,550
So we now build the
derivative according to a,
232
00:11:52,550 --> 00:11:55,474
the character a, of
that regular expression.
233
00:11:55,474 --> 00:11:57,380
So the first thing we
234
00:11:57,380 --> 00:11:59,555
have to analyse is the case star.
235
00:11:59,555 --> 00:12:04,370
So here's the regular expression,
which we are looking at.
236
00:12:04,370 --> 00:12:09,170
This r and the outermost
constructor is this star.
237
00:12:09,170 --> 00:12:11,510
If you go back to the definition,
238
00:12:11,510 --> 00:12:13,625
I hope you have it next to you,
239
00:12:13,625 --> 00:12:16,340
then this star-case is defined
240
00:12:16,340 --> 00:12:20,000
as you're taking just the
inside of the star
241
00:12:20,000 --> 00:12:23,030
and apply this derivative and
242
00:12:23,030 --> 00:12:26,765
leave the r on the
outside at the end.
243
00:12:26,765 --> 00:12:29,990
Ok. So that's the first step.
244
00:12:29,990 --> 00:12:32,030
Now we have to analyze
245
00:12:32,030 --> 00:12:36,035
the derivative according to
a of this regular expression,
246
00:12:36,035 --> 00:12:38,000
which is an alternative.
247
00:12:38,000 --> 00:12:39,665
So the outermost is a plus.
248
00:12:39,665 --> 00:12:41,375
Ok, that's very easy again,
249
00:12:41,375 --> 00:12:45,185
we just have to push the
derivative into each component,
250
00:12:45,185 --> 00:12:47,705
into the a followed by b.
251
00:12:47,705 --> 00:12:49,145
And into the second
252
00:12:49,145 --> 00:12:51,185
alternative, into b.
253
00:12:51,185 --> 00:12:56,030
We take the derivative
of each according to a.
254
00:12:56,030 --> 00:13:00,635
Now this one is a sequence
regular expression.
255
00:13:00,635 --> 00:13:02,210
This was the complicated case.
256
00:13:02,210 --> 00:13:04,160
First of all
we have to test is
257
00:13:04,160 --> 00:13:07,910
the first component
nullable of this sequence?
258
00:13:07,910 --> 00:13:09,200
Well, that is a,
259
00:13:09,200 --> 00:13:12,740
in this case, a on its
own is not nullable.
260
00:13:12,740 --> 00:13:14,210
So we are the easy case,
261
00:13:14,210 --> 00:13:17,000
we only have a single derivative
262
00:13:17,000 --> 00:13:19,370
pushed in the first component.
263
00:13:19,370 --> 00:13:25,160
So we have the derivative
of a with the character a.
264
00:13:25,160 --> 00:13:27,920
Okay, that's now
the character case.
265
00:13:27,920 --> 00:13:29,720
And in this case the character
266
00:13:29,720 --> 00:13:31,715
and the regular expression agree.
267
00:13:31,715 --> 00:13:33,890
So it's defined as one.
268
00:13:33,890 --> 00:13:37,550
Ok? In the other case,
269
00:13:37,550 --> 00:13:39,710
the regular expression is b,
270
00:13:39,710 --> 00:13:41,675
but the characters a.
271
00:13:41,675 --> 00:13:46,385
So in this case
it's defined as 0.
272
00:13:46,385 --> 00:13:50,630
Okay? So that's what the
derivative would be.
273
00:13:50,630 --> 00:13:52,160
This r is there
274
00:13:52,160 --> 00:13:55,280
because originally we
started with a star.
275
00:13:55,280 --> 00:13:58,295
This regular expression is that star.
276
00:13:58,295 --> 00:14:02,780
So the derivative
according to a
277
00:14:02,780 --> 00:14:07,610
of that regular expression
is this expression.
278
00:14:07,610 --> 00:14:10,970
We just have to
substitute this r back in.
279
00:14:10,970 --> 00:14:13,745
Just coming back to this slide.
280
00:14:13,745 --> 00:14:16,160
So far, we only analyzes
281
00:14:16,160 --> 00:14:19,505
the derivative according
to a single character.
282
00:14:19,505 --> 00:14:23,960
But we can also very easily
extend that to whole strings.
283
00:14:23,960 --> 00:14:26,360
So if you build the
derivative according
284
00:14:26,360 --> 00:14:27,905
to the empty string,
285
00:14:27,905 --> 00:14:30,875
we just return the
regular expression.
286
00:14:30,875 --> 00:14:35,585
If we have a string
starting with character c,
287
00:14:35,585 --> 00:14:37,850
remember that can
be any character,
288
00:14:37,850 --> 00:14:42,170
then we build first the simple
derivative according to
289
00:14:42,170 --> 00:14:44,360
that first character and
290
00:14:44,360 --> 00:14:46,925
continue with the
rest of the string.
291
00:14:46,925 --> 00:14:50,615
So here you see again,
my personal convention:
292
00:14:50,615 --> 00:14:54,365
everything which works on
lists has this s at the end.
293
00:14:54,365 --> 00:14:57,125
So this function is
for single characters.
294
00:14:57,125 --> 00:14:59,179
This one is for strings,
295
00:14:59,179 --> 00:15:02,450
but it uses the one
for the character.
296
00:15:02,450 --> 00:15:04,025
Essentially what it does is
297
00:15:04,025 --> 00:15:06,185
it chops off the first character,
298
00:15:06,185 --> 00:15:09,800
builds the derivative, then
chops off the next character,
299
00:15:09,800 --> 00:15:13,760
builds the derivative of
the result, and so on.
300
00:15:13,760 --> 00:15:17,000
Having this function,
we can actually now
301
00:15:17,000 --> 00:15:20,600
state what the algorithm
is, the complete algorithm.
302
00:15:20,600 --> 00:15:23,465
So the Brzozowski matcher
303
00:15:23,465 --> 00:15:24,860
takes a regular expression as
304
00:15:24,860 --> 00:15:26,915
argument and a
string as argument.
305
00:15:26,915 --> 00:15:30,920
And is supposed to say yes if
the regular expression matches
306
00:15:30,920 --> 00:15:33,560
the string or no
307
00:15:33,560 --> 00:15:36,065
if the regular expression does
not match the string.
308
00:15:36,065 --> 00:15:37,715
And how does it do that?
309
00:15:37,715 --> 00:15:42,560
Well, it takes this string
s and this regular expression,
310
00:15:42,560 --> 00:15:43,925
and it first builds
311
00:15:43,925 --> 00:15:48,845
successive derivatives until
that string is exhausted.
312
00:15:48,845 --> 00:15:52,115
Then you have
a final derivative,
313
00:15:52,115 --> 00:15:53,839
a final regular expression
314
00:15:53,839 --> 00:15:55,370
and you test whether
315
00:15:55,370 --> 00:15:57,920
this regular expression can
match the empty string.
316
00:15:57,920 --> 00:16:01,370
If yes, then the
original regular expression,
317
00:16:01,370 --> 00:16:03,245
this r, can match the string.
318
00:16:03,245 --> 00:16:05,210
If no, if it cannot match
319
00:16:05,210 --> 00:16:08,030
the final derivative
with the empty string,
320
00:16:08,030 --> 00:16:10,280
then no this
regular expression r
321
00:16:10,280 --> 00:16:12,905
cannot match that string.
322
00:16:12,905 --> 00:16:16,010
I know it looks
very anticlimactic,
323
00:16:16,010 --> 00:16:19,625
but that's actually the
beauty of this algorithm,
324
00:16:19,625 --> 00:16:22,760
that it's not that complicated.
325
00:16:22,760 --> 00:16:25,340
So how does the algorithm work
326
00:16:25,340 --> 00:16:27,634
in a concrete example?
327
00:16:27,634 --> 00:16:31,580
Imagine you have a string, abc
328
00:16:31,580 --> 00:16:34,370
and you have a regular
expression, say r1.
329
00:16:34,370 --> 00:16:37,520
And you want to find out
whether this r1 can match
330
00:16:37,520 --> 00:16:41,300
that string abc or not.
How do you do that?
331
00:16:41,300 --> 00:16:45,140
Well, you would first
take this r1 and you
332
00:16:45,140 --> 00:16:47,150
build the derivative according
333
00:16:47,150 --> 00:16:49,880
to the first character, the a.
334
00:16:49,880 --> 00:16:53,015
Okay? You get the derivative,
335
00:16:53,015 --> 00:16:55,294
which I call here r2.
336
00:16:55,294 --> 00:16:58,640
Then you take the next
character, the b.
337
00:16:58,640 --> 00:17:04,535
You now build the derivative
according to b of this r2.
338
00:17:04,535 --> 00:17:07,460
Okay? So you take the
result of the first step,
339
00:17:07,460 --> 00:17:09,530
you feed it into the second step,
340
00:17:09,530 --> 00:17:11,810
and you take the
second character.
341
00:17:11,810 --> 00:17:17,075
Then you do this also with c.
So you get a derivative r3,
342
00:17:17,075 --> 00:17:22,460
and you build the derivative
of r3 according to c,
343
00:17:22,460 --> 00:17:24,185
you get an r4.
344
00:17:24,185 --> 00:17:26,300
Okay, so that's the
final derivative.
345
00:17:26,300 --> 00:17:27,530
The string is exhausted.
346
00:17:27,530 --> 00:17:29,570
We build derivatives
according to a, b,
347
00:17:29,570 --> 00:17:34,610
and c. Now we just test whether
this r4 is nullable.
348
00:17:34,610 --> 00:17:37,175
If it says yes,
349
00:17:37,175 --> 00:17:41,510
then the regular expression matcher
will just say true, yes,
350
00:17:41,510 --> 00:17:43,340
this original regular expression,
351
00:17:43,340 --> 00:17:47,270
the r1, will be able to
match that string abc.
352
00:17:47,270 --> 00:17:50,585
And if this test returns false,
353
00:17:50,585 --> 00:17:53,015
then the algorithm says false.
354
00:17:53,015 --> 00:17:56,975
This regular expression will
not match the string.
355
00:17:56,975 --> 00:18:00,260
Ok, you might ask:
Why on earth does
356
00:18:00,260 --> 00:18:02,960
that algorithm
actually work?
357
00:18:02,960 --> 00:18:06,515
Here's anather explanation
for why it works.
358
00:18:06,515 --> 00:18:10,190
Imagine you have a regular
expression r1, okay?
359
00:18:10,190 --> 00:18:13,220
And you have a string abc,
360
00:18:13,220 --> 00:18:14,270
and you want to find out
361
00:18:14,270 --> 00:18:17,180
whether r1 can
match that string.
362
00:18:17,180 --> 00:18:18,799
And for the moment,
363
00:18:18,799 --> 00:18:22,610
let's assume that it
can match that string.
364
00:18:22,610 --> 00:18:26,315
Ok? So the language L of
365
00:18:26,315 --> 00:18:30,185
r1 will actually
contain that string,
366
00:18:30,185 --> 00:18:31,805
otherwise it wouldn't match that.
367
00:18:31,805 --> 00:18:36,710
Okay? So abc is in
this language, okay?
368
00:18:36,710 --> 00:18:39,785
If I now take the
semantic derivative,
369
00:18:39,785 --> 00:18:43,145
that means I look at all
the strings in this L(r1)
370
00:18:43,145 --> 00:18:46,820
and filter out
371
00:18:46,820 --> 00:18:48,740
all the ones which
do not start with
372
00:18:48,740 --> 00:18:51,260
an a, I discharge them.
373
00:18:51,260 --> 00:18:54,545
And I only look at the ones
which start with an a.
374
00:18:54,545 --> 00:18:56,465
And of those strings,
375
00:18:56,465 --> 00:18:58,475
I chop off this a.
376
00:18:58,475 --> 00:19:01,025
So after this
semantic derivative,
377
00:19:01,025 --> 00:19:05,735
this set of strings will
contain just bc.
378
00:19:05,735 --> 00:19:12,830
Ok. Now if I build the next
semantic derivative of that,
379
00:19:12,830 --> 00:19:14,345
then I would look at
380
00:19:14,345 --> 00:19:16,850
all the strings which
start with a b,
381
00:19:16,850 --> 00:19:21,350
and forget about everything
else. Of the remaining ones
382
00:19:21,350 --> 00:19:27,905
I know they start with b.
I just chop of the b. Ok.
383
00:19:27,905 --> 00:19:31,655
So in this whole set here,
384
00:19:31,655 --> 00:19:33,785
in this whole set here,
385
00:19:33,785 --> 00:19:39,030
there will be now a string
which is just c. Okay?
386
00:19:39,190 --> 00:19:44,420
Then I built the third
semantic derivative
387
00:19:44,420 --> 00:19:47,300
because I want to find out
whether abc is matched.
388
00:19:47,300 --> 00:19:50,540
Okay? So now I look
at all the strings in
389
00:19:50,540 --> 00:19:52,820
here and look at
390
00:19:52,820 --> 00:19:55,340
them whether they start
with a c. If yes,
391
00:19:55,340 --> 00:19:56,885
I chop off the c.
392
00:19:56,885 --> 00:19:59,120
And put in what is remaining.
393
00:19:59,120 --> 00:20:00,425
So in this case,
394
00:20:00,425 --> 00:20:02,510
if I have the string c
395
00:20:02,510 --> 00:20:04,550
in this language and
I chop off this c,
396
00:20:04,550 --> 00:20:07,700
what is remaining
is the empty string.
397
00:20:07,700 --> 00:20:09,695
So we have to check of
398
00:20:09,695 --> 00:20:14,510
that language whether it
contains the empty string.
399
00:20:14,510 --> 00:20:18,800
If yes, then the
original r1 can match
400
00:20:18,800 --> 00:20:21,050
this abc because this abc
401
00:20:21,050 --> 00:20:24,119
must have been in this language.
402
00:20:24,130 --> 00:20:28,565
And if in the end there wasn't
the empty string, then
403
00:20:28,565 --> 00:20:33,575
this abs was not in
this language of r1.
404
00:20:33,575 --> 00:20:36,260
And so the lexer must have,
405
00:20:36,260 --> 00:20:38,880
or the matcher must have failed.
406
00:20:39,040 --> 00:20:42,530
The clever bit is that here
407
00:20:42,530 --> 00:20:45,530
the explanation is for languages.
408
00:20:45,530 --> 00:20:49,835
Remember, this
semantic derivative
409
00:20:49,835 --> 00:20:53,450
works over languages and they
sometimes can be in finite.
410
00:20:53,450 --> 00:20:55,730
So that's not really
an algorithm.
411
00:20:55,730 --> 00:20:58,880
That's just
explaining the idea.
412
00:20:58,880 --> 00:21:02,525
What Brzozowski
achieved was that he
413
00:21:02,525 --> 00:21:06,440
now works with this derivative
regular expressions and
414
00:21:06,440 --> 00:21:10,715
somehow imitates what
happens on these languages.
415
00:21:10,715 --> 00:21:14,135
Because remember if you
have a regular expression,
416
00:21:14,135 --> 00:21:17,405
and you want to test
whether it can match abc,
417
00:21:17,405 --> 00:21:22,550
then you take first
derivative according to a.
418
00:21:22,550 --> 00:21:25,760
So you will get a regular
expression which can match b
419
00:21:25,760 --> 00:21:29,464
and c, if r could match abc.
420
00:21:29,464 --> 00:21:31,430
So after the first derivative,
421
00:21:31,430 --> 00:21:33,620
you will get a regular expression
which can match b and
422
00:21:33,620 --> 00:21:37,070
c. If you take the
second derivative,
423
00:21:37,070 --> 00:21:41,225
you will get a regular expression
which can match c alone.
424
00:21:41,225 --> 00:21:44,180
And if you take the
final derivative,
425
00:21:44,180 --> 00:21:46,070
then you will get
426
00:21:46,070 --> 00:21:48,260
a regular expression which hopefully
427
00:21:48,260 --> 00:21:49,715
can match the empty string.
428
00:21:49,715 --> 00:21:53,780
If it does, then this
r can match the abc.
429
00:21:53,780 --> 00:21:55,655
And if it doesn't, then
430
00:21:55,655 --> 00:21:58,680
abc couldn't be
matched by this r.
431
00:21:58,900 --> 00:22:02,990
Okay, let's have a look
how this pans out in code.
432
00:22:02,990 --> 00:22:06,050
Here's the file re1.sc.
433
00:22:06,050 --> 00:22:07,940
It's also uploaded on KEATS,
434
00:22:07,940 --> 00:22:10,625
so you can see exactly
what I'm doing.
435
00:22:10,625 --> 00:22:13,970
And actually you already saw
that file because I showed you
436
00:22:13,970 --> 00:22:15,710
how my regular expressions are
437
00:22:15,710 --> 00:22:17,960
defined, with the
abstract classes here.
438
00:22:17,960 --> 00:22:21,155
And here, the six cases
for 0, 1, character,
439
00:22:21,155 --> 00:22:23,540
alternative,
sequence and star.
440
00:22:23,540 --> 00:22:26,705
Ok. So the first
function nullable,
441
00:22:26,705 --> 00:22:28,760
the simple one, takes
442
00:22:28,760 --> 00:22:32,120
a regular expression as
argument and returns a boolean.
443
00:22:32,120 --> 00:22:34,280
And then with this
pattern matching,
444
00:22:34,280 --> 00:22:37,040
we just go through
all these six cases
445
00:22:37,040 --> 00:22:38,900
0 is defined as false.
446
00:22:38,900 --> 00:22:43,234
One is defined as true.
Character, for any character,
447
00:22:43,234 --> 00:22:45,455
this nullable will return false.
448
00:22:45,455 --> 00:22:47,540
The alternative is defined here,
449
00:22:47,540 --> 00:22:50,000
so that's the or in Scala.
450
00:22:50,000 --> 00:22:52,700
And for the sequence,
that's the and.
451
00:22:52,700 --> 00:22:56,180
And this star, no matter
what the regular expression is,
452
00:22:56,180 --> 00:22:59,540
it will always match the
empty string, so true.
453
00:22:59,540 --> 00:23:02,225
So nullable is very easy.
454
00:23:02,225 --> 00:23:07,430
The derivative is also not
so much more complicated.
455
00:23:07,430 --> 00:23:08,974
It takes two arguments,
456
00:23:08,974 --> 00:23:11,810
a character and the
regular expression,
457
00:23:11,810 --> 00:23:14,405
and returns a regular expression.
458
00:23:14,405 --> 00:23:16,340
So now we just have to analyze
459
00:23:16,340 --> 00:23:18,890
what's that regular
expression looks like.
460
00:23:18,890 --> 00:23:23,870
If it's 0, we return
0; if it's 1 we return 0.
461
00:23:23,870 --> 00:23:26,990
If the character... If the
regular expressions character
462
00:23:26,990 --> 00:23:30,260
d, we test whether it's
equal to this character
463
00:23:30,260 --> 00:23:32,225
we want to take the
derivative off.
464
00:23:32,225 --> 00:23:36,185
If yes, return 1, otherwise 0.
465
00:23:36,185 --> 00:23:38,600
The alternative which has pushed
466
00:23:38,600 --> 00:23:39,860
the derivative under
467
00:23:39,860 --> 00:23:42,710
this alternative,
that's the easy one.
468
00:23:42,710 --> 00:23:44,690
Here's the sequence case where we
469
00:23:44,690 --> 00:23:47,015
first have to test
if it's nullable,
470
00:23:47,015 --> 00:23:49,040
If it's not the have to push
471
00:23:49,040 --> 00:23:52,160
the derivative over
the first, the r1.
472
00:23:52,160 --> 00:23:56,135
And otherwise if it is nullable,
we have two cases.
473
00:23:56,135 --> 00:24:00,605
Either you have to push
it over the r1 or r2.
474
00:24:00,605 --> 00:24:03,860
And a star case, this one.
475
00:24:03,860 --> 00:24:07,160
We just build the sequence
of the derivative of
476
00:24:07,160 --> 00:24:11,600
r1 and append the
r1 as a sequence,
477
00:24:11,600 --> 00:24:14,195
put the star at the end.
478
00:24:14,195 --> 00:24:19,430
Okay, so here's the
function for strings.
479
00:24:19,430 --> 00:24:21,740
So a list of characters.
480
00:24:21,740 --> 00:24:23,870
This function takes an s,
481
00:24:23,870 --> 00:24:26,480
a list of characters as argument
and a regular expression
482
00:24:26,480 --> 00:24:29,360
as argument and returns
a regular expression.
483
00:24:29,360 --> 00:24:31,460
And we just have to
pattern match what
484
00:24:31,460 --> 00:24:34,055
that string looks like
or this list looks like.
485
00:24:34,055 --> 00:24:35,360
If it's the empty list,
486
00:24:35,360 --> 00:24:38,030
it just immediately
return the r. If
487
00:24:38,030 --> 00:24:42,020
it's something of the
form c followed by s,
488
00:24:42,020 --> 00:24:45,050
then we build first the
derivative according
489
00:24:45,050 --> 00:24:48,345
to c. And then
the result of that,
490
00:24:48,345 --> 00:24:50,090
we recursively call
491
00:24:50,090 --> 00:24:53,675
the derivative
according to s. Okay?
492
00:24:53,675 --> 00:24:56,060
And now the main matcher,
493
00:24:56,060 --> 00:24:59,360
it takes a regular
expression and takes
494
00:24:59,360 --> 00:25:02,870
a string and returns a
boolean, true or false.
495
00:25:02,870 --> 00:25:05,705
And it first builds
the derivative.
496
00:25:05,705 --> 00:25:07,430
The only thing I have to do here
497
00:25:07,430 --> 00:25:09,410
because I'm implementing
it in Scala,
498
00:25:09,410 --> 00:25:15,064
I have to coerce between strings
and lists of characters.
499
00:25:15,064 --> 00:25:16,580
So he, I take first
500
00:25:16,580 --> 00:25:20,810
the toList of the s. That
gives me a list of characters.
501
00:25:20,810 --> 00:25:23,135
Then I build this derivative
502
00:25:23,135 --> 00:25:26,045
with the s, because I'm
looking at strings.
503
00:25:26,045 --> 00:25:28,160
And then at the end,
504
00:25:28,160 --> 00:25:33,050
built the nullable of
the final derivative.
505
00:25:33,050 --> 00:25:37,310
The beauty of all this
is that in Scala,
506
00:25:37,310 --> 00:25:40,085
these definition which
I had on the slides
507
00:25:40,085 --> 00:25:43,835
go almost one-to-one into code.
508
00:25:43,835 --> 00:25:45,605
And that's of course,
509
00:25:45,605 --> 00:25:47,480
makes it very easy
to implement in
510
00:25:47,480 --> 00:25:49,730
a functional language like Scala.
511
00:25:49,730 --> 00:25:52,865
We can also now try
out some examples.
512
00:25:52,865 --> 00:25:56,960
This was the example
I had on the slide.
513
00:25:56,960 --> 00:25:58,370
And we could now calculate
514
00:25:58,370 --> 00:26:00,305
what's the derivative
according to a,
515
00:26:00,305 --> 00:26:02,720
b, and c of this
regular expression.
516
00:26:02,720 --> 00:26:07,040
And I hope you didn't
make any mistake.
517
00:26:07,040 --> 00:26:09,260
Now of course we want
to see whether we
518
00:26:09,260 --> 00:26:11,390
do any better with
this algorithm...
519
00:26:11,390 --> 00:26:13,715
than Python and Ruby.
520
00:26:13,715 --> 00:26:16,070
And let's first have a
look at this example.
521
00:26:16,070 --> 00:26:18,079
So this regular expression.
522
00:26:18,079 --> 00:26:19,880
The problem is that in
523
00:26:19,880 --> 00:26:22,070
our regular expression
matcher so far we have
524
00:26:22,070 --> 00:26:24,530
the sequence rregular
expression, but we
525
00:26:24,530 --> 00:26:27,200
don't have this optional
regular expression.
526
00:26:27,200 --> 00:26:30,800
And we don't have this n
times regular expression,
527
00:26:30,800 --> 00:26:36,605
but we can quickly implement
that in our implementation.
528
00:26:36,605 --> 00:26:40,549
So if you want to build the
optional regular expression,
529
00:26:40,549 --> 00:26:41,870
which is defined here,
530
00:26:41,870 --> 00:26:44,570
a little function which
takes a reg expression and
531
00:26:44,570 --> 00:26:47,360
returns the alternative of r and 1.
532
00:26:47,360 --> 00:26:49,624
So it essentially
takes the definition
533
00:26:49,624 --> 00:26:53,240
of optional r and
replaces it by that.
534
00:26:53,240 --> 00:26:56,150
The n-times what we
essentially do there is
535
00:26:56,150 --> 00:27:01,535
we built n copies of this r. Ok?
536
00:27:01,535 --> 00:27:04,745
So if this n-times was 0,
537
00:27:04,745 --> 00:27:06,245
we just return one.
538
00:27:06,245 --> 00:27:11,570
If it's one, then we return
just the regular expression.
539
00:27:11,570 --> 00:27:15,575
And if it's something
bigger than one,
540
00:27:15,575 --> 00:27:18,560
then we just built the
sequence of one of
541
00:27:18,560 --> 00:27:20,270
these copies and call
542
00:27:20,270 --> 00:27:22,925
this function again
with n - 1.
543
00:27:22,925 --> 00:27:26,330
So we just build now n-copies of that.
544
00:27:26,330 --> 00:27:30,710
Okay, so we can look
at our first example.
545
00:27:30,710 --> 00:27:32,629
This is the regular expression,
546
00:27:32,629 --> 00:27:36,035
and I call that here
evil regular expression1.
547
00:27:36,035 --> 00:27:37,670
It doesn't look as beautiful
548
00:27:37,670 --> 00:27:39,140
as what we write down on paper.
549
00:27:39,140 --> 00:27:41,240
We will actually look
at this later on
550
00:27:41,240 --> 00:27:43,640
if this can be improved.
But here it is.
551
00:27:43,640 --> 00:27:45,724
Here's the regular expression,
552
00:27:45,724 --> 00:27:49,520
and here's a little function
which measures the time.
553
00:27:49,520 --> 00:27:53,180
And we can now start testing
554
00:27:53,180 --> 00:27:57,845
this regular expression with
strings just containing a's.
555
00:27:57,845 --> 00:28:00,020
And we are first a bit cautious,
556
00:28:00,020 --> 00:28:04,985
we test it between 0 and 20,
and we count by two.
557
00:28:04,985 --> 00:28:08,330
And I actually will not
start that within Scala,
558
00:28:08,330 --> 00:28:12,965
but actually use Ammonite.
559
00:28:12,965 --> 00:28:15,305
And that's out.
560
00:28:15,305 --> 00:28:17,269
And that calculates
561
00:28:17,269 --> 00:28:20,675
for 18 a's. It's pretty fast.
562
00:28:20,675 --> 00:28:24,815
But for 20 it already
needs 2 seconds.
563
00:28:24,815 --> 00:28:28,265
And you can see
actually this jump from
564
00:28:28,265 --> 00:28:32,825
18 to 20 in the runtime
is pretty bad too.
565
00:28:32,825 --> 00:28:37,460
And if we actually measure
it more accurately,
566
00:28:37,460 --> 00:28:39,770
then we get this picture.
567
00:28:39,770 --> 00:28:41,255
And what turns out,
568
00:28:41,255 --> 00:28:45,830
we are actually worse than Python
and Ruby in this example.
569
00:28:45,830 --> 00:28:49,230
So I guess that's a fail.