afl-material: videos/02-algo1.srt@5678414a3898


1
00:00:05,880 --> 00:00:09,700
Welcome back.
Remember this slide.

2
00:00:09,700 --> 00:00:11,500
This slide said what is

3
00:00:11,500 --> 00:00:14,500
our regular expression matcher
actually supposed to do?

4
00:00:14,500 --> 00:00:16,570
It will take two arguments,

5
00:00:16,570 --> 00:00:18,670
a regular expression r and a string s.

6
00:00:18,670 --> 00:00:21,580
And it's supposed to say yes,

7
00:00:21,580 --> 00:00:23,440
the regular expression matches

8
00:00:23,440 --> 00:00:26,920
the string if and only
if the string is in

9
00:00:26,920 --> 00:00:28,855
the language of r.

10
00:00:28,855 --> 00:00:32,410
And if the string is not
in the language of r,

11
00:00:32,410 --> 00:00:35,515
then our algorithm has to say no.

12
00:00:35,515 --> 00:00:37,210
And we can't use

13
00:00:37,210 --> 00:00:39,565
this specification
directly because remember,

14
00:00:39,565 --> 00:00:43,305
this L sometimes
produces infinite sets.

15
00:00:43,305 --> 00:00:47,585
And we can't test whether a
string is in infinite set,

16
00:00:47,585 --> 00:00:50,090
at least not easily.

17
00:00:50,090 --> 00:00:52,340
And so what we have to do

18
00:00:52,340 --> 00:00:54,260
instead is we have
to be a bit more

19
00:00:54,260 --> 00:00:57,050
clever and implement
some operations

20
00:00:57,050 --> 00:00:59,284
on regular expressions instead,

21
00:00:59,284 --> 00:01:03,275
because regular expressions
are always finite trees.

22
00:01:03,275 --> 00:01:05,870
I should say the
person who has been

23
00:01:05,870 --> 00:01:08,150
clever is called Brzozowski.

24
00:01:08,150 --> 00:01:11,435
It's his algorithm
I'm introducing here.

25
00:01:11,435 --> 00:01:15,515
And his algorithm consists
of two functions.

26
00:01:15,515 --> 00:01:17,840
One is called
nullable and it takes

27
00:01:17,840 --> 00:01:20,104
a regular expression as argument.

28
00:01:20,104 --> 00:01:24,155
And the idea is that
this function nullable is

29
00:01:24,155 --> 00:01:26,450
testing whether
the regular expression

30
00:01:26,450 --> 00:01:27,950
can match the empty string.

31
00:01:27,950 --> 00:01:30,305
So 0 cannot match any string.

32
00:01:30,305 --> 00:01:33,275
So it cannot match the
empty string either.

33
00:01:33,275 --> 00:01:35,465
So that's defined as false.

34
00:01:35,465 --> 00:01:37,775
This regular expression,
the whole purpose

35
00:01:37,775 --> 00:01:39,680
is that it can match
the empty string.

36
00:01:39,680 --> 00:01:41,645
So it's defined as true.

37
00:01:41,645 --> 00:01:44,599
If a regular expression
can match a character,

38
00:01:44,599 --> 00:01:47,045
then it cannot match
the empty string.

39
00:01:47,045 --> 00:01:49,445
So that is defined as false.

40
00:01:49,445 --> 00:01:53,180
If an alternative can
match the empty string,

41
00:01:53,180 --> 00:01:56,960
then either r1 can
match the empty string,

42
00:01:56,960 --> 00:01:59,720
or r2 can match the empty string.

43
00:01:59,720 --> 00:02:03,110
So either nullable
of r1 has to hold,

44
00:02:03,110 --> 00:02:06,945
or nullable of r2 has to hold.

45
00:02:06,945 --> 00:02:09,925
In this sequence, it's
the other way around.

46
00:02:09,925 --> 00:02:12,790
If this regular expression can
match the empty string,

47
00:02:12,790 --> 00:02:16,615
then r1 has to be able to
match the empty string,

48
00:02:16,615 --> 00:02:20,290
and r2 has to be able to
match the empty string.

49
00:02:20,290 --> 00:02:22,555
So here it's just the opposite.

50
00:02:22,555 --> 00:02:25,660
In one case it is or and
the other case it is and.

51
00:02:25,660 --> 00:02:27,970
The star regular
expression can match

52
00:02:27,970 --> 00:02:30,445
always the empty string.
So that is true.

53
00:02:30,445 --> 00:02:33,340
So this is a simple
recursive function

54
00:02:33,340 --> 00:02:37,179
and it should not be too
difficult to implement.

55
00:02:37,179 --> 00:02:42,025
Okay, this nullable function
that is easy-peasy.

56
00:02:42,025 --> 00:02:44,604
The second function, however,

57
00:02:44,604 --> 00:02:49,155
is a bit more involved and
that's just to be expected.

58
00:02:49,155 --> 00:02:53,075
Remember people working in
this area already for decades.

59
00:02:53,075 --> 00:02:57,305
If they have some problems
with runtime, for example,

60
00:02:57,305 --> 00:02:58,940
we can't expect that 

61
00:02:58,940 --> 00:03:03,260
a simple fix will solve all
the problems in the world.

62
00:03:03,260 --> 00:03:06,530
So I admit the second
function is a bit more

63
00:03:06,530 --> 00:03:10,085
complicated and make sure
that you understand it.

64
00:03:10,085 --> 00:03:12,140
And it also just
shows this Brzozowski

65
00:03:12,140 --> 00:03:14,345
is a very clever guy.

66
00:03:14,345 --> 00:03:15,800
Actually, I have to say he was

67
00:03:15,800 --> 00:03:17,720
a clever guy because
I think he either

68
00:03:17,720 --> 00:03:21,650
died last year or
the year before.

69
00:03:21,650 --> 00:03:25,505
And he came up with this
algorithm already in 1964.

70
00:03:25,505 --> 00:03:27,260
It somehow got lost,

71
00:03:27,260 --> 00:03:30,305
but has been rediscovered
in the last ten years.

72
00:03:30,305 --> 00:03:34,685
So the idea of the second
function is the following.

73
00:03:34,685 --> 00:03:38,120
Imagine you have a
regular expression and it can

74
00:03:38,120 --> 00:03:41,930
match a string of the
form c followed by s.

75
00:03:41,930 --> 00:03:44,810
So the c is the first
character of that string.

76
00:03:44,810 --> 00:03:48,605
So I imagine that r can
match this kind of string.

77
00:03:48,605 --> 00:03:50,330
The question is now,

78
00:03:50,330 --> 00:03:52,400
what would a regular
expression look

79
00:03:52,400 --> 00:03:54,695
like that can match just

80
00:03:54,695 --> 00:03:57,140
s. You probably remember

81
00:03:57,140 --> 00:03:59,300
that from the semantic
derivative,

82
00:03:59,300 --> 00:04:00,860
there was also
something of chopping

83
00:04:00,860 --> 00:04:02,210
off the first character.

84
00:04:02,210 --> 00:04:04,940
Here it's the same.
If a regular expression

85
00:04:04,940 --> 00:04:07,835
can match a string
starting with a c,

86
00:04:07,835 --> 00:04:11,090
we're looking for a regular
expression which can match

87
00:04:11,090 --> 00:04:15,215
the rest of the string where
the c has been chopped off.

88
00:04:15,215 --> 00:04:17,690
And this operation will be called

89
00:04:17,690 --> 00:04:21,049
the derivative of a
regular expression.

90
00:04:21,049 --> 00:04:22,205
And it will also take

91
00:04:22,205 --> 00:04:25,460
a character as argument
and a regular expression.

92
00:04:25,460 --> 00:04:28,730
And in contrast to
the semantic 

93
00:04:28,730 --> 00:04:31,310
derivative, which works

94
00:04:31,310 --> 00:04:34,430
over languages or
sets of strings,

95
00:04:34,430 --> 00:04:39,620
this derivative works
over regular expressions.

96
00:04:39,620 --> 00:04:41,330
Okay, here's this function.

97
00:04:41,330 --> 00:04:43,970
It's defined recursively over

98
00:04:43,970 --> 00:04:46,370
the structure of regular expressions.

99
00:04:46,370 --> 00:04:48,814
The first argument
is the character,

100
00:04:48,814 --> 00:04:52,700
and the second one is
a regular expression.

101
00:04:52,700 --> 00:04:56,510
And remember the idea
is we're looking for

102
00:04:56,510 --> 00:05:00,680
a regular expression that
can match everything

103
00:05:00,680 --> 00:05:03,125
this regular expression
as argument can match

104
00:05:03,125 --> 00:05:07,040
except for the c. So now let's
look at this first case.

105
00:05:07,040 --> 00:05:10,550
So this reg expression
cannot match any string.

106
00:05:10,550 --> 00:05:14,270
So it certainly cannot
match any string starting

107
00:05:14,270 --> 00:05:16,910
a c. So we have to look

108
00:05:16,910 --> 00:05:20,090
for a regular expression which
can not match anything.

109
00:05:20,090 --> 00:05:22,700
Well, that's the purpose
of this regular expression,

110
00:05:22,700 --> 00:05:24,815
so we define it as 0.

111
00:05:24,815 --> 00:05:28,130
In the next case,
this regular expression

112
00:05:28,130 --> 00:05:30,440
can match the empty string,

113
00:05:30,440 --> 00:05:33,440
but it cannot match
any string that starts

114
00:05:33,440 --> 00:05:36,350
with c. So also in this case,

115
00:05:36,350 --> 00:05:39,560
we just define it as
the regular expression,

116
00:05:39,560 --> 00:05:41,615
which cannot match anything.

117
00:05:41,615 --> 00:05:45,170
In the next case, this
c is the argument to

118
00:05:45,170 --> 00:05:48,335
the derivative and this d
is the regular expression.

119
00:05:48,335 --> 00:05:50,225
So now there are two cases.

120
00:05:50,225 --> 00:05:53,525
If this c and this d
is actually equal,

121
00:05:53,525 --> 00:05:55,970
Thaat means this regular
expression can match

122
00:05:55,970 --> 00:05:59,240
a string which just contains c.

123
00:05:59,240 --> 00:06:01,505
So if we strip that off,

124
00:06:01,505 --> 00:06:04,790
what remains is
the empty string.

125
00:06:04,790 --> 00:06:06,890
What is a regular expression

126
00:06:06,890 --> 00:06:09,170
look like that can
match the empty string.

127
00:06:09,170 --> 00:06:11,915
Well, that's the
purpose of the 1.

128
00:06:11,915 --> 00:06:15,440
And if c is not equal to d,

129
00:06:15,440 --> 00:06:17,630
then this regular expression

130
00:06:17,630 --> 00:06:19,220
cannot match anything
which starts with

131
00:06:19,220 --> 00:06:23,780
a c. So again it will
be defined as just 0.

132
00:06:23,780 --> 00:06:29,390
Now, the alternative case,

133
00:06:29,390 --> 00:06:31,970
if this regular expression can

134
00:06:31,970 --> 00:06:36,050
match strings
starting with c,

135
00:06:36,050 --> 00:06:40,820
then it can either be
matched by the r1

136
00:06:40,820 --> 00:06:44,495
or it can be matched by the r2.

137
00:06:44,495 --> 00:06:50,090
If the r1 can match c
and then following a string,

138
00:06:50,090 --> 00:06:53,975
then we just have to recursively
call the derivative for

139
00:06:53,975 --> 00:06:56,570
r1 to get a regular expression

140
00:06:56,570 --> 00:06:59,135
that can match the
rest of the string.

141
00:06:59,135 --> 00:07:02,690
And the same we can do with r2.

142
00:07:02,690 --> 00:07:06,110
You can find a regular expression
which can match everything

143
00:07:06,110 --> 00:07:07,850
this r2 can match,

144
00:07:07,850 --> 00:07:09,710
starting with c, but

145
00:07:09,710 --> 00:07:12,590
we are chopping off this c. 

146
00:07:12,590 --> 00:07:16,370
So now if you have to find
the regular expression,

147
00:07:16,370 --> 00:07:20,030
which can match all the strings
where c is chopped off,

148
00:07:20,030 --> 00:07:22,295
then we just have to
take the alternative

149
00:07:22,295 --> 00:07:24,965
of these two derivatives.

150
00:07:24,965 --> 00:07:29,390
Now to sequence case.

151
00:07:29,390 --> 00:07:33,005
This sequence case is the
most complicated one.

152
00:07:33,005 --> 00:07:35,180
And let's look first at

153
00:07:35,180 --> 00:07:39,335
the second case, in the
else branch.

154
00:07:39,335 --> 00:07:42,830
So if this
regular expression

155
00:07:42,830 --> 00:07:46,145
can match a string
starting with c,

156
00:07:46,145 --> 00:07:48,155
then the following
must have happened.

157
00:07:48,155 --> 00:07:51,905
First, the r1 must have matched
a string starting with c

158
00:07:51,905 --> 00:07:56,420
and then anything coming
afterwards with r2.

159
00:07:56,420 --> 00:07:59,660
So in this case I only

160
00:07:59,660 --> 00:08:03,260
have to call this
derivative for this r1,

161
00:08:03,260 --> 00:08:06,830
and find the regular expression
which can match everything

162
00:08:06,830 --> 00:08:11,555
this r1 can match except
for this c chopped off.

163
00:08:11,555 --> 00:08:15,830
And I have to build the
sequence derivative of that.

164
00:08:15,830 --> 00:08:18,260
So that's what the else branch says:

165
00:08:18,260 --> 00:08:21,860
I take the
derivative of r1 and I

166
00:08:21,860 --> 00:08:23,480
put the r2 on

167
00:08:23,480 --> 00:08:25,865
the back because that's
the rest of the string.

168
00:08:25,865 --> 00:08:29,240
So that's the only
case we have to consider

169
00:08:29,240 --> 00:08:32,750
this sequence case
except if not the

170
00:08:32,750 --> 00:08:37,895
r1 can match the
c and something else.

171
00:08:37,895 --> 00:08:42,965
But if r1 is matching
actually the empty string.

172
00:08:42,965 --> 00:08:48,890
In this case actually the r2
is in charge of matching

173
00:08:48,890 --> 00:08:51,590
the string starting with c. So in

174
00:08:51,590 --> 00:08:55,490
this case we have to call
the derivative for r2.

175
00:08:55,490 --> 00:08:57,875
So that's why we have
these two cases.

176
00:08:57,875 --> 00:09:00,455
So if r1 is nullable,

177
00:09:00,455 --> 00:09:03,245
then it can match
the empty string.

178
00:09:03,245 --> 00:09:05,330
And we have to
consider the case that

179
00:09:05,330 --> 00:09:08,045
this r2 is matching a string

180
00:09:08,045 --> 00:09:10,700
starting with c. And so we have

181
00:09:10,700 --> 00:09:14,210
to call the derivative
for this r2 in this case.

182
00:09:14,210 --> 00:09:18,680
Otherwise, the r1 will
be in charge of matching

183
00:09:18,680 --> 00:09:20,840
a part of that
string starting with

184
00:09:20,840 --> 00:09:24,695
c. And I have to call the
derivative on this r1.

185
00:09:24,695 --> 00:09:30,670
I hope this makes sense.

186
00:09:30,670 --> 00:09:34,150
Go over that and
also the handouts

187
00:09:34,150 --> 00:09:37,465
again. That case
is really subtle.

188
00:09:37,465 --> 00:09:40,945
And how do I remember this case?

189
00:09:40,945 --> 00:09:42,430
Well, I know it's

190
00:09:42,430 --> 00:09:45,310
an if condition and the
condition is nullable,

191
00:09:45,310 --> 00:09:48,160
then I will always remember

192
00:09:48,160 --> 00:09:53,275
the else branch where it pushes
the derivative over the r1.

193
00:09:53,275 --> 00:09:55,780
So I usually write
down the else branch first,

194
00:09:55,780 --> 00:09:59,650
and then construct
the then branch by saying,

195
00:09:59,650 --> 00:10:01,525
well, I just repeat this.

196
00:10:01,525 --> 00:10:03,760
And I have to remember
in that case I

197
00:10:03,760 --> 00:10:06,580
have to build also
derivative of r2.

198
00:10:06,580 --> 00:10:12,695
Finally, the star case.

199
00:10:12,695 --> 00:10:15,665
So here again
we're looking for

200
00:10:15,665 --> 00:10:17,300
a regular expression which

201
00:10:17,300 --> 00:10:19,745
can match the string
starting with c,

202
00:10:19,745 --> 00:10:22,355
except that the c has
been chopped off.

203
00:10:22,355 --> 00:10:28,640
So if r* has to match
a string starting with c,

204
00:10:28,640 --> 00:10:32,735
then at least we need one
copy of this r. 

205
00:10:32,735 --> 00:10:34,310
So at least one copy of

206
00:10:34,310 --> 00:10:37,010
this r has matched
something which starts with

207
00:10:37,010 --> 00:10:38,870
a c and then afterwards

208
00:10:38,870 --> 00:10:41,570
come 0 more copies
of this r*.

209
00:10:41,570 --> 00:10:45,530
What we do there
is we are trying to find

210
00:10:45,530 --> 00:10:47,960
the regular expression
which can match

211
00:10:47,960 --> 00:10:50,915
this first part of the string.

212
00:10:50,915 --> 00:10:53,255
However, where the
c is chopped off.

213
00:10:53,255 --> 00:10:55,130
And then we just say, well,

214
00:10:55,130 --> 00:10:57,050
the rest has to be
matched again with

215
00:10:57,050 --> 00:10:59,030
0 or more copies of

216
00:10:59,030 --> 00:11:02,600
this r. So that's why
it's defined like this.

217
00:11:02,600 --> 00:11:09,050
Okay? ...As said, please take care
with this definition.

218
00:11:09,050 --> 00:11:11,435
That's not so simple.

219
00:11:11,435 --> 00:11:13,250
Once you get a hang of it,

220
00:11:13,250 --> 00:11:15,170
however, it makes perfect sense.

221
00:11:15,170 --> 00:11:17,210
So let me explain it in

222
00:11:17,210 --> 00:11:20,825
different ways in
the next slides.

223
00:11:20,825 --> 00:11:24,695
Let's look
first at some examples.

224
00:11:24,695 --> 00:11:27,140
So here is a regular expression,

225
00:11:27,140 --> 00:11:29,390
r. And let's have

226
00:11:29,390 --> 00:11:32,450
a look at these three
derivatives according to a,

227
00:11:32,450 --> 00:11:38,405
b, and c. And we shall do
the one for a. 

228
00:11:38,405 --> 00:11:42,379
So here is our regular expression

229
00:11:42,379 --> 00:11:45,334
and I was very generous
with parentheses.

230
00:11:45,334 --> 00:11:48,140
And the outermost is a star.

231
00:11:48,140 --> 00:11:52,550
So we now build the
derivative according to a,

232
00:11:52,550 --> 00:11:55,474
the character a, of
that regular expression.

233
00:11:55,474 --> 00:11:57,380
So the first thing we

234
00:11:57,380 --> 00:11:59,555
have to analyse is the case star.

235
00:11:59,555 --> 00:12:04,370
So here's the regular expression,
which we are looking at.

236
00:12:04,370 --> 00:12:09,170
This r and the outermost
constructor is this star.

237
00:12:09,170 --> 00:12:11,510
If you go back to the definition,

238
00:12:11,510 --> 00:12:13,625
I hope you have it next to you,

239
00:12:13,625 --> 00:12:16,340
then this star-case is defined

240
00:12:16,340 --> 00:12:20,000
as you're taking just the
inside of the star

241
00:12:20,000 --> 00:12:23,030
and apply this derivative and

242
00:12:23,030 --> 00:12:26,765
leave the r on the
outside at the end.

243
00:12:26,765 --> 00:12:29,990
Ok. So that's the first step.

244
00:12:29,990 --> 00:12:32,030
Now we have to analyze

245
00:12:32,030 --> 00:12:36,035
the derivative according to
a of this regular expression,

246
00:12:36,035 --> 00:12:38,000
which is an alternative.

247
00:12:38,000 --> 00:12:39,665
So the outermost is a plus.

248
00:12:39,665 --> 00:12:41,375
Ok, that's very easy again,

249
00:12:41,375 --> 00:12:45,185
we just have to push the
derivative into each component,

250
00:12:45,185 --> 00:12:47,705
into the a followed by b.

251
00:12:47,705 --> 00:12:49,145
And into the second

252
00:12:49,145 --> 00:12:51,185
alternative, into b. 

253
00:12:51,185 --> 00:12:56,030
We take the derivative
of each according to a.

254
00:12:56,030 --> 00:13:00,635
Now this one is a sequence
regular expression.

255
00:13:00,635 --> 00:13:02,210
This was the complicated case.

256
00:13:02,210 --> 00:13:04,160
First of all
we have to test is

257
00:13:04,160 --> 00:13:07,910
the first component
nullable of this sequence?

258
00:13:07,910 --> 00:13:09,200
Well, that is a,

259
00:13:09,200 --> 00:13:12,740
in this case, a on its
own is not nullable.

260
00:13:12,740 --> 00:13:14,210
So we are the easy case,

261
00:13:14,210 --> 00:13:17,000
we only have a single derivative

262
00:13:17,000 --> 00:13:19,370
pushed in the first component.

263
00:13:19,370 --> 00:13:25,160
So we have the derivative
of a with the character a.

264
00:13:25,160 --> 00:13:27,920
Okay, that's now
the character case.

265
00:13:27,920 --> 00:13:29,720
And in this case the character

266
00:13:29,720 --> 00:13:31,715
and the regular expression agree.

267
00:13:31,715 --> 00:13:33,890
So it's defined as one.

268
00:13:33,890 --> 00:13:37,550
Ok? In the other case,

269
00:13:37,550 --> 00:13:39,710
the regular expression is b,

270
00:13:39,710 --> 00:13:41,675
but the characters a.

271
00:13:41,675 --> 00:13:46,385
So in this case
it's defined as 0.

272
00:13:46,385 --> 00:13:50,630
Okay? So that's what the
derivative would be.

273
00:13:50,630 --> 00:13:52,160
This r is there

274
00:13:52,160 --> 00:13:55,280
because originally we
started with a star.

275
00:13:55,280 --> 00:13:58,295
This regular expression is that star.

276
00:13:58,295 --> 00:14:02,780
So the derivative
according to a

277
00:14:02,780 --> 00:14:07,610
of that regular expression
is this expression.

278
00:14:07,610 --> 00:14:10,970
We just have to
substitute this r back in.

279
00:14:10,970 --> 00:14:13,745
Just coming back to this slide.

280
00:14:13,745 --> 00:14:16,160
So far, we only analyzes

281
00:14:16,160 --> 00:14:19,505
the derivative according
to a single character.

282
00:14:19,505 --> 00:14:23,960
But we can also very easily
extend that to whole strings.

283
00:14:23,960 --> 00:14:26,360
So if you build the
derivative according

284
00:14:26,360 --> 00:14:27,905
to the empty string,

285
00:14:27,905 --> 00:14:30,875
we just return the
regular expression.

286
00:14:30,875 --> 00:14:35,585
If we have a string
starting with character c,

287
00:14:35,585 --> 00:14:37,850
remember that can
be any character,

288
00:14:37,850 --> 00:14:42,170
then we build first the simple
derivative according to

289
00:14:42,170 --> 00:14:44,360
that first character and

290
00:14:44,360 --> 00:14:46,925
continue with the
rest of the string.

291
00:14:46,925 --> 00:14:50,615
So here you see again,
my personal convention:

292
00:14:50,615 --> 00:14:54,365
everything which works on
lists has this s at the end.

293
00:14:54,365 --> 00:14:57,125
So this function is
for single characters.

294
00:14:57,125 --> 00:14:59,179
This one is for strings,

295
00:14:59,179 --> 00:15:02,450
but it uses the one
for the character.

296
00:15:02,450 --> 00:15:04,025
Essentially what it does is

297
00:15:04,025 --> 00:15:06,185
it chops off the first character,

298
00:15:06,185 --> 00:15:09,800
builds the derivative, then
chops off the next character,

299
00:15:09,800 --> 00:15:13,760
builds the derivative of
the result, and so on.

300
00:15:13,760 --> 00:15:17,000
Having this function,
we can actually now

301
00:15:17,000 --> 00:15:20,600
state what the algorithm
is, the complete algorithm.

302
00:15:20,600 --> 00:15:23,465
So the Brzozowski matcher

303
00:15:23,465 --> 00:15:24,860
takes a regular expression as

304
00:15:24,860 --> 00:15:26,915
argument and a
string as argument.

305
00:15:26,915 --> 00:15:30,920
And is supposed to say yes if
the regular expression matches

306
00:15:30,920 --> 00:15:33,560
the string or no

307
00:15:33,560 --> 00:15:36,065
if the regular expression does
not match the string.

308
00:15:36,065 --> 00:15:37,715
And how does it do that?

309
00:15:37,715 --> 00:15:42,560
Well, it takes this string
s and this regular expression,

310
00:15:42,560 --> 00:15:43,925
and it first builds

311
00:15:43,925 --> 00:15:48,845
successive derivatives until
that string is exhausted.

312
00:15:48,845 --> 00:15:52,115
Then you have
a final derivative,

313
00:15:52,115 --> 00:15:53,839
a final regular expression

314
00:15:53,839 --> 00:15:55,370
and you test whether

315
00:15:55,370 --> 00:15:57,920
this regular expression can
match the empty string.

316
00:15:57,920 --> 00:16:01,370
If yes, then the
original regular expression,

317
00:16:01,370 --> 00:16:03,245
this r, can match the string.

318
00:16:03,245 --> 00:16:05,210
If no, if it cannot match

319
00:16:05,210 --> 00:16:08,030
the final derivative
with the empty string,

320
00:16:08,030 --> 00:16:10,280
then no this
regular expression r

321
00:16:10,280 --> 00:16:12,905
cannot match that string.

322
00:16:12,905 --> 00:16:16,010
I know it looks
very anticlimactic,

323
00:16:16,010 --> 00:16:19,625
but that's actually the
beauty of this algorithm,

324
00:16:19,625 --> 00:16:22,760
that it's not that complicated.

325
00:16:22,760 --> 00:16:25,340
So how does the algorithm work

326
00:16:25,340 --> 00:16:27,634
in a concrete example?

327
00:16:27,634 --> 00:16:31,580
Imagine you have a string, abc

328
00:16:31,580 --> 00:16:34,370
and you have a regular
expression, say r1.

329
00:16:34,370 --> 00:16:37,520
And you want to find out
whether this r1 can match

330
00:16:37,520 --> 00:16:41,300
that string abc or not.
How do you do that?

331
00:16:41,300 --> 00:16:45,140
Well, you would first
take this r1 and you

332
00:16:45,140 --> 00:16:47,150
build the derivative according

333
00:16:47,150 --> 00:16:49,880
to the first character, the a.

334
00:16:49,880 --> 00:16:53,015
Okay? You get the derivative,

335
00:16:53,015 --> 00:16:55,294
which I call here r2.

336
00:16:55,294 --> 00:16:58,640
Then you take the next
character, the b.

337
00:16:58,640 --> 00:17:04,535
You now build the derivative
according to b of this r2.

338
00:17:04,535 --> 00:17:07,460
Okay? So you take the
result of the first step,

339
00:17:07,460 --> 00:17:09,530
you feed it into the second step,

340
00:17:09,530 --> 00:17:11,810
and you take the
second character.

341
00:17:11,810 --> 00:17:17,075
Then you do this also with c.
So you get a derivative r3,

342
00:17:17,075 --> 00:17:22,460
and you build the derivative
of r3 according to c,

343
00:17:22,460 --> 00:17:24,185
you get an r4.

344
00:17:24,185 --> 00:17:26,300
Okay, so that's the
final derivative.

345
00:17:26,300 --> 00:17:27,530
The string is exhausted.

346
00:17:27,530 --> 00:17:29,570
We build derivatives
according to a, b,

347
00:17:29,570 --> 00:17:34,610
and c. Now we just test whether
this r4 is nullable.

348
00:17:34,610 --> 00:17:37,175
If it says yes,

349
00:17:37,175 --> 00:17:41,510
then the regular expression matcher
will just say true, yes,

350
00:17:41,510 --> 00:17:43,340
this original regular expression,

351
00:17:43,340 --> 00:17:47,270
the r1, will be able to
match that string abc.

352
00:17:47,270 --> 00:17:50,585
And if this test returns false,

353
00:17:50,585 --> 00:17:53,015
then the algorithm says false.

354
00:17:53,015 --> 00:17:56,975
This regular expression will
not match the string.

355
00:17:56,975 --> 00:18:00,260
Ok, you might ask:
Why on earth does

356
00:18:00,260 --> 00:18:02,960
that algorithm
actually work?

357
00:18:02,960 --> 00:18:06,515
Here's anather explanation
for why it works.

358
00:18:06,515 --> 00:18:10,190
Imagine you have a regular
expression r1, okay?

359
00:18:10,190 --> 00:18:13,220
And you have a string abc,

360
00:18:13,220 --> 00:18:14,270
and you want to find out

361
00:18:14,270 --> 00:18:17,180
whether r1 can
match that string.

362
00:18:17,180 --> 00:18:18,799
And for the moment,

363
00:18:18,799 --> 00:18:22,610
let's assume that it
can match that string.

364
00:18:22,610 --> 00:18:26,315
Ok? So the language L of

365
00:18:26,315 --> 00:18:30,185
r1 will actually
contain that string,

366
00:18:30,185 --> 00:18:31,805
otherwise it wouldn't match that.

367
00:18:31,805 --> 00:18:36,710
Okay? So abc is in
this language, okay?

368
00:18:36,710 --> 00:18:39,785
If I now take the
semantic derivative,

369
00:18:39,785 --> 00:18:43,145
that means I look at all
the strings in this L(r1)

370
00:18:43,145 --> 00:18:46,820
and filter out

371
00:18:46,820 --> 00:18:48,740
all the ones which
do not start with

372
00:18:48,740 --> 00:18:51,260
an a, I discharge them.

373
00:18:51,260 --> 00:18:54,545
And I only look at the ones
which start with an a.

374
00:18:54,545 --> 00:18:56,465
And of those strings,

375
00:18:56,465 --> 00:18:58,475
I chop off this a.

376
00:18:58,475 --> 00:19:01,025
So after this
semantic derivative,

377
00:19:01,025 --> 00:19:05,735
this set of strings will
contain just bc.

378
00:19:05,735 --> 00:19:12,830
Ok. Now if I build the next
semantic derivative of that,

379
00:19:12,830 --> 00:19:14,345
then I would look at

380
00:19:14,345 --> 00:19:16,850
all the strings which
start with a b,

381
00:19:16,850 --> 00:19:21,350
and forget about everything
else. Of the remaining ones

382
00:19:21,350 --> 00:19:27,905
I know they start with b.
I just chop of the b. Ok.

383
00:19:27,905 --> 00:19:31,655
So in this whole set here,

384
00:19:31,655 --> 00:19:33,785
in this whole set here,

385
00:19:33,785 --> 00:19:39,030
there will be now a string
which is just c. Okay?

386
00:19:39,190 --> 00:19:44,420
Then I built the third
semantic derivative

387
00:19:44,420 --> 00:19:47,300
because I want to find out
whether abc is matched.

388
00:19:47,300 --> 00:19:50,540
Okay? So now I look
at all the strings in

389
00:19:50,540 --> 00:19:52,820
here and look at

390
00:19:52,820 --> 00:19:55,340
them whether they start
with a c. If yes,

391
00:19:55,340 --> 00:19:56,885
I chop off the c.

392
00:19:56,885 --> 00:19:59,120
And put in what is remaining.

393
00:19:59,120 --> 00:20:00,425
So in this case,

394
00:20:00,425 --> 00:20:02,510
if I have the string c

395
00:20:02,510 --> 00:20:04,550
in this language and
I chop off this c,

396
00:20:04,550 --> 00:20:07,700
what is remaining
is the empty string.

397
00:20:07,700 --> 00:20:09,695
So we have to check of

398
00:20:09,695 --> 00:20:14,510
that language whether it
contains the empty string.

399
00:20:14,510 --> 00:20:18,800
If yes, then the
original r1 can match

400
00:20:18,800 --> 00:20:21,050
this abc because this abc

401
00:20:21,050 --> 00:20:24,119
must have been in this language.

402
00:20:24,130 --> 00:20:28,565
And if in the end there wasn't
the empty string, then

403
00:20:28,565 --> 00:20:33,575
this abs was not in
this language of r1.

404
00:20:33,575 --> 00:20:36,260
And so the lexer must have,

405
00:20:36,260 --> 00:20:38,880
or the matcher must have failed.

406
00:20:39,040 --> 00:20:42,530
The clever bit is that here

407
00:20:42,530 --> 00:20:45,530
the explanation is for languages.

408
00:20:45,530 --> 00:20:49,835
Remember, this
semantic derivative

409
00:20:49,835 --> 00:20:53,450
works over languages and they
sometimes can be in finite.

410
00:20:53,450 --> 00:20:55,730
So that's not really
an algorithm.

411
00:20:55,730 --> 00:20:58,880
That's just
explaining the idea. 

412
00:20:58,880 --> 00:21:02,525
What Brzozowski
achieved was that he

413
00:21:02,525 --> 00:21:06,440
now works with this derivative
regular expressions and

414
00:21:06,440 --> 00:21:10,715
somehow imitates what
happens on these languages.

415
00:21:10,715 --> 00:21:14,135
Because remember if you
have a regular expression,

416
00:21:14,135 --> 00:21:17,405
and you want to test
whether it can match abc,

417
00:21:17,405 --> 00:21:22,550
then you take first
derivative according to a.

418
00:21:22,550 --> 00:21:25,760
So you will get a regular
expression which can match b

419
00:21:25,760 --> 00:21:29,464
and c, if r could match abc.

420
00:21:29,464 --> 00:21:31,430
So after the first derivative,

421
00:21:31,430 --> 00:21:33,620
you will get a regular expression
which can match b and

422
00:21:33,620 --> 00:21:37,070
c. If you take the
second derivative,

423
00:21:37,070 --> 00:21:41,225
you will get a regular expression
which can match c alone.

424
00:21:41,225 --> 00:21:44,180
And if you take the
final derivative,

425
00:21:44,180 --> 00:21:46,070
then you will get

426
00:21:46,070 --> 00:21:48,260
a regular expression which hopefully

427
00:21:48,260 --> 00:21:49,715
can match the empty string.

428
00:21:49,715 --> 00:21:53,780
If it does, then this
r can match the abc.

429
00:21:53,780 --> 00:21:55,655
And if it doesn't, then

430
00:21:55,655 --> 00:21:58,680
abc couldn't be
matched by this r.

431
00:21:58,900 --> 00:22:02,990
Okay, let's have a look
how this pans out in code.

432
00:22:02,990 --> 00:22:06,050
Here's the file re1.sc.

433
00:22:06,050 --> 00:22:07,940
It's also uploaded on KEATS,

434
00:22:07,940 --> 00:22:10,625
so you can see exactly
what I'm doing.

435
00:22:10,625 --> 00:22:13,970
And actually you already saw
that file because I showed you

436
00:22:13,970 --> 00:22:15,710
how my regular expressions are

437
00:22:15,710 --> 00:22:17,960
defined, with the
abstract classes here.

438
00:22:17,960 --> 00:22:21,155
And here, the six cases
for 0, 1, character,

439
00:22:21,155 --> 00:22:23,540
alternative, 
sequence and star.

440
00:22:23,540 --> 00:22:26,705
Ok. So the first
function nullable,

441
00:22:26,705 --> 00:22:28,760
the simple one, takes

442
00:22:28,760 --> 00:22:32,120
a regular expression as
argument and returns a boolean.

443
00:22:32,120 --> 00:22:34,280
And then with this
pattern matching,

444
00:22:34,280 --> 00:22:37,040
we just go through
all these six cases

445
00:22:37,040 --> 00:22:38,900
0 is defined as false.

446
00:22:38,900 --> 00:22:43,234
One is defined as true.
Character, for any character,

447
00:22:43,234 --> 00:22:45,455
this nullable will return false.

448
00:22:45,455 --> 00:22:47,540
The alternative is defined here,

449
00:22:47,540 --> 00:22:50,000
so that's the or in Scala.

450
00:22:50,000 --> 00:22:52,700
And for the sequence,
that's the and.

451
00:22:52,700 --> 00:22:56,180
And this star, no matter
what the regular expression is,

452
00:22:56,180 --> 00:22:59,540
it will always match the
empty string, so true.

453
00:22:59,540 --> 00:23:02,225
So nullable is very easy.

454
00:23:02,225 --> 00:23:07,430
The derivative is also not
so much more complicated.

455
00:23:07,430 --> 00:23:08,974
It takes two arguments,

456
00:23:08,974 --> 00:23:11,810
a character and the
regular expression,

457
00:23:11,810 --> 00:23:14,405
and returns a regular expression.

458
00:23:14,405 --> 00:23:16,340
So now we just have to analyze

459
00:23:16,340 --> 00:23:18,890
what's that regular
expression looks like.

460
00:23:18,890 --> 00:23:23,870
If it's 0, we return
0; if it's 1 we return 0.

461
00:23:23,870 --> 00:23:26,990
If the character... If the
regular expressions character

462
00:23:26,990 --> 00:23:30,260
d, we test whether it's
equal to this character

463
00:23:30,260 --> 00:23:32,225
we want to take the
derivative off.

464
00:23:32,225 --> 00:23:36,185
If yes, return 1, otherwise 0.

465
00:23:36,185 --> 00:23:38,600
The alternative which has pushed

466
00:23:38,600 --> 00:23:39,860
the derivative under

467
00:23:39,860 --> 00:23:42,710
this alternative,
that's the easy one.

468
00:23:42,710 --> 00:23:44,690
Here's the sequence case where we

469
00:23:44,690 --> 00:23:47,015
first have to test
if it's nullable,

470
00:23:47,015 --> 00:23:49,040
If it's not the have to push

471
00:23:49,040 --> 00:23:52,160
the derivative over
the first, the r1.

472
00:23:52,160 --> 00:23:56,135
And otherwise if it is nullable,
we have two cases.

473
00:23:56,135 --> 00:24:00,605
Either you have to push
it over the r1 or r2.

474
00:24:00,605 --> 00:24:03,860
And a star case, this one.

475
00:24:03,860 --> 00:24:07,160
We just build the sequence
of the derivative of

476
00:24:07,160 --> 00:24:11,600
r1 and append the
r1 as a sequence,

477
00:24:11,600 --> 00:24:14,195
put the star at the end.

478
00:24:14,195 --> 00:24:19,430
Okay, so here's the
function for strings.

479
00:24:19,430 --> 00:24:21,740
So a list of characters.

480
00:24:21,740 --> 00:24:23,870
This function takes an s,

481
00:24:23,870 --> 00:24:26,480
a list of characters as argument
and a regular expression

482
00:24:26,480 --> 00:24:29,360
as argument and returns
a regular expression.

483
00:24:29,360 --> 00:24:31,460
And we just have to
pattern match what

484
00:24:31,460 --> 00:24:34,055
that string looks like
or this list looks like.

485
00:24:34,055 --> 00:24:35,360
If it's the empty list,

486
00:24:35,360 --> 00:24:38,030
it just immediately
return the r. If

487
00:24:38,030 --> 00:24:42,020
it's something of the
form c followed by s,

488
00:24:42,020 --> 00:24:45,050
then we build first the
derivative according

489
00:24:45,050 --> 00:24:48,345
to c. And then
the result of that,

490
00:24:48,345 --> 00:24:50,090
we recursively call

491
00:24:50,090 --> 00:24:53,675
the derivative
according to s. Okay?

492
00:24:53,675 --> 00:24:56,060
And now the main matcher,

493
00:24:56,060 --> 00:24:59,360
it takes a regular
expression and takes

494
00:24:59,360 --> 00:25:02,870
a string and returns a
boolean, true or false.

495
00:25:02,870 --> 00:25:05,705
And it first builds
the derivative.

496
00:25:05,705 --> 00:25:07,430
The only thing I have to do here

497
00:25:07,430 --> 00:25:09,410
because I'm implementing
it in Scala,

498
00:25:09,410 --> 00:25:15,064
I have to coerce between strings
and lists of characters.

499
00:25:15,064 --> 00:25:16,580
So he, I take first

500
00:25:16,580 --> 00:25:20,810
the toList of the s. That
gives me a list of characters.

501
00:25:20,810 --> 00:25:23,135
Then I build this derivative

502
00:25:23,135 --> 00:25:26,045
with the s, because I'm
looking at strings.

503
00:25:26,045 --> 00:25:28,160
And then at the end,

504
00:25:28,160 --> 00:25:33,050
built the nullable of
the final derivative.

505
00:25:33,050 --> 00:25:37,310
The beauty of all this
is that in Scala,

506
00:25:37,310 --> 00:25:40,085
these definition which
I had on the slides

507
00:25:40,085 --> 00:25:43,835
go almost one-to-one into code.

508
00:25:43,835 --> 00:25:45,605
And that's of course,

509
00:25:45,605 --> 00:25:47,480
makes it very easy
to implement in

510
00:25:47,480 --> 00:25:49,730
a functional language like Scala.

511
00:25:49,730 --> 00:25:52,865
We can also now try
out some examples.

512
00:25:52,865 --> 00:25:56,960
This was the example
I had on the slide.

513
00:25:56,960 --> 00:25:58,370
And we could now calculate

514
00:25:58,370 --> 00:26:00,305
what's the derivative
according to a,

515
00:26:00,305 --> 00:26:02,720
b, and c of this
regular expression.

516
00:26:02,720 --> 00:26:07,040
And I hope you didn't
make any mistake.

517
00:26:07,040 --> 00:26:09,260
Now of course we want
to see whether we

518
00:26:09,260 --> 00:26:11,390
do any better with
this algorithm...

519
00:26:11,390 --> 00:26:13,715
than Python and Ruby.

520
00:26:13,715 --> 00:26:16,070
And let's first have a
look at this example.

521
00:26:16,070 --> 00:26:18,079
So this regular expression.

522
00:26:18,079 --> 00:26:19,880
The problem is that in

523
00:26:19,880 --> 00:26:22,070
our regular expression
matcher so far we have

524
00:26:22,070 --> 00:26:24,530
the sequence rregular
expression, but we

525
00:26:24,530 --> 00:26:27,200
don't have this optional
regular expression.

526
00:26:27,200 --> 00:26:30,800
And we don't have this n
times regular expression,

527
00:26:30,800 --> 00:26:36,605
but we can quickly implement
that in our implementation.

528
00:26:36,605 --> 00:26:40,549
So if you want to build the
optional regular expression,

529
00:26:40,549 --> 00:26:41,870
which is defined here,

530
00:26:41,870 --> 00:26:44,570
a little function which
takes a reg expression and

531
00:26:44,570 --> 00:26:47,360
returns the alternative of r and 1.

532
00:26:47,360 --> 00:26:49,624
So it essentially
takes the definition

533
00:26:49,624 --> 00:26:53,240
of optional r and
replaces it by that.

534
00:26:53,240 --> 00:26:56,150
The n-times what we
essentially do there is

535
00:26:56,150 --> 00:27:01,535
we built n copies of this r. Ok?

536
00:27:01,535 --> 00:27:04,745
So if this n-times was 0,

537
00:27:04,745 --> 00:27:06,245
we just return one.

538
00:27:06,245 --> 00:27:11,570
If it's one, then we return
just the regular expression.

539
00:27:11,570 --> 00:27:15,575
And if it's something
bigger than one,

540
00:27:15,575 --> 00:27:18,560
then we just built the
sequence of one of

541
00:27:18,560 --> 00:27:20,270
these copies and call

542
00:27:20,270 --> 00:27:22,925
this function again
with n - 1.

543
00:27:22,925 --> 00:27:26,330
So we just build now n-copies of that.

544
00:27:26,330 --> 00:27:30,710
Okay, so we can look
at our first example.

545
00:27:30,710 --> 00:27:32,629
This is the regular expression,

546
00:27:32,629 --> 00:27:36,035
and I call that here
evil regular expression1.

547
00:27:36,035 --> 00:27:37,670
It doesn't look as beautiful

548
00:27:37,670 --> 00:27:39,140
as what we write down on paper.

549
00:27:39,140 --> 00:27:41,240
We will actually look
at this later on

550
00:27:41,240 --> 00:27:43,640
if this can be improved.
But here it is.

551
00:27:43,640 --> 00:27:45,724
Here's the regular expression,

552
00:27:45,724 --> 00:27:49,520
and here's a little function
which measures the time.

553
00:27:49,520 --> 00:27:53,180
And we can now start testing

554
00:27:53,180 --> 00:27:57,845
this regular expression with
strings just containing a's.

555
00:27:57,845 --> 00:28:00,020
And we are first a bit cautious,

556
00:28:00,020 --> 00:28:04,985
we test it between 0 and 20,
and we count by two.

557
00:28:04,985 --> 00:28:08,330
And I actually will not
start that within Scala,

558
00:28:08,330 --> 00:28:12,965
but actually use Ammonite.

559
00:28:12,965 --> 00:28:15,305
And that's out.

560
00:28:15,305 --> 00:28:17,269
And that calculates

561
00:28:17,269 --> 00:28:20,675
for 18 a's. It's pretty fast.

562
00:28:20,675 --> 00:28:24,815
But for 20 it already
needs 2 seconds.

563
00:28:24,815 --> 00:28:28,265
And you can see
actually this jump from

564
00:28:28,265 --> 00:28:32,825
18 to 20 in the runtime
is pretty bad too.

565
00:28:32,825 --> 00:28:37,460
And if we actually measure
it more accurately,

566
00:28:37,460 --> 00:28:39,770
then we get this picture.

567
00:28:39,770 --> 00:28:41,255
And what turns out,

568
00:28:41,255 --> 00:28:45,830
we are actually worse than Python
and Ruby in this example.

569
00:28:45,830 --> 00:28:49,230
So I guess that's a fail.
author	Christian Urban <christian.urban@kcl.ac.uk>
	Sun, 01 Oct 2023 10:57:32 +0100
changeset 932	5678414a3898
parent 769	f9686b22db7e
permissions	-rw-r--r--