videos/02-algo2.srt
author Christian Urban <christian.urban@kcl.ac.uk>
Mon, 05 Oct 2020 01:42:47 +0100
changeset 773 260e330f7466
parent 772 3bf3f5bb067e
permissions -rw-r--r--
updated

1
00:00:06,020 --> 00:00:09,945
Welcome back. After
this disappointment

2
00:00:09,945 --> 00:00:14,115
and case of over-promising
and under-achieving,

3
00:00:14,115 --> 00:00:17,295
let's have a look whether
we can recover from that.

4
00:00:17,295 --> 00:00:20,535
So here's one problem.

5
00:00:20,535 --> 00:00:23,790
When we looked at this 
n-times regular expression, we

6
00:00:23,790 --> 00:00:27,330
were not able to represent
that directly.

7
00:00:27,330 --> 00:00:31,275
We had to represent it as a
sequence regular expression.

8
00:00:31,275 --> 00:00:32,850
So we expanded it.

9
00:00:32,850 --> 00:00:36,510
So times-one would be just a,

10
00:00:36,510 --> 00:00:40,595
times-two would
be a o a, and so on.

11
00:00:40,595 --> 00:00:43,535
And so you can see if
you go already to 13,

12
00:00:43,535 --> 00:00:47,960
n equals 13, the regular
expression becomes quite large.

13
00:00:47,960 --> 00:00:52,865
And now the functions
nullable and also derivative,

14
00:00:52,865 --> 00:00:56,360
they need to traverse
this regular expression.

15
00:00:56,360 --> 00:00:59,060
Remember this tree, sometimes,

16
00:00:59,060 --> 00:01:01,820
they have to traverse
that even several times.

17
00:01:01,820 --> 00:01:04,535
So that will slow
down the algorithm.

18
00:01:04,535 --> 00:01:08,150
And in particular, because
our first regular expression was

19
00:01:08,150 --> 00:01:11,840
actually not just a, but
a + 1. So we had

20
00:01:11,840 --> 00:01:14,330
in the case of 13,
we had a + 1, a + 1,...

21
00:01:14,330 --> 00:01:17,330
a + 1... 13 times.

22
00:01:17,330 --> 00:01:20,150
And this regular
expression just grows

23
00:01:20,150 --> 00:01:25,475
with the number n. Just to
show you this also in code,

24
00:01:25,475 --> 00:01:28,115
I'm in the file re1.sc,

25
00:01:28,115 --> 00:01:30,920
and in the end I have a size function.

26
00:01:30,920 --> 00:01:33,140
The size function takes
a regular expression as

27
00:01:33,140 --> 00:01:36,215
argument and produces an integer.

28
00:01:36,215 --> 00:01:39,980
And essentially it takes
this regular expression, seen as

29
00:01:39,980 --> 00:01:44,045
a tree and counts how many
nodes there are in this tree.

30
00:01:44,045 --> 00:01:49,490
And so if I take this EVIL1
regular expression, this one,

31
00:01:49,490 --> 00:01:52,160
and increase now this n. So you

32
00:01:52,160 --> 00:01:54,680
can already see
for n = 1,

33
00:01:54,680 --> 00:01:56,540
the size of this
record expression is 5

34
00:01:56,540 --> 00:01:59,615
and you see it
slowly increases.

35
00:01:59,615 --> 00:02:02,150
And this 20, i.e. n = 20,

36
00:02:02,150 --> 00:02:07,100
the size of this regular
expression is now already 119.

37
00:02:07,100 --> 00:02:10,145
The interesting
line is this one.

38
00:02:10,145 --> 00:02:13,295
Remember, when we
built the derivative,

39
00:02:13,295 --> 00:02:17,150
very often parts of a regular
expression gets copied.

40
00:02:17,150 --> 00:02:19,280
So if you have like EVIL1

41
00:02:19,280 --> 00:02:22,325
of 20, and you have 119 nodes,

42
00:02:22,325 --> 00:02:25,475
now this will be often copied.

43
00:02:25,475 --> 00:02:31,475
And we want to see what's the
resulting tree look like, i.e.

44
00:02:31,475 --> 00:02:32,780
how many nodes does it have.

45
00:02:32,780 --> 00:02:34,985
So I take here EVIL1 of 20

46
00:02:34,985 --> 00:02:38,405
and the derivative
according to 20 a's.

47
00:02:38,405 --> 00:02:41,820
And just have a look
what the size is.

48
00:02:43,210 --> 00:02:45,680
Okay, that takes a while.

49
00:02:45,680 --> 00:02:48,410
You see now this recular expression,

50
00:02:48,410 --> 00:02:52,280
the derivative has already
8 million plus nodes.

51
00:02:52,280 --> 00:02:55,400
And now nullable and
derivative have to

52
00:02:55,400 --> 00:02:58,430
traverse these trees with
8 million plus nodes.

53
00:02:58,430 --> 00:03:01,400
So it's no wonder
that this is slow.

54
00:03:01,400 --> 00:03:03,860
The first thing we
can try to mitigate

55
00:03:03,860 --> 00:03:06,350
this explosion problem is to

56
00:03:06,350 --> 00:03:09,050
have an explicit 
n-times regular expression.

57
00:03:09,050 --> 00:03:11,600
So instead of expanding

58
00:03:11,600 --> 00:03:13,415
it to the sequence
regular expression,

59
00:03:13,415 --> 00:03:17,510
let's just have a single
regular expression n-times,

60
00:03:17,510 --> 00:03:20,540
which takes an expression and

61
00:03:20,540 --> 00:03:24,470
a number, and keeps that
regular expression small.

62
00:03:24,470 --> 00:03:27,095
I'm here in the file re2.sc,

63
00:03:27,095 --> 00:03:29,090
which is also on KEATS.

64
00:03:29,090 --> 00:03:32,570
And the only change I made
is I added explicitly

65
00:03:32,570 --> 00:03:33,860
a regular expression for

66
00:03:33,860 --> 00:03:36,770
n-times. The optional
regular expression

67
00:03:36,770 --> 00:03:39,920
we leave out at the
moment because this a?

68
00:03:39,920 --> 00:03:41,975
is represented as

69
00:03:41,975 --> 00:03:44,645
a + 1 and doesn't
expand too much.

70
00:03:44,645 --> 00:03:47,375
The real culprit
is this n-times.

71
00:03:47,375 --> 00:03:51,095
And instead of expanding
it n-times, we just

72
00:03:51,095 --> 00:03:54,275
take here a regular expression
and a natural number,

73
00:03:54,275 --> 00:03:56,960
which says how many times
it should be repeated.

74
00:03:56,960 --> 00:03:59,165
And in this week we
can keep it small.

75
00:03:59,165 --> 00:04:01,370
But by adding that
regular expression,

76
00:04:01,370 --> 00:04:05,150
we now have to think about
cases for nullable and

77
00:04:05,150 --> 00:04:07,340
derivative. So that we have

78
00:04:07,340 --> 00:04:10,205
to calculate next
what they look like.

79
00:04:10,205 --> 00:04:14,060
We can actually
calculate what the rule

80
00:04:14,060 --> 00:04:17,525
for nullable should be from
how we defined it earlier.

81
00:04:17,525 --> 00:04:19,580
Remember, if you have
a regular expression,

82
00:04:19,580 --> 00:04:21,980
r and we take it
n-times of this r,

83
00:04:21,980 --> 00:04:25,220
then in case if n = 0,

84
00:04:25,220 --> 00:04:28,130
we definded that as the
1 regular expression.

85
00:04:28,130 --> 00:04:30,380
If n = 1,

86
00:04:30,380 --> 00:04:32,495
it will be just a
single copy of r.

87
00:04:32,495 --> 00:04:33,725
If n = 2,

88
00:04:33,725 --> 00:04:35,270
it will be two copies in sequence.

89
00:04:35,270 --> 00:04:39,260
If three, we have
three copies and so on.

90
00:04:39,260 --> 00:04:41,390
Now we have to
somehow characterize

91
00:04:41,390 --> 00:04:44,285
when are these cases all nullable.

92
00:04:44,285 --> 00:04:47,810
Well, if n equals 0,

93
00:04:47,810 --> 00:04:49,850
then this will be defined as 1.

94
00:04:49,850 --> 00:04:52,100
So 1 can match
the empty string.

95
00:04:52,100 --> 00:04:54,785
So that should be
definitely true.

96
00:04:54,785 --> 00:04:57,725
Okay, if n = 1,

97
00:04:57,725 --> 00:05:00,470
when can this regular expression
match the empty string?

98
00:05:00,470 --> 00:05:02,000
So when should this be nullable?

99
00:05:02,000 --> 00:05:07,535
Well, if nullable of r holds,

100
00:05:07,535 --> 00:05:09,860
If it can match
the empty string,

101
00:05:09,860 --> 00:05:12,110
then we can match
the empty string.

102
00:05:12,110 --> 00:05:14,870
When can this regular expression
match the empty string?

103
00:05:14,870 --> 00:05:16,265
So these are now two copies of r.

104
00:05:16,265 --> 00:05:20,690
Well, both copies need
to match the empty string.

105
00:05:20,690 --> 00:05:22,820
So again, we have to ask

106
00:05:22,820 --> 00:05:25,774
both of them need to be nullable.

107
00:05:25,774 --> 00:05:28,520
Similarly, in the third case,

108
00:05:28,520 --> 00:05:30,710
all three copies need to be

109
00:05:30,710 --> 00:05:33,230
able to match the empty
string. When is that the case?

110
00:05:33,230 --> 00:05:38,360
Well, if nullable of
r holds and so on.

111
00:05:38,360 --> 00:05:48,754
So we can actually define that
if n = 0, then true,

112
00:05:48,754 --> 00:05:56,045
else we have to ask whether
nullable of r holds.

113
00:05:56,045 --> 00:05:58,550
So that will be the clause we

114
00:05:58,550 --> 00:06:01,625
have to implement for
n-times and nullable.

115
00:06:01,625 --> 00:06:04,280
Deriving the definition for

116
00:06:04,280 --> 00:06:06,920
the derivative of the n-times

117
00:06:06,920 --> 00:06:10,175
regular expression
is not that easy.

118
00:06:10,175 --> 00:06:12,755
We have to look again
how it was defined

119
00:06:12,755 --> 00:06:16,010
earlier and we somehow
have to spot a pattern.

120
00:06:16,010 --> 00:06:18,380
Otherwise, our
algorithm will be again

121
00:06:18,380 --> 00:06:20,975
quite slow if we don't
spot that pattern.

122
00:06:20,975 --> 00:06:22,550
So let's have a look.

123
00:06:22,550 --> 00:06:26,240
So in the case that
n is equal to 0,

124
00:06:26,240 --> 00:06:29,885
then r n-times was
defined as just 1.

125
00:06:29,885 --> 00:06:32,030
What would be the
derivative of one?

126
00:06:32,030 --> 00:06:36,140
So the derivative of c of 1

127
00:06:36,140 --> 00:06:38,990
would be defined as 0.

128
00:06:38,990 --> 00:06:41,465
Okay, fair enough.

129
00:06:41,465 --> 00:06:44,960
If he have n = 1,

130
00:06:44,960 --> 00:06:48,125
then we just have one copy
of this regular expression.

131
00:06:48,125 --> 00:06:50,120
So there's not much else we can do.

132
00:06:50,120 --> 00:06:53,735
We would have to calculate
the derivative of r.

133
00:06:53,735 --> 00:06:57,395
Now in the n = 2 case,

134
00:06:57,395 --> 00:07:00,245
that would mean we
have two copies of r.

135
00:07:00,245 --> 00:07:03,425
And they would be a sequence.

136
00:07:03,425 --> 00:07:07,775
How would be the derivative c of

137
00:07:07,775 --> 00:07:10,340
r followed by r be

138
00:07:10,340 --> 00:07:13,265
defined? Well we would
look up the definition.

139
00:07:13,265 --> 00:07:15,470
And it would say that's
the complicated case,

140
00:07:15,470 --> 00:07:16,760
the sequence.

141
00:07:16,760 --> 00:07:21,875
If nullable of r,

142
00:07:21,875 --> 00:07:25,250
then the more complicated case,

143
00:07:25,250 --> 00:07:28,225
else, that's the easy
case, that would be

144
00:07:28,225 --> 00:07:32,660
derivative of c of r, followed

145
00:07:32,660 --> 00:07:35,540
by r. And then I just have to copy

146
00:07:35,540 --> 00:07:40,775
that derivative of c
of r followed by r here.

147
00:07:40,775 --> 00:07:43,130
But in this case I have also

148
00:07:43,130 --> 00:07:51,145
the alternative derivative
of c of r. And from that,

149
00:07:51,145 --> 00:07:55,030
it looks like we can't
do much better than

150
00:07:55,030 --> 00:07:58,390
that, unless we do

151
00:07:58,390 --> 00:08:02,560
a little bit of magic and the
magic has to do with this case.

152
00:08:02,560 --> 00:08:07,420
So let me look at this
case a bit more closely.

153
00:08:07,420 --> 00:08:09,819
Let me go back to slides

154
00:08:09,819 --> 00:08:12,940
because actually the calculation
might be a bit hairy.

155
00:08:12,940 --> 00:08:16,945
So remember we are in this
case where n equals two.

156
00:08:16,945 --> 00:08:18,550
And this was defined as

157
00:08:18,550 --> 00:08:20,680
this sequence regular
expression r followed by r.

158
00:08:20,680 --> 00:08:23,080
And the question was,

159
00:08:23,080 --> 00:08:26,365
can we somehow make sense
out of this derivative

160
00:08:26,365 --> 00:08:30,370
where we have to build the
derivative of a sequence.

161
00:08:30,370 --> 00:08:33,020
So if we just unfold the definition,

162
00:08:33,020 --> 00:08:36,050
we would ask if
the r is nullable,

163
00:08:36,050 --> 00:08:39,095
In this case, we return
this alternative.

164
00:08:39,095 --> 00:08:40,640
And if it's not nullable,

165
00:08:40,640 --> 00:08:44,135
it is just this
regular expression.

166
00:08:44,135 --> 00:08:49,550
Now my claim is that
this whole if condition

167
00:08:49,550 --> 00:08:55,895
can be replaced by just this
simple derivative here.

168
00:08:55,895 --> 00:08:58,775
And if that claim is correct,

169
00:08:58,775 --> 00:09:01,520
there would be very nice
because in contrast to

170
00:09:01,520 --> 00:09:04,130
this if condition where

171
00:09:04,130 --> 00:09:06,280
we have to calculate
like nullable,

172
00:09:06,280 --> 00:09:08,330
decide in which branch we are.

173
00:09:08,330 --> 00:09:10,580
We don't have to
calculate nullable,

174
00:09:10,580 --> 00:09:13,880
we can just replace
it by this expression.

175
00:09:13,880 --> 00:09:16,670
So if we can
substantiate that claim,

176
00:09:16,670 --> 00:09:19,860
that will be definitely
good our algorithm.

177
00:09:20,140 --> 00:09:22,775
And to substantiate that,

178
00:09:22,775 --> 00:09:26,795
I will focus on this
record expression here.

179
00:09:26,795 --> 00:09:31,100
And notice that this record
expression will only be

180
00:09:31,100 --> 00:09:35,780
called or only be generated
if r is nullable.

181
00:09:35,780 --> 00:09:38,075
So in any other case,

182
00:09:38,075 --> 00:09:40,060
I will actually not go into 

183
00:09:40,060 --> 00:09:43,850
that if branch and would
be in the other one.

184
00:09:43,850 --> 00:09:45,260
So if we are in this if branch,

185
00:09:45,260 --> 00:09:47,705
we definitely know
that r is nullable.

186
00:09:47,705 --> 00:09:52,955
Okay, so here's
our regular expression.

187
00:09:52,955 --> 00:09:55,940
And we know it's nullable.

188
00:09:55,940 --> 00:09:57,920
So we have to somehow find

189
00:09:57,920 --> 00:10:00,380
an equivalent expression that

190
00:10:00,380 --> 00:10:04,100
is smaller or simpler
than that one.

191
00:10:04,100 --> 00:10:05,945
Let's see what we can do.

192
00:10:05,945 --> 00:10:10,160
So the first thing
actually is we're multiplying

193
00:10:10,160 --> 00:10:16,595
this right hand side of the
alternative with times 1.

194
00:10:16,595 --> 00:10:19,700
We can do that
because this does not

195
00:10:19,700 --> 00:10:23,090
change which strings this
regular expression can match.

196
00:10:23,090 --> 00:10:25,685
Remember we even had it
as a simplification rule,

197
00:10:25,685 --> 00:10:27,425
just in this case we

198
00:10:27,425 --> 00:10:29,705
don't apply it as a
simplification rule,

199
00:10:29,705 --> 00:10:31,310
actually make this
regular expression

200
00:10:31,310 --> 00:10:32,720
a bit more complicated.

201
00:10:32,720 --> 00:10:34,910
But times 1 doesn't make

202
00:10:34,910 --> 00:10:37,820
a difference because it
means at the end of a string,

203
00:10:37,820 --> 00:10:40,070
we still want to match
the empty string.

204
00:10:40,070 --> 00:10:42,155
Okay, so that is possible.

205
00:10:42,155 --> 00:10:45,740
I can do that. Once
we have done that,

206
00:10:45,740 --> 00:10:48,410
you will notice that this
factor derivative of

207
00:10:48,410 --> 00:10:51,860
r are exactly the
same as that one.

208
00:10:51,860 --> 00:10:54,650
And in between is a plus.

209
00:10:54,650 --> 00:10:57,440
So you probably remember the law

210
00:10:57,440 --> 00:11:00,170
from school math
that I can pull out

211
00:11:00,170 --> 00:11:02,735
this factor derivative of c of r.

212
00:11:02,735 --> 00:11:06,320
And I'm inside the parentheses
is left with that.

213
00:11:06,320 --> 00:11:09,245
So now I have r + 1.

214
00:11:09,245 --> 00:11:13,055
Usually we cannot
simplify r + 1.

215
00:11:13,055 --> 00:11:15,530
If it had been r + 0, then yes,

216
00:11:15,530 --> 00:11:18,665
we could have got rid of the 0.

217
00:11:18,665 --> 00:11:21,590
But this + 1 often
makes a difference,

218
00:11:21,590 --> 00:11:22,970
but not in our case.

219
00:11:22,970 --> 00:11:25,940
Remember, we know that
this r is nullable,

220
00:11:25,940 --> 00:11:29,840
so on its own can already
match the empty string.

221
00:11:29,840 --> 00:11:33,305
So we don't really need this
alternative plus one there,

222
00:11:33,305 --> 00:11:35,300
so we can just get rid of that.

223
00:11:35,300 --> 00:11:37,070
Okay, and so now we have

224
00:11:37,070 --> 00:11:40,535
a much simpler equivalent
regular expression.

225
00:11:40,535 --> 00:11:44,285
And this actually helps a
lot for our if-condition.

226
00:11:44,285 --> 00:11:46,925
Look, this was the
original if-condition

227
00:11:46,925 --> 00:11:50,270
and this is the regular expression
we just simplified.

228
00:11:50,270 --> 00:11:53,105
If we replace it with this one,

229
00:11:53,105 --> 00:11:56,090
then we just end up with this.

230
00:11:56,090 --> 00:11:59,510
And now you will see that
the if condition is actually

231
00:11:59,510 --> 00:12:02,750
pointless because you
check if it's nullable,

232
00:12:02,750 --> 00:12:05,060
we return this regular
expression or it's

233
00:12:05,060 --> 00:12:08,210
not nullable and we
return exactly the same.

234
00:12:08,210 --> 00:12:10,025
That doesn't make any difference.

235
00:12:10,025 --> 00:12:11,750
So we can just get rid of

236
00:12:11,750 --> 00:12:14,645
that one and can
replace it by that.

237
00:12:14,645 --> 00:12:16,865
And you see, this is now

238
00:12:16,865 --> 00:12:20,720
a much simpler case than
what we had before.

239
00:12:20,720 --> 00:12:24,170
So let's take stock
what we have so far.

240
00:12:24,170 --> 00:12:26,915
So we know in the 0-case,

241
00:12:26,915 --> 00:12:30,920
the derivative needs
to be defined as 0.

242
00:12:30,920 --> 00:12:33,095
Because we define this

243
00:12:33,095 --> 00:12:36,785
n-times as one,
the derivative is 0.

244
00:12:36,785 --> 00:12:39,889
For just r, this will
be the derivative.

245
00:12:39,889 --> 00:12:42,170
And we can't do any
better than that.

246
00:12:42,170 --> 00:12:45,620
For r followed by r, as we
just found out

247
00:12:45,620 --> 00:12:47,270
actually it is quite simple

248
00:12:47,270 --> 00:12:51,410
regular expression is equivalent
to the derivative.

249
00:12:51,410 --> 00:12:53,870
Now, we have to continue with

250
00:12:53,870 --> 00:12:56,090
that case where n is
equal to three and we

251
00:12:56,090 --> 00:12:58,099
now have three copies

252
00:12:58,099 --> 00:13:02,390
of this r. What should
be the derivative?

253
00:13:02,390 --> 00:13:05,330
Well, if you look very carefully

254
00:13:05,330 --> 00:13:08,465
at this emerging pattern,

255
00:13:08,465 --> 00:13:12,410
I have to say then
what would be nice if,

256
00:13:12,410 --> 00:13:16,400
if he could show that in
the n equals three case,

257
00:13:16,400 --> 00:13:18,275
we end up with this.

258
00:13:18,275 --> 00:13:21,290
Because then what we have is this.

259
00:13:21,290 --> 00:13:25,370
We can define our
nullable for n-times

260
00:13:25,370 --> 00:13:31,025
as if n = 0 then
true, else nullable r.

261
00:13:31,025 --> 00:13:33,875
And for the derivative,

262
00:13:33,875 --> 00:13:37,190
we can save if n is equal to 0,

263
00:13:37,190 --> 00:13:40,235
then we return the
0 regular expression.

264
00:13:40,235 --> 00:13:43,295
Otherwise, as you can see
from this pattern here,

265
00:13:43,295 --> 00:13:50,735
it would be derivative of
c r followed by r{n-1}.

266
00:13:50,735 --> 00:13:54,770
Okay? And the only

267
00:13:54,770 --> 00:13:56,330
thing we have to make csure is that

268
00:13:56,330 --> 00:13:58,175
this pattern actually holds.

269
00:13:58,175 --> 00:14:00,470
So that's, I will
show you next in

270
00:14:00,470 --> 00:14:03,735
the case for n equals three.

271
00:14:03,735 --> 00:14:07,810
If we have a regular expression r

272
00:14:07,810 --> 00:14:09,820
and it's followed
by r and another r,

273
00:14:09,820 --> 00:14:11,275
three copies of it.

274
00:14:11,275 --> 00:14:14,245
We can just unfold
again the definition.

275
00:14:14,245 --> 00:14:16,930
So we would ask if r is nullable,

276
00:14:16,930 --> 00:14:19,765
then we have this if branch.

277
00:14:19,765 --> 00:14:23,905
And if r is not nullable,
we have this else branch.

278
00:14:23,905 --> 00:14:27,010
Okay? What can we do here?

279
00:14:27,010 --> 00:14:30,310
Well, we notice that
this one is just now

280
00:14:30,310 --> 00:14:34,510
the derivative of two
r's, one after another.

281
00:14:34,510 --> 00:14:37,330
But this we just
calculated a moment ago,

282
00:14:37,330 --> 00:14:40,120
so we can just
replace it by this.

283
00:14:40,120 --> 00:14:43,255
Ok. That's what we
calculated earlier.

284
00:14:43,255 --> 00:14:46,665
But now you will see
again this factor,

285
00:14:46,665 --> 00:14:48,695
and this factor is the same.

286
00:14:48,695 --> 00:14:52,700
So I can pull that
out to be like that.

287
00:14:52,700 --> 00:14:57,380
And I have now r followed
by r plus r. 

288
00:14:57,380 --> 00:14:59,030
But now you probably

289
00:14:59,030 --> 00:15:00,680
remember how we did it earlier.

290
00:15:00,680 --> 00:15:03,080
We can now pull out one copy of

291
00:15:03,080 --> 00:15:06,020
this are to just get
something like this.

292
00:15:06,020 --> 00:15:08,765
We have to add one essentially,

293
00:15:08,765 --> 00:15:11,615
but we now get r plus one.

294
00:15:11,615 --> 00:15:15,065
And this r here is
now just pulled out.

295
00:15:15,065 --> 00:15:18,995
Well, here again kicks
in this reasoning.

296
00:15:18,995 --> 00:15:22,700
We go in this if branch
only if r is nullable.

297
00:15:22,700 --> 00:15:26,150
So r on its own can already
match the empty string.

298
00:15:26,150 --> 00:15:28,895
So I don't need
really this plus one.

299
00:15:28,895 --> 00:15:30,695
I can just get rid of it.

300
00:15:30,695 --> 00:15:33,140
And so I now just have
to rearrange the parentheses

301
00:15:33,140 --> 00:15:35,450
which we said we can also do.

302
00:15:35,450 --> 00:15:37,595
So we have something like that.

303
00:15:37,595 --> 00:15:39,740
And here again, the
same reasoning,

304
00:15:39,740 --> 00:15:41,975
we have an if condition

305
00:15:41,975 --> 00:15:43,310
where it doesn't
really matter what

306
00:15:43,310 --> 00:15:44,405
we're going to return,

307
00:15:44,405 --> 00:15:46,205
it's in both branches the same.

308
00:15:46,205 --> 00:15:48,470
So we can just
replace it by that.

309
00:15:48,470 --> 00:15:51,920
And yes, now we can be
pretty sure they'll

310
00:15:51,920 --> 00:15:55,310
work for all the n-times
regular expressions.

311
00:15:55,310 --> 00:15:57,860
And I leave 
the calculation for

312
00:15:57,860 --> 00:16:02,570
maybe r to the four to you.

313
00:16:02,570 --> 00:16:04,490
And the reason why I do it

314
00:16:04,490 --> 00:16:06,605
in such a detail,
this calculation,

315
00:16:06,605 --> 00:16:08,960
this is really meant
to help you with

316
00:16:08,960 --> 00:16:13,200
the coursework which is
coming up this week.

317
00:16:13,210 --> 00:16:16,250
I'm now back in our
implementation.

318
00:16:16,250 --> 00:16:20,660
In this re2.sc, we said we have
this explicit constructor 

319
00:16:20,660 --> 00:16:25,535
for n-times. We can now fill
in the cases for nullable.

320
00:16:25,535 --> 00:16:27,635
So if we have r n-times,

321
00:16:27,635 --> 00:16:30,995
if this n is equal to
0, we return true.

322
00:16:30,995 --> 00:16:34,190
Otherwise we have to test
whether r is nullable.

323
00:16:34,190 --> 00:16:37,565
And in the derivative case, 

324
00:16:37,565 --> 00:16:40,339
if this n is equal to 0,

325
00:16:40,339 --> 00:16:43,564
the return this 0
regular expression.

326
00:16:43,564 --> 00:16:46,700
Otherwise we return the
sequence of the derivative

327
00:16:46,700 --> 00:16:50,270
of c of r followed by n times of r,

328
00:16:50,270 --> 00:16:54,545
but n minus one times, and
everything else stays the same.

329
00:16:54,545 --> 00:16:56,510
Here's the function for strings,

330
00:16:56,510 --> 00:16:58,430
derivative function for strings.

331
00:16:58,430 --> 00:17:01,595
In the main matcher
function is all the same.

332
00:17:01,595 --> 00:17:04,820
We still require this
definition because

333
00:17:04,820 --> 00:17:06,050
we haven't done anything about

334
00:17:06,050 --> 00:17:08,090
the optional regular
expression yet.

335
00:17:08,090 --> 00:17:10,670
And we have now this

336
00:17:10,670 --> 00:17:13,250
EVIL1 and EVIL2
regular expression.

337
00:17:13,250 --> 00:17:15,290
And let's test it. And let's be

338
00:17:15,290 --> 00:17:17,000
a bit more ambitious.
We're testing it with

339
00:17:17,000 --> 00:17:20,315
strings between 0 and 1000

340
00:17:20,315 --> 00:17:22,580
and let's see what the time say.

341
00:17:22,580 --> 00:17:26,210
I'm testing this again
inside the Ammonite REPL.

342
00:17:26,210 --> 00:17:30,180
And you'll see it should
be now much quicker.

343
00:17:30,610 --> 00:17:34,640
Okay, it might slow
down also around 600.

344
00:17:34,640 --> 00:17:40,490
700 needs two seconds,
three seconds for 800.

345
00:17:40,490 --> 00:17:43,940
Let's see what it
needs for one thousand?

346
00:17:43,940 --> 00:17:47,435
But you have to remember
Ruby and Python

347
00:17:47,435 --> 00:17:51,530
needed half a
minute for just 30 a's.

348
00:17:51,530 --> 00:17:54,485
We need a little bit
more than six seconds,

349
00:17:54,485 --> 00:17:57,110
but we are processing a string of

350
00:17:57,110 --> 00:18:00,575
1000 a's. So that success.

351
00:18:00,575 --> 00:18:04,775
This speed is also explained
if you look at the sizes.

352
00:18:04,775 --> 00:18:08,975
Since we now have this explicit
n-times constructor,

353
00:18:08,975 --> 00:18:11,930
it doesn't really matter
how big this n is.

354
00:18:11,930 --> 00:18:14,540
This EVIL1 regular
expression will always be

355
00:18:14,540 --> 00:18:17,195
of this size seven, at
the beginning.

356
00:18:17,195 --> 00:18:20,315
And you can also see if you
now build the derivatives,

357
00:18:20,315 --> 00:18:22,550
they still grow in size,

358
00:18:22,550 --> 00:18:24,740
but much more moderately.

359
00:18:24,740 --> 00:18:28,100
And let's try out this
example with 20 a's.

360
00:18:28,100 --> 00:18:31,685
So we build the derivative
according to 20 character a's.

361
00:18:31,685 --> 00:18:33,890
In the earlier example,

362
00:18:33,890 --> 00:18:35,780
we ended up with a
regular expression that

363
00:18:35,780 --> 00:18:37,895
had like 8 million plus nodes.

364
00:18:37,895 --> 00:18:39,830
And if we do this now,

365
00:18:39,830 --> 00:18:43,205
then we just have a regular
expression with 211 nodes.

366
00:18:43,205 --> 00:18:44,750
And that is much smaller and

367
00:18:44,750 --> 00:18:47,165
all the calculations
would be much quicker.

368
00:18:47,165 --> 00:18:49,550
So yeah, the Brzozowski

369
00:18:49,550 --> 00:18:52,205
algorithm
and with this improvement,

370
00:18:52,205 --> 00:18:54,890
we're now running
circles around Ruby and

371
00:18:54,890 --> 00:18:58,445
Python because they're just
stuck here at the beginning.

372
00:18:58,445 --> 00:19:00,230
Because they need already

373
00:19:00,230 --> 00:19:02,975
like half a minute
for just 30 a's.

374
00:19:02,975 --> 00:19:05,930
We now can do something
like thousand a's

375
00:19:05,930 --> 00:19:07,580
in a reasonable time.

376
00:19:07,580 --> 00:19:09,740
I think this must be
timing I obtained with

377
00:19:09,740 --> 00:19:12,635
my older laptop some time ago.

378
00:19:12,635 --> 00:19:14,210
Because remember we
had something like

379
00:19:14,210 --> 00:19:16,670
six seconds. Here it says 15.

380
00:19:16,670 --> 00:19:18,080
You always have to take

381
00:19:18,080 --> 00:19:20,885
these times with
the pinch of salt.

382
00:19:20,885 --> 00:19:22,670
It's essentially only the trend,

383
00:19:22,670 --> 00:19:25,100
but it's clear we are
much, much better.

384
00:19:25,100 --> 00:19:27,065
We have worked a lot,

385
00:19:27,065 --> 00:19:30,720
but we also got something
for it in return.