videos/02-algo3.srt
author Christian Urban <christian.urban@kcl.ac.uk>
Wed, 21 Oct 2020 16:09:44 +0100
changeset 787 4b2496bc79b2
parent 774 42733adf2a48
permissions -rw-r--r--
updated

1
00:00:06,350 --> 00:00:10,305
Welcome come back. I
can hear you saying,

2
00:00:10,305 --> 00:00:12,000
yes, you tried it out on

3
00:00:12,000 --> 00:00:14,760
one example and you
were much better.

4
00:00:14,760 --> 00:00:17,415
But how about on other examples?

5
00:00:17,415 --> 00:00:21,265
Specifically, we had two
evil regular expressions.

6
00:00:21,265 --> 00:00:23,480
How about the other case?

7
00:00:23,480 --> 00:00:27,480
Well, let's have a look at
that other case in this video.

8
00:00:29,140 --> 00:00:32,705
So I'm back here
in this re2.sc file.

9
00:00:32,705 --> 00:00:35,180
And here's this test
case which run quite

10
00:00:35,180 --> 00:00:39,665
nicely for strings between 0 and 1000.

11
00:00:39,665 --> 00:00:42,725
Here is the other test case.

12
00:00:42,725 --> 00:00:44,090
I still run it only

13
00:00:44,090 --> 00:00:48,470
for relatively small
strings between 0 and 20.

14
00:00:48,470 --> 00:00:50,240
And let's see what it says.

15
00:00:50,240 --> 00:00:51,800
So I'm running the test in

16
00:00:51,800 --> 00:00:57,320
the Amoonite REPL and it 
doesn't look too bad.

17
00:00:57,320 --> 00:01:01,160
But this might be because
20 is not generous enough.

18
00:01:01,160 --> 00:01:03,620
So let's try it with 30.

19
00:01:03,620 --> 00:01:06,530
Let's run this test again.

20
00:01:06,530 --> 00:01:11,075
And 20 is quite okay.

21
00:01:11,075 --> 00:01:15,965
22, okay, that's now
almost ten times as much.

22
00:01:15,965 --> 00:01:18,995
And then the next
one would be 24.

23
00:01:18,995 --> 00:01:23,615
And we're waiting and waiting.

24
00:01:23,615 --> 00:01:26,410
And we are waiting.

25
00:01:26,410 --> 00:01:29,300
Actually, it isn't
calculated at all.

26
00:01:29,300 --> 00:01:31,399
It run out of memory.

27
00:01:31,399 --> 00:01:33,905
So here is something going on,

28
00:01:33,905 --> 00:01:37,640
which is definitely bad and we
have to have a look at that.

29
00:01:37,640 --> 00:01:40,640
It's always
instructive with

30
00:01:40,640 --> 00:01:43,460
this algorithm to
look at the sizes of

31
00:01:43,460 --> 00:01:45,695
the regular expressions
we calculate.

32
00:01:45,695 --> 00:01:49,625
The EVIL2 is this
a star, star b.

33
00:01:49,625 --> 00:01:51,800
So there's nothing we
can compress there.

34
00:01:51,800 --> 00:01:55,490
It has just stars and
sequences and nothing else.

35
00:01:55,490 --> 00:01:58,430
And so it's not that we
can play the same trick

36
00:01:58,430 --> 00:02:01,805
of trying to introduce
an explicit constructor.

37
00:02:01,805 --> 00:02:04,190
But now let's have a
look at the derivatives.

38
00:02:04,190 --> 00:02:05,600
As you can see here.

39
00:02:05,600 --> 00:02:08,420
If I take this EVIL2 regular

40
00:02:08,420 --> 00:02:09,935
expression and then build

41
00:02:09,935 --> 00:02:12,470
longer and
longer derivatives,

42
00:02:12,470 --> 00:02:14,090
you can see this grows.

43
00:02:14,090 --> 00:02:16,055
And as you see in this line,

44
00:02:16,055 --> 00:02:17,270
if I'm trying to take

45
00:02:17,270 --> 00:02:20,180
the derivative of a
quite large string,

46
00:02:20,180 --> 00:02:21,680
it is quite quick.

47
00:02:21,680 --> 00:02:26,870
But the size of the
regular expression,

48
00:02:26,870 --> 00:02:28,310
the number of nodes,

49
00:02:28,310 --> 00:02:30,815
is also like nearly
eight millions.

50
00:02:30,815 --> 00:02:33,845
And so even if that calculates
relatively quickly,

51
00:02:33,845 --> 00:02:37,865
that is the reason why at 24,

52
00:02:37,865 --> 00:02:39,560
it just runs out of memory.

53
00:02:39,560 --> 00:02:42,785
This regular expression
just grew too much.

54
00:02:42,785 --> 00:02:46,520
So we somehow need
to still compress

55
00:02:46,520 --> 00:02:49,655
this regular expression
and make it not

56
00:02:49,655 --> 00:02:52,700
grow so much when we
build the derivative.

57
00:02:52,700 --> 00:02:54,020
So we have to look at

58
00:02:54,020 --> 00:02:57,050
where does that grow
actually come from.

59
00:02:57,050 --> 00:03:00,170
Let's look at the
derivative operation

60
00:03:00,170 --> 00:03:01,895
again in more detail.

61
00:03:01,895 --> 00:03:03,740
When we introduced
it, we looked at

62
00:03:03,740 --> 00:03:06,560
this example of a
regulat expression r,

63
00:03:06,560 --> 00:03:11,465
which is a star of some
alternative and some sequence.

64
00:03:11,465 --> 00:03:13,130
And we can build now

65
00:03:13,130 --> 00:03:15,800
the derivative of this
r according to a,

66
00:03:15,800 --> 00:03:18,965
b, and c and see
what it calculates.

67
00:03:18,965 --> 00:03:21,935
And you see in case of a,

68
00:03:21,935 --> 00:03:26,570
it's like one times b plus
0 and then followed by r,

69
00:03:26,570 --> 00:03:29,015
which is this whole
regular expression again.

70
00:03:29,015 --> 00:03:30,935
So you can also see

71
00:03:30,935 --> 00:03:34,745
the derivative operation
introduces a lot of junk.

72
00:03:34,745 --> 00:03:38,165
This plus 0 isn't
really necessary.

73
00:03:38,165 --> 00:03:42,455
And this times 1 could
be also thrown away.

74
00:03:42,455 --> 00:03:43,685
So in the first case,

75
00:03:43,685 --> 00:03:48,110
actually we could just have
one times b is b plus 0,

76
00:03:48,110 --> 00:03:49,580
it still b,

77
00:03:49,580 --> 00:03:54,905
so it would be b followed by
r. And in this second case,

78
00:03:54,905 --> 00:03:57,515
0 times b, that will be just 0.

79
00:03:57,515 --> 00:03:59,270
Then 0 plus one is

80
00:03:59,270 --> 00:04:05,330
1 ... times r would be just
r. And in the last case,

81
00:04:05,330 --> 00:04:12,155
0 times b would be 0. 0 plus
0 is 0. 0 times r would be 0.

82
00:04:12,155 --> 00:04:17,540
Okay? So we could throw out
all these useless zeros and

83
00:04:17,540 --> 00:04:20,960
ones because actually
we have to throw

84
00:04:20,960 --> 00:04:24,845
them out because over time
they just accumulate.

85
00:04:24,845 --> 00:04:27,035
And then we build
the next derivative.

86
00:04:27,035 --> 00:04:31,130
And the 0s wouldn't go away by
building a new derivative.

87
00:04:31,130 --> 00:04:34,310
So we need to somehow
a mechanism to

88
00:04:34,310 --> 00:04:39,120
delete the junk from
these derivatives.

89
00:04:39,280 --> 00:04:41,900
But how to delete junk?

90
00:04:41,900 --> 00:04:43,370
We already know we have

91
00:04:43,370 --> 00:04:47,825
the simplification rules
which say like if r plus 0,

92
00:04:47,825 --> 00:04:53,000
I can just replace by
r and the other ones.

93
00:04:53,000 --> 00:04:55,760
And so what I now
need to do is in

94
00:04:55,760 --> 00:04:58,715
my algorithm when I
built the derivative,

95
00:04:58,715 --> 00:05:01,415
this might have
introduced some junk.

96
00:05:01,415 --> 00:05:02,960
And I now have to have

97
00:05:02,960 --> 00:05:06,590
essentially an additional function

98
00:05:06,590 --> 00:05:08,945
which deletes this junk again.

99
00:05:08,945 --> 00:05:10,490
And in doing so,

100
00:05:10,490 --> 00:05:13,340
keeps the regular expression,

101
00:05:13,340 --> 00:05:16,730
relatively small, because that

102
00:05:16,730 --> 00:05:19,535
is the Achilles' heel
of this algorithm.

103
00:05:19,535 --> 00:05:22,565
If this regular expression
grows too much,

104
00:05:22,565 --> 00:05:25,070
then all these functions
will slow down.

105
00:05:25,070 --> 00:05:26,360
So the purpose of

106
00:05:26,360 --> 00:05:30,455
the simplification function
is to after the derivative,

107
00:05:30,455 --> 00:05:33,080
compress again this
regular expression

108
00:05:33,080 --> 00:05:35,570
and then do the next derivative.

109
00:05:35,570 --> 00:05:39,815
So if we introduce that
additional functions simp,

110
00:05:39,815 --> 00:05:42,440
which essentially
just goes through

111
00:05:42,440 --> 00:05:46,040
this regular expression and
deletes all this junk,

112
00:05:46,040 --> 00:05:50,045
then we should be in a
much better position.

113
00:05:50,045 --> 00:05:52,490
As you can see on this slide,

114
00:05:52,490 --> 00:05:54,440
the simplification
function can actually

115
00:05:54,440 --> 00:05:56,855
be implemented very easily.

116
00:05:56,855 --> 00:05:59,750
However, there are few
interesting points.

117
00:05:59,750 --> 00:06:02,720
First of all, there
are only two cases.

118
00:06:02,720 --> 00:06:05,060
We only simplify when we have

119
00:06:05,060 --> 00:06:08,255
an alternative or a sequence.

120
00:06:08,255 --> 00:06:12,859
So for example, we will
never simplify under star.

121
00:06:12,859 --> 00:06:15,950
If you look at the
derivative operation

122
00:06:15,950 --> 00:06:17,825
and you scratch your
head a little bit,

123
00:06:17,825 --> 00:06:20,180
we'll find out why
is that the case.

124
00:06:20,180 --> 00:06:22,145
It's essentially a waste of time.

125
00:06:22,145 --> 00:06:25,505
So you would not
simplify under star.

126
00:06:25,505 --> 00:06:27,650
You only simplify if you have

127
00:06:27,650 --> 00:06:30,530
an alternative and the sequence.

128
00:06:30,530 --> 00:06:34,880
Now, however, there
is one small point

129
00:06:34,880 --> 00:06:39,785
you have to be aware of.
These simplification rules

130
00:06:39,785 --> 00:06:42,740
they need to be essentially
applied backwards.

131
00:06:42,740 --> 00:06:45,650
So I have to first essentially
go to the leaves of

132
00:06:45,650 --> 00:06:49,085
this regular expression and
then have to find out,

133
00:06:49,085 --> 00:06:51,170
can I apply the simplification?

134
00:06:51,170 --> 00:06:52,670
And then if yes,

135
00:06:52,670 --> 00:06:55,430
I apply the simplification
and I look at the next step,

136
00:06:55,430 --> 00:06:57,815
can I now simplify
something more?

137
00:06:57,815 --> 00:06:59,390
The reason is how

138
00:06:59,390 --> 00:07:03,125
the simplification
rules are formulated.

139
00:07:03,125 --> 00:07:05,300
They might fire in

140
00:07:05,300 --> 00:07:08,765
a higher level if something
simplifies below.

141
00:07:08,765 --> 00:07:14,315
So we have to essentially
simplify from the inside out.

142
00:07:14,315 --> 00:07:16,850
And in this
simplification function,

143
00:07:16,850 --> 00:07:20,885
what that means we're
matching this regular expression.

144
00:07:20,885 --> 00:07:23,120
We test whether it's
an alternative or

145
00:07:23,120 --> 00:07:26,345
a sequence. Only then we
actually go into action.

146
00:07:26,345 --> 00:07:28,385
Once we know

147
00:07:28,385 --> 00:07:30,530
in case of an alternative,

148
00:07:30,530 --> 00:07:32,120
what are the two components?

149
00:07:32,120 --> 00:07:35,765
We first, simplify each component,

150
00:07:35,765 --> 00:07:39,095
okay, and then we
get a result back.

151
00:07:39,095 --> 00:07:41,180
And we check whether

152
00:07:41,180 --> 00:07:45,679
this simplification of
r1 resulted into a 0.

153
00:07:45,679 --> 00:07:47,884
Then because it's an alternative

154
00:07:47,884 --> 00:07:49,190
then I can just replace it

155
00:07:49,190 --> 00:07:53,375
by r2s, which a simplified
version of r2.

156
00:07:53,375 --> 00:07:58,820
If it came back r2s
is actually 0,

157
00:07:58,820 --> 00:08:00,410
then I can replace it by

158
00:08:00,410 --> 00:08:03,785
the simplified version of a r1.

159
00:08:03,785 --> 00:08:07,460
If I simplify both
alternatives and it

160
00:08:07,460 --> 00:08:11,180
happens that the simplified
versions are the same,

161
00:08:11,180 --> 00:08:15,170
then  I can throw away
one of the alternatives.

162
00:08:15,170 --> 00:08:18,080
Otherwise, I just have to
keep the alternatives,

163
00:08:18,080 --> 00:08:21,155
but the simplified components.

164
00:08:21,155 --> 00:08:23,915
In the sequence case it is
pretty much the same.

165
00:08:23,915 --> 00:08:27,950
I first have to check what
does it simplify underneath.

166
00:08:27,950 --> 00:08:31,385
So I call first simplify
and then have a look...

167
00:08:31,385 --> 00:08:33,020
Does it matches 0 one of 

168
00:08:33,020 --> 00:08:36,035
the simplifications,
then I just return 0.

169
00:08:36,035 --> 00:08:38,330
Or if the other component is 0,

170
00:08:38,330 --> 00:08:40,535
I can also return a 0.

171
00:08:40,535 --> 00:08:43,310
If it's one, then I keep
the other component.

172
00:08:43,310 --> 00:08:45,920
And if the rs2 is one,

173
00:08:45,920 --> 00:08:47,615
and I keep the first component.

174
00:08:47,615 --> 00:08:50,359
And otherwise it's
still the sequence.

175
00:08:50,359 --> 00:08:53,540
And in all the other cases I
don't have to do anything.

176
00:08:53,540 --> 00:08:55,700
If it's just a plain
0. I leave it in.

177
00:08:55,700 --> 00:08:57,860
If it's a plain
1, I leave it in.

178
00:08:57,860 --> 00:09:00,170
And if something is under a star,

179
00:09:00,170 --> 00:09:02,840
I don't open up this
star and simplify it.

180
00:09:02,840 --> 00:09:06,680
I just apply that to be
as quick as possible.

181
00:09:06,680 --> 00:09:10,280
Let's see whether this has
any effect on our code.

182
00:09:10,280 --> 00:09:12,980
So I'm now in the
file re3.sc,

183
00:09:12,980 --> 00:09:17,405
and it's the same as re2.sc.

184
00:09:17,405 --> 00:09:20,885
It still has this
explicit and n-times case,

185
00:09:20,885 --> 00:09:24,665
nullable and derivative are
still the same.

186
00:09:24,665 --> 00:09:28,610
Except now we have this
simplification function as well.

187
00:09:28,610 --> 00:09:29,960
And when we apply

188
00:09:29,960 --> 00:09:33,725
the derivative and after
we apply the derivative,

189
00:09:33,725 --> 00:09:35,870
we simplify it to throw away

190
00:09:35,870 --> 00:09:39,050
all the junk this
derivative introduced.

191
00:09:39,050 --> 00:09:41,510
Otherwise everything
stays the same.

192
00:09:41,510 --> 00:09:43,580
You still have this expansion

193
00:09:43,580 --> 00:09:45,515
of the optional regular expression.

194
00:09:45,515 --> 00:09:49,670
Here are our two EVIL regular
expressions we use as a test.

195
00:09:49,670 --> 00:09:52,460
And here's now this test case,

196
00:09:52,460 --> 00:09:55,835
the first one and we're
actually now trying it

197
00:09:55,835 --> 00:10:00,515
out for strings between
0 and 9000 a's.

198
00:10:00,515 --> 00:10:08,285
So let's see. So also the 
simplification makes 

199
00:10:08,285 --> 00:10:10,655
actually this case faster.

200
00:10:10,655 --> 00:10:16,260
So we can now run strings
up to 9000.

201
00:10:16,260 --> 00:10:19,360
And we don't actually
sweat about this at all.

202
00:10:19,360 --> 00:10:24,745
For I have to say my laptop
is now starting to run its fan.

203
00:10:24,745 --> 00:10:28,525
And also, because the times
are relatively small,

204
00:10:28,525 --> 00:10:30,610
I'm actually running
each test three

205
00:10:30,610 --> 00:10:32,785
times and then take the average,

206
00:10:32,785 --> 00:10:34,720
which I didn't do before,

207
00:10:34,720 --> 00:10:37,720
just to be a tiny bit
more accurate.

208
00:10:37,720 --> 00:10:42,025
So three seconds for a
string of 9000 a's.

209
00:10:42,025 --> 00:10:44,980
And now the more
interesting question is,

210
00:10:44,980 --> 00:10:46,240
for the second case,

211
00:10:46,240 --> 00:10:48,625
this a star, star, b.

212
00:10:48,625 --> 00:10:51,250
And you can already see
on these numbers...

213
00:10:51,250 --> 00:10:53,260
we are really ambitious.

214
00:10:53,260 --> 00:10:57,860
And let's see how our
program is coping with that.

215
00:11:02,610 --> 00:11:07,780
Again. No sweat, at
least not on my part.

216
00:11:07,780 --> 00:11:10,480
The laptop has to
calculate quite a bit.

217
00:11:10,480 --> 00:11:12,940
But this is now a string of

218
00:11:12,940 --> 00:11:16,539
3 million a's and
it needs a second.

219
00:11:16,539 --> 00:11:20,680
And let's see how far we get.

220
00:11:20,680 --> 00:11:25,280
4 million a's around two seconds.

221
00:11:27,030 --> 00:11:29,050
So say it again,

222
00:11:29,050 --> 00:11:31,690
I'm actually slowing it down

223
00:11:31,690 --> 00:11:34,150
here artificially with running the test

224
00:11:34,150 --> 00:11:36,895
three times and then
take the average.

225
00:11:36,895 --> 00:11:42,749
But it seems to be something
less than five seconds.

226
00:11:42,749 --> 00:11:48,185
And remember that is a
string of 6 million a's.

227
00:11:48,185 --> 00:11:51,170
Let's just have a
look at the graphs.

228
00:11:51,170 --> 00:11:56,105
So the simplification also
made our first case faster.

229
00:11:56,105 --> 00:11:58,880
So earlier without
simplification,

230
00:11:58,880 --> 00:12:00,710
where we just have
this explicit 

231
00:12:00,710 --> 00:12:03,050
n-times. That is this graph.

232
00:12:03,050 --> 00:12:05,210
And now we can go up to

233
00:12:05,210 --> 00:12:09,620
10 thousand and be still
under five seconds.

234
00:12:09,620 --> 00:12:12,300
So that is good news.

235
00:12:13,270 --> 00:12:16,745
In the other example, remember,

236
00:12:16,745 --> 00:12:19,220
the existing regular
expression matchers in

237
00:12:19,220 --> 00:12:22,880
Java 8, Python,
and JavaScript.

238
00:12:22,880 --> 00:12:26,030
And thanks to the
student we also 

239
00:12:26,030 --> 00:12:27,935
have a graph for Swift.

240
00:12:27,935 --> 00:12:29,750
They're pretty atrocious.

241
00:12:29,750 --> 00:12:33,320
They need like for 30 a's,

242
00:12:33,320 --> 00:12:37,490
something like on the
magnitude of 30 seconds.

243
00:12:37,490 --> 00:12:39,740
As I said already,

244
00:12:39,740 --> 00:12:42,665
Java 9 is slightly better.

245
00:12:42,665 --> 00:12:44,870
Java 9 and above, 

246
00:12:44,870 --> 00:12:46,220
if you try that example,

247
00:12:46,220 --> 00:12:48,665
the same example in Java 9,

248
00:12:48,665 --> 00:12:51,230
you would be able to
process something

249
00:12:51,230 --> 00:12:54,215
like 40 thousand a's
in half a minute.

250
00:12:54,215 --> 00:12:57,740
I have to admit I'm not
100% sure what they

251
00:12:57,740 --> 00:13:03,575
did to improve the
performance between Java 8 and 9.

252
00:13:03,575 --> 00:13:05,510
I know they did something on

253
00:13:05,510 --> 00:13:07,790
their regular expression
matching engine,

254
00:13:07,790 --> 00:13:09,770
but I haven't really looked

255
00:13:09,770 --> 00:13:12,230
at what precisely they've done.

256
00:13:12,230 --> 00:13:17,945
But that's still not
really a competition for us.

257
00:13:17,945 --> 00:13:20,915
So our third version,

258
00:13:20,915 --> 00:13:24,335
in this example, a star, star b.

259
00:13:24,335 --> 00:13:28,445
We said for something like
6 million a's we need 15 secs.

260
00:13:28,445 --> 00:13:31,760
And again, I think these
are slightly older times,

261
00:13:31,760 --> 00:13:33,770
so here it needs 20 seconds.

262
00:13:33,770 --> 00:13:37,250
I think we had something
like below five seconds.

263
00:13:37,250 --> 00:13:40,865
But again, you can see
the trends. We run.

264
00:13:40,865 --> 00:13:42,590
circles around them.

265
00:13:42,590 --> 00:13:46,530
And that's quite nice.

266
00:13:46,570 --> 00:13:49,774
So what's good about
this algorithm

267
00:13:49,774 --> 00:13:51,605
by Brzozowski?

268
00:13:51,605 --> 00:13:54,500
Well, first, it extends

269
00:13:54,500 --> 00:13:57,800
actually to quite a number
of regular expressions.

270
00:13:57,800 --> 00:13:59,900
So we saw in this example

271
00:13:59,900 --> 00:14:03,950
the n-times regular expression.
Is a little bit of work, but

272
00:14:03,950 --> 00:14:05,405
we could extend that.

273
00:14:05,405 --> 00:14:08,480
It also extends to the
not regular expression,

274
00:14:08,480 --> 00:14:10,820
which I explain on
the next slide.

275
00:14:10,820 --> 00:14:15,290
It's very easy to implement
in a functional language.

276
00:14:15,290 --> 00:14:16,610
If you don't buy

277
00:14:16,610 --> 00:14:20,675
all that functional
programming language stuff,

278
00:14:20,675 --> 00:14:22,955
you still have to agree.

279
00:14:22,955 --> 00:14:25,715
It's quite beautiful in Scala.

280
00:14:25,715 --> 00:14:28,160
And you could also
easily implemented

281
00:14:28,160 --> 00:14:31,760
in the C language or in Python.

282
00:14:31,760 --> 00:14:34,250
There's really no
problem with that.

283
00:14:34,250 --> 00:14:38,390
The algorithm is actually
quite old already. 

284
00:14:38,390 --> 00:14:44,899
Brzozowski wrote it down in
1964 and his PhD thesis.

285
00:14:44,899 --> 00:14:49,460
Somehow it was forgotten during

286
00:14:49,460 --> 00:14:54,095
the next decades and only
recently has been rediscovered.

287
00:14:54,095 --> 00:14:57,065
At the moment, we are
only solved the problem

288
00:14:57,065 --> 00:15:02,240
of given a regular expression,
givven a string, does

289
00:15:02,240 --> 00:15:05,550
the regular expression
match the string yes or no.

290
00:15:05,550 --> 00:15:08,740
We want to of course, use
it as part of our lexer.

291
00:15:08,740 --> 00:15:12,370
And there we have to do
something more complicated.

292
00:15:12,370 --> 00:15:15,384
We have to essentially
generate tokens.

293
00:15:15,384 --> 00:15:18,670
And this will still take
a little bit of work.

294
00:15:18,670 --> 00:15:22,045
And that's actually relatively
recent work by somebody

295
00:15:22,045 --> 00:15:26,110
called Sulzmann and
the co-worker called Lu.

296
00:15:26,110 --> 00:15:30,880
And what I personally
also find very satisfying

297
00:15:30,880 --> 00:15:32,695
about this algorithm is

298
00:15:32,695 --> 00:15:36,040
that we can actually
prove that it's correct.

299
00:15:36,040 --> 00:15:37,735
You might remember I did

300
00:15:37,735 --> 00:15:41,500
quite some interesting
transformations

301
00:15:41,500 --> 00:15:44,830
when I calculated the derivative.

302
00:15:44,830 --> 00:15:48,270
How can I be sure that
this is all correct?

303
00:15:48,270 --> 00:15:50,270
Actually, I can now go away and

304
00:15:50,270 --> 00:15:52,865
prove the correctness
of this algorithm.

305
00:15:52,865 --> 00:15:56,645
Does it really satisfy
the specification?

306
00:15:56,645 --> 00:15:58,550
Is it really true that if a string

307
00:15:58,550 --> 00:16:00,440
is in the language
of a regular expression.

308
00:16:00,440 --> 00:16:03,050
Does that matter? I would say yes.

309
00:16:03,050 --> 00:16:07,070
However, I leave that
for another video.

310
00:16:07,070 --> 00:16:10,580
What I also like about
this algorithm that can be

311
00:16:10,580 --> 00:16:13,775
actually extended to quite a
number of regular expressions.

312
00:16:13,775 --> 00:16:17,810
So this is the NOT regular
expression that is

313
00:16:17,810 --> 00:16:19,760
supposed to match strings which

314
00:16:19,760 --> 00:16:22,235
this regular expression cannot match.

315
00:16:22,235 --> 00:16:24,245
So here's an example.

316
00:16:24,245 --> 00:16:28,640
If you're in the business of
trying to find out what

317
00:16:28,640 --> 00:16:33,530
are comments in languages like
Java or C. Then you know,

318
00:16:33,530 --> 00:16:35,060
they always start with a slash,

319
00:16:35,060 --> 00:16:36,245
then comes a star,

320
00:16:36,245 --> 00:16:38,240
then comes an arbitrary string.

321
00:16:38,240 --> 00:16:41,195
And then they finish
with a slash and a star.

322
00:16:41,195 --> 00:16:44,330
And how you want to specify
that is something like this.

323
00:16:44,330 --> 00:16:45,530
You want to say, OK,

324
00:16:45,530 --> 00:16:48,245
a comment starts with
a slash and a star.

325
00:16:48,245 --> 00:16:51,410
Then it comes any
string in between.

326
00:16:51,410 --> 00:16:55,340
But this string
in-between cannot contain

327
00:16:55,340 --> 00:16:58,280
any star and slash
because that would then

328
00:16:58,280 --> 00:17:01,415
indicate there's the
end already there.

329
00:17:01,415 --> 00:17:03,680
And then after you
have such a string

330
00:17:03,680 --> 00:17:06,410
which doesn't
contain a star and a slash,

331
00:17:06,410 --> 00:17:11,180
then at the end you want to
match a star and a slash.

332
00:17:11,180 --> 00:17:13,460
So for these kind of comments,

333
00:17:13,460 --> 00:17:15,995
this regular expression is
actually quite useful.

334
00:17:15,995 --> 00:17:19,430
And if you ever seen
how to do the negation,

335
00:17:19,430 --> 00:17:21,995
for this kind of regular
expression with automata,

336
00:17:21,995 --> 00:17:24,710
you will know that's
a bit of a pain.

337
00:17:24,710 --> 00:17:26,675
But for the Brzozowski,

338
00:17:26,675 --> 00:17:28,370
it's no pain at all.

339
00:17:28,370 --> 00:17:30,710
The meaning of this
regular expression

340
00:17:30,710 --> 00:17:32,255
would be something like that.

341
00:17:32,255 --> 00:17:35,540
If I have all the
strings there are,

342
00:17:35,540 --> 00:17:38,390
and I take away all the strings,

343
00:17:38,390 --> 00:17:40,055
this r can match.

344
00:17:40,055 --> 00:17:42,080
The remaining strings are

345
00:17:42,080 --> 00:17:44,630
what this regular expression
is supposed to match.

346
00:17:44,630 --> 00:17:47,165
So I can specify
what the meaning is.

347
00:17:47,165 --> 00:17:49,760
And then it's also very easy to

348
00:17:49,760 --> 00:17:52,174
say what is nullable
and derivative.

349
00:17:52,174 --> 00:17:54,140
So for nullable, it would be

350
00:17:54,140 --> 00:17:56,570
just a test whether r
is nullable.

351
00:17:56,570 --> 00:17:58,985
And I just take the
negation of that.

352
00:17:58,985 --> 00:18:00,680
And then I have to build

353
00:18:00,680 --> 00:18:03,620
the derivative of this
NOT regular expression.

354
00:18:03,620 --> 00:18:05,420
Then I just have to move

355
00:18:05,420 --> 00:18:07,325
this ....

356
00:18:07,325 --> 00:18:10,070
derivative inside
the regular expression

357
00:18:10,070 --> 00:18:12,575
and keep the NOT on
the outermost level.

358
00:18:12,575 --> 00:18:15,515
So there's again no
pain involved in

359
00:18:15,515 --> 00:18:19,130
adding this regular expression
to the algorithm.

360
00:18:19,130 --> 00:18:22,100
We made it to the end of
this video and we made

361
00:18:22,100 --> 00:18:24,739
it also to the end
of this algorithm.

362
00:18:24,739 --> 00:18:27,620
I can see to loose strands.

363
00:18:27,620 --> 00:18:29,420
One is we implemented

364
00:18:29,420 --> 00:18:32,855
this algorithm for the
basic regular expressions.

365
00:18:32,855 --> 00:18:38,705
And we also added the 
n-times out of necessity.

366
00:18:38,705 --> 00:18:41,120
And I also showed
you how to implement

367
00:18:41,120 --> 00:18:44,840
the NOT regular
expression on a slide.

368
00:18:44,840 --> 00:18:48,995
But your task in the
coursework actually is

369
00:18:48,995 --> 00:18:52,970
to extend that to some of
the extended regular expressions.

370
00:18:52,970 --> 00:18:57,260
So especially these bounded
repetitions like from 0 to

371
00:18:57,260 --> 00:19:01,550
n-times or between n and m times.

372
00:19:01,550 --> 00:19:04,325
And I think also
plus and option.

373
00:19:04,325 --> 00:19:07,220
The other loose strand is,

374
00:19:07,220 --> 00:19:09,200
remember I did this wild

375
00:19:09,200 --> 00:19:11,645
calculations with
regular expressions.

376
00:19:11,645 --> 00:19:13,205
Why on earth

377
00:19:13,205 --> 00:19:14,480
is that all correct?

378
00:19:14,480 --> 00:19:16,355
Why on earth should
this algorithm

379
00:19:16,355 --> 00:19:18,575
actually satisfy
our specification,

380
00:19:18,575 --> 00:19:20,450
which we set out
at the beginning.

381
00:19:20,450 --> 00:19:25,410
So that needs to be looked at
more closely. Bye for now.