videos/02-prelims.srt
author Christian Urban <christian.urban@kcl.ac.uk>
Wed, 16 Oct 2024 13:14:13 +0100
changeset 968 d8d8911a3d6f
parent 766 e8402d8ec8e6
permissions -rw-r--r--
updated

1
00:00:09,990 --> 00:00:13,465
Welcome back to this
week's lecture.

2
00:00:13,465 --> 00:00:16,450
The task this week is to actually

3
00:00:16,450 --> 00:00:20,320
implement a regular
expression matcher.

4
00:00:20,320 --> 00:00:22,510
And we want to be a bit

5
00:00:22,510 --> 00:00:25,315
faster than the regular
expression matcher

6
00:00:25,315 --> 00:00:29,380
in Python, Ruby,
Javascript, and Java.

7
00:00:29,380 --> 00:00:31,960
Remember this regular expression

8
00:00:31,960 --> 00:00:35,785
and strings which are
just a number of a's,

9
00:00:35,785 --> 00:00:39,670
these regular expression matchers
where abysmally slow.

10
00:00:39,670 --> 00:00:43,170
They could only match
approximately 30 a's in

11
00:00:43,170 --> 00:00:45,665
something like half a minute.

12
00:00:45,665 --> 00:00:49,460
What we like to do with
our regular expression matcher.

13
00:00:49,460 --> 00:00:51,890
We will try a few times,

14
00:00:51,890 --> 00:00:55,505
but our second trial will already
be much, much better.

15
00:00:55,505 --> 00:00:58,400
It will probably
match around maybe

16
00:00:58,400 --> 00:01:02,030
thousand a's in something
in half a minute.

17
00:01:02,030 --> 00:01:03,920
But our third version in

18
00:01:03,920 --> 00:01:06,230
comparison will be
blindingly fast.

19
00:01:06,230 --> 00:01:08,180
And we'll be able to match

20
00:01:08,180 --> 00:01:10,145
strings of length 10,000 a's

21
00:01:10,145 --> 00:01:12,230
and will not require

22
00:01:12,230 --> 00:01:14,975
more than five seconds.
So let's go ahead with that.

23
00:01:14,975 --> 00:01:18,095
We will first implement

24
00:01:18,095 --> 00:01:19,880
our regular expression
matcher only

25
00:01:19,880 --> 00:01:22,069
for the basic
regular expressions.

26
00:01:22,069 --> 00:01:25,430
So we will have only six
cases to think about.

27
00:01:25,430 --> 00:01:27,620
This will simplify matters at first.

28
00:01:27,620 --> 00:01:30,350
Later we can
think about how to

29
00:01:30,350 --> 00:01:34,100
extend that to the extended
regular expressions.

30
00:01:34,100 --> 00:01:37,625
Unfortunately, before we can
start our implementation,

31
00:01:37,625 --> 00:01:39,290
we have to discuss

32
00:01:39,290 --> 00:01:42,470
when two regular
expressions are equivalent.

33
00:01:42,470 --> 00:01:46,595
And we use here this notation
of the triple equals.

34
00:01:46,595 --> 00:01:48,215
We say a regular expression

35
00:01:48,215 --> 00:01:50,180
r1 and r2 are

36
00:01:50,180 --> 00:01:56,059
equivalent if and only
if the language of r1 is

37
00:01:56,059 --> 00:01:59,360
equal to the language of r2.

38
00:01:59,360 --> 00:02:02,915
Sounds complicated,
but essentially means

39
00:02:02,915 --> 00:02:04,700
if the two regular expressions can

40
00:02:04,700 --> 00:02:06,950
match exactly the same strings,

41
00:02:06,950 --> 00:02:09,800
then we want to regard
them as equivalent.

42
00:02:09,800 --> 00:02:14,300
This equivalence justifies
why we can be sloppy

43
00:02:14,300 --> 00:02:16,040
with parentheses.

44
00:02:16,040 --> 00:02:23,870
For example, if we have
(r1 + r2) + r3,

45
00:02:23,870 --> 00:02:32,270
then this will be equivalent
to r1 + (r2 + r3).

46
00:02:32,270 --> 00:02:34,910
Remember, regular
expressions are trees,

47
00:02:34,910 --> 00:02:37,940
so these are two different
regular expressions,

48
00:02:37,940 --> 00:02:41,540
but they can match
the same strings.

49
00:02:41,540 --> 00:02:45,695
Because, well, if we take
here the meaning of that,

50
00:02:45,695 --> 00:02:54,350
that would be just L(r1 + r2 + r3),


51
00:02:54,350 --> 00:03:04,100
which is equal to L(r1 + r2) u L(r3).


52
00:03:04,100 --> 00:03:10,595
And that is equal to of 

53
00:03:10,595 --> 00:03:15,710
L(r1) u L(r2) u L(r3).


54
00:03:15,710 --> 00:03:17,930
And if you do the
same calculation

55
00:03:17,930 --> 00:03:19,445
for that regular expression,

56
00:03:19,445 --> 00:03:22,940
you will find out the
two sets are the same.

57
00:03:22,940 --> 00:03:26,195
So both regular expressions
can match the same strings.

58
00:03:26,195 --> 00:03:28,805
So even if they're different
regular expressions,

59
00:03:28,805 --> 00:03:31,220
we can regard them as eqivalent.

60
00:03:31,220 --> 00:03:33,290
And so that's why we can forget

61
00:03:33,290 --> 00:03:35,270
about writing these parentheses.

62
00:03:35,270 --> 00:03:40,205
Let's have a look
at some more examples.

63
00:03:40,205 --> 00:03:41,870
So the first one here,

64
00:03:41,870 --> 00:03:43,205
that is now clear.

65
00:03:43,205 --> 00:03:45,200
We did this calculation already

66
00:03:45,200 --> 00:03:47,480
for arbitrary regular expressions.

67
00:03:47,480 --> 00:03:49,520
Here it is for
concrete ones a, b and c.

68
00:03:49,520 --> 00:03:52,690
Here on the left-hand side,

69
00:03:52,690 --> 00:03:54,895
the regular expression
has the same meaning

70
00:03:54,895 --> 00:03:56,620
as on the right-hand side.

71
00:03:56,620 --> 00:03:58,420
So they are equivalent.

72
00:03:58,420 --> 00:04:01,375
We can ignore the parentheses.

73
00:04:01,375 --> 00:04:03,220
If I have a choice,

74
00:04:03,220 --> 00:04:06,610
but the choice is
the same a or a,

75
00:04:06,610 --> 00:04:09,265
then this is
equivalent to just a.

76
00:04:09,265 --> 00:04:13,075
So the same choice doesn't
introduce anything more.

77
00:04:13,075 --> 00:04:15,370
In the next case, if I have

78
00:04:15,370 --> 00:04:19,450
a regular expression
which can match a or b,

79
00:04:19,450 --> 00:04:23,860
that can match the same
strings as b or a.

80
00:04:23,860 --> 00:04:27,325
So you have a kind of
commutativity for the plus,

81
00:04:27,325 --> 00:04:29,485
which is like on natural numbers.

82
00:04:29,485 --> 00:04:32,880
But you would not have a
communitivity for the sequence:

83
00:04:32,880 --> 00:04:37,070
if you have
first a and then b,

84
00:04:37,070 --> 00:04:40,850
that's not the same as
matching first b and then a.

85
00:04:40,850 --> 00:04:42,800
So for the sequence ...

86
00:04:42,800 --> 00:04:44,615
they are not equivalent.

87
00:04:44,615 --> 00:04:49,475
However, for the sequence I
can, like for the plus, ignore

88
00:04:49,475 --> 00:04:51,245
prarentheses.

89
00:04:51,245 --> 00:04:55,070
If I have the parentheses
and this way,

90
00:04:55,070 --> 00:04:57,785
or I have it in this way.

91
00:04:57,785 --> 00:05:00,605
These are two different
regular expressions,

92
00:05:00,605 --> 00:05:02,120
but they have the same meaning.

93
00:05:02,120 --> 00:05:03,815
So they are equivalent.

94
00:05:03,815 --> 00:05:05,780
And so I can leave out parentheses

95
00:05:05,780 --> 00:05:09,170
for times as well.

96
00:05:09,170 --> 00:05:12,185
The next one is a slightly
more interesting one.

97
00:05:12,185 --> 00:05:15,020
If I have a regular
expression which can match

98
00:05:15,020 --> 00:05:19,655
c first followed by (a + b),

99
00:05:19,655 --> 00:05:25,355
then this is the same as
first c followed by a,

100
00:05:25,355 --> 00:05:28,640
or c followed by b.

101
00:05:28,640 --> 00:05:31,265
So that's the kind of
distributivity law

102
00:05:31,265 --> 00:05:33,545
on regular expressions.

103
00:05:33,545 --> 00:05:36,350
But they're also regular expressions
which are not equivalent.

104
00:05:36,350 --> 00:05:38,990
If I have the regular
expression which can

105
00:05:38,990 --> 00:05:42,800
match the string containing
exactly two a's.

106
00:05:42,800 --> 00:05:44,240
That is not equivalent

107
00:05:44,240 --> 00:05:46,730
to the regular
expression which can just match

108
00:05:46,730 --> 00:05:49,250
a single a. And similarly

109
00:05:49,250 --> 00:05:51,680
in this case, if I can match

110
00:05:51,680 --> 00:05:56,075
a or the string b followed by c,

111
00:05:56,075 --> 00:05:59,825
that is not the same as a or b,

112
00:05:59,825 --> 00:06:02,000
followed by a or c.

113
00:06:02,000 --> 00:06:05,990
I'll let you think about
why is that not equivalent.

114
00:06:05,990 --> 00:06:08,060
Essentially you
have to find out what's

115
00:06:08,060 --> 00:06:10,475
the meaning of both
regular expressions.

116
00:06:10,475 --> 00:06:14,090
And are they the
same sets or not?

117
00:06:14,090 --> 00:06:17,435
There are again some
interesting corner cases.

118
00:06:17,435 --> 00:06:20,540
If I have a regular expression
that can match a,

119
00:06:20,540 --> 00:06:23,450
but followed by the regular
expression which cannot match

120
00:06:23,450 --> 00:06:25,670
anything, that is not equivalent

121
00:06:25,670 --> 00:06:29,255
to the regular
expression which can match a.

122
00:06:29,255 --> 00:06:31,340
Again here you have
to do the calculation

123
00:06:31,340 --> 00:06:32,915
to convince you of that.

124
00:06:32,915 --> 00:06:35,840
Similarly, if I have a regular
expression which can

125
00:06:35,840 --> 00:06:38,750
match a or the empty string,

126
00:06:38,750 --> 00:06:40,640
this is not equivalent to

127
00:06:40,640 --> 00:06:43,355
the regular expression
which can just match a.

128
00:06:43,355 --> 00:06:46,925
Here are some interesting
ones with zeros and ones.

129
00:06:46,925 --> 00:06:48,860
So if I have the regular expression

130
00:06:48,860 --> 00:06:50,975
that can match the empty string,

131
00:06:50,975 --> 00:06:53,540
this will be actually equivalent to

132
00:06:53,540 --> 00:06:56,435
the regular expression
which can match nothing,

133
00:06:56,435 --> 00:06:59,405
but star of that.

134
00:06:59,405 --> 00:07:01,970
Remember the star
always introduces,

135
00:07:01,970 --> 00:07:04,370
no matter what, the empty string.

136
00:07:04,370 --> 00:07:06,170
So this regular expression can match

137
00:07:06,170 --> 00:07:08,930
the empty string,
just like the 1.

138
00:07:08,930 --> 00:07:12,125
So these two expressions
are not the same,

139
00:07:12,125 --> 00:07:14,210
but they are equivalent.

140
00:07:14,210 --> 00:07:16,700
And it doesn't matter
whether you take

141
00:07:16,700 --> 00:07:20,090
the empty string to  the star,

142
00:07:20,090 --> 00:07:23,855
because if I can match 0 or
more times the empty string,

143
00:07:23,855 --> 00:07:27,450
that will be equivalent to 
just the empty string.

144
00:07:27,520 --> 00:07:32,510
Slightly similar to the
third case is this one.

145
00:07:32,510 --> 00:07:35,720
Zero star is not equivalent to 0

146
00:07:35,720 --> 00:07:40,025
because that can match
exactly the empty string.

147
00:07:40,025 --> 00:07:42,740
This cannot match anything.

148
00:07:42,740 --> 00:07:44,839
While the previous

149
00:07:44,839 --> 00:07:47,540
equivalences are all very
instructive to look at,

150
00:07:47,540 --> 00:07:49,910
these are the
equivalences we need

151
00:07:49,910 --> 00:07:52,685
in our regular expression matchers.

152
00:07:52,685 --> 00:07:54,350
And they are defined for

153
00:07:54,350 --> 00:07:58,160
all regular expressions r.
So r plus 0,

154
00:07:58,160 --> 00:08:00,470
no matter what r looks
like is equivalent

155
00:08:00,470 --> 00:08:02,945
to just r. Similarly

156
00:08:02,945 --> 00:08:05,930
0 plus r is also
equivalent to r.

157
00:08:05,930 --> 00:08:08,525
The order of this
choice doesn't matter;

158
00:08:08,525 --> 00:08:11,495
r followed by 1,

159
00:08:11,495 --> 00:08:14,030
that's the sequence
regular expression, is

160
00:08:14,030 --> 00:08:16,760
equivalent to r. The
sam, r followed by

161
00:08:16,760 --> 00:08:19,010
r is equivalent to r.

162
00:08:19,010 --> 00:08:20,990
And r followed by

163
00:08:20,990 --> 00:08:23,390
the regular expression which
cannot match anything,

164
00:08:23,390 --> 00:08:27,455
is equivalent to just 0.

165
00:08:27,455 --> 00:08:30,740
And 0 followed by r is also equivalent to 0.

166
00:08:30,740 --> 00:08:33,615
And if you have the
choice of r plus r,

167
00:08:33,615 --> 00:08:37,210
that is also
equivalent to just r.

168
00:08:37,210 --> 00:08:40,270
What we're going to do
with these equivalences is

169
00:08:40,270 --> 00:08:42,820
that we regard them as
simplification rules.

170
00:08:42,820 --> 00:08:43,930
So whenever we see

171
00:08:43,930 --> 00:08:46,465
this regular expression
in our algorithm,

172
00:08:46,465 --> 00:08:48,580
we will replace it by
the right-hand side.

173
00:08:48,580 --> 00:08:51,715
So we will orient
these equivalences.

174
00:08:51,715 --> 00:08:53,650
If we see, this regular expression,

175
00:08:53,650 --> 00:08:55,225
we will replace it by that one.

176
00:08:55,225 --> 00:08:58,945
And it will not matter
because the left-hand sides

177
00:08:58,945 --> 00:09:01,255
can match exactly
the same strings

178
00:09:01,255 --> 00:09:03,475
as the right-hand sides.

179
00:09:03,475 --> 00:09:06,370
Here two quick examples.

180
00:09:06,370 --> 00:09:08,680
The first one, let's
assume you have

181
00:09:08,680 --> 00:09:10,270
a regular expression r and

182
00:09:10,270 --> 00:09:11,905
there is something
in front of it.

183
00:09:11,905 --> 00:09:13,720
This "something in front of it"

184
00:09:13,720 --> 00:09:15,870
can be simplified as follows.

185
00:09:15,870 --> 00:09:20,000
This 1 times b
can be simplified to b.

186
00:09:20,000 --> 00:09:23,555
Then this b plus 0 can
be simplified to b.

187
00:09:23,555 --> 00:09:25,490
And assuming that nothing can

188
00:09:25,490 --> 00:09:27,470
be simplified inside this r,

189
00:09:27,470 --> 00:09:28,700
then this would be

190
00:09:28,700 --> 00:09:33,050
the simplified version
of this regular expression.

191
00:09:33,050 --> 00:09:34,820
And because the simplification

192
00:09:34,820 --> 00:09:36,965
rules preserve what can be matched,

193
00:09:36,965 --> 00:09:39,080
we can be sure that
this regular expression

194
00:09:39,080 --> 00:09:41,255
can match exactly the strings

195
00:09:41,255 --> 00:09:43,940
this regular expression can match.

196
00:09:43,940 --> 00:09:45,740
Here is the other example.

197
00:09:45,740 --> 00:09:49,550
This has just a tiny change
that this becomes here as 0.

198
00:09:49,550 --> 00:09:54,485
Then 0 times b can be
replaced by just 0.

199
00:09:54,485 --> 00:09:56,705
Then they are actually two
rules which could apply

200
00:09:56,705 --> 00:09:59,014
either that we have
the same choice

201
00:09:59,014 --> 00:10:01,160
or we can just say something plus

202
00:10:01,160 --> 00:10:04,100
0 will always go to something.

203
00:10:04,100 --> 00:10:06,245
So we can simplify it to that.

204
00:10:06,245 --> 00:10:08,510
And then we have
0 times r again,

205
00:10:08,510 --> 00:10:10,460
and that can be simplified to 0.

206
00:10:10,460 --> 00:10:12,080
So actually what we find out with

207
00:10:12,080 --> 00:10:14,645
this calculation is that
this regular expression,

208
00:10:14,645 --> 00:10:16,835
even if it looks
quite complicated,

209
00:10:16,835 --> 00:10:19,295
actually doesn't
match any string.

210
00:10:19,295 --> 00:10:21,290
Because it matches exactly

211
00:10:21,290 --> 00:10:23,420
those string this regular
expression can match.

212
00:10:23,420 --> 00:10:26,285
And this reg expression
cannot match any.

213
00:10:26,285 --> 00:10:29,930
We need one more
operation on languages.

214
00:10:29,930 --> 00:10:31,700
I call this operation

215
00:10:31,700 --> 00:10:35,165
the semantic derivative
of a language.

216
00:10:35,165 --> 00:10:37,775
This operation takes
two arguments.

217
00:10:37,775 --> 00:10:40,670
It takes a character
which we take

218
00:10:40,670 --> 00:10:43,925
the semantic derivative
according to,

219
00:10:43,925 --> 00:10:45,980
and it takes a language.

220
00:10:45,980 --> 00:10:49,579
And what it does is it
looks into this language

221
00:10:49,579 --> 00:10:51,365
and looks for all the strings

222
00:10:51,365 --> 00:10:53,735
that start with this character,

223
00:10:53,735 --> 00:10:56,405
let's say c, and then

224
00:10:56,405 --> 00:11:00,920
just strips off that c.
So here's an example.

225
00:11:00,920 --> 00:11:02,975
Suppose you have a language A,

226
00:11:02,975 --> 00:11:04,460
with the strings

227
00:11:04,460 --> 00:11:06,965
foo, bar and frak.

228
00:11:06,965 --> 00:11:10,835
And you take the semantic
derivative according to f,

229
00:11:10,835 --> 00:11:14,450
then we discard all the
strings which do not

230
00:11:14,450 --> 00:11:18,320
start with an f. So
bar will be discarded.

231
00:11:18,320 --> 00:11:22,850
Foo and frac start with
an f. So we keep them

232
00:11:22,850 --> 00:11:25,265
but we strip off the first f.

233
00:11:25,265 --> 00:11:29,045
So here we have only oo and rak.

234
00:11:29,045 --> 00:11:31,609
If you take the
semantic derivative

235
00:11:31,609 --> 00:11:33,830
of that language according to b,

236
00:11:33,830 --> 00:11:37,190
then we will discard foo and
frak because they don't

237
00:11:37,190 --> 00:11:40,940
start with b, and just keep bar.

238
00:11:40,940 --> 00:11:44,585
But again, we have to
strip off this first b.

239
00:11:44,585 --> 00:11:49,305
And if you take the semantic
derivative according to a,

240
00:11:49,305 --> 00:11:52,585
then none of these
strings start with a.

241
00:11:52,585 --> 00:11:56,800
So this will be defined
as just the empty set.

242
00:11:56,800 --> 00:11:59,695
You can slightly 
generalized this

243
00:11:59,695 --> 00:12:02,560
semantic derivative
to also strings.

244
00:12:02,560 --> 00:12:05,170
So imagine you have
a language A and you

245
00:12:05,170 --> 00:12:08,274
build the semantic derivative
according to a string s.

246
00:12:08,274 --> 00:12:10,990
Then you look in
this language and

247
00:12:10,990 --> 00:12:13,420
you look for all the
strings that start with

248
00:12:13,420 --> 00:12:19,555
this s. But you strip
off that s. Ok?

249
00:12:19,555 --> 00:12:23,830
So if you have a string
starting with the s,

250
00:12:23,830 --> 00:12:26,065
then you keep that string,

251
00:12:26,065 --> 00:12:27,490
but you keep only the rest...

252
00:12:27,490 --> 00:12:28,810
what's coming after this s.

253
00:12:28,810 --> 00:12:32,010
That is here indicated
with this s'.

254
00:12:32,010 --> 00:12:35,330
I also have this convention,
this personal convention

255
00:12:35,330 --> 00:12:40,055
I have to say, that everything
which works on lists,

256
00:12:40,055 --> 00:12:42,185
remember strings are
lists of characters.

257
00:12:42,185 --> 00:12:46,655
I just put there an 's'. So
here's the one for characters.

258
00:12:46,655 --> 00:12:48,680
I just call it Der with a capital

259
00:12:48,680 --> 00:12:51,740
D. And I call that Ders,

260
00:12:51,740 --> 00:12:54,350
because that works over strings.

261
00:12:54,350 --> 00:12:55,865
And you can see how it would

262
00:12:55,865 --> 00:12:59,750
be defined if you only take this
as a primitive operation.

263
00:12:59,750 --> 00:13:01,340
We would just need to iterate

264
00:13:01,340 --> 00:13:04,050
that essentially for this Ders.

265
00:13:04,060 --> 00:13:07,970
Ok, we now have all
the theory in place.

266
00:13:07,970 --> 00:13:11,345
And we can finally start
implementing our algorithm.

267
00:13:11,345 --> 00:13:14,580
That's when we'll do
in the next video.