videos/02-cw1.srt
author Christian Urban <christian.urban@kcl.ac.uk>
Fri, 24 Oct 2025 08:55:14 +0100
changeset 1016 c02d409ed7f4
parent 774 42733adf2a48
permissions -rw-r--r--
added amm faq

1
00:00:06,410 --> 00:00:09,390
The final video for this week.

2
00:00:09,390 --> 00:00:12,465
And let me say something
about the coursework.

3
00:00:12,465 --> 00:00:15,300
First. You can solve

4
00:00:15,300 --> 00:00:17,745
the coursework in any programming
language you're like,

5
00:00:17,745 --> 00:00:22,440
I already said that. You
have to submit a PDF file.

6
00:00:22,440 --> 00:00:23,850
There will be some questions,

7
00:00:23,850 --> 00:00:26,250
so you have to write down the
solution to this questions.

8
00:00:26,250 --> 00:00:30,315
Please use a PDF and you have
to submit your source code.

9
00:00:30,315 --> 00:00:35,580
And yes, if you do use a
language which isn't Scala,

10
00:00:35,580 --> 00:00:39,450
it might help to tell me
how I can run your code.

11
00:00:39,450 --> 00:00:42,550
If I can't run your code,
I will ask you anyway.

12
00:00:42,550 --> 00:00:50,044
Also, please submit both the
PDF and the code in a file,

13
00:00:50,044 --> 00:00:52,730
in a zip file, which generates

14
00:00:52,730 --> 00:00:55,835
a subdirectory with your
name and your family name.

15
00:00:55,835 --> 00:00:58,970
That we'll just save a
lot of time for me to try

16
00:00:58,970 --> 00:01:02,030
to figure out who
has submitted what.

17
00:01:02,030 --> 00:01:04,445
Please do that.

18
00:01:04,445 --> 00:01:07,789
So what is the task
into coursework?

19
00:01:07,789 --> 00:01:09,885
I essentially showed you how

20
00:01:09,885 --> 00:01:11,995
the Brzozowski matcher

21
00:01:11,995 --> 00:01:14,645
works for the basic
regular expressions.

22
00:01:14,645 --> 00:01:16,295
I also showed you actually how it

23
00:01:16,295 --> 00:01:18,110
works for the n-times.

24
00:01:18,110 --> 00:01:20,300
And the task in coursework

25
00:01:20,300 --> 00:01:22,970
is that you extend the
Brzozowski matcher to

26
00:01:22,970 --> 00:01:25,820
the other regular expressions

27
00:01:25,820 --> 00:01:27,800
from the extended
regular expressions.

28
00:01:27,800 --> 00:01:30,245
So you're supposed
to extended with

29
00:01:30,245 --> 00:01:34,490
r-plus, optional r, for n-times.

30
00:01:34,490 --> 00:01:38,420
You've already seen that.
For 0 or more times,

31
00:01:38,420 --> 00:01:41,120
but not more than m
times regular expression.

32
00:01:41,120 --> 00:01:45,890
For at least n-times regular
expression and for between

33
00:01:45,890 --> 00:01:47,480
n times and m times

34
00:01:47,480 --> 00:01:50,810
regular expression and also this
NOT regular expression.

35
00:01:50,810 --> 00:01:52,790
So your task is to

36
00:01:52,790 --> 00:01:55,310
essentially find the definition

37
00:01:55,310 --> 00:01:57,155
for nullable in these cases.

38
00:01:57,155 --> 00:02:00,215
Find a definition for derivative,

39
00:02:00,215 --> 00:02:02,480
implement them,
write them down in

40
00:02:02,480 --> 00:02:06,065
a PDF and then do some
experiments with them.

41
00:02:06,065 --> 00:02:08,875
Well, how can you do that?

42
00:02:08,875 --> 00:02:10,370
Well I've given you

43
00:02:10,370 --> 00:02:13,565
the meaning of these additional
regular expressions.

44
00:02:13,565 --> 00:02:16,730
So here's, for example, the
meaning of this r-plus.

45
00:02:16,730 --> 00:02:18,290
So that would be, I have

46
00:02:18,290 --> 00:02:22,115
at least one copy up to infinity.

47
00:02:22,115 --> 00:02:25,070
And the optional-r would be it's

48
00:02:25,070 --> 00:02:28,370
the language of r plus
the empty string.

49
00:02:28,370 --> 00:02:30,440
If I have it exactly n times,

50
00:02:30,440 --> 00:02:31,985
then it would be
just the language

51
00:02:31,985 --> 00:02:34,010
of r exactly n-times.

52
00:02:34,010 --> 00:02:38,105
And here you have union
from 0 to m and so on.

53
00:02:38,105 --> 00:02:42,560
This might be a slightly
interesting regular expression.

54
00:02:42,560 --> 00:02:46,580
So that's supposed to be the
character set that is very

55
00:02:46,580 --> 00:02:48,410
similar to character ranges like

56
00:02:48,410 --> 00:02:51,005
in the existing regular
expression matchers.

57
00:02:51,005 --> 00:02:52,820
So this just says

58
00:02:52,820 --> 00:02:55,565
...this regular
expression can match

59
00:02:55,565 --> 00:03:00,860
either the character c1 or
the character c2 up to cn.

60
00:03:00,860 --> 00:03:03,620
Why do I include
that regular expression?

61
00:03:03,620 --> 00:03:09,050
I could also just say
c1 plus c2 plus c3...

62
00:03:09,050 --> 00:03:12,889
a big alternative.
Well that's possible.

63
00:03:12,889 --> 00:03:16,595
But remember the Achilles'
heel of this algorithm is,

64
00:03:16,595 --> 00:03:18,425
if the regular expression is large,

65
00:03:18,425 --> 00:03:20,825
then the computation
take always a long time.

66
00:03:20,825 --> 00:03:23,840
So I'm trying to compress
that as much as I can.

67
00:03:23,840 --> 00:03:25,370
So instead of giving a big

68
00:03:25,370 --> 00:03:29,134
alternative, I am trying to
give one regular expression,

69
00:03:29,134 --> 00:03:31,340
which can not just match
a single character,

70
00:03:31,340 --> 00:03:34,230
but a set of characters.

71
00:03:34,630 --> 00:03:36,980
How can you be sure that

72
00:03:36,980 --> 00:03:39,215
definition you come
up with are correct?

73
00:03:39,215 --> 00:03:41,450
Well, I showed you which are

74
00:03:41,450 --> 00:03:45,170
the properties these two
functions need to satisfy.

75
00:03:45,170 --> 00:03:47,060
I've given you here what

76
00:03:47,060 --> 00:03:49,325
the meaning of these
expressions are.

77
00:03:49,325 --> 00:03:52,700
So you will always know what's
on the right-hand side.

78
00:03:52,700 --> 00:03:54,515
This is completely determined.

79
00:03:54,515 --> 00:03:56,720
You essentially
have to fill something

80
00:03:56,720 --> 00:03:58,910
equivalent on the left-hand side.

81
00:03:58,910 --> 00:04:02,105
That's the main task
of the coursework.

82
00:04:02,105 --> 00:04:08,090
And you can start from the
files I provided on KEATS.

83
00:04:08,090 --> 00:04:12,125
So you would just have to
add these additional cases.

84
00:04:12,125 --> 00:04:15,020
When you add these
additional cases

85
00:04:15,020 --> 00:04:17,330
and you're using the Scala language,

86
00:04:17,330 --> 00:04:18,980
you can do me a favour.

87
00:04:18,980 --> 00:04:22,550
You can call these
constructors for

88
00:04:22,550 --> 00:04:25,400
these different regular
expressions as RANGE,

89
00:04:25,400 --> 00:04:28,985
PLUS, OPTIONAL and NTIMES,
UPTO, FROM and BETWEEN.

90
00:04:28,985 --> 00:04:31,370
Remember I have this
convention to always

91
00:04:31,370 --> 00:04:34,025
use capital letters
for regular expressions.

92
00:04:34,025 --> 00:04:36,680
It would be nice if you could use

93
00:04:36,680 --> 00:04:39,260
these names because
then it will be

94
00:04:39,260 --> 00:04:42,335
very easy for me
to test your code.

95
00:04:42,335 --> 00:04:44,690
If you're using a different
programming language

96
00:04:44,690 --> 00:04:46,370
like let's say Rust,

97
00:04:46,370 --> 00:04:48,860
I expect some people will use that, where I

98
00:04:48,860 --> 00:04:51,380
have no idea what are the
conventions in your language,

99
00:04:51,380 --> 00:04:53,420
how you have to call
these constructors,

100
00:04:53,420 --> 00:04:56,420
we will see whatever you
submit. I'll have a look.

101
00:04:56,420 --> 00:04:59,360
There's one more
constraint I have to

102
00:04:59,360 --> 00:05:02,194
impose to make this
coursework interesting.

103
00:05:02,194 --> 00:05:04,730
I do not want you 
that you just take

104
00:05:04,730 --> 00:05:07,370
these extended regular
expressions and that you

105
00:05:07,370 --> 00:05:10,550
expand them using
their definition.

106
00:05:10,550 --> 00:05:12,320
Because that would make the regular

107
00:05:12,320 --> 00:05:13,955
expression matcher very slow.

108
00:05:13,955 --> 00:05:16,160
So for example, if
you want to define

109
00:05:16,160 --> 00:05:18,785
what's the derivative for r-plus,

110
00:05:18,785 --> 00:05:21,560
then what does not
count as a solution:

111
00:05:21,560 --> 00:05:24,770
if you just expand that
to the definition that r

112
00:05:24,770 --> 00:05:28,935
plus is equivalent to
r followed by r-star.

113
00:05:28,935 --> 00:05:32,845
So in code, what I
would like to not see,

114
00:05:32,845 --> 00:05:35,680
I would not give any
marks for that is,

115
00:05:35,680 --> 00:05:37,780
if you add the plus as follows,

116
00:05:37,780 --> 00:05:39,910
so that is still perfectly fine.

117
00:05:39,910 --> 00:05:42,160
There's nothing you
can do differently.

118
00:05:42,160 --> 00:05:44,065
That would be the constructor.

119
00:05:44,065 --> 00:05:46,480
But when you define nullable,

120
00:05:46,480 --> 00:05:49,180
I do not want to see
that you defined

121
00:05:49,180 --> 00:05:51,790
this plus r as nullable

122
00:05:51,790 --> 00:05:54,385
of the sequence of r and star-r,

123
00:05:54,385 --> 00:05:58,075
just unfolding
the definition.

124
00:05:58,075 --> 00:06:00,415
As you can see, when you come

125
00:06:00,415 --> 00:06:02,815
up with a much better
solution for that.

126
00:06:02,815 --> 00:06:05,110
This is a very inefficient way

127
00:06:05,110 --> 00:06:07,090
for how to calculate nullable.

128
00:06:07,090 --> 00:06:10,825
And so the same for derivative
would not be allowed.

129
00:06:10,825 --> 00:06:13,895
If you, I have the plus r here,

130
00:06:13,895 --> 00:06:16,685
you can't just unfold
the definition,

131
00:06:16,685 --> 00:06:19,790
with r-plus
being defined as r

132
00:06:19,790 --> 00:06:21,350
followed by r star and

133
00:06:21,350 --> 00:06:23,375
then just build the
derivative of that.

134
00:06:23,375 --> 00:06:25,280
You have to find something more

135
00:06:25,280 --> 00:06:28,730
primitive that involves
only the derivative of r,

136
00:06:28,730 --> 00:06:31,790
essentially, nothing
more involved. The same

137
00:06:31,790 --> 00:06:33,815
if you have n-times, for example,

138
00:06:33,815 --> 00:06:39,935
you can't just unfold this
n-times in this sequence

139
00:06:39,935 --> 00:06:43,310
of .... n-copies

140
00:06:43,310 --> 00:06:47,285
of this r. You have to get
something more primitive.

141
00:06:47,285 --> 00:06:49,760
Otherwise, as you've
seen earlier,

142
00:06:49,760 --> 00:06:53,135
your regular expression matcher
would be totally slow.

143
00:06:53,135 --> 00:06:55,475
When you submit your solution,

144
00:06:55,475 --> 00:06:58,445
please submit this
solution in the PDF.

145
00:06:58,445 --> 00:07:01,580
So give the cases of your definition

146
00:07:01,580 --> 00:07:06,004
in a form like that for
nullable and derivative.

147
00:07:06,004 --> 00:07:08,405
And also implement it in code.

148
00:07:08,405 --> 00:07:10,910
That is just helping me to

149
00:07:10,910 --> 00:07:13,400
find out whether you have
the correct solution or not.

150
00:07:13,400 --> 00:07:15,710
So you needed a kind of
mathematical notation of

151
00:07:15,710 --> 00:07:18,695
your solution, and
an actual code.

152
00:07:18,695 --> 00:07:22,414
And then once you
have your definition,

153
00:07:22,414 --> 00:07:25,699
also make sure you try
it out on some examples.

154
00:07:25,699 --> 00:07:28,970
These will be the examples
I will definitely try out,

155
00:07:28,970 --> 00:07:30,560
but probably also more

156
00:07:30,560 --> 00:07:33,215
depending on what
definitions you give me.

157
00:07:33,215 --> 00:07:38,300
There's one more question I
ask you to do. You've seen

158
00:07:38,300 --> 00:07:40,130
there's some regular
expressions that

159
00:07:40,130 --> 00:07:42,215
are involved for characters,

160
00:07:42,215 --> 00:07:46,925
for character ranges or
character sets essentially.

161
00:07:46,925 --> 00:07:48,665
And sometimes I also want to have

162
00:07:48,665 --> 00:07:51,665
just a reg expression which
can match any character!!

163
00:07:51,665 --> 00:07:56,195
And I could have for all of
them separate definitions.

164
00:07:56,195 --> 00:07:57,710
And that would mean I also need

165
00:07:57,710 --> 00:07:59,645
separate definitions for nullable,

166
00:07:59,645 --> 00:08:02,405
separate definitions
for derivative.

167
00:08:02,405 --> 00:08:05,825
But actually they can be
generalized to just one constructor.

168
00:08:05,825 --> 00:08:08,210
And that is if we introduce

169
00:08:08,210 --> 00:08:11,630
a constructor for regular expressions,

170
00:08:11,630 --> 00:08:13,760
which not just takes
a single character

171
00:08:13,760 --> 00:08:15,469
or set of characters.

172
00:08:15,469 --> 00:08:18,245
But, which I call here CFUN.

173
00:08:18,245 --> 00:08:23,165
And it takes a function from
characters to booleans.

174
00:08:23,165 --> 00:08:24,470
So if I want to match

175
00:08:24,470 --> 00:08:26,900
just a single character
then this function f

176
00:08:26,900 --> 00:08:29,060
would only say true

177
00:08:29,060 --> 00:08:32,225
if it gets as argument
the single character.

178
00:08:32,225 --> 00:08:34,850
If I have a character set,

179
00:08:34,850 --> 00:08:36,515
then I would say, well,

180
00:08:36,515 --> 00:08:38,300
I need a function
which says true

181
00:08:38,300 --> 00:08:40,970
for all the characters
in this set.

182
00:08:40,970 --> 00:08:43,460
And otherwise it's false.

183
00:08:43,460 --> 00:08:46,205
And if I want to
match any character!!,

184
00:08:46,205 --> 00:08:48,500
then they should get a function

185
00:08:48,500 --> 00:08:53,450
which says true for
all characters.

186
00:08:53,450 --> 00:08:56,630
Okay? If I have
such a constructor

187
00:08:56,630 --> 00:09:00,140
that actually I can save
myself a bit of work.

188
00:09:00,140 --> 00:09:03,200
And I can just have one case

189
00:09:03,200 --> 00:09:06,920
for nullable and one
case for CFUNS.

190
00:09:06,920 --> 00:09:09,800
So that would then subsume
the character ranges and

191
00:09:09,800 --> 00:09:13,385
the character and also
this ALL regular expression.

192
00:09:13,385 --> 00:09:15,380
Ok, this was not explicitly
included at the beginning,

193
00:09:15,380 --> 00:09:17,510
but that's what I can now define.

194
00:09:17,510 --> 00:09:21,740
And then I can define
this for characters,

195
00:09:21,740 --> 00:09:23,885
just as special cases
for these functions.

196
00:09:23,885 --> 00:09:25,610
And I told you already
what this function

197
00:09:25,610 --> 00:09:28,025
should look like in
these three cases.

198
00:09:28,025 --> 00:09:30,350
So I let ...

199
00:09:30,350 --> 00:09:33,515
you decide how you're
going to implement that.

200
00:09:33,515 --> 00:09:35,450
If you first define

201
00:09:35,450 --> 00:09:38,495
your implementation is
all these explicit cases.

202
00:09:38,495 --> 00:09:41,900
Because in these cases it's
actually more intuitive,

203
00:09:41,900 --> 00:09:45,980
what nullable and
derivative should be.

204
00:09:45,980 --> 00:09:48,035
And then in a second step,

205
00:09:48,035 --> 00:09:51,140
you implement these
more general cases.

206
00:09:51,140 --> 00:09:53,360
And just keep the original ones

207
00:09:53,360 --> 00:09:54,500
in your implementation if you

208
00:09:54,500 --> 00:09:57,710
want, or if you feel
adventurous enough,

209
00:09:57,710 --> 00:09:59,225
and want to be lazy,

210
00:09:59,225 --> 00:10:01,115
that you just implement that.

211
00:10:01,115 --> 00:10:02,660
And then you have already done

212
00:10:02,660 --> 00:10:06,530
at least two constructors
by just implementing one.

213
00:10:06,530 --> 00:10:08,915
But as said that 
requires a bit skill

214
00:10:08,915 --> 00:10:11,450
of generalizing how
that should be.

215
00:10:11,450 --> 00:10:14,180
And the other questions
are just then

216
00:10:14,180 --> 00:10:16,745
trying out your
expression matcher on

217
00:10:16,745 --> 00:10:19,580
some examples, more
power examples,

218
00:10:19,580 --> 00:10:22,400
and they should be
very easy to solve.

219
00:10:22,400 --> 00:10:24,605
So I hope it's not
too much asked for

220
00:10:24,605 --> 00:10:26,615
and I hope you have fun.

221
00:10:26,615 --> 00:10:29,675
It is not too much ask
because you can,

222
00:10:29,675 --> 00:10:32,420
I hope it's not too much
because you can start from

223
00:10:32,420 --> 00:10:35,615
my definitions or
my implementation.

224
00:10:35,615 --> 00:10:39,425
It's really essentially
mostly is brain work,

225
00:10:39,425 --> 00:10:42,170
how these nullable and
derivative should work.

226
00:10:42,170 --> 00:10:44,510
If you're in a
language like Scala,

227
00:10:44,510 --> 00:10:48,875
the implementation should
then be easy-peasy.

228
00:10:48,875 --> 00:10:51,365
If you are in a different language

229
00:10:51,365 --> 00:10:52,610
I assume you also

230
00:10:52,610 --> 00:10:54,890
dedicated to that
language to start with,

231
00:10:54,890 --> 00:10:58,475
so it should be also very
easy for you to translate

232
00:10:58,475 --> 00:11:01,100
my Scala code into whatever
language you want to

233
00:11:01,100 --> 00:11:04,280
do, first and then
start from there.

234
00:11:04,280 --> 00:11:06,230
If there are any questions,

235
00:11:06,230 --> 00:11:08,360
please ask me or the TAs.

236
00:11:08,360 --> 00:11:10,040
We are trying to be as helpful

237
00:11:10,040 --> 00:11:12,900
as possible with the coursework.

238
00:11:13,240 --> 00:11:17,885
Bear in mind, this is the
first step in our compiler.

239
00:11:17,885 --> 00:11:21,005
The coursework 2 will then
build on top of that.

240
00:11:21,005 --> 00:11:25,770
So it's better to get this
correct first. Thanks.