videos/01-basics1.srt
author Christian Urban <christian.urban@kcl.ac.uk>
Sat, 04 Sep 2021 14:08:09 +0100
changeset 834 d3e38dd3b449
parent 763 4e628958c01a
permissions -rw-r--r--
cwupdates

1
00:00:06,710 --> 00:00:09,225
Thanks for tuning in again.

2
00:00:09,225 --> 00:00:11,640
In this video, we want to specify

3
00:00:11,640 --> 00:00:14,370
what problem our regular
expression matcher

4
00:00:14,370 --> 00:00:16,155
is actually supposed to solve.

5
00:00:16,155 --> 00:00:18,900
The reason is that
we know that some of

6
00:00:18,900 --> 00:00:21,585
the existing regular
expression matching engines

7
00:00:21,585 --> 00:00:25,200
are not just abysmally
slow in some examples,

8
00:00:25,200 --> 00:00:27,105
as you've seen in the
previous video,

9
00:00:27,105 --> 00:00:30,570
but also produce sometimes
incorrect results.

10
00:00:30,570 --> 00:00:33,330
In order to avoid
this with our matcher,

11
00:00:33,330 --> 00:00:35,325
we need to somehow explain

12
00:00:35,325 --> 00:00:39,255
precisely what is the problem
our algorithm solves.

13
00:00:39,255 --> 00:00:41,935
This will require
a bit of theory, but

14
00:00:41,935 --> 00:00:45,335
I hope it is nevertheless
a bit of fun.

15
00:00:45,335 --> 00:00:47,915
First, we have to specify

16
00:00:47,915 --> 00:00:50,585
what we mean by a
regular expression.

17
00:00:50,585 --> 00:00:53,210
You've seen earlier some
examples. They were

18
00:00:53,210 --> 00:00:56,060
actually taken or
inspired by what

19
00:00:56,060 --> 00:00:58,850
is available in standard
regular expression matching

20
00:00:58,850 --> 00:01:02,330
engines, like star, plus and n-times.

21
00:01:02,330 --> 00:01:05,690
But for many tasks,
for our algorithm,

22
00:01:05,690 --> 00:01:10,174
we will focus only what I call
basic regular expressions.

23
00:01:10,174 --> 00:01:11,840
Since I'm lazy, I will call

24
00:01:11,840 --> 00:01:13,550
these basic regular expressions

25
00:01:13,550 --> 00:01:15,485
just as regular expressions.

26
00:01:15,485 --> 00:01:17,405
And the ones you've seen earlier

27
00:01:17,405 --> 00:01:19,400
as extended regular expressions.

28
00:01:19,400 --> 00:01:22,940
So the basic regulare expressions,
or just regular expressions,

29
00:01:22,940 --> 00:01:25,280
they will have characters.

30
00:01:25,280 --> 00:01:27,170
So you can match any character,

31
00:01:27,170 --> 00:01:31,370
a,b,c to z or 0 to 9.

32
00:01:31,370 --> 00:01:35,525
Any Ascii character. 'c' here
is just a representative.

33
00:01:35,525 --> 00:01:38,825
So we can match
single characters.

34
00:01:38,825 --> 00:01:42,440
Then we can match alternatives.

35
00:01:42,440 --> 00:01:44,930
That means a string
is either matched

36
00:01:44,930 --> 00:01:46,730
by the regular expression r1

37
00:01:46,730 --> 00:01:49,324
or by the regular expression r2.

38
00:01:49,324 --> 00:01:52,790
And for the
alternative we write +.

39
00:01:52,790 --> 00:01:55,175
Then we also have sequence.

40
00:01:55,175 --> 00:01:57,410
This sequence regular
expression essentially

41
00:01:57,410 --> 00:01:59,915
says that a string needs to be matched

42
00:01:59,915 --> 00:02:02,210
the first part by
the regular expression r1

43
00:02:02,210 --> 00:02:06,275
and then the second
part by the r2.

44
00:02:06,275 --> 00:02:10,190
And then we have also the
star regular expression,

45
00:02:10,190 --> 00:02:12,980
which says the regular
expression needs to match

46
00:02:12,980 --> 00:02:16,520
the string with zero
or more copies.

47
00:02:16,520 --> 00:02:18,140
And then we also have some

48
00:02:18,140 --> 00:02:20,060
slightly strange
regular expressions.

49
00:02:20,060 --> 00:02:22,505
We have the regular expression 1,

50
00:02:22,505 --> 00:02:25,910
which can only match
the empty string.

51
00:02:25,910 --> 00:02:29,075
I'm using here the
notation 1 for that

52
00:02:29,075 --> 00:02:31,340
and in my writing I will always

53
00:02:31,340 --> 00:02:33,440
make sure that for the
regular expression

54
00:02:33,440 --> 00:02:35,765
I will write the
1 in a bold font.

55
00:02:35,765 --> 00:02:38,510
So whenever you see
a 1 in bold font,

56
00:02:38,510 --> 00:02:40,395
this is not the 1, but

57
00:02:40,395 --> 00:02:44,300
the regular expression which
can match the empty string.

58
00:02:44,300 --> 00:02:48,050
And we also have the
regular expression 0,

59
00:02:48,050 --> 00:02:50,315
which cannot match
anything at all.

60
00:02:50,315 --> 00:02:51,695
You might think, well,

61
00:02:51,695 --> 00:02:54,635
that's not much use if it cannot
match anything at all,

62
00:02:54,635 --> 00:02:58,130
but you will see why that
one is important later on.

63
00:02:58,130 --> 00:03:00,785
So our basic regular expressions,

64
00:03:00,785 --> 00:03:02,375
they will be 0,

65
00:03:02,375 --> 00:03:08,390
1, characters, alternatives,
sequences and stars.

66
00:03:08,390 --> 00:03:12,170
And these are all the
basic regular expressions.

67
00:03:12,170 --> 00:03:16,280
If this definition is a
bit too abstract for you,

68
00:03:16,280 --> 00:03:18,560
we can also look at
the concrete code,

69
00:03:18,560 --> 00:03:23,060
how that would pan out when
actually writing some Scala.

70
00:03:23,060 --> 00:03:28,040
I promised you, I show
you always my code in Scala.

71
00:03:28,040 --> 00:03:29,480
So here you would have

72
00:03:29,480 --> 00:03:32,885
first an abstract class
for regular expressions.

73
00:03:32,885 --> 00:03:37,580
Then you have one regular
expression for 0, 

74
00:03:37,580 --> 00:03:41,540
one regular expression for 1, 

75
00:03:41,540 --> 00:03:42,875
one regular expression, which
takes an argument,

76
00:03:42,875 --> 00:03:45,050
the character you want to match,

77
00:03:45,050 --> 00:03:47,915
the characters a,b, c and so on.

78
00:03:47,915 --> 00:03:50,945
Then we have an alternative
regular expression,

79
00:03:50,945 --> 00:03:53,480
which takes the first
alternative and

80
00:03:53,480 --> 00:03:56,435
the second alternative
as arguments.

81
00:03:56,435 --> 00:03:59,690
And we have a sequence
regular expression. Again,

82
00:03:59,690 --> 00:04:01,850
which takes the
first component and

83
00:04:01,850 --> 00:04:04,730
the second component
as two arguments.

84
00:04:04,730 --> 00:04:07,249
And we have the star
regular expression,

85
00:04:07,249 --> 00:04:10,880
which just take one regular
expression as argument.

86
00:04:10,880 --> 00:04:16,115
And all these reg expressions
extend our abstract class.

87
00:04:16,115 --> 00:04:20,300
For whatever I do in
this module here I have

88
00:04:20,300 --> 00:04:23,300
the convention that all
the regular expressions

89
00:04:23,300 --> 00:04:25,550
are written with capital letters.

90
00:04:25,550 --> 00:04:26,885
As you can see that here,

91
00:04:26,885 --> 00:04:31,685
O, 1,  character, these will be
always regular expressions.

92
00:04:31,685 --> 00:04:34,370
They have all capital letters.

93
00:04:34,370 --> 00:04:36,484
Let's for a moment,

94
00:04:36,484 --> 00:04:38,720
play around with this definition.

95
00:04:38,720 --> 00:04:41,945
I'm using here the
Ammonite REPL.

96
00:04:41,945 --> 00:04:46,950
And I can evaluate
this definition.

97
00:04:53,430 --> 00:04:55,810
And now I can start to

98
00:04:55,810 --> 00:04:58,570
define particular
regular expressions.

99
00:04:58,570 --> 00:05:00,340
For example, if I need
a regular expression

100
00:05:00,340 --> 00:05:02,860
which can recognise
the character a,

101
00:05:02,860 --> 00:05:06,025
then I would write
something like this.

102
00:05:06,025 --> 00:05:08,710
So this regular expression
takes an argument,

103
00:05:08,710 --> 00:05:13,615
the character 'a'  to specify
which character to match.

104
00:05:13,615 --> 00:05:16,945
We do this obviously also with 'b'.

105
00:05:16,945 --> 00:05:19,405
And I can do that with

106
00:05:19,405 --> 00:05:22,975
'c'. So now we have three
regular expressions.

107
00:05:22,975 --> 00:05:25,570
If you look very carefully
at this definition,

108
00:05:25,570 --> 00:05:27,070
you can actually see

109
00:05:27,070 --> 00:05:29,940
these regular
expressions are trees.

110
00:05:29,940 --> 00:05:33,365
So no matter what we
write down on paper,

111
00:05:33,365 --> 00:05:36,755
they are behind the
scenes always trees.

112
00:05:36,755 --> 00:05:40,010
And you can see that
actually in this definition.

113
00:05:40,010 --> 00:05:44,330
If you define two regular
expressions r1 and r2.

114
00:05:44,330 --> 00:05:49,310
They are essentially
the alternative of a, b and c.

115
00:05:49,310 --> 00:05:52,760
Then this regular expression

116
00:05:52,760 --> 00:05:54,710
can match either the character

117
00:05:54,710 --> 00:05:57,980
a or the character b
or the character c.

118
00:05:57,980 --> 00:06:01,640
And the same for the
regular expression r2.

119
00:06:01,640 --> 00:06:03,875
So let me just evaluate that.

120
00:06:03,875 --> 00:06:05,690
And even though these are

121
00:06:05,690 --> 00:06:07,175
two regular expressions
which can match

122
00:06:07,175 --> 00:06:11,750
exactly the same things,
they a different trees.

123
00:06:11,750 --> 00:06:14,195
So if I ask Scala,

124
00:06:14,195 --> 00:06:16,460
are these trees different?

125
00:06:16,460 --> 00:06:19,250
Or ask if they're

126
00:06:19,250 --> 00:06:21,865
the same, then Scala will say No,

127
00:06:21,865 --> 00:06:25,440
they actually different trees.

128
00:06:25,450 --> 00:06:28,459
Let's come back to
this definition.

129
00:06:28,459 --> 00:06:31,760
If we want to write down
regular expressions on paper,

130
00:06:31,760 --> 00:06:33,620
then we want to be sloppy as

131
00:06:33,620 --> 00:06:35,750
mathematicians rather than as

132
00:06:35,750 --> 00:06:37,745
precise as computer scientists.

133
00:06:37,745 --> 00:06:40,490
So when we want to write down
a regular expression which can

134
00:06:40,490 --> 00:06:43,955
either match the character
a or the character b,

135
00:06:43,955 --> 00:06:49,130
then we would write down
something like this, a plus b.

136
00:06:49,130 --> 00:06:51,170
And if you want to have
the regular expression

137
00:06:51,170 --> 00:06:52,625
which can either match

138
00:06:52,625 --> 00:06:55,925
the character a or b or c,

139
00:06:55,925 --> 00:06:58,340
we will write
something like this.

140
00:06:58,340 --> 00:07:01,370
But of course behind the
scenes, these are trees.

141
00:07:01,370 --> 00:07:04,460
So we should have written
them with parentheses.

142
00:07:04,460 --> 00:07:06,440
And you can see
actually, there are two

143
00:07:06,440 --> 00:07:08,990
regular expressions I
could have written down.

144
00:07:08,990 --> 00:07:11,270
They're different.

145
00:07:11,270 --> 00:07:12,710
Just by convention,

146
00:07:12,710 --> 00:07:15,575
we on't write these parentheses.

147
00:07:15,575 --> 00:07:18,740
And that is similar with sequences.

148
00:07:18,740 --> 00:07:20,000
If I want to write down

149
00:07:20,000 --> 00:07:22,955
the regular expression which
can match first an 'a',

150
00:07:22,955 --> 00:07:25,010
then a 'b', and then a 'c',

151
00:07:25,010 --> 00:07:28,160
then I would write down
something like this.

152
00:07:28,160 --> 00:07:32,120
Just, there are again

153
00:07:32,120 --> 00:07:35,735
two regular expressions I
could have written down.

154
00:07:35,735 --> 00:07:38,480
Again by convention we don't

155
00:07:38,480 --> 00:07:40,670
write these parentheses though.

156
00:07:40,670 --> 00:07:42,350
However, sometimes we have to be

157
00:07:42,350 --> 00:07:43,940
very careful with parentheses,

158
00:07:43,940 --> 00:07:47,195
especially with star. 

159
00:07:47,195 --> 00:07:50,525
Because this regular expression
is definitely not

160
00:07:50,525 --> 00:07:54,900
the same as this regular expression.

161
00:07:56,100 --> 00:07:59,410
The first one here can match

162
00:07:59,410 --> 00:08:03,610
any strings containing a or b's.

163
00:08:03,610 --> 00:08:05,860
While this regular expression can

164
00:08:05,860 --> 00:08:07,945
only match the single character

165
00:08:07,945 --> 00:08:13,300
a or any string
containing only b's.

166
00:08:13,300 --> 00:08:15,265
So to make the difference clear,

167
00:08:15,265 --> 00:08:20,065
in this example, we would have
to use the parentheses.

168
00:08:20,065 --> 00:08:23,140
There's one more issue
with this definition.

169
00:08:23,140 --> 00:08:26,635
Why do we focus on these
basic regular expressions?

170
00:08:26,635 --> 00:08:28,660
Why don't we also include

171
00:08:28,660 --> 00:08:31,285
the ones from the
extended regular expressions.

172
00:08:31,285 --> 00:08:33,055
The answers very easy.

173
00:08:33,055 --> 00:08:35,680
These basic regular
expressions can be used

174
00:08:35,680 --> 00:08:38,370
to represent also
the extended ones.

175
00:08:38,370 --> 00:08:40,220
Let me give you some examples.

176
00:08:40,220 --> 00:08:44,225
If I have a regular
expression r+, for example,

177
00:08:44,225 --> 00:08:46,280
then the meaning
was I have to use

178
00:08:46,280 --> 00:08:49,115
at least one or more copies

179
00:08:49,115 --> 00:08:51,200
of this r to
match a string. 

180
00:08:51,200 --> 00:08:53,810
Well, one or more copies
can be represented by

181
00:08:53,810 --> 00:08:58,385
the basic ones as just
r followed by r*.

182
00:08:58,385 --> 00:09:01,760
Meaning I have to use one
copy of r, followed by

183
00:09:01,760 --> 00:09:05,150
0 or more copies of r.

184
00:09:05,150 --> 00:09:07,895
Similarly, if I have the optional
regular expression,

185
00:09:07,895 --> 00:09:10,715
which is supposed to
match a string

186
00:09:10,715 --> 00:09:13,865
by using r, or match
the empty string.

187
00:09:13,865 --> 00:09:19,295
Then this can be obviously
defined as r + 1.

188
00:09:19,295 --> 00:09:23,945
So here is the bold
regular expression 1,

189
00:09:23,945 --> 00:09:26,180
which means it either can

190
00:09:26,180 --> 00:09:28,205
recognize whatever
r can recognize,

191
00:09:28,205 --> 00:09:30,470
or it can recognize
the empty string.

192
00:09:30,470 --> 00:09:35,150
And if I have ranges, like a
to z,  then I can define

193
00:09:35,150 --> 00:09:41,135
that as a + b + c + ...
and so on until z.

194
00:09:41,135 --> 00:09:45,920
Maybe this definition is not
good in terms of runtime,

195
00:09:45,920 --> 00:09:47,960
but in terms of just being able

196
00:09:47,960 --> 00:09:50,780
to recognize strings
or match strings,

197
00:09:50,780 --> 00:09:54,680
the basic regular expressions
will be just sufficient.

198
00:09:54,680 --> 00:09:56,690
Unfortunately, we
also need to have

199
00:09:56,690 --> 00:09:58,850
a quick chat about strings.

200
00:09:58,850 --> 00:10:02,255
In Scala, it's crystal
clear what a string is.

201
00:10:02,255 --> 00:10:05,480
There's a separate datatype
which is called string.

202
00:10:05,480 --> 00:10:07,895
So here, for example,
is a string.

203
00:10:07,895 --> 00:10:09,200
And as you can see,

204
00:10:09,200 --> 00:10:11,105
it is of the type string.

205
00:10:11,105 --> 00:10:13,985
And the empty string
will be just that.

206
00:10:13,985 --> 00:10:16,160
However, when we write things down on

207
00:10:16,160 --> 00:10:18,320
paper and think
about our algorithm,

208
00:10:18,320 --> 00:10:22,790
we want to think of strings
as lists of characters.

209
00:10:22,790 --> 00:10:26,070
So more something like this.

210
00:10:27,070 --> 00:10:31,745
You can see here, this is actually
a list of characters.

211
00:10:31,745 --> 00:10:35,150
And the two operations
we need are taking

212
00:10:35,150 --> 00:10:37,280
the head of this list and

213
00:10:37,280 --> 00:10:39,770
the rest of the list
or tail of the list.

214
00:10:39,770 --> 00:10:41,720
That's why we want
to regard them as

215
00:10:41,720 --> 00:10:45,260
lists rather than strings.

216
00:10:45,260 --> 00:10:48,200
So if I'm using a
string like this,

217
00:10:48,200 --> 00:10:51,935
then on paper I always will
write something like that.

218
00:10:51,935 --> 00:10:54,575
Or since I'm lazy, just that.

219
00:10:54,575 --> 00:10:56,675
And for the empty string,

220
00:10:56,675 --> 00:10:59,210
I will write either
the empty list, with

221
00:10:59,210 --> 00:11:03,920
two brackets or,
being lazy, just that.

222
00:11:03,920 --> 00:11:06,620
Actually there is one
more operation we need on

223
00:11:06,620 --> 00:11:09,410
strings and that
is concatenation.

224
00:11:09,410 --> 00:11:11,255
If you have a string s1,

225
00:11:11,255 --> 00:11:14,510
string s2, and put an
at symbol in between,

226
00:11:14,510 --> 00:11:18,050
that means we want to
concatenate both strings.

227
00:11:18,050 --> 00:11:22,625
So foo concatenated with
bar, would be foobar.

228
00:11:22,625 --> 00:11:25,085
And any string concatenated with

229
00:11:25,085 --> 00:11:27,950
the empty string
is left untouched.

230
00:11:27,950 --> 00:11:31,310
So baz concatenated with

231
00:11:31,310 --> 00:11:33,545
the empty string, is just baz.

232
00:11:33,545 --> 00:11:37,295
So that's like if we have
strings as lists of characters,

233
00:11:37,295 --> 00:11:39,755
that will be just list append.

234
00:11:39,755 --> 00:11:41,480
In the next video,

235
00:11:41,480 --> 00:11:43,160
we will use these definitions

236
00:11:43,160 --> 00:11:45,050
and introduce the notion of what

237
00:11:45,050 --> 00:11:46,850
a language is and

238
00:11:46,850 --> 00:11:49,920
what the meaning of a
regular expression is.