videos/01-evilregexes.srt
author Christian Urban <christian.urban@kcl.ac.uk>
Sun, 01 Oct 2023 10:57:32 +0100
changeset 932 5678414a3898
parent 837 499405058cfd
permissions -rw-r--r--
updated

1
00:00:06,240 --> 00:00:11,050
Welcome back. This video
is about regular expressions.

2
00:00:11,050 --> 00:00:14,230
We want to use regular
expressions in our lexer.

3
00:00:14,230 --> 00:00:16,165
And the purpose of the
lexer is to find

4
00:00:16,165 --> 00:00:18,070
out where the words in

5
00:00:18,070 --> 00:00:21,070
our programs are. However
regular expressions

6
00:00:21,070 --> 00:00:23,875
are fundamental tool
in computer science.

7
00:00:23,875 --> 00:00:27,910
And I'm sure you've used them
already on several occasions.

8
00:00:27,910 --> 00:00:30,370
And one would expect that about

9
00:00:30,370 --> 00:00:31,750
regular expressions since they are

10
00:00:31,750 --> 00:00:33,850
so well-known and well studied,

11
00:00:33,850 --> 00:00:37,915
that everything under the
sun is known about them.

12
00:00:37,915 --> 00:00:41,080
But actually there's
still some surprising

13
00:00:41,080 --> 00:00:44,465
and interesting
problems with them.

14
00:00:44,465 --> 00:00:47,945
And I want to show you
them in this video.

15
00:00:47,945 --> 00:00:50,720
I'm sure you've seen
regular expressions

16
00:00:50,720 --> 00:00:52,445
many, many times before.

17
00:00:52,445 --> 00:00:55,100
But just to be on the same page,

18
00:00:55,100 --> 00:00:57,110
let me just recap them.

19
00:00:57,110 --> 00:00:59,210
So here in this line,

20
00:00:59,210 --> 00:01:01,790
there is a regular expression
which is supposed to

21
00:01:01,790 --> 00:01:05,285
recognize some form
of email addresses.

22
00:01:05,285 --> 00:01:07,745
So an e-mail address

23
00:01:07,745 --> 00:01:11,000
has part which is
before the @ symbol,

24
00:01:11,000 --> 00:01:13,400
which is the name of the person.

25
00:01:13,400 --> 00:01:16,880
And that can be
any number between

26
00:01:16,880 --> 00:01:20,195
0 and 9, and letters between a and z.

27
00:01:20,195 --> 00:01:24,155
Let's say we avoiding
here capital letters.

28
00:01:24,155 --> 00:01:26,045
There can be underscores.

29
00:01:26,045 --> 00:01:29,405
There can be a dot and
there can be hyphens.

30
00:01:29,405 --> 00:01:35,390
And after the @ symbol
comes the domain name.

31
00:01:35,390 --> 00:01:37,310
So as you can see here,

32
00:01:37,310 --> 00:01:40,640
we use things like star to

33
00:01:40,640 --> 00:01:44,314
match letters
zero or more times.

34
00:01:44,314 --> 00:01:45,985
Or we have a plus,

35
00:01:45,985 --> 00:01:47,420
which means you have to match

36
00:01:47,420 --> 00:01:52,489
at least once or more
times. Then we have.

37
00:01:52,489 --> 00:01:55,790
question mark, which says you

38
00:01:55,790 --> 00:01:59,105
match either it is there
or it is not there.

39
00:01:59,105 --> 00:02:01,340
You are also regular
expressions which

40
00:02:01,340 --> 00:02:03,755
match exactly n-times.

41
00:02:03,755 --> 00:02:08,720
Or this is a regular expression
for between n and m times.

42
00:02:08,720 --> 00:02:12,065
You can see in
this email address,

43
00:02:12,065 --> 00:02:13,730
the top-level domain

44
00:02:13,730 --> 00:02:16,130
name can be any letter 

45
00:02:16,130 --> 00:02:19,265
between a to z,
and contain dots,

46
00:02:19,265 --> 00:02:22,340
but can only be two
characters long

47
00:02:22,340 --> 00:02:25,685
up till six characters
and not more.

48
00:02:25,685 --> 00:02:29,240
Then you also have
something like ranges.

49
00:02:29,240 --> 00:02:31,220
So you can see, letters between a

50
00:02:31,220 --> 00:02:33,635
and z and 0 to 9 and so on.

51
00:02:33,635 --> 00:02:36,545
Here you also have regular
expressions which can

52
00:02:36,545 --> 00:02:40,070
match something which
isn't in this range.

53
00:02:40,070 --> 00:02:42,560
So for example, if
you want for example match,

54
00:02:42,560 --> 00:02:44,030
letters but not numbers,

55
00:02:44,030 --> 00:02:45,800
you would say, well, if

56
00:02:45,800 --> 00:02:48,990
this is a number that
should not match.

57
00:02:49,090 --> 00:02:52,804
Typically you also
have these ranges.

58
00:02:52,804 --> 00:02:55,565
Lowercase letters,
capital letters.

59
00:02:55,565 --> 00:02:58,550
Then you have some
special regular expressions

60
00:02:58,550 --> 00:03:02,195
like this one is only
supposed to match digits.

61
00:03:02,195 --> 00:03:05,674
A dot is supposed to
match any character.

62
00:03:05,674 --> 00:03:07,370
And then they have also something

63
00:03:07,370 --> 00:03:09,800
called groups which
is supposed to be

64
00:03:09,800 --> 00:03:12,799
used when you are
trying to extract

65
00:03:12,799 --> 00:03:15,605
a string you've matched.

66
00:03:15,605 --> 00:03:19,925
Okay, so these are the
typical regular expressions.

67
00:03:19,925 --> 00:03:23,075
And here's a particular one

68
00:03:23,075 --> 00:03:25,820
trying to match something

69
00:03:25,820 --> 00:03:28,770
which resembles
an email address.

70
00:03:29,590 --> 00:03:33,065
Clearly that should be all easy.

71
00:03:33,065 --> 00:03:36,230
And our technology should
be on top of that.

72
00:03:36,230 --> 00:03:37,865
That we can take a

73
00:03:37,865 --> 00:03:41,015
regular expressions and
we can take a string,

74
00:03:41,015 --> 00:03:43,340
and we should have programs to

75
00:03:43,340 --> 00:03:45,680
decide whether this
string is matched

76
00:03:45,680 --> 00:03:50,330
by a regular expression or
not and should be easy-peasy, no?

77
00:03:50,330 --> 00:03:56,150
Well, let's have a
look at two examples.

78
00:03:56,150 --> 00:04:00,860
The first regular expression
is a star star b.

79
00:04:00,860 --> 00:04:02,990
And it is supposed
to match strings of

80
00:04:02,990 --> 00:04:05,825
the form 0 or more a's,

81
00:04:05,825 --> 00:04:10,385
followed by a b. The parentheses
you can ignore.

82
00:04:10,385 --> 00:04:11,990
And a star star

83
00:04:11,990 --> 00:04:14,120
also doesn't
make any difference

84
00:04:14,120 --> 00:04:16,505
to what kind of strings
that can be matched.

85
00:04:16,505 --> 00:04:21,635
It can only make 0 more
a's followed by a b.

86
00:04:21,635 --> 00:04:23,900
And the other regular expression

87
00:04:23,900 --> 00:04:26,990
is possibly a character a,

88
00:04:26,990 --> 00:04:32,930
n times, followed by character
a axactly n-times.

89
00:04:32,930 --> 00:04:35,570
And we will try out

90
00:04:35,570 --> 00:04:38,360
these two regular expressions
with strings of the form a,

91
00:04:38,360 --> 00:04:39,890
aa, and so on,

92
00:04:39,890 --> 00:04:45,770
and up to the length of n. And

93
00:04:45,770 --> 00:04:49,130
this regular expression should
actually not match any of

94
00:04:49,130 --> 00:04:53,315
the strings because the
final b is missing.

95
00:04:53,315 --> 00:04:56,150
But that is
okay. For example

96
00:04:56,150 --> 00:04:57,425
if you have a regular expression

97
00:04:57,425 --> 00:05:00,110
that is supposed to
check whether a string is

98
00:05:00,110 --> 00:05:01,490
an email address and the user

99
00:05:01,490 --> 00:05:03,380
gives some random
strings in there,

100
00:05:03,380 --> 00:05:06,545
then this regular expression
should not match that string.

101
00:05:06,545 --> 00:05:08,420
And for this regular expression

102
00:05:08,420 --> 00:05:11,195
you have to scratch a
little bit of your head,

103
00:05:11,195 --> 00:05:12,620
what it can actually match.

104
00:05:12,620 --> 00:05:14,720
But after a little bit
of head scratching,

105
00:05:14,720 --> 00:05:18,260
you find out can match
any string which is of

106
00:05:18,260 --> 00:05:22,580
the length n a's up
to 2n of a's.

107
00:05:22,580 --> 00:05:24,290
So anything in this range,

108
00:05:24,290 --> 00:05:27,185
this regular expression
can actually match.

109
00:05:27,185 --> 00:05:30,395
Okay, let's
take a random tool,

110
00:05:30,395 --> 00:05:32,630
maybe for example Python.

111
00:05:32,630 --> 00:05:35,240
So here's a little
Python program.

112
00:05:35,240 --> 00:05:38,690
It uses the library
function of Python to

113
00:05:38,690 --> 00:05:42,935
match the regular expressions of
a star star b.

114
00:05:42,935 --> 00:05:46,805
And we measure the time with longer
and longer strings of a's.

115
00:05:46,805 --> 00:05:48,770
And so conveniently we can give

116
00:05:48,770 --> 00:05:51,140
the number of a's here
on the command line.

117
00:05:51,140 --> 00:05:56,900
If I just call
this on the command line,

118
00:05:56,900 --> 00:05:59,900
Let's say we first
start with five a's.

119
00:05:59,900 --> 00:06:03,920
And I get also the times which
in this case is next to nothing.

120
00:06:03,920 --> 00:06:05,960
And here's the string
we just matched.

121
00:06:05,960 --> 00:06:07,640
And obviously the
regular expression

122
00:06:07,640 --> 00:06:09,110
did not match the string.

123
00:06:09,110 --> 00:06:11,255
That's indicated by this None.

124
00:06:11,255 --> 00:06:13,925
Let's take ten a's.

125
00:06:13,925 --> 00:06:16,490
It's also pretty quick.

126
00:06:16,490 --> 00:06:20,780
Fifteen a's, even quicker,

127
00:06:20,780 --> 00:06:23,180
but these times always need to

128
00:06:23,180 --> 00:06:25,820
be taken with a grain of salt.

129
00:06:25,820 --> 00:06:28,040
They are not 100
percent accurate.

130
00:06:28,040 --> 00:06:31,490
So 15 is also OK.
Let's take 20.

131
00:06:31,490 --> 00:06:36,965
Hmmm this already takes
double the time.

132
00:06:36,965 --> 00:06:42,440
Twenty-five. Then even longer.

133
00:06:42,440 --> 00:06:45,680
Okay, then suddenly
from 0.2 seconds,

134
00:06:45,680 --> 00:06:48,960
it now takes almost four seconds.

135
00:06:49,600 --> 00:06:54,890
Twenty-Six, this

136
00:06:54,890 --> 00:07:01,415
takes six seconds...
already double. 

137
00:07:01,415 --> 00:07:07,229
Let's go to 28. That would be
...hmmm....hmmm

138
00:07:08,890 --> 00:07:11,840
You see the string
isn't very long,

139
00:07:11,840 --> 00:07:13,340
so that could be easily like

140
00:07:13,340 --> 00:07:16,070
just the size of
an email address.

141
00:07:16,070 --> 00:07:19,280
And the regular
expression matching

142
00:07:19,280 --> 00:07:22,550
engine in Python needs
quite a long time

143
00:07:22,550 --> 00:07:24,710
to find out that
this string of 28

144
00:07:24,710 --> 00:07:26,570
a's is actually not matched

145
00:07:26,570 --> 00:07:28,490
by that. You see it's
still not finished.

146
00:07:28,490 --> 00:07:32,900
I think it should take
approximately like 20 seconds.

147
00:07:32,900 --> 00:07:34,400
Okay. Already 30.

148
00:07:34,400 --> 00:07:36,530
And if we would try

149
00:07:36,530 --> 00:07:40,805
30, we would be already here
for more than a minute.

150
00:07:40,805 --> 00:07:43,940
And if I could use
something like 100,

151
00:07:43,940 --> 00:07:46,220
you remember if a doubling in

152
00:07:46,220 --> 00:07:48,770
each step or the second step,

153
00:07:48,770 --> 00:07:50,720
the story with the chess board,

154
00:07:50,720 --> 00:07:53,855
we probably would sit here
until the next century.

155
00:07:53,855 --> 00:07:56,820
So something strange here.

156
00:07:57,580 --> 00:08:01,355
Okay, that might be just
a problem of Python.

157
00:08:01,355 --> 00:08:02,990
Let's have a look at another

158
00:08:02,990 --> 00:08:04,985
regular expression
matching engine.

159
00:08:04,985 --> 00:08:06,890
This time from JavaScript,

160
00:08:06,890 --> 00:08:10,040
also are pretty well-known
programming language.

161
00:08:10,040 --> 00:08:13,610
So here you can see
it's still a star,

162
00:08:13,610 --> 00:08:16,235
star followed by b,

163
00:08:16,235 --> 00:08:18,920
by direct expression is
supposed to match that from

164
00:08:18,920 --> 00:08:21,830
the beginning of the
string up till the end.

165
00:08:21,830 --> 00:08:23,930
So there's not any difference

166
00:08:23,930 --> 00:08:26,150
in the strings this regular
expression matches.

167
00:08:26,150 --> 00:08:28,610
We'll just start at the
beginning of the string

168
00:08:28,610 --> 00:08:31,460
and finish at the
end of the string.

169
00:08:31,460 --> 00:08:35,285
And we again, we just use
repeated a's for that.

170
00:08:35,285 --> 00:08:38,195
And similarly, we can

171
00:08:38,195 --> 00:08:41,930
call it on the command line
and can do some timing.

172
00:08:41,930 --> 00:08:44,540
So ten a's is very good.

173
00:08:44,540 --> 00:08:46,340
Here's the string.

174
00:08:46,340 --> 00:08:48,320
It cannot match that string.

175
00:08:48,320 --> 00:08:50,525
And it's pretty fast.

176
00:08:50,525 --> 00:08:54,725
Twenty...also pretty fast.

177
00:08:54,725 --> 00:08:59,120
Twenty-five... Again,

178
00:08:59,120 --> 00:09:06,650
somehow is a kind of
threshold that is 25, 26.

179
00:09:06,650 --> 00:09:09,485
Suddenly it takes much longer.

180
00:09:09,485 --> 00:09:14,360
And it has essentially the
same problem as with Python.

181
00:09:14,360 --> 00:09:17,165
So you'll see in now from 26 on,

182
00:09:17,165 --> 00:09:19,250
the times always
double from

183
00:09:19,250 --> 00:09:21,860
three seconds to seven seconds.

184
00:09:21,860 --> 00:09:23,330
So you can imagine what that

185
00:09:23,330 --> 00:09:24,890
roughly takes when I put your

186
00:09:24,890 --> 00:09:30,230
27 and you see the
string isn't very long.

187
00:09:30,230 --> 00:09:32,165
It is just twenty-or-something a's.

188
00:09:32,165 --> 00:09:35,419
Imagine you have to
search a database

189
00:09:35,419 --> 00:09:38,720
with Gigabytes of data

190
00:09:38,720 --> 00:09:42,260
with these regular
expressions that would 

191
00:09:42,260 --> 00:09:48,150
need years to go through with
these regular expressions.

192
00:09:48,630 --> 00:09:51,850
Okay, maybe the people in

193
00:09:51,850 --> 00:09:55,435
Python and JavaScript,
they're just idiots.

194
00:09:55,435 --> 00:09:58,180
Surely Java must do much better.

195
00:09:58,180 --> 00:10:01,045
So here's a program.

196
00:10:01,045 --> 00:10:03,415
You can see this again

197
00:10:03,415 --> 00:10:05,980
is the regular expression
and we just having

198
00:10:05,980 --> 00:10:08,320
some scaffolding to generate

199
00:10:08,320 --> 00:10:11,905
strings from 5 up till 28.

200
00:10:11,905 --> 00:10:14,305
And if we run that,

201
00:10:14,305 --> 00:10:16,660
actually does that automatically.

202
00:10:16,660 --> 00:10:19,900
So uphill 19, pretty fast,

203
00:10:19,900 --> 00:10:24,925
but then starting from
23, it is getting pretty slow.

204
00:10:24,925 --> 00:10:27,445
So the question is
what's going on?

205
00:10:27,445 --> 00:10:29,230
By the way, I'm not gloating here.

206
00:10:29,230 --> 00:10:33,755
Scala uses internally
the regular expression

207
00:10:33,755 --> 00:10:36,665
matching engine from Java.

208
00:10:36,665 --> 00:10:39,065
So would have exactly
the same problem.

209
00:10:39,065 --> 00:10:41,480
Also, I have been
here very careful,

210
00:10:41,480 --> 00:10:43,550
I'm using here Java 8,

211
00:10:43,550 --> 00:10:46,085
which nowadays is quite old.

212
00:10:46,085 --> 00:10:50,765
But you will see also
current Java versions.

213
00:10:50,765 --> 00:10:55,490
We will see we can out-compete
them by magnitudes.

214
00:10:55,490 --> 00:10:57,605
So I think I can 

215
00:10:57,605 --> 00:10:59,165
now, just finish this here.

216
00:10:59,165 --> 00:11:04,025
You see the problem.
Just for completeness sake.

217
00:11:04,025 --> 00:11:07,010
Here is a Ruby program.

218
00:11:07,010 --> 00:11:09,935
This is using the other
regular expression.

219
00:11:09,935 --> 00:11:12,935
In this case the
string should match.

220
00:11:12,935 --> 00:11:20,300
And again it tries out
strings between 1 and 30 here.

221
00:11:20,300 --> 00:11:23,450
That's a program actually
a former student produced.

222
00:11:23,450 --> 00:11:25,565
And you can see four a's

223
00:11:25,565 --> 00:11:29,780
of links up till 20
a's is pretty fast.

224
00:11:29,780 --> 00:11:32,495
But then starting at 26,

225
00:11:32,495 --> 00:11:35,285
it's getting really slow.

226
00:11:35,285 --> 00:11:37,100
So in this case,
remember the string

227
00:11:37,100 --> 00:11:38,870
is actually matched by
the regular expression.

228
00:11:38,870 --> 00:11:40,130
So it has nothing to do

229
00:11:40,130 --> 00:11:41,540
with a regular
expression actually

230
00:11:41,540 --> 00:11:45,485
matches a string or does
not match a string.

231
00:11:45,485 --> 00:11:48,260
I admit though these
regular expressions

232
00:11:48,260 --> 00:11:49,610
are carefully chosen,

233
00:11:49,610 --> 00:11:52,250
as you will see later on.

234
00:11:52,250 --> 00:11:55,620
I also just stop that here.

235
00:11:55,710 --> 00:12:00,985
Okay, this slide collects
the information about times.

236
00:12:00,985 --> 00:12:03,400
On the right-hand side will

237
00:12:03,400 --> 00:12:05,860
be our regular expression matcher,

238
00:12:05,860 --> 00:12:08,290
which we implement next week.

239
00:12:08,290 --> 00:12:10,795
On the left-hand side,
are these times by

240
00:12:10,795 --> 00:12:14,260
various other regular
expression matching engines?

241
00:12:14,260 --> 00:12:17,809
On the top is this
regular expression.

242
00:12:19,080 --> 00:12:23,335
Possible a n-times a n-times.

243
00:12:23,335 --> 00:12:26,890
And on the lower
is (a*)* b.

244
00:12:26,890 --> 00:12:30,370
And the x-axis show here

245
00:12:30,370 --> 00:12:35,335
the length of the
string. How many a's.

246
00:12:35,335 --> 00:12:38,925
And on the y-axis is the time

247
00:12:38,925 --> 00:12:41,660
they need to decide whether

248
00:12:41,660 --> 00:12:44,615
the string is matched by
the regular expression or not.

249
00:12:44,615 --> 00:12:46,415
So you can see here, Python,

250
00:12:46,415 --> 00:12:47,945
Java 8 and JavaScript,

251
00:12:47,945 --> 00:12:52,250
they max out approximately
at between 25 and 30.

252
00:12:52,250 --> 00:12:53,900
Because then it takes already

253
00:12:53,900 --> 00:12:55,160
a half a minute to

254
00:12:55,160 --> 00:12:57,410
decide whether the string
is matched or not.

255
00:12:57,410 --> 00:13:00,815
And similarly, in
the other example,

256
00:13:00,815 --> 00:13:03,830
Python and Ruby max out

257
00:13:03,830 --> 00:13:07,220
at a similar kind of
length of the strings.

258
00:13:07,220 --> 00:13:10,400
Because then they use also
half a minute to decide

259
00:13:10,400 --> 00:13:13,940
whether this regular expression
actually matches the string.

260
00:13:13,940 --> 00:13:16,790
Contrast that with


261
00:13:16,790 --> 00:13:19,235
the regular expression matcher

262
00:13:19,235 --> 00:13:21,470
which we're going to implement.

263
00:13:21,470 --> 00:13:25,040
This can match
approximately 10 thousand

264
00:13:25,040 --> 00:13:30,065
a's in this example and
needs less than ten seconds.

265
00:13:30,065 --> 00:13:32,285
Actually, there will be
two versions of that.

266
00:13:32,285 --> 00:13:34,850
The first version will be
also relatively slow.

267
00:13:34,850 --> 00:13:36,410
But the second version,

268
00:13:36,410 --> 00:13:38,240
in contrast to Python,

269
00:13:38,240 --> 00:13:40,295
Ruby, we'll be blindingly fast.

270
00:13:40,295 --> 00:13:42,380
And in the second example,

271
00:13:42,380 --> 00:13:45,740
you have to be careful
about the x-axis because

272
00:13:45,740 --> 00:13:49,385
that means four times
ten to the power six.

273
00:13:49,385 --> 00:13:51,695
It's actually 4 million a's.

274
00:13:51,695 --> 00:13:55,100
So our regular
expression matcher needs

275
00:13:55,100 --> 00:13:57,635
less than ten seconds to

276
00:13:57,635 --> 00:14:00,725
match a string of length
of 4 million a's.

277
00:14:00,725 --> 00:14:04,430
Contrast that Python, Java 8,

278
00:14:04,430 --> 00:14:06,770
and JavaScript need half a minute

279
00:14:06,770 --> 00:14:09,905
already for a string
of length just 30.

280
00:14:09,905 --> 00:14:12,365
I was very careful with Java 8.

281
00:14:12,365 --> 00:14:15,725
Yes, Java 9 and above,

282
00:14:15,725 --> 00:14:17,180
they already have

283
00:14:17,180 --> 00:14:19,610
a much better regular
expression matching engine,

284
00:14:19,610 --> 00:14:22,805
but still we will be running
circles around them.

285
00:14:22,805 --> 00:14:27,050
with this data.
I call this slide:

286
00:14:27,050 --> 00:14:29,675
Why bother with
regular expressions?

287
00:14:29,675 --> 00:14:33,515
But you can probably
see these are

288
00:14:33,515 --> 00:14:34,910
abysmal times by

289
00:14:34,910 --> 00:14:38,015
the existing regular
expression matching engines.

290
00:14:38,015 --> 00:14:40,070
And it's actually
surprising that after

291
00:14:40,070 --> 00:14:42,695
one lecture we can already
do substantially better.

292
00:14:42,695 --> 00:14:47,495
And if you don't believe
in the times, I gave here,

293
00:14:47,495 --> 00:14:50,090
please feel free to
play on your own

294
00:14:50,090 --> 00:14:52,865
with the examples
I uploaded on KEATS.

295
00:14:52,865 --> 00:14:55,235
These are exactly the programs

296
00:14:55,235 --> 00:14:57,470
I used here in the examples.

297
00:14:57,470 --> 00:14:59,255
So feel free.

298
00:14:59,255 --> 00:15:01,970
You might however now think, hmm.

299
00:15:01,970 --> 00:15:05,449
These are two very
well chosen examples,

300
00:15:05,449 --> 00:15:07,145
and I admit that's true,

301
00:15:07,145 --> 00:15:09,410
and such problems never

302
00:15:09,410 --> 00:15:12,540
cause any problems
in real life.

303
00:15:13,300 --> 00:15:15,980
Regular expressions are used very

304
00:15:15,980 --> 00:15:19,415
frequently and they
do cause problems.

305
00:15:19,415 --> 00:15:21,410
So here's my first example from

306
00:15:21,410 --> 00:15:23,885
a company called Cloudflare.

307
00:15:23,885 --> 00:15:27,560
This is a huge hosting company

308
00:15:27,560 --> 00:15:30,935
which hosts very
well-known web pages.

309
00:15:30,935 --> 00:15:34,970
And they really try hard
to have no outage at all.

310
00:15:34,970 --> 00:15:37,340
And they manage
that for six years.

311
00:15:37,340 --> 00:15:39,320
But then a regular expression,

312
00:15:39,320 --> 00:15:41,180
actually this one, caused
a problem and you

313
00:15:41,180 --> 00:15:43,265
can see they're also
two stars

314
00:15:43,265 --> 00:15:44,630
at the end.

315
00:15:44,630 --> 00:15:46,955
And because of that string needed

316
00:15:46,955 --> 00:15:49,865
too much time to be matched.

317
00:15:49,865 --> 00:15:50,990
And because of that,

318
00:15:50,990 --> 00:15:52,430
they had some outage for,

319
00:15:52,430 --> 00:15:54,125
I think several hours,

320
00:15:54,125 --> 00:15:57,920
actually in their malware
detection subsystem.

321
00:15:57,920 --> 00:16:02,060
And the second example
comes from 2016,

322
00:16:02,060 --> 00:16:04,040
where Stack Exchange,
I guess you know

323
00:16:04,040 --> 00:16:06,650
this webpage, had
also an outage for

324
00:16:06,650 --> 00:16:08,390
I think at least an hour.

325
00:16:08,390 --> 00:16:13,070
Because a regular expression,
needed to format posts,

326
00:16:13,070 --> 00:16:15,575
needed too much time to

327
00:16:15,575 --> 00:16:19,010
recognize whether this post
should be accepted or not.

328
00:16:19,010 --> 00:16:23,390
And again, there was a
similar kind of problem.

329
00:16:23,390 --> 00:16:24,950
And you can read
the stories behind

330
00:16:24,950 --> 00:16:28,080
that on these two given links.

331
00:16:28,720 --> 00:16:31,730
When I looked at
this the first time,

332
00:16:31,730 --> 00:16:34,175
what surprised me is
that theoreticians,

333
00:16:34,175 --> 00:16:37,520
who sometimes dedicate their
life to regular expressions

334
00:16:37,520 --> 00:16:39,440
and know really a lot about

335
00:16:39,440 --> 00:16:41,690
them, didn't know
anything about this.

336
00:16:41,690 --> 00:16:43,610
But engineers, they
already created

337
00:16:43,610 --> 00:16:46,160
a name for that:
Regular Expression

338
00:16:46,160 --> 00:16:47,975
Denial of Service Attack.

339
00:16:47,975 --> 00:16:49,745
Because what you can,

340
00:16:49,745 --> 00:16:51,230
what can happen now is that

341
00:16:51,230 --> 00:16:54,920
attackers look for
certain strings

342
00:16:54,920 --> 00:16:56,780
that make your regular expression

343
00:16:56,780 --> 00:16:59,105
matching engine topple over.

344
00:16:59,105 --> 00:17:01,370
And these kind of 

345
00:17:01,370 --> 00:17:04,160
regular expressions are called
Evil Regular Expressions.

346
00:17:04,160 --> 00:17:06,350
And actually there are
quite a number of them.

347
00:17:06,350 --> 00:17:08,495
So you seen this one,

348
00:17:08,495 --> 00:17:11,255
the first one, and the
second one already.

349
00:17:11,255 --> 00:17:13,400
But there are many, many more.

350
00:17:13,400 --> 00:17:15,620
And you can easily have in

351
00:17:15,620 --> 00:17:18,560
your program one of
these regular expressions.

352
00:17:18,560 --> 00:17:21,830
And then you have the
problem that if you do have

353
00:17:21,830 --> 00:17:23,240
this regular expression and

354
00:17:23,240 --> 00:17:25,640
somebody finds the
corresponding string,

355
00:17:25,640 --> 00:17:29,945
which make the regular
matching engine topple over,

356
00:17:29,945 --> 00:17:31,820
then you have a problem

357
00:17:31,820 --> 00:17:34,295
because your webpage is
probably not available.

358
00:17:34,295 --> 00:17:36,140
This phenomenon is also sometimes 

359
00:17:36,140 --> 00:17:39,350
called
catastrophic backtracking.

360
00:17:39,350 --> 00:17:43,595
In lecture three, we will
look at this more carefully.

361
00:17:43,595 --> 00:17:46,910
And actually why that
is such a problem in

362
00:17:46,910 --> 00:17:50,795
real life is actually
not to do with lexers.

363
00:17:50,795 --> 00:17:53,180
Yes, regular
expressions are used as

364
00:17:53,180 --> 00:17:55,040
the basic tool for implementing

365
00:17:55,040 --> 00:17:57,185
lexers. But regular expressions,

366
00:17:57,185 --> 00:18:00,065
of course, used in
a much wider area.

367
00:18:00,065 --> 00:18:03,770
And they especially used for
network intrusion detection.

368
00:18:03,770 --> 00:18:06,590
Remember, say you're having to

369
00:18:06,590 --> 00:18:10,130
administer a big network
and you only want to let

370
00:18:10,130 --> 00:18:13,640
in packets which you think are OK

371
00:18:13,640 --> 00:18:14,930
and you want to keep out

372
00:18:14,930 --> 00:18:17,645
any package which might
hack into your network.

373
00:18:17,645 --> 00:18:22,670
So what they have is they
have suites of thousands and

374
00:18:22,670 --> 00:18:25,745
sometimes even more
regular expressions which

375
00:18:25,745 --> 00:18:27,755
check whether this package

376
00:18:27,755 --> 00:18:30,065
satisfies some patterns or not.

377
00:18:30,065 --> 00:18:31,460
And in this case it will be left

378
00:18:31,460 --> 00:18:34,205
out or it will be let in.

379
00:18:34,205 --> 00:18:36,335
And with networks,

380
00:18:36,335 --> 00:18:39,080
the problem is that our
hardware is already

381
00:18:39,080 --> 00:18:43,190
so fast that the regular
expressions

382
00:18:43,190 --> 00:18:45,169
really become a bottleneck.

383
00:18:45,169 --> 00:18:47,060
Because what do you do if now is

384
00:18:47,060 --> 00:18:49,880
suddenly a regular expression
takes too much time?

385
00:18:49,880 --> 00:18:52,670
Do you just stop the matching

386
00:18:52,670 --> 00:18:55,100
and let the package
in regardless?

387
00:18:55,100 --> 00:18:58,190
Or do you just hold
the network up

388
00:18:58,190 --> 00:19:01,715
and don't let anything in
until you decided that.

389
00:19:01,715 --> 00:19:04,895
So that's actually a
really hard problem.

390
00:19:04,895 --> 00:19:06,650
But the first time I came across

391
00:19:06,650 --> 00:19:09,965
that problem was actually
by this engineer.

392
00:19:09,965 --> 00:19:13,820
And it's always say that
Germans don't have any humor.

393
00:19:13,820 --> 00:19:16,985
But I found that
video quite funny.

394
00:19:16,985 --> 00:19:19,145
Maybe you have a
different opinion,

395
00:19:19,145 --> 00:19:21,095
but feel free to
have a look. 

396
00:19:21,095 --> 00:19:23,705
It explains exactly that problem.

397
00:19:23,705 --> 00:19:25,610
So in the next video,

398
00:19:25,610 --> 00:19:28,445
we will start to
implement this matcher.

399
00:19:28,445 --> 00:19:30,870
So I hope to see you there.