1
00:00:06,240 --> 00:00:11,050
Welcome back. This video
is about regular expressions.
2
00:00:11,050 --> 00:00:14,230
We want to use regular
expressions in our lexer.
3
00:00:14,230 --> 00:00:16,165
And the purpose of the
lexer is to find
4
00:00:16,165 --> 00:00:18,070
out where the words in
5
00:00:18,070 --> 00:00:21,070
our programs are. However
regular expressions
6
00:00:21,070 --> 00:00:23,875
are fundamental tool
in computer science.
7
00:00:23,875 --> 00:00:27,910
And I'm sure you've used them
already on several occasions.
8
00:00:27,910 --> 00:00:30,370
And one would expect that about
9
00:00:30,370 --> 00:00:31,750
regular expressions since they are
10
00:00:31,750 --> 00:00:33,850
so well-known and well studied,
11
00:00:33,850 --> 00:00:37,915
that everything under the
sun is known about them.
12
00:00:37,915 --> 00:00:41,080
But actually there's
still some surprising
13
00:00:41,080 --> 00:00:44,465
and interesting
problems with them.
14
00:00:44,465 --> 00:00:47,945
And I want to show you
them in this video.
15
00:00:47,945 --> 00:00:50,720
I'm sure you've seen
regular expressions
16
00:00:50,720 --> 00:00:52,445
many, many times before.
17
00:00:52,445 --> 00:00:55,100
But just to be on the same page,
18
00:00:55,100 --> 00:00:57,110
let me just recap them.
19
00:00:57,110 --> 00:00:59,210
So here in this line,
20
00:00:59,210 --> 00:01:01,790
there is a regular expression
which is supposed to
21
00:01:01,790 --> 00:01:05,285
recognize some form
of email addresses.
22
00:01:05,285 --> 00:01:07,745
So an e-mail address
23
00:01:07,745 --> 00:01:11,000
has part which is
before the @ symbol,
24
00:01:11,000 --> 00:01:13,400
which is the name of the person.
25
00:01:13,400 --> 00:01:16,880
And that can be
any number between
26
00:01:16,880 --> 00:01:20,195
0 and 9, and letters between a and z.
27
00:01:20,195 --> 00:01:24,155
Let's say we avoiding
here capital letters.
28
00:01:24,155 --> 00:01:26,045
There can be underscores.
29
00:01:26,045 --> 00:01:29,405
There can be a dot and
there can be hyphens.
30
00:01:29,405 --> 00:01:35,390
And after the @ symbol
comes the domain name.
31
00:01:35,390 --> 00:01:37,310
So as you can see here,
32
00:01:37,310 --> 00:01:40,640
we use things like star to
33
00:01:40,640 --> 00:01:44,314
match letters
zero or more times.
34
00:01:44,314 --> 00:01:45,985
Or we have a plus,
35
00:01:45,985 --> 00:01:47,420
which means you have to match
36
00:01:47,420 --> 00:01:52,489
at least once or more
times. Then we have.
37
00:01:52,489 --> 00:01:55,790
question mark, which says you
38
00:01:55,790 --> 00:01:59,105
match either it is there
or it is not there.
39
00:01:59,105 --> 00:02:01,340
You are also regular
expressions which
40
00:02:01,340 --> 00:02:03,755
match exactly n-times.
41
00:02:03,755 --> 00:02:08,720
Or this is a regular expression
for between n and m times.
42
00:02:08,720 --> 00:02:12,065
You can see in
this email address,
43
00:02:12,065 --> 00:02:13,730
the top-level domain
44
00:02:13,730 --> 00:02:16,130
name can be any letter
45
00:02:16,130 --> 00:02:19,265
between a to z,
and contain dots,
46
00:02:19,265 --> 00:02:22,340
but can only be two
characters long
47
00:02:22,340 --> 00:02:25,685
up till six characters
and not more.
48
00:02:25,685 --> 00:02:29,240
Then you also have
something like ranges.
49
00:02:29,240 --> 00:02:31,220
So you can see, letters between a
50
00:02:31,220 --> 00:02:33,635
and z and 0 to 9 and so on.
51
00:02:33,635 --> 00:02:36,545
Here you also have regular
expressions which can
52
00:02:36,545 --> 00:02:40,070
match something which
isn't in this range.
53
00:02:40,070 --> 00:02:42,560
So for example, if
you want for example match,
54
00:02:42,560 --> 00:02:44,030
letters but not numbers,
55
00:02:44,030 --> 00:02:45,800
you would say, well, if
56
00:02:45,800 --> 00:02:48,990
this is a number that
should not match.
57
00:02:49,090 --> 00:02:52,804
Typically you also
have these ranges.
58
00:02:52,804 --> 00:02:55,565
Lowercase letters,
capital letters.
59
00:02:55,565 --> 00:02:58,550
Then you have some
special regular expressions
60
00:02:58,550 --> 00:03:02,195
like this one is only
supposed to match digits.
61
00:03:02,195 --> 00:03:05,674
A dot is supposed to
match any character.
62
00:03:05,674 --> 00:03:07,370
And then they have also something
63
00:03:07,370 --> 00:03:09,800
called groups which
is supposed to be
64
00:03:09,800 --> 00:03:12,799
used when you are
trying to extract
65
00:03:12,799 --> 00:03:15,605
a string you've matched.
66
00:03:15,605 --> 00:03:19,925
Okay, so these are the
typical regular expressions.
67
00:03:19,925 --> 00:03:23,075
And here's a particular one
68
00:03:23,075 --> 00:03:25,820
trying to match something
69
00:03:25,820 --> 00:03:28,770
which resembles
an email address.
70
00:03:29,590 --> 00:03:33,065
Clearly that should be all easy.
71
00:03:33,065 --> 00:03:36,230
And our technology should
be on top of that.
72
00:03:36,230 --> 00:03:37,865
That we can take a
73
00:03:37,865 --> 00:03:41,015
regular expressions and
we can take a string,
74
00:03:41,015 --> 00:03:43,340
and we should have programs to
75
00:03:43,340 --> 00:03:45,680
decide whether this
string is matched
76
00:03:45,680 --> 00:03:50,330
by a regular expression or
not and should be easy-peasy, no?
77
00:03:50,330 --> 00:03:56,150
Well, let's have a
look at two examples.
78
00:03:56,150 --> 00:04:00,860
The first regular expression
is a star star b.
79
00:04:00,860 --> 00:04:02,990
And it is supposed
to match strings of
80
00:04:02,990 --> 00:04:05,825
the form 0 or more a's,
81
00:04:05,825 --> 00:04:10,385
followed by a b. The parentheses
you can ignore.
82
00:04:10,385 --> 00:04:11,990
And a star star
83
00:04:11,990 --> 00:04:14,120
also doesn't
make any difference
84
00:04:14,120 --> 00:04:16,505
to what kind of strings
that can be matched.
85
00:04:16,505 --> 00:04:21,635
It can only make 0 more
a's followed by a b.
86
00:04:21,635 --> 00:04:23,900
And the other regular expression
87
00:04:23,900 --> 00:04:26,990
is possibly a character a,
88
00:04:26,990 --> 00:04:32,930
n times, followed by character
a axactly n-times.
89
00:04:32,930 --> 00:04:35,570
And we will try out
90
00:04:35,570 --> 00:04:38,360
these two regular expressions
with strings of the form a,
91
00:04:38,360 --> 00:04:39,890
aa, and so on,
92
00:04:39,890 --> 00:04:45,770
and up to the length of n. And
93
00:04:45,770 --> 00:04:49,130
this regular expression should
actually not match any of
94
00:04:49,130 --> 00:04:53,315
the strings because the
final b is missing.
95
00:04:53,315 --> 00:04:56,150
But that is
okay. For example
96
00:04:56,150 --> 00:04:57,425
if you have a regular expression
97
00:04:57,425 --> 00:05:00,110
that is supposed to
check whether a string is
98
00:05:00,110 --> 00:05:01,490
an email address and the user
99
00:05:01,490 --> 00:05:03,380
gives some random
strings in there,
100
00:05:03,380 --> 00:05:06,545
then this regular expression
should not match that string.
101
00:05:06,545 --> 00:05:08,420
And for this regular expression
102
00:05:08,420 --> 00:05:11,195
you have to scratch a
little bit of your head,
103
00:05:11,195 --> 00:05:12,620
what it can actually match.
104
00:05:12,620 --> 00:05:14,720
But after a little bit
of head scratching,
105
00:05:14,720 --> 00:05:18,260
you find out can match
any string which is of
106
00:05:18,260 --> 00:05:22,580
the length n a's up
to 2n of a's.
107
00:05:22,580 --> 00:05:24,290
So anything in this range,
108
00:05:24,290 --> 00:05:27,185
this regular expression
can actually match.
109
00:05:27,185 --> 00:05:30,395
Okay, let's
take a random tool,
110
00:05:30,395 --> 00:05:32,630
maybe for example Python.
111
00:05:32,630 --> 00:05:35,240
So here's a little
Python program.
112
00:05:35,240 --> 00:05:38,690
It uses the library
function of Python to
113
00:05:38,690 --> 00:05:42,935
match the regular expressions of
a star star b.
114
00:05:42,935 --> 00:05:46,805
And we measure the time with longer
and longer strings of a's.
115
00:05:46,805 --> 00:05:48,770
And so conveniently we can give
116
00:05:48,770 --> 00:05:51,140
the number of a's here
on the command line.
117
00:05:51,140 --> 00:05:56,900
If I just call
this on the command line,
118
00:05:56,900 --> 00:05:59,900
Let's say we first
start with five a's.
119
00:05:59,900 --> 00:06:03,920
And I get also the times which
in this case is next to nothing.
120
00:06:03,920 --> 00:06:05,960
And here's the string
we just matched.
121
00:06:05,960 --> 00:06:07,640
And obviously the
regular expression
122
00:06:07,640 --> 00:06:09,110
did not match the string.
123
00:06:09,110 --> 00:06:11,255
That's indicated by this None.
124
00:06:11,255 --> 00:06:13,925
Let's take ten a's.
125
00:06:13,925 --> 00:06:16,490
It's also pretty quick.
126
00:06:16,490 --> 00:06:20,780
Fifteen a's, even quicker,
127
00:06:20,780 --> 00:06:23,180
but these times always need to
128
00:06:23,180 --> 00:06:25,820
be taken with a grain of salt.
129
00:06:25,820 --> 00:06:28,040
They are not 100
percent accurate.
130
00:06:28,040 --> 00:06:31,490
So 15 is also OK.
Let's take 20.
131
00:06:31,490 --> 00:06:36,965
Hmmm this already takes
double the time.
132
00:06:36,965 --> 00:06:42,440
Twenty-five. Then even longer.
133
00:06:42,440 --> 00:06:45,680
Okay, then suddenly
from 0.2 seconds,
134
00:06:45,680 --> 00:06:48,960
it now takes almost four seconds.
135
00:06:49,600 --> 00:06:54,890
Twenty-Six, this
136
00:06:54,890 --> 00:07:01,415
takes six seconds...
already double.
137
00:07:01,415 --> 00:07:07,229
Let's go to 28. That would be
...hmmm....hmmm
138
00:07:08,890 --> 00:07:11,840
You see the string
isn't very long,
139
00:07:11,840 --> 00:07:13,340
so that could be easily like
140
00:07:13,340 --> 00:07:16,070
just the size of
an email address.
141
00:07:16,070 --> 00:07:19,280
And the regular
expression matching
142
00:07:19,280 --> 00:07:22,550
engine in Python needs
quite a long time
143
00:07:22,550 --> 00:07:24,710
to find out that
this string of 28
144
00:07:24,710 --> 00:07:26,570
a's is actually not matched
145
00:07:26,570 --> 00:07:28,490
by that. You see it's
still not finished.
146
00:07:28,490 --> 00:07:32,900
I think it should take
approximately like 20 seconds.
147
00:07:32,900 --> 00:07:34,400
Okay. Already 30.
148
00:07:34,400 --> 00:07:36,530
And if we would try
149
00:07:36,530 --> 00:07:40,805
30, we would be already here
for more than a minute.
150
00:07:40,805 --> 00:07:43,940
And if I could use
something like 100,
151
00:07:43,940 --> 00:07:46,220
you remember if a doubling in
152
00:07:46,220 --> 00:07:48,770
each step or the second step,
153
00:07:48,770 --> 00:07:50,720
the story with the chess board,
154
00:07:50,720 --> 00:07:53,855
we probably would sit here
until the next century.
155
00:07:53,855 --> 00:07:56,820
So something strange here.
156
00:07:57,580 --> 00:08:01,355
Okay, that might be just
a problem of Python.
157
00:08:01,355 --> 00:08:02,990
Let's have a look at another
158
00:08:02,990 --> 00:08:04,985
regular expression
matching engine.
159
00:08:04,985 --> 00:08:06,890
This time from JavaScript,
160
00:08:06,890 --> 00:08:10,040
also are pretty well-known
programming language.
161
00:08:10,040 --> 00:08:13,610
So here you can see
it's still a star,
162
00:08:13,610 --> 00:08:16,235
star followed by b,
163
00:08:16,235 --> 00:08:18,920
by direct expression is
supposed to match that from
164
00:08:18,920 --> 00:08:21,830
the beginning of the
string up till the end.
165
00:08:21,830 --> 00:08:23,930
So there's not any difference
166
00:08:23,930 --> 00:08:26,150
in the strings this regular
expression matches.
167
00:08:26,150 --> 00:08:28,610
We'll just start at the
beginning of the string
168
00:08:28,610 --> 00:08:31,460
and finish at the
end of the string.
169
00:08:31,460 --> 00:08:35,285
And we again, we just use
repeated a's for that.
170
00:08:35,285 --> 00:08:38,195
And similarly, we can
171
00:08:38,195 --> 00:08:41,930
call it on the command line
and can do some timing.
172
00:08:41,930 --> 00:08:44,540
So ten a's is very good.
173
00:08:44,540 --> 00:08:46,340
Here's the string.
174
00:08:46,340 --> 00:08:48,320
It cannot match that string.
175
00:08:48,320 --> 00:08:50,525
And it's pretty fast.
176
00:08:50,525 --> 00:08:54,725
Twenty...also pretty fast.
177
00:08:54,725 --> 00:08:59,120
Twenty-five... Again,
178
00:08:59,120 --> 00:09:06,650
somehow is a kind of
threshold that is 25, 26.
179
00:09:06,650 --> 00:09:09,485
Suddenly it takes much longer.
180
00:09:09,485 --> 00:09:14,360
And it has essentially the
same problem as with Python.
181
00:09:14,360 --> 00:09:17,165
So you'll see in now from 26 on,
182
00:09:17,165 --> 00:09:19,250
the times always
double from
183
00:09:19,250 --> 00:09:21,860
three seconds to seven seconds.
184
00:09:21,860 --> 00:09:23,330
So you can imagine what that
185
00:09:23,330 --> 00:09:24,890
roughly takes when I put your
186
00:09:24,890 --> 00:09:30,230
27 and you see the
string isn't very long.
187
00:09:30,230 --> 00:09:32,165
It is just twenty-or-something a's.
188
00:09:32,165 --> 00:09:35,419
Imagine you have to
search a database
189
00:09:35,419 --> 00:09:38,720
with Gigabytes of data
190
00:09:38,720 --> 00:09:42,260
with these regular
expressions that would
191
00:09:42,260 --> 00:09:48,150
need years to go through with
these regular expressions.
192
00:09:48,630 --> 00:09:51,850
Okay, maybe the people in
193
00:09:51,850 --> 00:09:55,435
Python and JavaScript,
they're just idiots.
194
00:09:55,435 --> 00:09:58,180
Surely Java must do much better.
195
00:09:58,180 --> 00:10:01,045
So here's a program.
196
00:10:01,045 --> 00:10:03,415
You can see this again
197
00:10:03,415 --> 00:10:05,980
is the regular expression
and we just having
198
00:10:05,980 --> 00:10:08,320
some scaffolding to generate
199
00:10:08,320 --> 00:10:11,905
strings from 5 up till 28.
200
00:10:11,905 --> 00:10:14,305
And if we run that,
201
00:10:14,305 --> 00:10:16,660
actually does that automatically.
202
00:10:16,660 --> 00:10:19,900
So uphill 19, pretty fast,
203
00:10:19,900 --> 00:10:24,925
but then starting from
23, it is getting pretty slow.
204
00:10:24,925 --> 00:10:27,445
So the question is
what's going on?
205
00:10:27,445 --> 00:10:29,230
By the way, I'm not gloating here.
206
00:10:29,230 --> 00:10:33,755
Scala uses internally
the regular expression
207
00:10:33,755 --> 00:10:36,665
matching engine from Java.
208
00:10:36,665 --> 00:10:39,065
So would have exactly
the same problem.
209
00:10:39,065 --> 00:10:41,480
Also, I have been
here very careful,
210
00:10:41,480 --> 00:10:43,550
I'm using here Java 8,
211
00:10:43,550 --> 00:10:46,085
which nowadays is quite old.
212
00:10:46,085 --> 00:10:50,765
But you will see also
current Java versions.
213
00:10:50,765 --> 00:10:55,490
We will see we can out-compete
them by magnitudes.
214
00:10:55,490 --> 00:10:57,605
So I think I can
215
00:10:57,605 --> 00:10:59,165
now, just finish this here.
216
00:10:59,165 --> 00:11:04,025
You see the problem.
Just for completeness sake.
217
00:11:04,025 --> 00:11:07,010
Here is a Ruby program.
218
00:11:07,010 --> 00:11:09,935
This is using the other
regular expression.
219
00:11:09,935 --> 00:11:12,935
In this case the
string should match.
220
00:11:12,935 --> 00:11:20,300
And again it tries out
strings between 1 and 30 here.
221
00:11:20,300 --> 00:11:23,450
That's a program actually
a former student produced.
222
00:11:23,450 --> 00:11:25,565
And you can see four a's
223
00:11:25,565 --> 00:11:29,780
of links up till 20
a's is pretty fast.
224
00:11:29,780 --> 00:11:32,495
But then starting at 26,
225
00:11:32,495 --> 00:11:35,285
it's getting really slow.
226
00:11:35,285 --> 00:11:37,100
So in this case,
remember the string
227
00:11:37,100 --> 00:11:38,870
is actually matched by
the regular expression.
228
00:11:38,870 --> 00:11:40,130
So it has nothing to do
229
00:11:40,130 --> 00:11:41,540
with a regular
expression actually
230
00:11:41,540 --> 00:11:45,485
matches a string or does
not match a string.
231
00:11:45,485 --> 00:11:48,260
I admit though these
regular expressions
232
00:11:48,260 --> 00:11:49,610
are carefully chosen,
233
00:11:49,610 --> 00:11:52,250
as you will see later on.
234
00:11:52,250 --> 00:11:55,620
I also just stop that here.
235
00:11:55,710 --> 00:12:00,985
Okay, this slide collects
the information about times.
236
00:12:00,985 --> 00:12:03,400
On the right-hand side will
237
00:12:03,400 --> 00:12:05,860
be our regular expression matcher,
238
00:12:05,860 --> 00:12:08,290
which we implement next week.
239
00:12:08,290 --> 00:12:10,795
On the left-hand side,
are these times by
240
00:12:10,795 --> 00:12:14,260
various other regular
expression matching engines?
241
00:12:14,260 --> 00:12:17,809
On the top is this
regular expression.
242
00:12:19,080 --> 00:12:23,335
Possible a n-times a n-times.
243
00:12:23,335 --> 00:12:26,890
And on the lower
is (a*)* b.
244
00:12:26,890 --> 00:12:30,370
And the x-axis show here
245
00:12:30,370 --> 00:12:35,335
the length of the
string. How many a's.
246
00:12:35,335 --> 00:12:38,925
And on the y-axis is the time
247
00:12:38,925 --> 00:12:41,660
they need to decide whether
248
00:12:41,660 --> 00:12:44,615
the string is matched by
the regular expression or not.
249
00:12:44,615 --> 00:12:46,415
So you can see here, Python,
250
00:12:46,415 --> 00:12:47,945
Java 8 and JavaScript,
251
00:12:47,945 --> 00:12:52,250
they max out approximately
at between 25 and 30.
252
00:12:52,250 --> 00:12:53,900
Because then it takes already
253
00:12:53,900 --> 00:12:55,160
a half a minute to
254
00:12:55,160 --> 00:12:57,410
decide whether the string
is matched or not.
255
00:12:57,410 --> 00:13:00,815
And similarly, in
the other example,
256
00:13:00,815 --> 00:13:03,830
Python and Ruby max out
257
00:13:03,830 --> 00:13:07,220
at a similar kind of
length of the strings.
258
00:13:07,220 --> 00:13:10,400
Because then they use also
half a minute to decide
259
00:13:10,400 --> 00:13:13,940
whether this regular expression
actually matches the string.
260
00:13:13,940 --> 00:13:16,790
Contrast that with
261
00:13:16,790 --> 00:13:19,235
the regular expression matcher
262
00:13:19,235 --> 00:13:21,470
which we're going to implement.
263
00:13:21,470 --> 00:13:25,040
This can match
approximately 10 thousand
264
00:13:25,040 --> 00:13:30,065
a's in this example and
needs less than ten seconds.
265
00:13:30,065 --> 00:13:32,285
Actually, there will be
two versions of that.
266
00:13:32,285 --> 00:13:34,850
The first version will be
also relatively slow.
267
00:13:34,850 --> 00:13:36,410
But the second version,
268
00:13:36,410 --> 00:13:38,240
in contrast to Python,
269
00:13:38,240 --> 00:13:40,295
Ruby, we'll be blindingly fast.
270
00:13:40,295 --> 00:13:42,380
And in the second example,
271
00:13:42,380 --> 00:13:45,740
you have to be careful
about the x-axis because
272
00:13:45,740 --> 00:13:49,385
that means four times
ten to the power six.
273
00:13:49,385 --> 00:13:51,695
It's actually 4 million a's.
274
00:13:51,695 --> 00:13:55,100
So our regular
expression matcher needs
275
00:13:55,100 --> 00:13:57,635
less than ten seconds to
276
00:13:57,635 --> 00:14:00,725
match a string of length
of 4 million a's.
277
00:14:00,725 --> 00:14:04,430
Contrast that Python, Java 8,
278
00:14:04,430 --> 00:14:06,770
and JavaScript need half a minute
279
00:14:06,770 --> 00:14:09,905
already for a string
of length just 30.
280
00:14:09,905 --> 00:14:12,365
I was very careful with Java 8.
281
00:14:12,365 --> 00:14:15,725
Yes, Java 9 and above,
282
00:14:15,725 --> 00:14:17,180
they already have
283
00:14:17,180 --> 00:14:19,610
a much better regular
expression matching engine,
284
00:14:19,610 --> 00:14:22,805
but still we will be running
circles around them.
285
00:14:22,805 --> 00:14:27,050
with this data.
I call this slide:
286
00:14:27,050 --> 00:14:29,675
Why bother with
regular expressions?
287
00:14:29,675 --> 00:14:33,515
But you can probably
see these are
288
00:14:33,515 --> 00:14:34,910
abysmal times by
289
00:14:34,910 --> 00:14:38,015
the existing regular
expression matching engines.
290
00:14:38,015 --> 00:14:40,070
And it's actually
surprising that after
291
00:14:40,070 --> 00:14:42,695
one lecture we can already
do substantially better.
292
00:14:42,695 --> 00:14:47,495
And if you don't believe
in the times, I gave here,
293
00:14:47,495 --> 00:14:50,090
please feel free to
play on your own
294
00:14:50,090 --> 00:14:52,865
with the examples
I uploaded on KEATS.
295
00:14:52,865 --> 00:14:55,235
These are exactly the programs
296
00:14:55,235 --> 00:14:57,470
I used here in the examples.
297
00:14:57,470 --> 00:14:59,255
So feel free.
298
00:14:59,255 --> 00:15:01,970
You might however now think, hmm.
299
00:15:01,970 --> 00:15:05,449
These are two very
well chosen examples,
300
00:15:05,449 --> 00:15:07,145
and I admit that's true,
301
00:15:07,145 --> 00:15:09,410
and such problems never
302
00:15:09,410 --> 00:15:12,540
cause any problems
in real life.
303
00:15:13,300 --> 00:15:15,980
Regular expressions are used very
304
00:15:15,980 --> 00:15:19,415
frequently and they
do cause problems.
305
00:15:19,415 --> 00:15:21,410
So here's my first example from
306
00:15:21,410 --> 00:15:23,885
a company called Cloudflare.
307
00:15:23,885 --> 00:15:27,560
This is a huge hosting company
308
00:15:27,560 --> 00:15:30,935
which hosts very
well-known web pages.
309
00:15:30,935 --> 00:15:34,970
And they really try hard
to have no outage at all.
310
00:15:34,970 --> 00:15:37,340
And they manage
that for six years.
311
00:15:37,340 --> 00:15:39,320
But then a regular expression,
312
00:15:39,320 --> 00:15:41,180
actually this one, caused
a problem and you
313
00:15:41,180 --> 00:15:43,265
can see they're also
two stars
314
00:15:43,265 --> 00:15:44,630
at the end.
315
00:15:44,630 --> 00:15:46,955
And because of that string needed
316
00:15:46,955 --> 00:15:49,865
too much time to be matched.
317
00:15:49,865 --> 00:15:50,990
And because of that,
318
00:15:50,990 --> 00:15:52,430
they had some outage for,
319
00:15:52,430 --> 00:15:54,125
I think several hours,
320
00:15:54,125 --> 00:15:57,920
actually in their malware
detection subsystem.
321
00:15:57,920 --> 00:16:02,060
And the second example
comes from 2016,
322
00:16:02,060 --> 00:16:04,040
where Stack Exchange,
I guess you know
323
00:16:04,040 --> 00:16:06,650
this webpage, had
also an outage for
324
00:16:06,650 --> 00:16:08,390
I think at least an hour.
325
00:16:08,390 --> 00:16:13,070
Because a regular expression,
needed to format posts,
326
00:16:13,070 --> 00:16:15,575
needed too much time to
327
00:16:15,575 --> 00:16:19,010
recognize whether this post
should be accepted or not.
328
00:16:19,010 --> 00:16:23,390
And again, there was a
similar kind of problem.
329
00:16:23,390 --> 00:16:24,950
And you can read
the stories behind
330
00:16:24,950 --> 00:16:28,080
that on these two given links.
331
00:16:28,720 --> 00:16:31,730
When I looked at
this the first time,
332
00:16:31,730 --> 00:16:34,175
what surprised me is
that theoreticians,
333
00:16:34,175 --> 00:16:37,520
who sometimes dedicate their
life to regular expressions
334
00:16:37,520 --> 00:16:39,440
and know really a lot about
335
00:16:39,440 --> 00:16:41,690
them, didn't know
anything about this.
336
00:16:41,690 --> 00:16:43,610
But engineers, they
already created
337
00:16:43,610 --> 00:16:46,160
a name for that:
Regular Expression
338
00:16:46,160 --> 00:16:47,975
Denial of Service Attack.
339
00:16:47,975 --> 00:16:49,745
Because what you can,
340
00:16:49,745 --> 00:16:51,230
what can happen now is that
341
00:16:51,230 --> 00:16:54,920
attackers look for
certain strings
342
00:16:54,920 --> 00:16:56,780
that make your regular expression
343
00:16:56,780 --> 00:16:59,105
matching engine topple over.
344
00:16:59,105 --> 00:17:01,370
And these kind of
345
00:17:01,370 --> 00:17:04,160
regular expressions are called
Evil Regular Expressions.
346
00:17:04,160 --> 00:17:06,350
And actually there are
quite a number of them.
347
00:17:06,350 --> 00:17:08,495
So you seen this one,
348
00:17:08,495 --> 00:17:11,255
the first one, and the
second one already.
349
00:17:11,255 --> 00:17:13,400
But there are many, many more.
350
00:17:13,400 --> 00:17:15,620
And you can easily have in
351
00:17:15,620 --> 00:17:18,560
your program one of
these regular expressions.
352
00:17:18,560 --> 00:17:21,830
And then you have the
problem that if you do have
353
00:17:21,830 --> 00:17:23,240
this regular expression and
354
00:17:23,240 --> 00:17:25,640
somebody finds the
corresponding string,
355
00:17:25,640 --> 00:17:29,945
which make the regular
matching engine topple over,
356
00:17:29,945 --> 00:17:31,820
then you have a problem
357
00:17:31,820 --> 00:17:34,295
because your webpage is
probably not available.
358
00:17:34,295 --> 00:17:36,140
This phenomenon is also sometimes
359
00:17:36,140 --> 00:17:39,350
called
catastrophic backtracking.
360
00:17:39,350 --> 00:17:43,595
In lecture three, we will
look at this more carefully.
361
00:17:43,595 --> 00:17:46,910
And actually why that
is such a problem in
362
00:17:46,910 --> 00:17:50,795
real life is actually
not to do with lexers.
363
00:17:50,795 --> 00:17:53,180
Yes, regular
expressions are used as
364
00:17:53,180 --> 00:17:55,040
the basic tool for implementing
365
00:17:55,040 --> 00:17:57,185
lexers. But regular expressions,
366
00:17:57,185 --> 00:18:00,065
of course, used in
a much wider area.
367
00:18:00,065 --> 00:18:03,770
And they especially used for
network intrusion detection.
368
00:18:03,770 --> 00:18:06,590
Remember, say you're having to
369
00:18:06,590 --> 00:18:10,130
administer a big network
and you only want to let
370
00:18:10,130 --> 00:18:13,640
in packets which you think are OK
371
00:18:13,640 --> 00:18:14,930
and you want to keep out
372
00:18:14,930 --> 00:18:17,645
any package which might
hack into your network.
373
00:18:17,645 --> 00:18:22,670
So what they have is they
have suites of thousands and
374
00:18:22,670 --> 00:18:25,745
sometimes even more
regular expressions which
375
00:18:25,745 --> 00:18:27,755
check whether this package
376
00:18:27,755 --> 00:18:30,065
satisfies some patterns or not.
377
00:18:30,065 --> 00:18:31,460
And in this case it will be left
378
00:18:31,460 --> 00:18:34,205
out or it will be let in.
379
00:18:34,205 --> 00:18:36,335
And with networks,
380
00:18:36,335 --> 00:18:39,080
the problem is that our
hardware is already
381
00:18:39,080 --> 00:18:43,190
so fast that the regular
expressions
382
00:18:43,190 --> 00:18:45,169
really become a bottleneck.
383
00:18:45,169 --> 00:18:47,060
Because what do you do if now is
384
00:18:47,060 --> 00:18:49,880
suddenly a regular expression
takes too much time?
385
00:18:49,880 --> 00:18:52,670
Do you just stop the matching
386
00:18:52,670 --> 00:18:55,100
and let the package
in regardless?
387
00:18:55,100 --> 00:18:58,190
Or do you just hold
the network up
388
00:18:58,190 --> 00:19:01,715
and don't let anything in
until you decided that.
389
00:19:01,715 --> 00:19:04,895
So that's actually a
really hard problem.
390
00:19:04,895 --> 00:19:06,650
But the first time I came across
391
00:19:06,650 --> 00:19:09,965
that problem was actually
by this engineer.
392
00:19:09,965 --> 00:19:13,820
And it's always say that
Germans don't have any humor.
393
00:19:13,820 --> 00:19:16,985
But I found that
video quite funny.
394
00:19:16,985 --> 00:19:19,145
Maybe you have a
different opinion,
395
00:19:19,145 --> 00:19:21,095
but feel free to
have a look.
396
00:19:21,095 --> 00:19:23,705
It explains exactly that problem.
397
00:19:23,705 --> 00:19:25,610
So in the next video,
398
00:19:25,610 --> 00:19:28,445
we will start to
implement this matcher.
399
00:19:28,445 --> 00:19:30,870
So I hope to see you there.