1+ −
00:00:06,240 --> 00:00:11,050+ −
Welcome back. This video+ −
is about regular expressions.+ −
+ −
2+ −
00:00:11,050 --> 00:00:14,230+ −
We want to use regular+ −
expressions in our lexer.+ −
+ −
3+ −
00:00:14,230 --> 00:00:16,165+ −
And the purpose of the+ −
lexer is to find+ −
+ −
4+ −
00:00:16,165 --> 00:00:18,070+ −
out where the words in+ −
+ −
5+ −
00:00:18,070 --> 00:00:21,070+ −
our programs are. However+ −
regular expressions+ −
+ −
6+ −
00:00:21,070 --> 00:00:23,875+ −
are fundamental tool+ −
in computer science.+ −
+ −
7+ −
00:00:23,875 --> 00:00:27,910+ −
And I'm sure you've used them+ −
already on several occasions.+ −
+ −
8+ −
00:00:27,910 --> 00:00:30,370+ −
And one would expect that about+ −
+ −
9+ −
00:00:30,370 --> 00:00:31,750+ −
regular expressions since they are+ −
+ −
10+ −
00:00:31,750 --> 00:00:33,850+ −
so well-known and well studied,+ −
+ −
11+ −
00:00:33,850 --> 00:00:37,915+ −
that everything under the+ −
sun is known about them.+ −
+ −
12+ −
00:00:37,915 --> 00:00:41,080+ −
But actually there's+ −
still some surprising+ −
+ −
13+ −
00:00:41,080 --> 00:00:44,465+ −
and interesting+ −
problems with them.+ −
+ −
14+ −
00:00:44,465 --> 00:00:47,945+ −
And I want to show you+ −
them in this video.+ −
+ −
15+ −
00:00:47,945 --> 00:00:50,720+ −
I'm sure you've seen+ −
regular expressions+ −
+ −
16+ −
00:00:50,720 --> 00:00:52,445+ −
many, many times before.+ −
+ −
17+ −
00:00:52,445 --> 00:00:55,100+ −
But just to be on the same page,+ −
+ −
18+ −
00:00:55,100 --> 00:00:57,110+ −
let me just recap them.+ −
+ −
19+ −
00:00:57,110 --> 00:00:59,210+ −
So here in this line,+ −
+ −
20+ −
00:00:59,210 --> 00:01:01,790+ −
there is a regular expression+ −
which is supposed to+ −
+ −
21+ −
00:01:01,790 --> 00:01:05,285+ −
recognize some form+ −
of email addresses.+ −
+ −
22+ −
00:01:05,285 --> 00:01:07,745+ −
So an e-mail address+ −
+ −
23+ −
00:01:07,745 --> 00:01:11,000+ −
has part which is+ −
before the @ symbol,+ −
+ −
24+ −
00:01:11,000 --> 00:01:13,400+ −
which is the name of the person.+ −
+ −
25+ −
00:01:13,400 --> 00:01:16,880+ −
And that can be+ −
any number between+ −
+ −
26+ −
00:01:16,880 --> 00:01:20,195+ −
0 and 9, and letters between a and z.+ −
+ −
27+ −
00:01:20,195 --> 00:01:24,155+ −
Let's say we avoiding+ −
here capital letters.+ −
+ −
28+ −
00:01:24,155 --> 00:01:26,045+ −
There can be underscores.+ −
+ −
29+ −
00:01:26,045 --> 00:01:29,405+ −
There can be a dot and+ −
there can be hyphens.+ −
+ −
30+ −
00:01:29,405 --> 00:01:35,390+ −
And after the @ symbol+ −
comes the domain name.+ −
+ −
31+ −
00:01:35,390 --> 00:01:37,310+ −
So as you can see here,+ −
+ −
32+ −
00:01:37,310 --> 00:01:40,640+ −
we use things like star to+ −
+ −
33+ −
00:01:40,640 --> 00:01:44,314+ −
match letters+ −
zero or more times.+ −
+ −
34+ −
00:01:44,314 --> 00:01:45,985+ −
Or we have a plus,+ −
+ −
35+ −
00:01:45,985 --> 00:01:47,420+ −
which means you have to match+ −
+ −
36+ −
00:01:47,420 --> 00:01:52,489+ −
at least once or more+ −
times. Then we have.+ −
+ −
37+ −
00:01:52,489 --> 00:01:55,790+ −
question mark, which says you+ −
+ −
38+ −
00:01:55,790 --> 00:01:59,105+ −
match either it is there+ −
or it ss not there.+ −
+ −
39+ −
00:01:59,105 --> 00:02:01,340+ −
You are also regular+ −
expressions which+ −
+ −
40+ −
00:02:01,340 --> 00:02:03,755+ −
match exactly n-times.+ −
+ −
41+ −
00:02:03,755 --> 00:02:08,720+ −
Or this is a regular expression+ −
for between n and m times.+ −
+ −
42+ −
00:02:08,720 --> 00:02:12,065+ −
You can see in+ −
this email address,+ −
+ −
43+ −
00:02:12,065 --> 00:02:13,730+ −
the top-level domain+ −
+ −
44+ −
00:02:13,730 --> 00:02:16,130+ −
name can be any letter + −
+ −
45+ −
00:02:16,130 --> 00:02:19,265+ −
between a to z,+ −
and contain dots,+ −
+ −
46+ −
00:02:19,265 --> 00:02:22,340+ −
but can only be two+ −
characters long+ −
+ −
47+ −
00:02:22,340 --> 00:02:25,685+ −
up till six characters+ −
and not more.+ −
+ −
48+ −
00:02:25,685 --> 00:02:29,240+ −
Then you also have+ −
something like ranges.+ −
+ −
49+ −
00:02:29,240 --> 00:02:31,220+ −
So you can see, letters between a+ −
+ −
50+ −
00:02:31,220 --> 00:02:33,635+ −
and z and 0 to 9 and so on.+ −
+ −
51+ −
00:02:33,635 --> 00:02:36,545+ −
Here you also have regular+ −
expression which can+ −
+ −
52+ −
00:02:36,545 --> 00:02:40,070+ −
match something which+ −
isn't in this range.+ −
+ −
53+ −
00:02:40,070 --> 00:02:42,560+ −
So for example, if+ −
you want for example match,+ −
+ −
54+ −
00:02:42,560 --> 00:02:44,030+ −
letters but not numbers,+ −
+ −
55+ −
00:02:44,030 --> 00:02:45,800+ −
you would say, well, if+ −
+ −
56+ −
00:02:45,800 --> 00:02:48,990+ −
this is a number that+ −
should not match.+ −
+ −
57+ −
00:02:49,090 --> 00:02:52,804+ −
Typically you also+ −
have these ranges.+ −
+ −
58+ −
00:02:52,804 --> 00:02:55,565+ −
Lowercase letters,+ −
capital letters.+ −
+ −
59+ −
00:02:55,565 --> 00:02:58,550+ −
Then you have some+ −
special regular expressions+ −
+ −
60+ −
00:02:58,550 --> 00:03:02,195+ −
like this one is only+ −
supposed to match digits.+ −
+ −
61+ −
00:03:02,195 --> 00:03:05,674+ −
A dot is supposed to+ −
match any character.+ −
+ −
62+ −
00:03:05,674 --> 00:03:07,370+ −
And then they have also something+ −
+ −
63+ −
00:03:07,370 --> 00:03:09,800+ −
called groups which+ −
is supposed to be+ −
+ −
64+ −
00:03:09,800 --> 00:03:12,799+ −
used when you are+ −
trying to extract+ −
+ −
65+ −
00:03:12,799 --> 00:03:15,605+ −
a string you've matched.+ −
+ −
66+ −
00:03:15,605 --> 00:03:19,925+ −
Okay, so these are the+ −
typical regular expressions.+ −
+ −
67+ −
00:03:19,925 --> 00:03:23,075+ −
And here's a particular one.+ −
+ −
68+ −
00:03:23,075 --> 00:03:25,820+ −
Trying to match something+ −
+ −
69+ −
00:03:25,820 --> 00:03:28,770+ −
which resembles+ −
an email address.+ −
+ −
70+ −
00:03:29,590 --> 00:03:33,065+ −
Clearly that should be all easy.+ −
+ −
71+ −
00:03:33,065 --> 00:03:36,230+ −
And our technology should+ −
be on top of that.+ −
+ −
72+ −
00:03:36,230 --> 00:03:37,865+ −
That we can take a+ −
+ −
73+ −
00:03:37,865 --> 00:03:41,015+ −
regular expressions and+ −
we can take a string,+ −
+ −
74+ −
00:03:41,015 --> 00:03:43,340+ −
and we should have programs to+ −
+ −
75+ −
00:03:43,340 --> 00:03:45,680+ −
decide whether this+ −
string is matched+ −
+ −
76+ −
00:03:45,680 --> 00:03:50,330+ −
by a regular expression or+ −
not and should be easy-peasy, no?+ −
+ −
77+ −
00:03:50,330 --> 00:03:56,150+ −
Well, let's have a+ −
look at two examples.+ −
+ −
78+ −
00:03:56,150 --> 00:04:00,860+ −
The first regular expression+ −
is a star star b.+ −
+ −
79+ −
00:04:00,860 --> 00:04:02,990+ −
And it is supposed+ −
to match strings of+ −
+ −
80+ −
00:04:02,990 --> 00:04:05,825+ −
the form 0 or more a's,+ −
+ −
81+ −
00:04:05,825 --> 00:04:10,385+ −
followed by a b. The parentheses+ −
you can ignore.+ −
+ −
82+ −
00:04:10,385 --> 00:04:11,990+ −
And a star star+ −
+ −
83+ −
00:04:11,990 --> 00:04:14,120+ −
also doesn't+ −
make any difference+ −
+ −
84+ −
00:04:14,120 --> 00:04:16,505+ −
to what kind of strings+ −
that can be matched.+ −
+ −
85+ −
00:04:16,505 --> 00:04:21,635+ −
It can only make 0 more+ −
a's followed by a b.+ −
+ −
86+ −
00:04:21,635 --> 00:04:23,900+ −
And the other regular expression+ −
+ −
87+ −
00:04:23,900 --> 00:04:26,990+ −
is possibly a character a,+ −
+ −
88+ −
00:04:26,990 --> 00:04:32,930+ −
n times, followed by character+ −
a axactly n-times.+ −
+ −
89+ −
00:04:32,930 --> 00:04:35,570+ −
And we will try out+ −
+ −
90+ −
00:04:35,570 --> 00:04:38,360+ −
these two regular expressions+ −
with strings of the form a,+ −
+ −
91+ −
00:04:38,360 --> 00:04:39,890+ −
aa, and so on,+ −
+ −
92+ −
00:04:39,890 --> 00:04:45,770+ −
and up to the length of n. And+ −
+ −
93+ −
00:04:45,770 --> 00:04:49,130+ −
this regular expression should+ −
actually not match any of+ −
+ −
94+ −
00:04:49,130 --> 00:04:53,315+ −
the strings because the+ −
final b is missing.+ −
+ −
95+ −
00:04:53,315 --> 00:04:56,150+ −
But that is+ −
okay. For example+ −
+ −
96+ −
00:04:56,150 --> 00:04:57,425+ −
if you have a regular expression+ −
+ −
97+ −
00:04:57,425 --> 00:05:00,110+ −
that is supposed to+ −
check whether a string is+ −
+ −
98+ −
00:05:00,110 --> 00:05:01,490+ −
an email address and the user+ −
+ −
99+ −
00:05:01,490 --> 00:05:03,380+ −
gives some random+ −
strings in there,+ −
+ −
100+ −
00:05:03,380 --> 00:05:06,545+ −
then this regular expression+ −
should not match that string.+ −
+ −
101+ −
00:05:06,545 --> 00:05:08,420+ −
And for this regular expression+ −
+ −
102+ −
00:05:08,420 --> 00:05:11,195+ −
you have to scratch a+ −
little bit of your head,+ −
+ −
103+ −
00:05:11,195 --> 00:05:12,620+ −
what it can actually match.+ −
+ −
104+ −
00:05:12,620 --> 00:05:14,720+ −
But after a little bit+ −
of head scratching,+ −
+ −
105+ −
00:05:14,720 --> 00:05:18,260+ −
you find out can match+ −
any string which is of+ −
+ −
106+ −
00:05:18,260 --> 00:05:22,580+ −
the length n a's up+ −
to 2n of a's.+ −
+ −
107+ −
00:05:22,580 --> 00:05:24,290+ −
So anything in this range,+ −
+ −
108+ −
00:05:24,290 --> 00:05:27,185+ −
this regular expression+ −
can actually match.+ −
+ −
109+ −
00:05:27,185 --> 00:05:30,395+ −
Okay, let's+ −
take a random tool,+ −
+ −
110+ −
00:05:30,395 --> 00:05:32,630+ −
maybe for example Python.+ −
+ −
111+ −
00:05:32,630 --> 00:05:35,240+ −
So here's a little+ −
Python program.+ −
+ −
112+ −
00:05:35,240 --> 00:05:38,690+ −
It uses the library+ −
function of Python to+ −
+ −
113+ −
00:05:38,690 --> 00:05:42,935+ −
match the regular expressions of+ −
a star star b.+ −
+ −
114+ −
00:05:42,935 --> 00:05:46,805+ −
And we measure time with longer+ −
and longer strings of a.+ −
+ −
115+ −
00:05:46,805 --> 00:05:48,770+ −
And so conveniently we can give+ −
+ −
116+ −
00:05:48,770 --> 00:05:51,140+ −
the number of a's here+ −
on the command line.+ −
+ −
117+ −
00:05:51,140 --> 00:05:56,900+ −
If I just call+ −
this on the command line,+ −
+ −
118+ −
00:05:56,900 --> 00:05:59,900+ −
Let's say we first+ −
start with five a's.+ −
+ −
119+ −
00:05:59,900 --> 00:06:03,920+ −
And I get also the times which+ −
in this case is next to nothing.+ −
+ −
120+ −
00:06:03,920 --> 00:06:05,960+ −
And here's the string+ −
we just matched.+ −
+ −
121+ −
00:06:05,960 --> 00:06:07,640+ −
And obviously the+ −
regular expression+ −
+ −
122+ −
00:06:07,640 --> 00:06:09,110+ −
did not match the string.+ −
+ −
123+ −
00:06:09,110 --> 00:06:11,255+ −
That's indicated by this none.+ −
+ −
124+ −
00:06:11,255 --> 00:06:13,925+ −
Let's take ten a's.+ −
+ −
125+ −
00:06:13,925 --> 00:06:16,490+ −
It's also pretty quick.+ −
+ −
126+ −
00:06:16,490 --> 00:06:20,780+ −
Fifteen a's, even quicker,+ −
+ −
127+ −
00:06:20,780 --> 00:06:23,180+ −
but these times always need to+ −
+ −
128+ −
00:06:23,180 --> 00:06:25,820+ −
be taken with a grain of salt.+ −
+ −
129+ −
00:06:25,820 --> 00:06:28,040+ −
They are not 100+ −
percent accurate.+ −
+ −
130+ −
00:06:28,040 --> 00:06:31,490+ −
So 15 is also a let's take+ −
+ −
131+ −
00:06:31,490 --> 00:06:36,965+ −
28th notes already+ −
double the time.+ −
+ −
132+ −
00:06:36,965 --> 00:06:42,440+ −
Twenty-five longer.+ −
+ −
133+ −
00:06:42,440 --> 00:06:45,680+ −
Okay, that suddenly+ −
from 02 seconds,+ −
+ −
134+ −
00:06:45,680 --> 00:06:48,960+ −
it takes almost four seconds.+ −
+ −
135+ −
00:06:49,600 --> 00:06:54,890+ −
Six this+ −
+ −
136+ −
00:06:54,890 --> 00:07:01,415+ −
takes six seconds+ −
already Double, okay?+ −
+ −
137+ −
00:07:01,415 --> 00:07:07,229+ −
Go to 28. That would be now.+ −
+ −
138+ −
00:07:08,890 --> 00:07:11,840+ −
You see the string+ −
isn't very long,+ −
+ −
139+ −
00:07:11,840 --> 00:07:13,340+ −
so that could be easily like+ −
+ −
140+ −
00:07:13,340 --> 00:07:16,070+ −
just the size of+ −
an email address.+ −
+ −
141+ −
00:07:16,070 --> 00:07:19,280+ −
And the regular+ −
expression matching+ −
+ −
142+ −
00:07:19,280 --> 00:07:22,550+ −
engine in Python needs+ −
quite a long time+ −
+ −
143+ −
00:07:22,550 --> 00:07:24,710+ −
to find out that+ −
this string of 28+ −
+ −
144+ −
00:07:24,710 --> 00:07:26,570+ −
AES is actually not much+ −
+ −
145+ −
00:07:26,570 --> 00:07:28,490+ −
by that you see it's+ −
still not finished.+ −
+ −
146+ −
00:07:28,490 --> 00:07:32,900+ −
I think it should take+ −
approximately like 20 seconds.+ −
+ −
147+ −
00:07:32,900 --> 00:07:34,400+ −
Okay. Already 30.+ −
+ −
148+ −
00:07:34,400 --> 00:07:36,530+ −
And if we would try+ −
+ −
149+ −
00:07:36,530 --> 00:07:40,805+ −
30 would be already+ −
more than a minute.+ −
+ −
150+ −
00:07:40,805 --> 00:07:43,940+ −
And if I could read+ −
something like hundreds,+ −
+ −
151+ −
00:07:43,940 --> 00:07:46,220+ −
you remember if a doubling in+ −
+ −
152+ −
00:07:46,220 --> 00:07:48,770+ −
each step or the second step,+ −
+ −
153+ −
00:07:48,770 --> 00:07:50,720+ −
the story with the chess board,+ −
+ −
154+ −
00:07:50,720 --> 00:07:53,855+ −
we probably would sit here+ −
until the next century.+ −
+ −
155+ −
00:07:53,855 --> 00:07:56,820+ −
So something strange here.+ −
+ −
156+ −
00:07:57,580 --> 00:08:01,355+ −
Okay, that might be just+ −
a problem of Python.+ −
+ −
157+ −
00:08:01,355 --> 00:08:02,990+ −
Let's have a look at another+ −
+ −
158+ −
00:08:02,990 --> 00:08:04,985+ −
regular expression+ −
matching engine.+ −
+ −
159+ −
00:08:04,985 --> 00:08:06,890+ −
This time from JavaScript,+ −
+ −
160+ −
00:08:06,890 --> 00:08:10,040+ −
also are pretty well-known+ −
programming language.+ −
+ −
161+ −
00:08:10,040 --> 00:08:13,610+ −
So here you can see+ −
it's still a star,+ −
+ −
162+ −
00:08:13,610 --> 00:08:16,235+ −
star followed by b,+ −
+ −
163+ −
00:08:16,235 --> 00:08:18,920+ −
by direct expression is+ −
supposed to match that from+ −
+ −
164+ −
00:08:18,920 --> 00:08:21,830+ −
the beginning of the+ −
string up till the end.+ −
+ −
165+ −
00:08:21,830 --> 00:08:23,930+ −
So there's not any difference+ −
+ −
166+ −
00:08:23,930 --> 00:08:26,150+ −
in the strings this work+ −
expression matches.+ −
+ −
167+ −
00:08:26,150 --> 00:08:28,610+ −
We'll just start at the+ −
beginning of the string+ −
+ −
168+ −
00:08:28,610 --> 00:08:31,460+ −
and finish at the+ −
end of the string.+ −
+ −
169+ −
00:08:31,460 --> 00:08:35,285+ −
And we again, we just use+ −
repeated A's for that.+ −
+ −
170+ −
00:08:35,285 --> 00:08:38,195+ −
And similarly, we can+ −
+ −
171+ −
00:08:38,195 --> 00:08:41,930+ −
call it on the command line+ −
and can do some timing.+ −
+ −
172+ −
00:08:41,930 --> 00:08:44,540+ −
So ten SBA, good.+ −
+ −
173+ −
00:08:44,540 --> 00:08:46,340+ −
Here's the string.+ −
+ −
174+ −
00:08:46,340 --> 00:08:48,320+ −
It cannot match that string.+ −
+ −
175+ −
00:08:48,320 --> 00:08:50,525+ −
And it's pretty fast.+ −
+ −
176+ −
00:08:50,525 --> 00:08:54,725+ −
Friendly. Although pretty fast.+ −
+ −
177+ −
00:08:54,725 --> 00:08:59,120+ −
Five, again,+ −
+ −
178+ −
00:08:59,120 --> 00:09:06,650+ −
somehow is kind of+ −
threshold that is 25, 26.+ −
+ −
179+ −
00:09:06,650 --> 00:09:09,485+ −
Suddenly it takes much longer.+ −
+ −
180+ −
00:09:09,485 --> 00:09:14,360+ −
And it has essentially the+ −
same problem as with Python.+ −
+ −
181+ −
00:09:14,360 --> 00:09:17,165+ −
So you'll see in now from 26 on,+ −
+ −
182+ −
00:09:17,165 --> 00:09:19,250+ −
the Times has always+ −
doubling from+ −
+ −
183+ −
00:09:19,250 --> 00:09:21,860+ −
three seconds to seven seconds.+ −
+ −
184+ −
00:09:21,860 --> 00:09:23,330+ −
So you can imagine what that+ −
+ −
185+ −
00:09:23,330 --> 00:09:24,890+ −
roughly takes when I put your+ −
+ −
186+ −
00:09:24,890 --> 00:09:30,230+ −
27 and you see the+ −
string isn't very long.+ −
+ −
187+ −
00:09:30,230 --> 00:09:32,165+ −
Let's choose twenties or maize.+ −
+ −
188+ −
00:09:32,165 --> 00:09:35,419+ −
Imagine you have to+ −
search a database+ −
+ −
189+ −
00:09:35,419 --> 00:09:38,720+ −
with kilobytes of data.+ −
+ −
190+ −
00:09:38,720 --> 00:09:42,260+ −
This, these regular+ −
expressions that would years+ −
+ −
191+ −
00:09:42,260 --> 00:09:48,150+ −
need years to go through with+ −
these regular expressions.+ −
+ −
192+ −
00:09:48,630 --> 00:09:51,850+ −
Okay, maybe the people in+ −
+ −
193+ −
00:09:51,850 --> 00:09:55,435+ −
Python and JavaScript,+ −
they're just idiots.+ −
+ −
194+ −
00:09:55,435 --> 00:09:58,180+ −
Surely Java must do much better.+ −
+ −
195+ −
00:09:58,180 --> 00:10:01,045+ −
So here's a program.+ −
+ −
196+ −
00:10:01,045 --> 00:10:03,415+ −
You can see this again+ −
+ −
197+ −
00:10:03,415 --> 00:10:05,980+ −
is the reg expression+ −
and we just having+ −
+ −
198+ −
00:10:05,980 --> 00:10:08,320+ −
some scaffolding to generate+ −
+ −
199+ −
00:10:08,320 --> 00:10:11,905+ −
strings from five up till 28.+ −
+ −
200+ −
00:10:11,905 --> 00:10:14,305+ −
And if we run that,+ −
+ −
201+ −
00:10:14,305 --> 00:10:16,660+ −
actually does that automatically.+ −
+ −
202+ −
00:10:16,660 --> 00:10:19,900+ −
So uphill 19, pretty fast,+ −
+ −
203+ −
00:10:19,900 --> 00:10:24,925+ −
but then starting from+ −
23, skidding pretty slow.+ −
+ −
204+ −
00:10:24,925 --> 00:10:27,445+ −
So the question is+ −
what's going on?+ −
+ −
205+ −
00:10:27,445 --> 00:10:29,230+ −
By the way, I'm not quoting here.+ −
+ −
206+ −
00:10:29,230 --> 00:10:33,755+ −
Scala, using internally+ −
the regular expression+ −
+ −
207+ −
00:10:33,755 --> 00:10:36,665+ −
matching engine from Java.+ −
+ −
208+ −
00:10:36,665 --> 00:10:39,065+ −
So would have exactly+ −
the same problem.+ −
+ −
209+ −
00:10:39,065 --> 00:10:41,480+ −
Also, I have been+ −
here very careful,+ −
+ −
210+ −
00:10:41,480 --> 00:10:43,550+ −
I'm using here Java 8,+ −
+ −
211+ −
00:10:43,550 --> 00:10:46,085+ −
which nowadays is quite old.+ −
+ −
212+ −
00:10:46,085 --> 00:10:50,765+ −
But you will see also+ −
current Java versions.+ −
+ −
213+ −
00:10:50,765 --> 00:10:55,490+ −
We will see we can out-compete+ −
them by magnitudes.+ −
+ −
214+ −
00:10:55,490 --> 00:10:57,605+ −
So I think I can that.+ −
+ −
215+ −
00:10:57,605 --> 00:10:59,165+ −
Now, just finish here.+ −
+ −
216+ −
00:10:59,165 --> 00:11:04,025+ −
You see the problem. Just+ −
for completeness sake.+ −
+ −
217+ −
00:11:04,025 --> 00:11:07,010+ −
Here is a Ruby program.+ −
+ −
218+ −
00:11:07,010 --> 00:11:09,935+ −
This is using the other+ −
regular expression.+ −
+ −
219+ −
00:11:09,935 --> 00:11:12,935+ −
In this case the+ −
string should match.+ −
+ −
220+ −
00:11:12,935 --> 00:11:20,300+ −
And again it tries out+ −
strings between 130 here.+ −
+ −
221+ −
00:11:20,300 --> 00:11:23,450+ −
That's a program actually+ −
a former student produced.+ −
+ −
222+ −
00:11:23,450 --> 00:11:25,565+ −
And you can see four a's+ −
+ −
223+ −
00:11:25,565 --> 00:11:29,780+ −
of links up till 20+ −
AES is pretty fast.+ −
+ −
224+ −
00:11:29,780 --> 00:11:32,495+ −
But then starting at 26,+ −
+ −
225+ −
00:11:32,495 --> 00:11:35,285+ −
it's getting really slow.+ −
+ −
226+ −
00:11:35,285 --> 00:11:37,100+ −
So in this case,+ −
remember the string+ −
+ −
227+ −
00:11:37,100 --> 00:11:38,870+ −
is actually matched by+ −
the regular expression.+ −
+ −
228+ −
00:11:38,870 --> 00:11:40,130+ −
So it has nothing to do+ −
+ −
229+ −
00:11:40,130 --> 00:11:41,540+ −
with a regular+ −
expression actually+ −
+ −
230+ −
00:11:41,540 --> 00:11:45,485+ −
matches a string or does+ −
not match a string.+ −
+ −
231+ −
00:11:45,485 --> 00:11:48,260+ −
I admit though these+ −
regular expressions+ −
+ −
232+ −
00:11:48,260 --> 00:11:49,610+ −
are carefully chosen,+ −
+ −
233+ −
00:11:49,610 --> 00:11:52,250+ −
as you will see later on.+ −
+ −
234+ −
00:11:52,250 --> 00:11:55,620+ −
Hey, I also just stop that here.+ −
+ −
235+ −
00:11:55,710 --> 00:12:00,985+ −
Okay, this slight collect+ −
this information about times.+ −
+ −
236+ −
00:12:00,985 --> 00:12:03,400+ −
On the right hand side will+ −
+ −
237+ −
00:12:03,400 --> 00:12:05,860+ −
be our regular expression mantra,+ −
+ −
238+ −
00:12:05,860 --> 00:12:08,290+ −
which we implement next week.+ −
+ −
239+ −
00:12:08,290 --> 00:12:10,795+ −
On the left-hand side,+ −
are these times by+ −
+ −
240+ −
00:12:10,795 --> 00:12:14,260+ −
barriers than regular+ −
expression matching engines?+ −
+ −
241+ −
00:12:14,260 --> 00:12:17,809+ −
On the top is this+ −
regular expression.+ −
+ −
242+ −
00:12:19,080 --> 00:12:23,335+ −
Possible a n times a n times.+ −
+ −
243+ −
00:12:23,335 --> 00:12:26,890+ −
And on the lowest+ −
is a star, star b.+ −
+ −
244+ −
00:12:26,890 --> 00:12:30,370+ −
And the x-axis show here+ −
+ −
245+ −
00:12:30,370 --> 00:12:35,335+ −
the length of the+ −
string. How many a's.+ −
+ −
246+ −
00:12:35,335 --> 00:12:38,925+ −
And on the y axis is the time.+ −
+ −
247+ −
00:12:38,925 --> 00:12:41,660+ −
They need to decide whether+ −
+ −
248+ −
00:12:41,660 --> 00:12:44,615+ −
the string is matched by+ −
the rate expression or not.+ −
+ −
249+ −
00:12:44,615 --> 00:12:46,415+ −
So you can see here, Python,+ −
+ −
250+ −
00:12:46,415 --> 00:12:47,945+ −
Java eight in JavaScript,+ −
+ −
251+ −
00:12:47,945 --> 00:12:52,250+ −
they max out approximately+ −
at between 2530.+ −
+ −
252+ −
00:12:52,250 --> 00:12:53,900+ −
The kristin, it takes already+ −
+ −
253+ −
00:12:53,900 --> 00:12:55,160+ −
a half a minute to+ −
+ −
254+ −
00:12:55,160 --> 00:12:57,410+ −
decide whether the string+ −
is matched or not.+ −
+ −
255+ −
00:12:57,410 --> 00:13:00,815+ −
And similarly, in+ −
the other example,+ −
+ −
256+ −
00:13:00,815 --> 00:13:03,830+ −
Python and derived Ruby max out+ −
+ −
257+ −
00:13:03,830 --> 00:13:07,220+ −
at a similar kind of+ −
length of the strings.+ −
+ −
258+ −
00:13:07,220 --> 00:13:10,400+ −
Because then they use also+ −
half a minute to decide+ −
+ −
259+ −
00:13:10,400 --> 00:13:13,940+ −
whether this rec expression+ −
actually matches the string.+ −
+ −
260+ −
00:13:13,940 --> 00:13:16,790+ −
Contrast that with+ −
the reg expression+ −
+ −
261+ −
00:13:16,790 --> 00:13:19,235+ −
which we are regular+ −
expression mantra,+ −
+ −
262+ −
00:13:19,235 --> 00:13:21,470+ −
which we're going to implement.+ −
+ −
263+ −
00:13:21,470 --> 00:13:25,040+ −
This can match+ −
approximately 10 thousand+ −
+ −
264+ −
00:13:25,040 --> 00:13:30,065+ −
a's in this example and+ −
needs less than ten seconds.+ −
+ −
265+ −
00:13:30,065 --> 00:13:32,285+ −
Actually, there will be+ −
two versions of that.+ −
+ −
266+ −
00:13:32,285 --> 00:13:34,850+ −
First version may be+ −
also relatively slow.+ −
+ −
267+ −
00:13:34,850 --> 00:13:36,410+ −
But the second version,+ −
+ −
268+ −
00:13:36,410 --> 00:13:38,240+ −
in contrast to Python,+ −
+ −
269+ −
00:13:38,240 --> 00:13:40,295+ −
Ruby, we'll be blindingly fast.+ −
+ −
270+ −
00:13:40,295 --> 00:13:42,380+ −
And in the second example,+ −
+ −
271+ −
00:13:42,380 --> 00:13:45,740+ −
you have to be careful+ −
about the x axis because+ −
+ −
272+ −
00:13:45,740 --> 00:13:49,385+ −
that means four times+ −
ten to the power six.+ −
+ −
273+ −
00:13:49,385 --> 00:13:51,695+ −
It's actually 4 million A's.+ −
+ −
274+ −
00:13:51,695 --> 00:13:55,100+ −
So our regular+ −
expression match or need+ −
+ −
275+ −
00:13:55,100 --> 00:13:57,635+ −
less than ten seconds to+ −
+ −
276+ −
00:13:57,635 --> 00:14:00,725+ −
match a string of length+ −
of 4 million A's.+ −
+ −
277+ −
00:14:00,725 --> 00:14:04,430+ −
Contrast that Python, Java eight,+ −
+ −
278+ −
00:14:04,430 --> 00:14:06,770+ −
and JavaScript need half a minute+ −
+ −
279+ −
00:14:06,770 --> 00:14:09,905+ −
already for a string+ −
of length just 30,+ −
+ −
280+ −
00:14:09,905 --> 00:14:12,365+ −
unless you're very+ −
careful with Java eight.+ −
+ −
281+ −
00:14:12,365 --> 00:14:15,725+ −
Yes, Java nine and above,+ −
+ −
282+ −
00:14:15,725 --> 00:14:17,180+ −
they already have+ −
+ −
283+ −
00:14:17,180 --> 00:14:19,610+ −
a much better regular+ −
expression matching engine,+ −
+ −
284+ −
00:14:19,610 --> 00:14:22,805+ −
but still we will be running+ −
circles around them.+ −
+ −
285+ −
00:14:22,805 --> 00:14:27,050+ −
It's this data. I+ −
call this slide.+ −
+ −
286+ −
00:14:27,050 --> 00:14:29,675+ −
Why bother with+ −
regular expressions?+ −
+ −
287+ −
00:14:29,675 --> 00:14:33,515+ −
But you can probably+ −
see these are+ −
+ −
288+ −
00:14:33,515 --> 00:14:34,910+ −
at least more times by+ −
+ −
289+ −
00:14:34,910 --> 00:14:38,015+ −
the existing regular+ −
expression matching engines.+ −
+ −
290+ −
00:14:38,015 --> 00:14:40,070+ −
And it's actually+ −
surprising that after+ −
+ −
291+ −
00:14:40,070 --> 00:14:42,695+ −
one lecture we can already+ −
do substantially better.+ −
+ −
292+ −
00:14:42,695 --> 00:14:47,495+ −
And if you don't believe+ −
in D times, I gave here,+ −
+ −
293+ −
00:14:47,495 --> 00:14:50,090+ −
please feel free to+ −
play on your own+ −
+ −
294+ −
00:14:50,090 --> 00:14:52,865+ −
with the examples+ −
I uploaded, Keats.+ −
+ −
295+ −
00:14:52,865 --> 00:14:55,235+ −
These are exactly the programs+ −
+ −
296+ −
00:14:55,235 --> 00:14:57,470+ −
are used here in the examples.+ −
+ −
297+ −
00:14:57,470 --> 00:14:59,255+ −
So feel free.+ −
+ −
298+ −
00:14:59,255 --> 00:15:01,970+ −
You might however now think, hmm.+ −
+ −
299+ −
00:15:01,970 --> 00:15:05,449+ −
These are two very+ −
well chosen examples.+ −
+ −
300+ −
00:15:05,449 --> 00:15:07,145+ −
And I admit that's true.+ −
+ −
301+ −
00:15:07,145 --> 00:15:09,410+ −
And such problem there never+ −
+ −
302+ −
00:15:09,410 --> 00:15:12,540+ −
causing any problems+ −
in real life.+ −
+ −
303+ −
00:15:13,300 --> 00:15:15,980+ −
Regular expressions are used very+ −
+ −
304+ −
00:15:15,980 --> 00:15:19,415+ −
frequently and they+ −
do cause problems.+ −
+ −
305+ −
00:15:19,415 --> 00:15:21,410+ −
So here's my first example from+ −
+ −
306+ −
00:15:21,410 --> 00:15:23,885+ −
a company called cloudflare.+ −
+ −
307+ −
00:15:23,885 --> 00:15:27,560+ −
This is a huge hosting company+ −
+ −
308+ −
00:15:27,560 --> 00:15:30,935+ −
which host very+ −
well-known web pages.+ −
+ −
309+ −
00:15:30,935 --> 00:15:34,970+ −
And they really try hard+ −
to have no outage at all.+ −
+ −
310+ −
00:15:34,970 --> 00:15:37,340+ −
And they manage+ −
that for six years.+ −
+ −
311+ −
00:15:37,340 --> 00:15:39,320+ −
But then a Rekha expression,+ −
+ −
312+ −
00:15:39,320 --> 00:15:41,180+ −
actually this one caused+ −
a problem and you+ −
+ −
313+ −
00:15:41,180 --> 00:15:43,265+ −
can see they're also+ −
like two stars.+ −
+ −
314+ −
00:15:43,265 --> 00:15:44,630+ −
They are at the end.+ −
+ −
315+ −
00:15:44,630 --> 00:15:46,955+ −
And because of that string needed+ −
+ −
316+ −
00:15:46,955 --> 00:15:49,865+ −
too much time to be matched.+ −
+ −
317+ −
00:15:49,865 --> 00:15:50,990+ −
And because of that,+ −
+ −
318+ −
00:15:50,990 --> 00:15:52,430+ −
they had some outage for,+ −
+ −
319+ −
00:15:52,430 --> 00:15:54,125+ −
I think several hours,+ −
+ −
320+ −
00:15:54,125 --> 00:15:57,920+ −
actually in their malware+ −
detection subsystem.+ −
+ −
321+ −
00:15:57,920 --> 00:16:02,060+ −
And the second example+ −
comes from 2016,+ −
+ −
322+ −
00:16:02,060 --> 00:16:04,040+ −
where Stack Exchange,+ −
I guess you know+ −
+ −
323+ −
00:16:04,040 --> 00:16:06,650+ −
this webpage had+ −
also an outage from,+ −
+ −
324+ −
00:16:06,650 --> 00:16:08,390+ −
I think at least an hour.+ −
+ −
325+ −
00:16:08,390 --> 00:16:13,070+ −
Because a regular expression+ −
then needed to format posts,+ −
+ −
326+ −
00:16:13,070 --> 00:16:15,575+ −
needed too much time to+ −
+ −
327+ −
00:16:15,575 --> 00:16:19,010+ −
recognize whether this post+ −
should be accepted or not.+ −
+ −
328+ −
00:16:19,010 --> 00:16:23,390+ −
And again, there was a+ −
semi kind of problem.+ −
+ −
329+ −
00:16:23,390 --> 00:16:24,950+ −
And you can read+ −
the stories behind+ −
+ −
330+ −
00:16:24,950 --> 00:16:28,080+ −
that on these two given links.+ −
+ −
331+ −
00:16:28,720 --> 00:16:31,730+ −
When I looked at+ −
this the first time,+ −
+ −
332+ −
00:16:31,730 --> 00:16:34,175+ −
what surprised me is+ −
that theoretician+ −
+ −
333+ −
00:16:34,175 --> 00:16:37,520+ −
who sometimes dedicate their+ −
life to regular expression.+ −
+ −
334+ −
00:16:37,520 --> 00:16:39,440+ −
And no really a lot about+ −
+ −
335+ −
00:16:39,440 --> 00:16:41,690+ −
them didn't know+ −
anything about this.+ −
+ −
336+ −
00:16:41,690 --> 00:16:43,610+ −
But engineers, they+ −
already created+ −
+ −
337+ −
00:16:43,610 --> 00:16:46,160+ −
a name for that+ −
regular expression,+ −
+ −
338+ −
00:16:46,160 --> 00:16:47,975+ −
denial of service attack.+ −
+ −
339+ −
00:16:47,975 --> 00:16:49,745+ −
Because what you can,+ −
+ −
340+ −
00:16:49,745 --> 00:16:51,230+ −
what can happen now is that+ −
+ −
341+ −
00:16:51,230 --> 00:16:54,920+ −
attackers look for+ −
certain strings.+ −
+ −
342+ −
00:16:54,920 --> 00:16:56,780+ −
You make your regular expression+ −
+ −
343+ −
00:16:56,780 --> 00:16:59,105+ −
matching engine topple over.+ −
+ −
344+ −
00:16:59,105 --> 00:17:01,370+ −
And these kind of expressions,+ −
+ −
345+ −
00:17:01,370 --> 00:17:04,160+ −
regular expressions called+ −
Eve of reg expression.+ −
+ −
346+ −
00:17:04,160 --> 00:17:06,350+ −
And actually there are+ −
quite a number of them.+ −
+ −
347+ −
00:17:06,350 --> 00:17:08,495+ −
So you seen this one,+ −
+ −
348+ −
00:17:08,495 --> 00:17:11,255+ −
the first one, and the+ −
second one already.+ −
+ −
349+ −
00:17:11,255 --> 00:17:13,400+ −
But there are many, many more.+ −
+ −
350+ −
00:17:13,400 --> 00:17:15,620+ −
And you can easily have in+ −
+ −
351+ −
00:17:15,620 --> 00:17:18,560+ −
your program one of+ −
these reg expression.+ −
+ −
352+ −
00:17:18,560 --> 00:17:21,830+ −
And then you have the+ −
problem that if you do have+ −
+ −
353+ −
00:17:21,830 --> 00:17:23,240+ −
this regular expression and+ −
+ −
354+ −
00:17:23,240 --> 00:17:25,640+ −
somebody finds the+ −
corresponding string,+ −
+ −
355+ −
00:17:25,640 --> 00:17:29,945+ −
which make the records+ −
matching engine topple over,+ −
+ −
356+ −
00:17:29,945 --> 00:17:31,820+ −
then you have a problem+ −
+ −
357+ −
00:17:31,820 --> 00:17:34,295+ −
because your webpage is+ −
probably not variable.+ −
+ −
358+ −
00:17:34,295 --> 00:17:36,140+ −
This is also sometimes called+ −
+ −
359+ −
00:17:36,140 --> 00:17:39,350+ −
this phenomenon,+ −
catastrophic backtracking.+ −
+ −
360+ −
00:17:39,350 --> 00:17:43,595+ −
In lecture three, we will+ −
look at this more carefully.+ −
+ −
361+ −
00:17:43,595 --> 00:17:46,910+ −
And actually why that+ −
is such a problem in+ −
+ −
362+ −
00:17:46,910 --> 00:17:50,795+ −
real life is actually+ −
not to do with Lexus.+ −
+ −
363+ −
00:17:50,795 --> 00:17:53,180+ −
Yes, regular+ −
expressions are used as+ −
+ −
364+ −
00:17:53,180 --> 00:17:55,040+ −
the basic tool for implementing+ −
+ −
365+ −
00:17:55,040 --> 00:17:57,185+ −
like source bad reg expressions,+ −
+ −
366+ −
00:17:57,185 --> 00:18:00,065+ −
of course, used in+ −
a much wider area.+ −
+ −
367+ −
00:18:00,065 --> 00:18:03,770+ −
And they especially used for+ −
network intrusion detection.+ −
+ −
368+ −
00:18:03,770 --> 00:18:06,590+ −
Remember, you having to+ −
+ −
369+ −
00:18:06,590 --> 00:18:10,130+ −
administer a big network+ −
and you only want to let+ −
+ −
370+ −
00:18:10,130 --> 00:18:13,640+ −
in packets which you think are K+ −
+ −
371+ −
00:18:13,640 --> 00:18:14,930+ −
and you want to keep out+ −
+ −
372+ −
00:18:14,930 --> 00:18:17,645+ −
any package which might+ −
hack into your network.+ −
+ −
373+ −
00:18:17,645 --> 00:18:22,670+ −
So what they have is they+ −
have suites of thousands and+ −
+ −
374+ −
00:18:22,670 --> 00:18:25,745+ −
sometimes even more+ −
regular expressions which+ −
+ −
375+ −
00:18:25,745 --> 00:18:27,755+ −
check whether this package+ −
+ −
376+ −
00:18:27,755 --> 00:18:30,065+ −
satisfies some patterns or not.+ −
+ −
377+ −
00:18:30,065 --> 00:18:31,460+ −
And in this case it will be left+ −
+ −
378+ −
00:18:31,460 --> 00:18:34,205+ −
out or it will be let in.+ −
+ −
379+ −
00:18:34,205 --> 00:18:36,335+ −
And with networks,+ −
+ −
380+ −
00:18:36,335 --> 00:18:39,080+ −
the problem is that our+ −
hardware is already+ −
+ −
381+ −
00:18:39,080 --> 00:18:43,190+ −
so fast that the reg expressions+ −
+ −
382+ −
00:18:43,190 --> 00:18:45,169+ −
really become a bottleneck.+ −
+ −
383+ −
00:18:45,169 --> 00:18:47,060+ −
Because what do you do if now is+ −
+ −
384+ −
00:18:47,060 --> 00:18:49,880+ −
suddenly a reg expression+ −
takes too much time+ −
+ −
385+ −
00:18:49,880 --> 00:18:52,670+ −
to just stop the matching+ −
+ −
386+ −
00:18:52,670 --> 00:18:55,100+ −
and let the package+ −
in regardless?+ −
+ −
387+ −
00:18:55,100 --> 00:18:58,190+ −
Or do you just hold+ −
the network up+ −
+ −
388+ −
00:18:58,190 --> 00:19:01,715+ −
and don't let anything in+ −
until you decided that.+ −
+ −
389+ −
00:19:01,715 --> 00:19:04,895+ −
So that's actually a+ −
really hard problem.+ −
+ −
390+ −
00:19:04,895 --> 00:19:06,650+ −
But the first time I came across+ −
+ −
391+ −
00:19:06,650 --> 00:19:09,965+ −
that problem was actually+ −
by this engineer.+ −
+ −
392+ −
00:19:09,965 --> 00:19:13,820+ −
And it's always say that+ −
Germans don't have any Yammer.+ −
+ −
393+ −
00:19:13,820 --> 00:19:16,985+ −
But I found that+ −
video quite funny.+ −
+ −
394+ −
00:19:16,985 --> 00:19:19,145+ −
Maybe you have a+ −
different opinion,+ −
+ −
395+ −
00:19:19,145 --> 00:19:21,095+ −
but feel free to+ −
have a look which+ −
+ −
396+ −
00:19:21,095 --> 00:19:23,705+ −
explains exactly that problem.+ −
+ −
397+ −
00:19:23,705 --> 00:19:25,610+ −
So in the next video,+ −
+ −
398+ −
00:19:25,610 --> 00:19:28,445+ −
we will start to+ −
implement this matcher.+ −
+ −
399+ −
00:19:28,445 --> 00:19:30,870+ −
So I hope to see you there.+ −