1+ −
00:00:06,350 --> 00:00:10,305+ −
Welcome come back. I+ −
can hear you saying,+ −
+ −
2+ −
00:00:10,305 --> 00:00:12,000+ −
yes, you tried it out on+ −
+ −
3+ −
00:00:12,000 --> 00:00:14,760+ −
one example and you+ −
were much better.+ −
+ −
4+ −
00:00:14,760 --> 00:00:17,415+ −
But how about on other examples?+ −
+ −
5+ −
00:00:17,415 --> 00:00:21,265+ −
Specifically, we had two+ −
evil regular expressions.+ −
+ −
6+ −
00:00:21,265 --> 00:00:23,480+ −
How about the other case?+ −
+ −
7+ −
00:00:23,480 --> 00:00:27,480+ −
Well, let's have a look at+ −
that other case in this video.+ −
+ −
8+ −
00:00:29,140 --> 00:00:32,705+ −
So I'm back here+ −
in this re2.sc file.+ −
+ −
9+ −
00:00:32,705 --> 00:00:35,180+ −
And here's this test+ −
case which run quite+ −
+ −
10+ −
00:00:35,180 --> 00:00:39,665+ −
nicely for strings between 0 and 1000.+ −
+ −
11+ −
00:00:39,665 --> 00:00:42,725+ −
Here is the other test case.+ −
+ −
12+ −
00:00:42,725 --> 00:00:44,090+ −
I still run it only+ −
+ −
13+ −
00:00:44,090 --> 00:00:48,470+ −
for relatively small+ −
strings between 0 and 20.+ −
+ −
14+ −
00:00:48,470 --> 00:00:50,240+ −
And let's see what it says.+ −
+ −
15+ −
00:00:50,240 --> 00:00:51,800+ −
So I'm running the test in+ −
+ −
16+ −
00:00:51,800 --> 00:00:57,320+ −
the Amoonite REPL and it + −
doesn't look too bad.+ −
+ −
17+ −
00:00:57,320 --> 00:01:01,160+ −
But this might be because+ −
20 is not generous enough.+ −
+ −
18+ −
00:01:01,160 --> 00:01:03,620+ −
So let's try it with 30.+ −
+ −
19+ −
00:01:03,620 --> 00:01:06,530+ −
Let's run this test again.+ −
+ −
20+ −
00:01:06,530 --> 00:01:11,075+ −
And 20 is quite okay.+ −
+ −
21+ −
00:01:11,075 --> 00:01:15,965+ −
22, okay, that's now+ −
almost ten times as much.+ −
+ −
22+ −
00:01:15,965 --> 00:01:18,995+ −
And then the next+ −
one would be 24.+ −
+ −
23+ −
00:01:18,995 --> 00:01:23,615+ −
And we're waiting and waiting.+ −
+ −
24+ −
00:01:23,615 --> 00:01:26,410+ −
And we are waiting.+ −
+ −
25+ −
00:01:26,410 --> 00:01:29,300+ −
Actually, it isn't+ −
calculated at all.+ −
+ −
26+ −
00:01:29,300 --> 00:01:31,399+ −
It run out of memory.+ −
+ −
27+ −
00:01:31,399 --> 00:01:33,905+ −
So here is something going on,+ −
+ −
28+ −
00:01:33,905 --> 00:01:37,640+ −
which is definitely bad and we+ −
have to have a look at that.+ −
+ −
29+ −
00:01:37,640 --> 00:01:40,640+ −
It's always+ −
instructive with+ −
+ −
30+ −
00:01:40,640 --> 00:01:43,460+ −
this algorithm to+ −
look at the sizes of+ −
+ −
31+ −
00:01:43,460 --> 00:01:45,695+ −
the regular expressions+ −
we calculate.+ −
+ −
32+ −
00:01:45,695 --> 00:01:49,625+ −
The EVIL2 is this+ −
a star, star b.+ −
+ −
33+ −
00:01:49,625 --> 00:01:51,800+ −
So there's nothing we+ −
can compress there.+ −
+ −
34+ −
00:01:51,800 --> 00:01:55,490+ −
It has just stars and+ −
sequences and nothing else.+ −
+ −
35+ −
00:01:55,490 --> 00:01:58,430+ −
And so it's not that we+ −
can play the same trick+ −
+ −
36+ −
00:01:58,430 --> 00:02:01,805+ −
of trying to introduce+ −
an explicit constructor.+ −
+ −
37+ −
00:02:01,805 --> 00:02:04,190+ −
But now let's have a+ −
look at the derivatives.+ −
+ −
38+ −
00:02:04,190 --> 00:02:05,600+ −
As you can see here.+ −
+ −
39+ −
00:02:05,600 --> 00:02:08,420+ −
If I take this EVIL2 regular+ −
+ −
40+ −
00:02:08,420 --> 00:02:09,935+ −
expression and then build+ −
+ −
41+ −
00:02:09,935 --> 00:02:12,470+ −
longer and+ −
longer derivatives,+ −
+ −
42+ −
00:02:12,470 --> 00:02:14,090+ −
you can see this grows.+ −
+ −
43+ −
00:02:14,090 --> 00:02:16,055+ −
And as you see in this line,+ −
+ −
44+ −
00:02:16,055 --> 00:02:17,270+ −
if I'm trying to take+ −
+ −
45+ −
00:02:17,270 --> 00:02:20,180+ −
the derivative of a+ −
quite large string,+ −
+ −
46+ −
00:02:20,180 --> 00:02:21,680+ −
it is quite quick.+ −
+ −
47+ −
00:02:21,680 --> 00:02:26,870+ −
But the size of the+ −
regular expression,+ −
+ −
48+ −
00:02:26,870 --> 00:02:28,310+ −
the number of nodes,+ −
+ −
49+ −
00:02:28,310 --> 00:02:30,815+ −
is also like nearly+ −
eight millions.+ −
+ −
50+ −
00:02:30,815 --> 00:02:33,845+ −
And so even if that calculates+ −
relatively quickly,+ −
+ −
51+ −
00:02:33,845 --> 00:02:37,865+ −
that is the reason why at 24,+ −
+ −
52+ −
00:02:37,865 --> 00:02:39,560+ −
it just runs out of memory.+ −
+ −
53+ −
00:02:39,560 --> 00:02:42,785+ −
This regular expression+ −
just grew too much.+ −
+ −
54+ −
00:02:42,785 --> 00:02:46,520+ −
So we somehow need+ −
to still compress+ −
+ −
55+ −
00:02:46,520 --> 00:02:49,655+ −
this regular expression+ −
and make it not+ −
+ −
56+ −
00:02:49,655 --> 00:02:52,700+ −
grow so much when we+ −
build the derivative.+ −
+ −
57+ −
00:02:52,700 --> 00:02:54,020+ −
So we have to look at+ −
+ −
58+ −
00:02:54,020 --> 00:02:57,050+ −
where does that grow+ −
actually come from.+ −
+ −
59+ −
00:02:57,050 --> 00:03:00,170+ −
Let's look at the+ −
derivative operation+ −
+ −
60+ −
00:03:00,170 --> 00:03:01,895+ −
again in more detail.+ −
+ −
61+ −
00:03:01,895 --> 00:03:03,740+ −
When we introduced+ −
it, we looked at+ −
+ −
62+ −
00:03:03,740 --> 00:03:06,560+ −
this example of a+ −
regulat expression r,+ −
+ −
63+ −
00:03:06,560 --> 00:03:11,465+ −
which is a star of some+ −
alternative and some sequence.+ −
+ −
64+ −
00:03:11,465 --> 00:03:13,130+ −
And we can build now+ −
+ −
65+ −
00:03:13,130 --> 00:03:15,800+ −
the derivative of this+ −
r according to a,+ −
+ −
66+ −
00:03:15,800 --> 00:03:18,965+ −
b, and c and see+ −
what it calculates.+ −
+ −
67+ −
00:03:18,965 --> 00:03:21,935+ −
And you see in case of a,+ −
+ −
68+ −
00:03:21,935 --> 00:03:26,570+ −
it's like one times b plus+ −
0 and then followed by r,+ −
+ −
69+ −
00:03:26,570 --> 00:03:29,015+ −
which is this whole+ −
regular expression again.+ −
+ −
70+ −
00:03:29,015 --> 00:03:30,935+ −
So you can also see+ −
+ −
71+ −
00:03:30,935 --> 00:03:34,745+ −
the derivative operation+ −
introduces a lot of junk.+ −
+ −
72+ −
00:03:34,745 --> 00:03:38,165+ −
This plus 0 isn't+ −
really necessary.+ −
+ −
73+ −
00:03:38,165 --> 00:03:42,455+ −
And this times 1 could+ −
be also thrown away.+ −
+ −
74+ −
00:03:42,455 --> 00:03:43,685+ −
So in the first case,+ −
+ −
75+ −
00:03:43,685 --> 00:03:48,110+ −
actually we could just have+ −
one times b is b plus 0,+ −
+ −
76+ −
00:03:48,110 --> 00:03:49,580+ −
it still b,+ −
+ −
77+ −
00:03:49,580 --> 00:03:54,905+ −
so it would be b followed by+ −
r. And in this second case,+ −
+ −
78+ −
00:03:54,905 --> 00:03:57,515+ −
0 times b, that will be just 0.+ −
+ −
79+ −
00:03:57,515 --> 00:03:59,270+ −
Then 0 plus one is+ −
+ −
80+ −
00:03:59,270 --> 00:04:05,330+ −
1 ... times r would be just+ −
r. And in the last case,+ −
+ −
81+ −
00:04:05,330 --> 00:04:12,155+ −
0 times b would be 0. 0 plus+ −
0 is 0. 0 times r would be 0.+ −
+ −
82+ −
00:04:12,155 --> 00:04:17,540+ −
Okay? So we could throw out+ −
all these useless zeros and+ −
+ −
83+ −
00:04:17,540 --> 00:04:20,960+ −
ones because actually+ −
we have to throw+ −
+ −
84+ −
00:04:20,960 --> 00:04:24,845+ −
them out because over time+ −
they just accumulate.+ −
+ −
85+ −
00:04:24,845 --> 00:04:27,035+ −
And then we build+ −
the next derivative.+ −
+ −
86+ −
00:04:27,035 --> 00:04:31,130+ −
And the 0s wouldn't go away by+ −
building a new derivative.+ −
+ −
87+ −
00:04:31,130 --> 00:04:34,310+ −
So we need to somehow+ −
a mechanism to+ −
+ −
88+ −
00:04:34,310 --> 00:04:39,120+ −
delete the junk from+ −
these derivatives.+ −
+ −
89+ −
00:04:39,280 --> 00:04:41,900+ −
But how to delete junk?+ −
+ −
90+ −
00:04:41,900 --> 00:04:43,370+ −
We already know we have+ −
+ −
91+ −
00:04:43,370 --> 00:04:47,825+ −
the simplification rules+ −
which say like if r plus 0,+ −
+ −
92+ −
00:04:47,825 --> 00:04:53,000+ −
I can just replace by+ −
r and the other ones.+ −
+ −
93+ −
00:04:53,000 --> 00:04:55,760+ −
And so what I now+ −
need to do is in+ −
+ −
94+ −
00:04:55,760 --> 00:04:58,715+ −
my algorithm when I+ −
built the derivative,+ −
+ −
95+ −
00:04:58,715 --> 00:05:01,415+ −
this might have+ −
introduced some junk.+ −
+ −
96+ −
00:05:01,415 --> 00:05:02,960+ −
And I now have to have+ −
+ −
97+ −
00:05:02,960 --> 00:05:06,590+ −
essentially an additional function+ −
+ −
98+ −
00:05:06,590 --> 00:05:08,945+ −
which deletes this junk again.+ −
+ −
99+ −
00:05:08,945 --> 00:05:10,490+ −
And in doing so,+ −
+ −
100+ −
00:05:10,490 --> 00:05:13,340+ −
keeps the regular expression,+ −
+ −
101+ −
00:05:13,340 --> 00:05:16,730+ −
relatively small, because that+ −
+ −
102+ −
00:05:16,730 --> 00:05:19,535+ −
is the Achilles' heel+ −
of this algorithm.+ −
+ −
103+ −
00:05:19,535 --> 00:05:22,565+ −
If this regular expression+ −
grows too much,+ −
+ −
104+ −
00:05:22,565 --> 00:05:25,070+ −
then all these functions+ −
will slow down.+ −
+ −
105+ −
00:05:25,070 --> 00:05:26,360+ −
So the purpose of+ −
+ −
106+ −
00:05:26,360 --> 00:05:30,455+ −
the simplification function+ −
is to after the derivative,+ −
+ −
107+ −
00:05:30,455 --> 00:05:33,080+ −
compress again this+ −
regular expression+ −
+ −
108+ −
00:05:33,080 --> 00:05:35,570+ −
and then do the next derivative.+ −
+ −
109+ −
00:05:35,570 --> 00:05:39,815+ −
So if we introduce that+ −
additional functions simp,+ −
+ −
110+ −
00:05:39,815 --> 00:05:42,440+ −
which essentially+ −
just goes through+ −
+ −
111+ −
00:05:42,440 --> 00:05:46,040+ −
this regular expression and+ −
deletes all this junk,+ −
+ −
112+ −
00:05:46,040 --> 00:05:50,045+ −
then we should be in a+ −
much better position.+ −
+ −
113+ −
00:05:50,045 --> 00:05:52,490+ −
As you can see on this slide,+ −
+ −
114+ −
00:05:52,490 --> 00:05:54,440+ −
the simplification+ −
function can actually+ −
+ −
115+ −
00:05:54,440 --> 00:05:56,855+ −
be implemented very easily.+ −
+ −
116+ −
00:05:56,855 --> 00:05:59,750+ −
However, there are few+ −
interesting points.+ −
+ −
117+ −
00:05:59,750 --> 00:06:02,720+ −
First of all, there+ −
are only two cases.+ −
+ −
118+ −
00:06:02,720 --> 00:06:05,060+ −
We only simplify when we have+ −
+ −
119+ −
00:06:05,060 --> 00:06:08,255+ −
an alternative or a sequence.+ −
+ −
120+ −
00:06:08,255 --> 00:06:12,859+ −
So for example, we will+ −
never simplify under star.+ −
+ −
121+ −
00:06:12,859 --> 00:06:15,950+ −
If you look at the+ −
derivative operation+ −
+ −
122+ −
00:06:15,950 --> 00:06:17,825+ −
and you scratch your+ −
head a little bit,+ −
+ −
123+ −
00:06:17,825 --> 00:06:20,180+ −
we'll find out why+ −
is that the case.+ −
+ −
124+ −
00:06:20,180 --> 00:06:22,145+ −
It's essentially a waste of time.+ −
+ −
125+ −
00:06:22,145 --> 00:06:25,505+ −
So you would not+ −
simplify under star.+ −
+ −
126+ −
00:06:25,505 --> 00:06:27,650+ −
You only simplify if you have+ −
+ −
127+ −
00:06:27,650 --> 00:06:30,530+ −
an alternative and the sequence.+ −
+ −
128+ −
00:06:30,530 --> 00:06:34,880+ −
Now, however, there+ −
is one small point+ −
+ −
129+ −
00:06:34,880 --> 00:06:39,785+ −
you have to be aware of.+ −
These simplification rules+ −
+ −
130+ −
00:06:39,785 --> 00:06:42,740+ −
they need to be essentially+ −
applied backwards.+ −
+ −
131+ −
00:06:42,740 --> 00:06:45,650+ −
So I have to first essentially+ −
go to the leaves of+ −
+ −
132+ −
00:06:45,650 --> 00:06:49,085+ −
this regular expression and+ −
then have to find out,+ −
+ −
133+ −
00:06:49,085 --> 00:06:51,170+ −
can I apply the simplification?+ −
+ −
134+ −
00:06:51,170 --> 00:06:52,670+ −
And then if yes,+ −
+ −
135+ −
00:06:52,670 --> 00:06:55,430+ −
I apply the simplification+ −
and I look at the next step,+ −
+ −
136+ −
00:06:55,430 --> 00:06:57,815+ −
can I now simplify+ −
something more?+ −
+ −
137+ −
00:06:57,815 --> 00:06:59,390+ −
The reason is how+ −
+ −
138+ −
00:06:59,390 --> 00:07:03,125+ −
the simplification+ −
rules are formulated.+ −
+ −
139+ −
00:07:03,125 --> 00:07:05,300+ −
They might fire in+ −
+ −
140+ −
00:07:05,300 --> 00:07:08,765+ −
a higher level if something+ −
simplifies below.+ −
+ −
141+ −
00:07:08,765 --> 00:07:14,315+ −
So we have to essentially+ −
simplify from the inside out.+ −
+ −
142+ −
00:07:14,315 --> 00:07:16,850+ −
And in this+ −
simplification function,+ −
+ −
143+ −
00:07:16,850 --> 00:07:20,885+ −
what that means we're+ −
matching this regular expression.+ −
+ −
144+ −
00:07:20,885 --> 00:07:23,120+ −
We test whether it's+ −
an alternative or+ −
+ −
145+ −
00:07:23,120 --> 00:07:26,345+ −
a sequence. Only then we+ −
actually go into action.+ −
+ −
146+ −
00:07:26,345 --> 00:07:28,385+ −
Once we know+ −
+ −
147+ −
00:07:28,385 --> 00:07:30,530+ −
in case of an alternative,+ −
+ −
148+ −
00:07:30,530 --> 00:07:32,120+ −
what are the two components?+ −
+ −
149+ −
00:07:32,120 --> 00:07:35,765+ −
We first, simplify each component,+ −
+ −
150+ −
00:07:35,765 --> 00:07:39,095+ −
okay, and then we+ −
get a result back.+ −
+ −
151+ −
00:07:39,095 --> 00:07:41,180+ −
And we check whether+ −
+ −
152+ −
00:07:41,180 --> 00:07:45,679+ −
this simplification of+ −
r1 resulted into a 0.+ −
+ −
153+ −
00:07:45,679 --> 00:07:47,884+ −
Then because it's an alternative+ −
+ −
154+ −
00:07:47,884 --> 00:07:49,190+ −
then I can just replace it+ −
+ −
155+ −
00:07:49,190 --> 00:07:53,375+ −
by r2s, which a simplified+ −
version of r2.+ −
+ −
156+ −
00:07:53,375 --> 00:07:58,820+ −
If it came back r2s+ −
is actually 0,+ −
+ −
157+ −
00:07:58,820 --> 00:08:00,410+ −
then I can replace it by+ −
+ −
158+ −
00:08:00,410 --> 00:08:03,785+ −
the simplified version of a r1.+ −
+ −
159+ −
00:08:03,785 --> 00:08:07,460+ −
If I simplify both+ −
alternatives and it+ −
+ −
160+ −
00:08:07,460 --> 00:08:11,180+ −
happens that the simplified+ −
versions are the same,+ −
+ −
161+ −
00:08:11,180 --> 00:08:15,170+ −
then I can throw away+ −
one of the alternatives.+ −
+ −
162+ −
00:08:15,170 --> 00:08:18,080+ −
Otherwise, I just have to+ −
keep the alternatives,+ −
+ −
163+ −
00:08:18,080 --> 00:08:21,155+ −
but the simplified components.+ −
+ −
164+ −
00:08:21,155 --> 00:08:23,915+ −
In the sequence case it is+ −
pretty much the same.+ −
+ −
165+ −
00:08:23,915 --> 00:08:27,950+ −
I first have to check what+ −
does it simplify underneath.+ −
+ −
166+ −
00:08:27,950 --> 00:08:31,385+ −
So I call first simplify+ −
and then have a look...+ −
+ −
167+ −
00:08:31,385 --> 00:08:33,020+ −
Does it matches 0 one of + −
+ −
168+ −
00:08:33,020 --> 00:08:36,035+ −
the simplifications,+ −
then I just return 0.+ −
+ −
169+ −
00:08:36,035 --> 00:08:38,330+ −
Or if the other component is 0,+ −
+ −
170+ −
00:08:38,330 --> 00:08:40,535+ −
I can also return a 0.+ −
+ −
171+ −
00:08:40,535 --> 00:08:43,310+ −
If it's one, then I keep+ −
the other component.+ −
+ −
172+ −
00:08:43,310 --> 00:08:45,920+ −
And if the rs2 is one,+ −
+ −
173+ −
00:08:45,920 --> 00:08:47,615+ −
and I keep the first component.+ −
+ −
174+ −
00:08:47,615 --> 00:08:50,359+ −
And otherwise it's+ −
still the sequence.+ −
+ −
175+ −
00:08:50,359 --> 00:08:53,540+ −
And in all the other cases I+ −
don't have to do anything.+ −
+ −
176+ −
00:08:53,540 --> 00:08:55,700+ −
If it's just a plain+ −
0. I leave it in.+ −
+ −
177+ −
00:08:55,700 --> 00:08:57,860+ −
If it's a plain+ −
1, I leave it in.+ −
+ −
178+ −
00:08:57,860 --> 00:09:00,170+ −
And if something is under a star,+ −
+ −
179+ −
00:09:00,170 --> 00:09:02,840+ −
I don't open up this+ −
star and simplify it.+ −
+ −
180+ −
00:09:02,840 --> 00:09:06,680+ −
I just apply that to be+ −
as quick as possible.+ −
+ −
181+ −
00:09:06,680 --> 00:09:10,280+ −
Let's see whether this has+ −
any effect on our code.+ −
+ −
182+ −
00:09:10,280 --> 00:09:12,980+ −
So I'm now in the+ −
file re3.sc,+ −
+ −
183+ −
00:09:12,980 --> 00:09:17,405+ −
and it's the same as re2.sc.+ −
+ −
184+ −
00:09:17,405 --> 00:09:20,885+ −
It still has this+ −
explicit and n-times case,+ −
+ −
185+ −
00:09:20,885 --> 00:09:24,665+ −
nullable and derivative are+ −
still the same.+ −
+ −
186+ −
00:09:24,665 --> 00:09:28,610+ −
Except now we have this+ −
simplification function as well.+ −
+ −
187+ −
00:09:28,610 --> 00:09:29,960+ −
And when we apply+ −
+ −
188+ −
00:09:29,960 --> 00:09:33,725+ −
the derivative and after+ −
we apply the derivative,+ −
+ −
189+ −
00:09:33,725 --> 00:09:35,870+ −
we simplify it to throw away+ −
+ −
190+ −
00:09:35,870 --> 00:09:39,050+ −
all the junk this+ −
derivative introduced.+ −
+ −
191+ −
00:09:39,050 --> 00:09:41,510+ −
Otherwise everything+ −
stays the same.+ −
+ −
192+ −
00:09:41,510 --> 00:09:43,580+ −
You still have this expansion+ −
+ −
193+ −
00:09:43,580 --> 00:09:45,515+ −
of the optional regular expression.+ −
+ −
194+ −
00:09:45,515 --> 00:09:49,670+ −
Here are our two EVIL regular+ −
expressions we use as a test.+ −
+ −
195+ −
00:09:49,670 --> 00:09:52,460+ −
And here's now this test case,+ −
+ −
196+ −
00:09:52,460 --> 00:09:55,835+ −
the first one and we're+ −
actually now trying it+ −
+ −
197+ −
00:09:55,835 --> 00:10:00,515+ −
out for strings between+ −
0 and 9000 a's.+ −
+ −
198+ −
00:10:00,515 --> 00:10:08,285+ −
So let's see. So also the + −
simplification makes + −
+ −
199+ −
00:10:08,285 --> 00:10:10,655+ −
actually this case faster.+ −
+ −
200+ −
00:10:10,655 --> 00:10:16,260+ −
So we can now run strings+ −
up to 9000.+ −
+ −
201+ −
00:10:16,260 --> 00:10:19,360+ −
And we don't actually+ −
sweat about this at all.+ −
+ −
202+ −
00:10:19,360 --> 00:10:24,745+ −
For I have to say my laptop+ −
is now starting to run its fan.+ −
+ −
203+ −
00:10:24,745 --> 00:10:28,525+ −
And also, because the times+ −
are relatively small,+ −
+ −
204+ −
00:10:28,525 --> 00:10:30,610+ −
I'm actually running+ −
each test three+ −
+ −
205+ −
00:10:30,610 --> 00:10:32,785+ −
times and then take the average,+ −
+ −
206+ −
00:10:32,785 --> 00:10:34,720+ −
which I didn't do before,+ −
+ −
207+ −
00:10:34,720 --> 00:10:37,720+ −
just to be a tiny bit+ −
more accurate.+ −
+ −
208+ −
00:10:37,720 --> 00:10:42,025+ −
So three seconds for a+ −
string of 9000 a's.+ −
+ −
209+ −
00:10:42,025 --> 00:10:44,980+ −
And now the more+ −
interesting question is,+ −
+ −
210+ −
00:10:44,980 --> 00:10:46,240+ −
for the second case,+ −
+ −
211+ −
00:10:46,240 --> 00:10:48,625+ −
this a star, star, b.+ −
+ −
212+ −
00:10:48,625 --> 00:10:51,250+ −
And you can already see+ −
on these numbers...+ −
+ −
213+ −
00:10:51,250 --> 00:10:53,260+ −
we are really ambitious.+ −
+ −
214+ −
00:10:53,260 --> 00:10:57,860+ −
And let's see how our+ −
program is coping with that.+ −
+ −
215+ −
00:11:02,610 --> 00:11:07,780+ −
Again. No sweat, at+ −
least not on my part.+ −
+ −
216+ −
00:11:07,780 --> 00:11:10,480+ −
The laptop has to+ −
calculate quite a bit.+ −
+ −
217+ −
00:11:10,480 --> 00:11:12,940+ −
But this is now a string of+ −
+ −
218+ −
00:11:12,940 --> 00:11:16,539+ −
3 million a's and+ −
it needs a second.+ −
+ −
219+ −
00:11:16,539 --> 00:11:20,680+ −
And let's see how far we get.+ −
+ −
220+ −
00:11:20,680 --> 00:11:25,280+ −
4 million a's around two seconds.+ −
+ −
221+ −
00:11:27,030 --> 00:11:29,050+ −
So say it again,+ −
+ −
222+ −
00:11:29,050 --> 00:11:31,690+ −
I'm actually slowing it down+ −
+ −
223+ −
00:11:31,690 --> 00:11:34,150+ −
here artificially with running the test+ −
+ −
224+ −
00:11:34,150 --> 00:11:36,895+ −
three times and then+ −
take the average.+ −
+ −
225+ −
00:11:36,895 --> 00:11:42,749+ −
But it seems to be something+ −
less than five seconds.+ −
+ −
226+ −
00:11:42,749 --> 00:11:48,185+ −
And remember that is a+ −
string of 6 million a's.+ −
+ −
227+ −
00:11:48,185 --> 00:11:51,170+ −
Let's just have a+ −
look at the graphs.+ −
+ −
228+ −
00:11:51,170 --> 00:11:56,105+ −
So the simplification also+ −
made our first case faster.+ −
+ −
229+ −
00:11:56,105 --> 00:11:58,880+ −
So earlier without+ −
simplification,+ −
+ −
230+ −
00:11:58,880 --> 00:12:00,710+ −
where we just have+ −
this explicit + −
+ −
231+ −
00:12:00,710 --> 00:12:03,050+ −
n-times. That is this graph.+ −
+ −
232+ −
00:12:03,050 --> 00:12:05,210+ −
And now we can go up to+ −
+ −
233+ −
00:12:05,210 --> 00:12:09,620+ −
10 thousand and be still+ −
under five seconds.+ −
+ −
234+ −
00:12:09,620 --> 00:12:12,300+ −
So that is good news.+ −
+ −
235+ −
00:12:13,270 --> 00:12:16,745+ −
In the other example, remember,+ −
+ −
236+ −
00:12:16,745 --> 00:12:19,220+ −
the existing regular+ −
expression matchers in+ −
+ −
237+ −
00:12:19,220 --> 00:12:22,880+ −
Java 8, Python,+ −
and JavaScript.+ −
+ −
238+ −
00:12:22,880 --> 00:12:26,030+ −
And thanks to the+ −
student we also + −
+ −
239+ −
00:12:26,030 --> 00:12:27,935+ −
have a graph for Swift.+ −
+ −
240+ −
00:12:27,935 --> 00:12:29,750+ −
They're pretty atrocious.+ −
+ −
241+ −
00:12:29,750 --> 00:12:33,320+ −
They need like for 30 a's,+ −
+ −
242+ −
00:12:33,320 --> 00:12:37,490+ −
something like on the+ −
magnitude of 30 seconds.+ −
+ −
243+ −
00:12:37,490 --> 00:12:39,740+ −
As I said already,+ −
+ −
244+ −
00:12:39,740 --> 00:12:42,665+ −
Java 9 is slightly better.+ −
+ −
245+ −
00:12:42,665 --> 00:12:44,870+ −
Java 9 and above, + −
+ −
246+ −
00:12:44,870 --> 00:12:46,220+ −
if you try that example,+ −
+ −
247+ −
00:12:46,220 --> 00:12:48,665+ −
the same example in Java 9,+ −
+ −
248+ −
00:12:48,665 --> 00:12:51,230+ −
you would be able to+ −
process something+ −
+ −
249+ −
00:12:51,230 --> 00:12:54,215+ −
like 40 thousand a's+ −
in half a minute.+ −
+ −
250+ −
00:12:54,215 --> 00:12:57,740+ −
I have to admit I'm not+ −
100% sure what they+ −
+ −
251+ −
00:12:57,740 --> 00:13:03,575+ −
did to improve the+ −
performance between Java 8 and 9.+ −
+ −
252+ −
00:13:03,575 --> 00:13:05,510+ −
I know they did something on+ −
+ −
253+ −
00:13:05,510 --> 00:13:07,790+ −
their regular expression+ −
matching engine,+ −
+ −
254+ −
00:13:07,790 --> 00:13:09,770+ −
but I haven't really looked+ −
+ −
255+ −
00:13:09,770 --> 00:13:12,230+ −
at what precisely they've done.+ −
+ −
256+ −
00:13:12,230 --> 00:13:17,945+ −
But that's still not+ −
really a competition for us.+ −
+ −
257+ −
00:13:17,945 --> 00:13:20,915+ −
So our third version,+ −
+ −
258+ −
00:13:20,915 --> 00:13:24,335+ −
in this example, a star, star b.+ −
+ −
259+ −
00:13:24,335 --> 00:13:28,445+ −
We said for something like+ −
6 million a's we need 15 secs.+ −
+ −
260+ −
00:13:28,445 --> 00:13:31,760+ −
And again, I think these+ −
are slightly older times,+ −
+ −
261+ −
00:13:31,760 --> 00:13:33,770+ −
so here it needs 20 seconds.+ −
+ −
262+ −
00:13:33,770 --> 00:13:37,250+ −
I think we had something+ −
like below five seconds.+ −
+ −
263+ −
00:13:37,250 --> 00:13:40,865+ −
But again, you can see+ −
the trends. We run.+ −
+ −
264+ −
00:13:40,865 --> 00:13:42,590+ −
circles around them.+ −
+ −
265+ −
00:13:42,590 --> 00:13:46,530+ −
And that's quite nice.+ −
+ −
266+ −
00:13:46,570 --> 00:13:49,774+ −
So what's good about+ −
this algorithm+ −
+ −
267+ −
00:13:49,774 --> 00:13:51,605+ −
by Brzozowski?+ −
+ −
268+ −
00:13:51,605 --> 00:13:54,500+ −
Well, first, it extends+ −
+ −
269+ −
00:13:54,500 --> 00:13:57,800+ −
actually to quite a number+ −
of regular expressions.+ −
+ −
270+ −
00:13:57,800 --> 00:13:59,900+ −
So we saw in this example+ −
+ −
271+ −
00:13:59,900 --> 00:14:03,950+ −
the n-times regular expression.+ −
Is a little bit of work, but+ −
+ −
272+ −
00:14:03,950 --> 00:14:05,405+ −
we could extend that.+ −
+ −
273+ −
00:14:05,405 --> 00:14:08,480+ −
It also extends to the+ −
not regular expression,+ −
+ −
274+ −
00:14:08,480 --> 00:14:10,820+ −
which I explain on+ −
the next slide.+ −
+ −
275+ −
00:14:10,820 --> 00:14:15,290+ −
It's very easy to implement+ −
in a functional language.+ −
+ −
276+ −
00:14:15,290 --> 00:14:16,610+ −
If you don't buy+ −
+ −
277+ −
00:14:16,610 --> 00:14:20,675+ −
all that functional+ −
programming language stuff,+ −
+ −
278+ −
00:14:20,675 --> 00:14:22,955+ −
you still have to agree.+ −
+ −
279+ −
00:14:22,955 --> 00:14:25,715+ −
It's quite beautiful in Scala.+ −
+ −
280+ −
00:14:25,715 --> 00:14:28,160+ −
And you could also+ −
easily implemented+ −
+ −
281+ −
00:14:28,160 --> 00:14:31,760+ −
in the C language or in Python.+ −
+ −
282+ −
00:14:31,760 --> 00:14:34,250+ −
There's really no+ −
problem with that.+ −
+ −
283+ −
00:14:34,250 --> 00:14:38,390+ −
The algorithm is actually+ −
quite old already. + −
+ −
284+ −
00:14:38,390 --> 00:14:44,899+ −
Brzozowski wrote it down in+ −
1964 and his PhD thesis.+ −
+ −
285+ −
00:14:44,899 --> 00:14:49,460+ −
Somehow it was forgotten during+ −
+ −
286+ −
00:14:49,460 --> 00:14:54,095+ −
the next decades and only+ −
recently has been rediscovered.+ −
+ −
287+ −
00:14:54,095 --> 00:14:57,065+ −
At the moment, we are+ −
only solved the problem+ −
+ −
288+ −
00:14:57,065 --> 00:15:02,240+ −
of given a regular expression,+ −
givven a string, does+ −
+ −
289+ −
00:15:02,240 --> 00:15:05,550+ −
the regular expression+ −
match the string yes or no.+ −
+ −
290+ −
00:15:05,550 --> 00:15:08,740+ −
We want to of course, use+ −
it as part of our lexer.+ −
+ −
291+ −
00:15:08,740 --> 00:15:12,370+ −
And there we have to do+ −
something more complicated.+ −
+ −
292+ −
00:15:12,370 --> 00:15:15,384+ −
We have to essentially+ −
generate tokens.+ −
+ −
293+ −
00:15:15,384 --> 00:15:18,670+ −
And this will still take+ −
a little bit of work.+ −
+ −
294+ −
00:15:18,670 --> 00:15:22,045+ −
And that's actually relatively+ −
recent work by somebody+ −
+ −
295+ −
00:15:22,045 --> 00:15:26,110+ −
called Sulzmann and+ −
the co-worker called Lu.+ −
+ −
296+ −
00:15:26,110 --> 00:15:30,880+ −
And what I personally+ −
also find very satisfying+ −
+ −
297+ −
00:15:30,880 --> 00:15:32,695+ −
about this algorithm is+ −
+ −
298+ −
00:15:32,695 --> 00:15:36,040+ −
that we can actually+ −
prove that it's correct.+ −
+ −
299+ −
00:15:36,040 --> 00:15:37,735+ −
You might remember I did+ −
+ −
300+ −
00:15:37,735 --> 00:15:41,500+ −
quite some interesting+ −
transformations+ −
+ −
301+ −
00:15:41,500 --> 00:15:44,830+ −
when I calculated the derivative.+ −
+ −
302+ −
00:15:44,830 --> 00:15:48,270+ −
How can I be sure that+ −
this is all correct?+ −
+ −
303+ −
00:15:48,270 --> 00:15:50,270+ −
Actually, I can now go away and+ −
+ −
304+ −
00:15:50,270 --> 00:15:52,865+ −
prove the correctness+ −
of this algorithm.+ −
+ −
305+ −
00:15:52,865 --> 00:15:56,645+ −
Does it really satisfy+ −
the specification?+ −
+ −
306+ −
00:15:56,645 --> 00:15:58,550+ −
Is it really true that if a string+ −
+ −
307+ −
00:15:58,550 --> 00:16:00,440+ −
is in the language+ −
of a regular expression.+ −
+ −
308+ −
00:16:00,440 --> 00:16:03,050+ −
Does that matter? I would say yes.+ −
+ −
309+ −
00:16:03,050 --> 00:16:07,070+ −
However, I leave that+ −
for another video.+ −
+ −
310+ −
00:16:07,070 --> 00:16:10,580+ −
What I also like about+ −
this algorithm that can be+ −
+ −
311+ −
00:16:10,580 --> 00:16:13,775+ −
actually extended to quite a+ −
number of regular expressions.+ −
+ −
312+ −
00:16:13,775 --> 00:16:17,810+ −
So this is the NOT regular+ −
expression that is+ −
+ −
313+ −
00:16:17,810 --> 00:16:19,760+ −
supposed to match strings which+ −
+ −
314+ −
00:16:19,760 --> 00:16:22,235+ −
this regular expression cannot match.+ −
+ −
315+ −
00:16:22,235 --> 00:16:24,245+ −
So here's an example.+ −
+ −
316+ −
00:16:24,245 --> 00:16:28,640+ −
If you're in the business of+ −
trying to find out what+ −
+ −
317+ −
00:16:28,640 --> 00:16:33,530+ −
are comments in languages like+ −
Java or C. Then you know,+ −
+ −
318+ −
00:16:33,530 --> 00:16:35,060+ −
they always start with a slash,+ −
+ −
319+ −
00:16:35,060 --> 00:16:36,245+ −
then comes a star,+ −
+ −
320+ −
00:16:36,245 --> 00:16:38,240+ −
then comes an arbitrary string.+ −
+ −
321+ −
00:16:38,240 --> 00:16:41,195+ −
And then they finish+ −
with a slash and a star.+ −
+ −
322+ −
00:16:41,195 --> 00:16:44,330+ −
And how you want to specify+ −
that is something like this.+ −
+ −
323+ −
00:16:44,330 --> 00:16:45,530+ −
You want to say, OK,+ −
+ −
324+ −
00:16:45,530 --> 00:16:48,245+ −
a comment starts with+ −
a slash and a star.+ −
+ −
325+ −
00:16:48,245 --> 00:16:51,410+ −
Then it comes any+ −
string in between.+ −
+ −
326+ −
00:16:51,410 --> 00:16:55,340+ −
But this string+ −
in-between cannot contain+ −
+ −
327+ −
00:16:55,340 --> 00:16:58,280+ −
any star and slash+ −
because that would then+ −
+ −
328+ −
00:16:58,280 --> 00:17:01,415+ −
indicate there's the+ −
end already there.+ −
+ −
329+ −
00:17:01,415 --> 00:17:03,680+ −
And then after you+ −
have such a string+ −
+ −
330+ −
00:17:03,680 --> 00:17:06,410+ −
which doesn't+ −
contain a star and a slash,+ −
+ −
331+ −
00:17:06,410 --> 00:17:11,180+ −
then at the end you want to+ −
match a star and a slash.+ −
+ −
332+ −
00:17:11,180 --> 00:17:13,460+ −
So for these kind of comments,+ −
+ −
333+ −
00:17:13,460 --> 00:17:15,995+ −
this regular expression is+ −
actually quite useful.+ −
+ −
334+ −
00:17:15,995 --> 00:17:19,430+ −
And if you ever seen+ −
how to do the negation,+ −
+ −
335+ −
00:17:19,430 --> 00:17:21,995+ −
for this kind of regular+ −
expression with automata,+ −
+ −
336+ −
00:17:21,995 --> 00:17:24,710+ −
you will know that's+ −
a bit of a pain.+ −
+ −
337+ −
00:17:24,710 --> 00:17:26,675+ −
But for the Brzozowski,+ −
+ −
338+ −
00:17:26,675 --> 00:17:28,370+ −
it's no pain at all.+ −
+ −
339+ −
00:17:28,370 --> 00:17:30,710+ −
The meaning of this+ −
regular expression+ −
+ −
340+ −
00:17:30,710 --> 00:17:32,255+ −
would be something like that.+ −
+ −
341+ −
00:17:32,255 --> 00:17:35,540+ −
If I have all the+ −
strings there are,+ −
+ −
342+ −
00:17:35,540 --> 00:17:38,390+ −
and I take away all the strings,+ −
+ −
343+ −
00:17:38,390 --> 00:17:40,055+ −
this r can match.+ −
+ −
344+ −
00:17:40,055 --> 00:17:42,080+ −
The remaining strings are+ −
+ −
345+ −
00:17:42,080 --> 00:17:44,630+ −
what this regular expression+ −
is supposed to match.+ −
+ −
346+ −
00:17:44,630 --> 00:17:47,165+ −
So I can specify+ −
what the meaning is.+ −
+ −
347+ −
00:17:47,165 --> 00:17:49,760+ −
And then it's also very easy to+ −
+ −
348+ −
00:17:49,760 --> 00:17:52,174+ −
say what is nullable+ −
and derivative.+ −
+ −
349+ −
00:17:52,174 --> 00:17:54,140+ −
So for nullable, it would be+ −
+ −
350+ −
00:17:54,140 --> 00:17:56,570+ −
just a test whether r+ −
is nullable.+ −
+ −
351+ −
00:17:56,570 --> 00:17:58,985+ −
And I just take the+ −
negation of that.+ −
+ −
352+ −
00:17:58,985 --> 00:18:00,680+ −
And then I have to build+ −
+ −
353+ −
00:18:00,680 --> 00:18:03,620+ −
the derivative of this+ −
NOT regular expression.+ −
+ −
354+ −
00:18:03,620 --> 00:18:05,420+ −
Then I just have to move+ −
+ −
355+ −
00:18:05,420 --> 00:18:07,325+ −
this ....+ −
+ −
356+ −
00:18:07,325 --> 00:18:10,070+ −
derivative inside+ −
the regular expression+ −
+ −
357+ −
00:18:10,070 --> 00:18:12,575+ −
and keep the NOT on+ −
the outermost level.+ −
+ −
358+ −
00:18:12,575 --> 00:18:15,515+ −
So there's again no+ −
pain involved in+ −
+ −
359+ −
00:18:15,515 --> 00:18:19,130+ −
adding this regular expression+ −
to the algorithm.+ −
+ −
360+ −
00:18:19,130 --> 00:18:22,100+ −
We made it to the end of+ −
this video and we made+ −
+ −
361+ −
00:18:22,100 --> 00:18:24,739+ −
it also to the end+ −
of this algorithm.+ −
+ −
362+ −
00:18:24,739 --> 00:18:27,620+ −
I can see to loose strands.+ −
+ −
363+ −
00:18:27,620 --> 00:18:29,420+ −
One is we implemented+ −
+ −
364+ −
00:18:29,420 --> 00:18:32,855+ −
this algorithm for the+ −
basic regular expressions.+ −
+ −
365+ −
00:18:32,855 --> 00:18:38,705+ −
And we also added the + −
n-times out of necessity.+ −
+ −
366+ −
00:18:38,705 --> 00:18:41,120+ −
And I also showed+ −
you how to implement+ −
+ −
367+ −
00:18:41,120 --> 00:18:44,840+ −
the NOT regular+ −
expression on a slide.+ −
+ −
368+ −
00:18:44,840 --> 00:18:48,995+ −
But your task in the+ −
coursework actually is+ −
+ −
369+ −
00:18:48,995 --> 00:18:52,970+ −
to extend that to some of+ −
the extended regular expressions.+ −
+ −
370+ −
00:18:52,970 --> 00:18:57,260+ −
So especially these bounded+ −
repetitions like from 0 to+ −
+ −
371+ −
00:18:57,260 --> 00:19:01,550+ −
n-times or between n and m times.+ −
+ −
372+ −
00:19:01,550 --> 00:19:04,325+ −
And I think also+ −
plus and option.+ −
+ −
373+ −
00:19:04,325 --> 00:19:07,220+ −
The other loose strand is,+ −
+ −
374+ −
00:19:07,220 --> 00:19:09,200+ −
remember I did this wild+ −
+ −
375+ −
00:19:09,200 --> 00:19:11,645+ −
calculations with+ −
regular expressions.+ −
+ −
376+ −
00:19:11,645 --> 00:19:13,205+ −
Why on earth+ −
+ −
377+ −
00:19:13,205 --> 00:19:14,480+ −
is that all correct?+ −
+ −
378+ −
00:19:14,480 --> 00:19:16,355+ −
Why on earth should+ −
this algorithm+ −
+ −
379+ −
00:19:16,355 --> 00:19:18,575+ −
actually satisfy+ −
our specification,+ −
+ −
380+ −
00:19:18,575 --> 00:19:20,450+ −
which we set out+ −
at the beginning.+ −
+ −
381+ −
00:19:20,450 --> 00:19:25,410+ −
So that needs to be looked at+ −
more closely. Bye for now.+ −