1+ −
00:00:05,880 --> 00:00:09,700+ −
Welcome back.+ −
Remember this slide.+ −
+ −
2+ −
00:00:09,700 --> 00:00:11,500+ −
This slide said what is+ −
+ −
3+ −
00:00:11,500 --> 00:00:14,500+ −
our regular expression matcher+ −
actually supposed to do?+ −
+ −
4+ −
00:00:14,500 --> 00:00:16,570+ −
It will take two arguments,+ −
+ −
5+ −
00:00:16,570 --> 00:00:18,670+ −
a regular expression r and a string s.+ −
+ −
6+ −
00:00:18,670 --> 00:00:21,580+ −
And it's supposed to say yes,+ −
+ −
7+ −
00:00:21,580 --> 00:00:23,440+ −
the regular expression matches+ −
+ −
8+ −
00:00:23,440 --> 00:00:26,920+ −
the string if and only+ −
if the string is in+ −
+ −
9+ −
00:00:26,920 --> 00:00:28,855+ −
the language of r.+ −
+ −
10+ −
00:00:28,855 --> 00:00:32,410+ −
And if the string is not+ −
in the language of r,+ −
+ −
11+ −
00:00:32,410 --> 00:00:35,515+ −
then our algorithm has to say no.+ −
+ −
12+ −
00:00:35,515 --> 00:00:37,210+ −
And we can't use+ −
+ −
13+ −
00:00:37,210 --> 00:00:39,565+ −
this specification+ −
directly because remember,+ −
+ −
14+ −
00:00:39,565 --> 00:00:43,305+ −
this L sometimes+ −
produces infinite sets.+ −
+ −
15+ −
00:00:43,305 --> 00:00:47,585+ −
And we can't test whether a+ −
string is in infinite set,+ −
+ −
16+ −
00:00:47,585 --> 00:00:50,090+ −
at least not easily.+ −
+ −
17+ −
00:00:50,090 --> 00:00:52,340+ −
And so what we have to do+ −
+ −
18+ −
00:00:52,340 --> 00:00:54,260+ −
instead is we have+ −
to be a bit more+ −
+ −
19+ −
00:00:54,260 --> 00:00:57,050+ −
clever and implement+ −
some operations+ −
+ −
20+ −
00:00:57,050 --> 00:00:59,284+ −
on regular expressions instead,+ −
+ −
21+ −
00:00:59,284 --> 00:01:03,275+ −
because regular expressions+ −
are always finite trees.+ −
+ −
22+ −
00:01:03,275 --> 00:01:05,870+ −
I should say the+ −
person who has been+ −
+ −
23+ −
00:01:05,870 --> 00:01:08,150+ −
clever is called Brzozowski.+ −
+ −
24+ −
00:01:08,150 --> 00:01:11,435+ −
It's his algorithm+ −
I'm introducing here.+ −
+ −
25+ −
00:01:11,435 --> 00:01:15,515+ −
And his algorithm consists+ −
of two functions.+ −
+ −
26+ −
00:01:15,515 --> 00:01:17,840+ −
One is called+ −
nullable and it takes+ −
+ −
27+ −
00:01:17,840 --> 00:01:20,104+ −
a regular expression as argument.+ −
+ −
28+ −
00:01:20,104 --> 00:01:24,155+ −
And the idea is that+ −
this function nullable is+ −
+ −
29+ −
00:01:24,155 --> 00:01:26,450+ −
testing whether+ −
the regular expression+ −
+ −
30+ −
00:01:26,450 --> 00:01:27,950+ −
can match the empty string.+ −
+ −
31+ −
00:01:27,950 --> 00:01:30,305+ −
So 0 cannot match any string.+ −
+ −
32+ −
00:01:30,305 --> 00:01:33,275+ −
So it cannot match the+ −
empty string either.+ −
+ −
33+ −
00:01:33,275 --> 00:01:35,465+ −
So that's defined as false.+ −
+ −
34+ −
00:01:35,465 --> 00:01:37,775+ −
This regular expression,+ −
the whole purpose+ −
+ −
35+ −
00:01:37,775 --> 00:01:39,680+ −
is that it can match+ −
the empty string.+ −
+ −
36+ −
00:01:39,680 --> 00:01:41,645+ −
So it's defined as true.+ −
+ −
37+ −
00:01:41,645 --> 00:01:44,599+ −
If a regular expression+ −
can match a character,+ −
+ −
38+ −
00:01:44,599 --> 00:01:47,045+ −
then it cannot match+ −
the empty string.+ −
+ −
39+ −
00:01:47,045 --> 00:01:49,445+ −
So that is defined as false.+ −
+ −
40+ −
00:01:49,445 --> 00:01:53,180+ −
If an alternative can+ −
match the empty string,+ −
+ −
41+ −
00:01:53,180 --> 00:01:56,960+ −
then either r1 can+ −
match the empty string,+ −
+ −
42+ −
00:01:56,960 --> 00:01:59,720+ −
or r2 can match the empty string.+ −
+ −
43+ −
00:01:59,720 --> 00:02:03,110+ −
So either nullable+ −
of r1 has to hold,+ −
+ −
44+ −
00:02:03,110 --> 00:02:06,945+ −
or nullable of r2 has to hold.+ −
+ −
45+ −
00:02:06,945 --> 00:02:09,925+ −
In this sequence, it's+ −
the other way around.+ −
+ −
46+ −
00:02:09,925 --> 00:02:12,790+ −
If this regular expression can+ −
match the empty string,+ −
+ −
47+ −
00:02:12,790 --> 00:02:16,615+ −
then r1 has to be able to+ −
match the empty string,+ −
+ −
48+ −
00:02:16,615 --> 00:02:20,290+ −
and r2 has to be able to+ −
match the empty string.+ −
+ −
49+ −
00:02:20,290 --> 00:02:22,555+ −
So here it's just the opposite.+ −
+ −
50+ −
00:02:22,555 --> 00:02:25,660+ −
In one case it is or and+ −
the other case it is and.+ −
+ −
51+ −
00:02:25,660 --> 00:02:27,970+ −
The star regular+ −
expression can match+ −
+ −
52+ −
00:02:27,970 --> 00:02:30,445+ −
always the empty string.+ −
So that is true.+ −
+ −
53+ −
00:02:30,445 --> 00:02:33,340+ −
So this is a simple+ −
recursive function+ −
+ −
54+ −
00:02:33,340 --> 00:02:37,179+ −
and it should not be too+ −
difficult to implement.+ −
+ −
55+ −
00:02:37,179 --> 00:02:42,025+ −
Okay, this nullable function+ −
that is easy-peasy.+ −
+ −
56+ −
00:02:42,025 --> 00:02:44,604+ −
The second function, however,+ −
+ −
57+ −
00:02:44,604 --> 00:02:49,155+ −
is a bit more involved and+ −
that's just to be expected.+ −
+ −
58+ −
00:02:49,155 --> 00:02:53,075+ −
Remember people working in+ −
this area already for decades.+ −
+ −
59+ −
00:02:53,075 --> 00:02:57,305+ −
If they have some problems+ −
with runtime, for example,+ −
+ −
60+ −
00:02:57,305 --> 00:02:58,940+ −
we can't expect that + −
+ −
61+ −
00:02:58,940 --> 00:03:03,260+ −
a simple fix will solve all+ −
the problems in the world.+ −
+ −
62+ −
00:03:03,260 --> 00:03:06,530+ −
So I admit the second+ −
function is a bit more+ −
+ −
63+ −
00:03:06,530 --> 00:03:10,085+ −
complicated and make sure+ −
that you understand it.+ −
+ −
64+ −
00:03:10,085 --> 00:03:12,140+ −
And it also just+ −
shows this Brzozowski+ −
+ −
65+ −
00:03:12,140 --> 00:03:14,345+ −
is a very clever guy.+ −
+ −
66+ −
00:03:14,345 --> 00:03:15,800+ −
Actually, I have to say he was+ −
+ −
67+ −
00:03:15,800 --> 00:03:17,720+ −
a clever guy because+ −
I think he either+ −
+ −
68+ −
00:03:17,720 --> 00:03:21,650+ −
died last year or+ −
the year before.+ −
+ −
69+ −
00:03:21,650 --> 00:03:25,505+ −
And he came up with this+ −
algorithm already in 1964.+ −
+ −
70+ −
00:03:25,505 --> 00:03:27,260+ −
It somehow got lost,+ −
+ −
71+ −
00:03:27,260 --> 00:03:30,305+ −
but has been rediscovered+ −
in the last ten years.+ −
+ −
72+ −
00:03:30,305 --> 00:03:34,685+ −
So the idea of the second+ −
function is the following.+ −
+ −
73+ −
00:03:34,685 --> 00:03:38,120+ −
Imagine you have a+ −
regular expression and it can+ −
+ −
74+ −
00:03:38,120 --> 00:03:41,930+ −
match a string of the+ −
form c followed by s.+ −
+ −
75+ −
00:03:41,930 --> 00:03:44,810+ −
So the c is the first+ −
character of that string.+ −
+ −
76+ −
00:03:44,810 --> 00:03:48,605+ −
So I imagine that r can+ −
match this kind of string.+ −
+ −
77+ −
00:03:48,605 --> 00:03:50,330+ −
The question is now,+ −
+ −
78+ −
00:03:50,330 --> 00:03:52,400+ −
what would a regular+ −
expression look+ −
+ −
79+ −
00:03:52,400 --> 00:03:54,695+ −
like that can match just+ −
+ −
80+ −
00:03:54,695 --> 00:03:57,140+ −
s. You probably remember+ −
+ −
81+ −
00:03:57,140 --> 00:03:59,300+ −
that from the semantic+ −
derivative,+ −
+ −
82+ −
00:03:59,300 --> 00:04:00,860+ −
there was also+ −
something of chopping+ −
+ −
83+ −
00:04:00,860 --> 00:04:02,210+ −
off the first character.+ −
+ −
84+ −
00:04:02,210 --> 00:04:04,940+ −
Here it's the same.+ −
If a regular expression+ −
+ −
85+ −
00:04:04,940 --> 00:04:07,835+ −
can match a string+ −
starting with a c,+ −
+ −
86+ −
00:04:07,835 --> 00:04:11,090+ −
we're looking for a regular+ −
expression which can match+ −
+ −
87+ −
00:04:11,090 --> 00:04:15,215+ −
the rest of the string where+ −
the c has been chopped off.+ −
+ −
88+ −
00:04:15,215 --> 00:04:17,690+ −
And this operation will be called+ −
+ −
89+ −
00:04:17,690 --> 00:04:21,049+ −
the derivative of a+ −
regular expression.+ −
+ −
90+ −
00:04:21,049 --> 00:04:22,205+ −
And it will also take+ −
+ −
91+ −
00:04:22,205 --> 00:04:25,460+ −
a character as argument+ −
and a regular expression.+ −
+ −
92+ −
00:04:25,460 --> 00:04:28,730+ −
And in contrast to+ −
the semantic + −
+ −
93+ −
00:04:28,730 --> 00:04:31,310+ −
derivative, which works+ −
+ −
94+ −
00:04:31,310 --> 00:04:34,430+ −
over languages or+ −
sets of strings,+ −
+ −
95+ −
00:04:34,430 --> 00:04:39,620+ −
this derivative works+ −
over regular expressions.+ −
+ −
96+ −
00:04:39,620 --> 00:04:41,330+ −
Okay, here's this function.+ −
+ −
97+ −
00:04:41,330 --> 00:04:43,970+ −
It's defined recursively over+ −
+ −
98+ −
00:04:43,970 --> 00:04:46,370+ −
the structure of regular expressions.+ −
+ −
99+ −
00:04:46,370 --> 00:04:48,814+ −
The first argument+ −
is the character,+ −
+ −
100+ −
00:04:48,814 --> 00:04:52,700+ −
and the second one is+ −
a regular expression.+ −
+ −
101+ −
00:04:52,700 --> 00:04:56,510+ −
And remember the idea+ −
is we're looking for+ −
+ −
102+ −
00:04:56,510 --> 00:05:00,680+ −
a regular expression that+ −
can match everything+ −
+ −
103+ −
00:05:00,680 --> 00:05:03,125+ −
this regular expression+ −
as argument can match+ −
+ −
104+ −
00:05:03,125 --> 00:05:07,040+ −
except for the c. So now let's+ −
look at this first case.+ −
+ −
105+ −
00:05:07,040 --> 00:05:10,550+ −
So this reg expression+ −
cannot match any string.+ −
+ −
106+ −
00:05:10,550 --> 00:05:14,270+ −
So it certainly cannot+ −
match any string starting+ −
+ −
107+ −
00:05:14,270 --> 00:05:16,910+ −
a c. So we have to look+ −
+ −
108+ −
00:05:16,910 --> 00:05:20,090+ −
for a regular expression which+ −
can not match anything.+ −
+ −
109+ −
00:05:20,090 --> 00:05:22,700+ −
Well, that's the purpose+ −
of this regular expression,+ −
+ −
110+ −
00:05:22,700 --> 00:05:24,815+ −
so we define it as 0.+ −
+ −
111+ −
00:05:24,815 --> 00:05:28,130+ −
In the next case,+ −
this regular expression+ −
+ −
112+ −
00:05:28,130 --> 00:05:30,440+ −
can match the empty string,+ −
+ −
113+ −
00:05:30,440 --> 00:05:33,440+ −
but it cannot match+ −
any string that starts+ −
+ −
114+ −
00:05:33,440 --> 00:05:36,350+ −
with c. So also in this case,+ −
+ −
115+ −
00:05:36,350 --> 00:05:39,560+ −
we just define it as+ −
the regular expression,+ −
+ −
116+ −
00:05:39,560 --> 00:05:41,615+ −
which cannot match anything.+ −
+ −
117+ −
00:05:41,615 --> 00:05:45,170+ −
In the next case, this+ −
c is the argument to+ −
+ −
118+ −
00:05:45,170 --> 00:05:48,335+ −
the derivative and this d+ −
is the regular expression.+ −
+ −
119+ −
00:05:48,335 --> 00:05:50,225+ −
So now there are two cases.+ −
+ −
120+ −
00:05:50,225 --> 00:05:53,525+ −
If this c and this d+ −
is actually equal,+ −
+ −
121+ −
00:05:53,525 --> 00:05:55,970+ −
Thaat means this regular+ −
expression can match+ −
+ −
122+ −
00:05:55,970 --> 00:05:59,240+ −
a string which just contains c.+ −
+ −
123+ −
00:05:59,240 --> 00:06:01,505+ −
So if we strip that off,+ −
+ −
124+ −
00:06:01,505 --> 00:06:04,790+ −
what remains is+ −
the empty string.+ −
+ −
125+ −
00:06:04,790 --> 00:06:06,890+ −
What is a regular expression+ −
+ −
126+ −
00:06:06,890 --> 00:06:09,170+ −
look like that can+ −
match the empty string.+ −
+ −
127+ −
00:06:09,170 --> 00:06:11,915+ −
Well, that's the+ −
purpose of the 1.+ −
+ −
128+ −
00:06:11,915 --> 00:06:15,440+ −
And if c is not equal to d,+ −
+ −
129+ −
00:06:15,440 --> 00:06:17,630+ −
then this regular expression+ −
+ −
130+ −
00:06:17,630 --> 00:06:19,220+ −
cannot match anything+ −
which starts with+ −
+ −
131+ −
00:06:19,220 --> 00:06:23,780+ −
a c. So again it will+ −
be defined as just 0.+ −
+ −
132+ −
00:06:23,780 --> 00:06:29,390+ −
Now, the alternative case,+ −
+ −
133+ −
00:06:29,390 --> 00:06:31,970+ −
if this regular expression can+ −
+ −
134+ −
00:06:31,970 --> 00:06:36,050+ −
match strings+ −
starting with c,+ −
+ −
135+ −
00:06:36,050 --> 00:06:40,820+ −
then it can either be+ −
matched by the r1+ −
+ −
136+ −
00:06:40,820 --> 00:06:44,495+ −
or it can be matched by the r2.+ −
+ −
137+ −
00:06:44,495 --> 00:06:50,090+ −
If the r1 can match c+ −
and then following a string,+ −
+ −
138+ −
00:06:50,090 --> 00:06:53,975+ −
then we just have to recursively+ −
call the derivative for+ −
+ −
139+ −
00:06:53,975 --> 00:06:56,570+ −
r1 to get a regular expression+ −
+ −
140+ −
00:06:56,570 --> 00:06:59,135+ −
that can match the+ −
rest of the string.+ −
+ −
141+ −
00:06:59,135 --> 00:07:02,690+ −
And the same we can do with r2.+ −
+ −
142+ −
00:07:02,690 --> 00:07:06,110+ −
You can find a regular expression+ −
which can match everything+ −
+ −
143+ −
00:07:06,110 --> 00:07:07,850+ −
this r2 can match,+ −
+ −
144+ −
00:07:07,850 --> 00:07:09,710+ −
starting with c, but+ −
+ −
145+ −
00:07:09,710 --> 00:07:12,590+ −
we are chopping off this c. + −
+ −
146+ −
00:07:12,590 --> 00:07:16,370+ −
So now if you have to find+ −
the regular expression,+ −
+ −
147+ −
00:07:16,370 --> 00:07:20,030+ −
which can match all the strings+ −
where c is chopped off,+ −
+ −
148+ −
00:07:20,030 --> 00:07:22,295+ −
then we just have to+ −
take the alternative+ −
+ −
149+ −
00:07:22,295 --> 00:07:24,965+ −
of these two derivatives.+ −
+ −
150+ −
00:07:24,965 --> 00:07:29,390+ −
Now to sequence case.+ −
+ −
151+ −
00:07:29,390 --> 00:07:33,005+ −
This sequence case is the+ −
most complicated one.+ −
+ −
152+ −
00:07:33,005 --> 00:07:35,180+ −
And let's look first at+ −
+ −
153+ −
00:07:35,180 --> 00:07:39,335+ −
the second case, in the+ −
else branch.+ −
+ −
154+ −
00:07:39,335 --> 00:07:42,830+ −
So if this+ −
regular expression+ −
+ −
155+ −
00:07:42,830 --> 00:07:46,145+ −
can match a string+ −
starting with c,+ −
+ −
156+ −
00:07:46,145 --> 00:07:48,155+ −
then the following+ −
must have happened.+ −
+ −
157+ −
00:07:48,155 --> 00:07:51,905+ −
First, the r1 must have matched+ −
a string starting with c+ −
+ −
158+ −
00:07:51,905 --> 00:07:56,420+ −
and then anything coming+ −
afterwards with r2.+ −
+ −
159+ −
00:07:56,420 --> 00:07:59,660+ −
So in this case I only+ −
+ −
160+ −
00:07:59,660 --> 00:08:03,260+ −
have to call this+ −
derivative for this r1,+ −
+ −
161+ −
00:08:03,260 --> 00:08:06,830+ −
and find the regular expression+ −
which can match everything+ −
+ −
162+ −
00:08:06,830 --> 00:08:11,555+ −
this r1 can match except+ −
for this c chopped off.+ −
+ −
163+ −
00:08:11,555 --> 00:08:15,830+ −
And I have to build the+ −
sequence derivative of that.+ −
+ −
164+ −
00:08:15,830 --> 00:08:18,260+ −
So that's what the else branch says:+ −
+ −
165+ −
00:08:18,260 --> 00:08:21,860+ −
I take the+ −
derivative of r1 and I+ −
+ −
166+ −
00:08:21,860 --> 00:08:23,480+ −
put the r2 on+ −
+ −
167+ −
00:08:23,480 --> 00:08:25,865+ −
the back because that's+ −
the rest of the string.+ −
+ −
168+ −
00:08:25,865 --> 00:08:29,240+ −
So that's the only+ −
case we have to consider+ −
+ −
169+ −
00:08:29,240 --> 00:08:32,750+ −
this sequence case+ −
except if not the+ −
+ −
170+ −
00:08:32,750 --> 00:08:37,895+ −
r1 can match the+ −
c and something else.+ −
+ −
171+ −
00:08:37,895 --> 00:08:42,965+ −
But if r1 is matching+ −
actually the empty string.+ −
+ −
172+ −
00:08:42,965 --> 00:08:48,890+ −
In this case actually the r2+ −
is in charge of matching+ −
+ −
173+ −
00:08:48,890 --> 00:08:51,590+ −
the string starting with c. So in+ −
+ −
174+ −
00:08:51,590 --> 00:08:55,490+ −
this case we have to call+ −
the derivative for r2.+ −
+ −
175+ −
00:08:55,490 --> 00:08:57,875+ −
So that's why we have+ −
these two cases.+ −
+ −
176+ −
00:08:57,875 --> 00:09:00,455+ −
So if r1 is nullable,+ −
+ −
177+ −
00:09:00,455 --> 00:09:03,245+ −
then it can match+ −
the empty string.+ −
+ −
178+ −
00:09:03,245 --> 00:09:05,330+ −
And we have to+ −
consider the case that+ −
+ −
179+ −
00:09:05,330 --> 00:09:08,045+ −
this r2 is matching a string+ −
+ −
180+ −
00:09:08,045 --> 00:09:10,700+ −
starting with c. And so we have+ −
+ −
181+ −
00:09:10,700 --> 00:09:14,210+ −
to call the derivative+ −
for this r2 in this case.+ −
+ −
182+ −
00:09:14,210 --> 00:09:18,680+ −
Otherwise, the r1 will+ −
be in charge of matching+ −
+ −
183+ −
00:09:18,680 --> 00:09:20,840+ −
a part of that+ −
string starting with+ −
+ −
184+ −
00:09:20,840 --> 00:09:24,695+ −
c. And I have to call the+ −
derivative on this r1.+ −
+ −
185+ −
00:09:24,695 --> 00:09:30,670+ −
I hope this makes sense.+ −
+ −
186+ −
00:09:30,670 --> 00:09:34,150+ −
Go over that and+ −
also the handouts+ −
+ −
187+ −
00:09:34,150 --> 00:09:37,465+ −
again. That case+ −
is really subtle.+ −
+ −
188+ −
00:09:37,465 --> 00:09:40,945+ −
And how do I remember this case?+ −
+ −
189+ −
00:09:40,945 --> 00:09:42,430+ −
Well, I know it's+ −
+ −
190+ −
00:09:42,430 --> 00:09:45,310+ −
an if condition and the+ −
condition is nullable,+ −
+ −
191+ −
00:09:45,310 --> 00:09:48,160+ −
then I will always remember+ −
+ −
192+ −
00:09:48,160 --> 00:09:53,275+ −
the else branch where it pushes+ −
the derivative over the r1.+ −
+ −
193+ −
00:09:53,275 --> 00:09:55,780+ −
So I usually write+ −
down the else branch first,+ −
+ −
194+ −
00:09:55,780 --> 00:09:59,650+ −
and then construct+ −
the then branch by saying,+ −
+ −
195+ −
00:09:59,650 --> 00:10:01,525+ −
well, I just repeat this.+ −
+ −
196+ −
00:10:01,525 --> 00:10:03,760+ −
And I have to remember+ −
in that case I+ −
+ −
197+ −
00:10:03,760 --> 00:10:06,580+ −
have to build also+ −
derivative of r2.+ −
+ −
198+ −
00:10:06,580 --> 00:10:12,695+ −
Finally, the star case.+ −
+ −
199+ −
00:10:12,695 --> 00:10:15,665+ −
So here again+ −
we're looking for+ −
+ −
200+ −
00:10:15,665 --> 00:10:17,300+ −
a regular expression which+ −
+ −
201+ −
00:10:17,300 --> 00:10:19,745+ −
can match the string+ −
starting with c,+ −
+ −
202+ −
00:10:19,745 --> 00:10:22,355+ −
except that the c has+ −
been chopped off.+ −
+ −
203+ −
00:10:22,355 --> 00:10:28,640+ −
So if r* has to match+ −
a string starting with c,+ −
+ −
204+ −
00:10:28,640 --> 00:10:32,735+ −
then at least we need one+ −
copy of this r. + −
+ −
205+ −
00:10:32,735 --> 00:10:34,310+ −
So at least one copy of+ −
+ −
206+ −
00:10:34,310 --> 00:10:37,010+ −
this r has matched+ −
something which starts with+ −
+ −
207+ −
00:10:37,010 --> 00:10:38,870+ −
a c and then afterwards+ −
+ −
208+ −
00:10:38,870 --> 00:10:41,570+ −
come 0 more copies+ −
of this r*.+ −
+ −
209+ −
00:10:41,570 --> 00:10:45,530+ −
What we do there+ −
is we are trying to find+ −
+ −
210+ −
00:10:45,530 --> 00:10:47,960+ −
the regular expression+ −
which can match+ −
+ −
211+ −
00:10:47,960 --> 00:10:50,915+ −
this first part of the string.+ −
+ −
212+ −
00:10:50,915 --> 00:10:53,255+ −
However, where the+ −
c is chopped off.+ −
+ −
213+ −
00:10:53,255 --> 00:10:55,130+ −
And then we just say, well,+ −
+ −
214+ −
00:10:55,130 --> 00:10:57,050+ −
the rest has to be+ −
matched again with+ −
+ −
215+ −
00:10:57,050 --> 00:10:59,030+ −
0 or more copies of+ −
+ −
216+ −
00:10:59,030 --> 00:11:02,600+ −
this r. So that's why+ −
it's defined like this.+ −
+ −
217+ −
00:11:02,600 --> 00:11:09,050+ −
Okay? ...As said, please take care+ −
with this definition.+ −
+ −
218+ −
00:11:09,050 --> 00:11:11,435+ −
That's not so simple.+ −
+ −
219+ −
00:11:11,435 --> 00:11:13,250+ −
Once you get a hang of it,+ −
+ −
220+ −
00:11:13,250 --> 00:11:15,170+ −
however, it makes perfect sense.+ −
+ −
221+ −
00:11:15,170 --> 00:11:17,210+ −
So let me explain it in+ −
+ −
222+ −
00:11:17,210 --> 00:11:20,825+ −
different ways in+ −
the next slides.+ −
+ −
223+ −
00:11:20,825 --> 00:11:24,695+ −
Let's look+ −
first at some examples.+ −
+ −
224+ −
00:11:24,695 --> 00:11:27,140+ −
So here is a regular expression,+ −
+ −
225+ −
00:11:27,140 --> 00:11:29,390+ −
r. And let's have+ −
+ −
226+ −
00:11:29,390 --> 00:11:32,450+ −
a look at these three+ −
derivatives according to a,+ −
+ −
227+ −
00:11:32,450 --> 00:11:38,405+ −
b, and c. And we shall do+ −
the one for a. + −
+ −
228+ −
00:11:38,405 --> 00:11:42,379+ −
So here is our regular expression+ −
+ −
229+ −
00:11:42,379 --> 00:11:45,334+ −
and I was very generous+ −
with parentheses.+ −
+ −
230+ −
00:11:45,334 --> 00:11:48,140+ −
And the outermost is a star.+ −
+ −
231+ −
00:11:48,140 --> 00:11:52,550+ −
So we now build the+ −
derivative according to a,+ −
+ −
232+ −
00:11:52,550 --> 00:11:55,474+ −
the character a, of+ −
that regular expression.+ −
+ −
233+ −
00:11:55,474 --> 00:11:57,380+ −
So the first thing we+ −
+ −
234+ −
00:11:57,380 --> 00:11:59,555+ −
have to analyse is the case star.+ −
+ −
235+ −
00:11:59,555 --> 00:12:04,370+ −
So here's the regular expression,+ −
which we are looking at.+ −
+ −
236+ −
00:12:04,370 --> 00:12:09,170+ −
This r and the outermost+ −
constructor is this star.+ −
+ −
237+ −
00:12:09,170 --> 00:12:11,510+ −
If you go back to the definition,+ −
+ −
238+ −
00:12:11,510 --> 00:12:13,625+ −
I hope you have it next to you,+ −
+ −
239+ −
00:12:13,625 --> 00:12:16,340+ −
then this star-case is defined+ −
+ −
240+ −
00:12:16,340 --> 00:12:20,000+ −
as you're taking just the+ −
inside of the star+ −
+ −
241+ −
00:12:20,000 --> 00:12:23,030+ −
and apply this derivative and+ −
+ −
242+ −
00:12:23,030 --> 00:12:26,765+ −
leave the r on the+ −
outside at the end.+ −
+ −
243+ −
00:12:26,765 --> 00:12:29,990+ −
Ok. So that's the first step.+ −
+ −
244+ −
00:12:29,990 --> 00:12:32,030+ −
Now we have to analyze+ −
+ −
245+ −
00:12:32,030 --> 00:12:36,035+ −
the derivative according to+ −
a of this regular expression,+ −
+ −
246+ −
00:12:36,035 --> 00:12:38,000+ −
which is an alternative.+ −
+ −
247+ −
00:12:38,000 --> 00:12:39,665+ −
So the outermost is a plus.+ −
+ −
248+ −
00:12:39,665 --> 00:12:41,375+ −
Ok, that's very easy again,+ −
+ −
249+ −
00:12:41,375 --> 00:12:45,185+ −
we just have to push the+ −
derivative into each component,+ −
+ −
250+ −
00:12:45,185 --> 00:12:47,705+ −
into the a followed by b.+ −
+ −
251+ −
00:12:47,705 --> 00:12:49,145+ −
And into the second+ −
+ −
252+ −
00:12:49,145 --> 00:12:51,185+ −
alternative, into b. + −
+ −
253+ −
00:12:51,185 --> 00:12:56,030+ −
We take the derivative+ −
of each according to a.+ −
+ −
254+ −
00:12:56,030 --> 00:13:00,635+ −
Now this one is a sequence+ −
regular expression.+ −
+ −
255+ −
00:13:00,635 --> 00:13:02,210+ −
This was the complicated case.+ −
+ −
256+ −
00:13:02,210 --> 00:13:04,160+ −
First of all+ −
we have to test is+ −
+ −
257+ −
00:13:04,160 --> 00:13:07,910+ −
the first component+ −
nullable of this sequence?+ −
+ −
258+ −
00:13:07,910 --> 00:13:09,200+ −
Well, that is a,+ −
+ −
259+ −
00:13:09,200 --> 00:13:12,740+ −
in this case, a on its+ −
own is not nullable.+ −
+ −
260+ −
00:13:12,740 --> 00:13:14,210+ −
So we are the easy case,+ −
+ −
261+ −
00:13:14,210 --> 00:13:17,000+ −
we only have a single derivative+ −
+ −
262+ −
00:13:17,000 --> 00:13:19,370+ −
pushed in the first component.+ −
+ −
263+ −
00:13:19,370 --> 00:13:25,160+ −
So we have the derivative+ −
of a with the character a.+ −
+ −
264+ −
00:13:25,160 --> 00:13:27,920+ −
Okay, that's now+ −
the character case.+ −
+ −
265+ −
00:13:27,920 --> 00:13:29,720+ −
And in this case the character+ −
+ −
266+ −
00:13:29,720 --> 00:13:31,715+ −
and the regular expression agree.+ −
+ −
267+ −
00:13:31,715 --> 00:13:33,890+ −
So it's defined as one.+ −
+ −
268+ −
00:13:33,890 --> 00:13:37,550+ −
Ok? In the other case,+ −
+ −
269+ −
00:13:37,550 --> 00:13:39,710+ −
the regular expression is b,+ −
+ −
270+ −
00:13:39,710 --> 00:13:41,675+ −
but the characters a.+ −
+ −
271+ −
00:13:41,675 --> 00:13:46,385+ −
So in this case+ −
it's defined as 0.+ −
+ −
272+ −
00:13:46,385 --> 00:13:50,630+ −
Okay? So that's what the+ −
derivative would be.+ −
+ −
273+ −
00:13:50,630 --> 00:13:52,160+ −
This r is there+ −
+ −
274+ −
00:13:52,160 --> 00:13:55,280+ −
because originally we+ −
started with a star.+ −
+ −
275+ −
00:13:55,280 --> 00:13:58,295+ −
This regular expression is that star.+ −
+ −
276+ −
00:13:58,295 --> 00:14:02,780+ −
So the derivative+ −
according to a+ −
+ −
277+ −
00:14:02,780 --> 00:14:07,610+ −
of that regular expression+ −
is this expression.+ −
+ −
278+ −
00:14:07,610 --> 00:14:10,970+ −
We just have to+ −
substitute this r back in.+ −
+ −
279+ −
00:14:10,970 --> 00:14:13,745+ −
Just coming back to this slide.+ −
+ −
280+ −
00:14:13,745 --> 00:14:16,160+ −
So far, we only analyzes+ −
+ −
281+ −
00:14:16,160 --> 00:14:19,505+ −
the derivative according+ −
to a single character.+ −
+ −
282+ −
00:14:19,505 --> 00:14:23,960+ −
But we can also very easily+ −
extend that to whole strings.+ −
+ −
283+ −
00:14:23,960 --> 00:14:26,360+ −
So if you build the+ −
derivative according+ −
+ −
284+ −
00:14:26,360 --> 00:14:27,905+ −
to the empty string,+ −
+ −
285+ −
00:14:27,905 --> 00:14:30,875+ −
we just return the+ −
regular expression.+ −
+ −
286+ −
00:14:30,875 --> 00:14:35,585+ −
If we have a string+ −
starting with character c,+ −
+ −
287+ −
00:14:35,585 --> 00:14:37,850+ −
remember that can+ −
be any character,+ −
+ −
288+ −
00:14:37,850 --> 00:14:42,170+ −
then we build first the simple+ −
derivative according to+ −
+ −
289+ −
00:14:42,170 --> 00:14:44,360+ −
that first character and+ −
+ −
290+ −
00:14:44,360 --> 00:14:46,925+ −
continue with the+ −
rest of the string.+ −
+ −
291+ −
00:14:46,925 --> 00:14:50,615+ −
So here you see again,+ −
my personal convention:+ −
+ −
292+ −
00:14:50,615 --> 00:14:54,365+ −
everything which works on+ −
lists has this s at the end.+ −
+ −
293+ −
00:14:54,365 --> 00:14:57,125+ −
So this function is+ −
for single characters.+ −
+ −
294+ −
00:14:57,125 --> 00:14:59,179+ −
This one is for strings,+ −
+ −
295+ −
00:14:59,179 --> 00:15:02,450+ −
but it uses the one+ −
for the character.+ −
+ −
296+ −
00:15:02,450 --> 00:15:04,025+ −
Essentially what it does is+ −
+ −
297+ −
00:15:04,025 --> 00:15:06,185+ −
it chops off the first character,+ −
+ −
298+ −
00:15:06,185 --> 00:15:09,800+ −
builds the derivative, then+ −
chops off the next character,+ −
+ −
299+ −
00:15:09,800 --> 00:15:13,760+ −
builds the derivative of+ −
the result, and so on.+ −
+ −
300+ −
00:15:13,760 --> 00:15:17,000+ −
Having this function,+ −
we can actually now+ −
+ −
301+ −
00:15:17,000 --> 00:15:20,600+ −
state what the algorithm+ −
is, the complete algorithm.+ −
+ −
302+ −
00:15:20,600 --> 00:15:23,465+ −
So the Brzozowski matcher+ −
+ −
303+ −
00:15:23,465 --> 00:15:24,860+ −
takes a regular expression as+ −
+ −
304+ −
00:15:24,860 --> 00:15:26,915+ −
argument and a+ −
string as argument.+ −
+ −
305+ −
00:15:26,915 --> 00:15:30,920+ −
And is supposed to say yes if+ −
the regular expression matches+ −
+ −
306+ −
00:15:30,920 --> 00:15:33,560+ −
the string or no+ −
+ −
307+ −
00:15:33,560 --> 00:15:36,065+ −
if the regular expression does+ −
not match the string.+ −
+ −
308+ −
00:15:36,065 --> 00:15:37,715+ −
And how does it do that?+ −
+ −
309+ −
00:15:37,715 --> 00:15:42,560+ −
Well, it takes this string+ −
s and this regular expression,+ −
+ −
310+ −
00:15:42,560 --> 00:15:43,925+ −
and it first builds+ −
+ −
311+ −
00:15:43,925 --> 00:15:48,845+ −
successive derivatives until+ −
that string is exhausted.+ −
+ −
312+ −
00:15:48,845 --> 00:15:52,115+ −
Then you have+ −
a final derivative,+ −
+ −
313+ −
00:15:52,115 --> 00:15:53,839+ −
a final regular expression+ −
+ −
314+ −
00:15:53,839 --> 00:15:55,370+ −
and you test whether+ −
+ −
315+ −
00:15:55,370 --> 00:15:57,920+ −
this regular expression can+ −
match the empty string.+ −
+ −
316+ −
00:15:57,920 --> 00:16:01,370+ −
If yes, then the+ −
original regular expression,+ −
+ −
317+ −
00:16:01,370 --> 00:16:03,245+ −
this r, can match the string.+ −
+ −
318+ −
00:16:03,245 --> 00:16:05,210+ −
If no, if it cannot match+ −
+ −
319+ −
00:16:05,210 --> 00:16:08,030+ −
the final derivative+ −
with the empty string,+ −
+ −
320+ −
00:16:08,030 --> 00:16:10,280+ −
then no this+ −
regular expression r+ −
+ −
321+ −
00:16:10,280 --> 00:16:12,905+ −
cannot match that string.+ −
+ −
322+ −
00:16:12,905 --> 00:16:16,010+ −
I know it looks+ −
very anticlimactic,+ −
+ −
323+ −
00:16:16,010 --> 00:16:19,625+ −
but that's actually the+ −
beauty of this algorithm,+ −
+ −
324+ −
00:16:19,625 --> 00:16:22,760+ −
that it's not that complicated.+ −
+ −
325+ −
00:16:22,760 --> 00:16:25,340+ −
So how does the algorithm work+ −
+ −
326+ −
00:16:25,340 --> 00:16:27,634+ −
in a concrete example?+ −
+ −
327+ −
00:16:27,634 --> 00:16:31,580+ −
Imagine you have a string, abc+ −
+ −
328+ −
00:16:31,580 --> 00:16:34,370+ −
and you have a regular+ −
expression, say r1.+ −
+ −
329+ −
00:16:34,370 --> 00:16:37,520+ −
And you want to find out+ −
whether this r1 can match+ −
+ −
330+ −
00:16:37,520 --> 00:16:41,300+ −
that string abc or not.+ −
How do you do that?+ −
+ −
331+ −
00:16:41,300 --> 00:16:45,140+ −
Well, you would first+ −
take this r1 and you+ −
+ −
332+ −
00:16:45,140 --> 00:16:47,150+ −
build the derivative according+ −
+ −
333+ −
00:16:47,150 --> 00:16:49,880+ −
to the first character, the a.+ −
+ −
334+ −
00:16:49,880 --> 00:16:53,015+ −
Okay? You get the derivative,+ −
+ −
335+ −
00:16:53,015 --> 00:16:55,294+ −
which I call here r2.+ −
+ −
336+ −
00:16:55,294 --> 00:16:58,640+ −
Then you take the next+ −
character, the b.+ −
+ −
337+ −
00:16:58,640 --> 00:17:04,535+ −
You now build the derivative+ −
according to b of this r2.+ −
+ −
338+ −
00:17:04,535 --> 00:17:07,460+ −
Okay? So you take the+ −
result of the first step,+ −
+ −
339+ −
00:17:07,460 --> 00:17:09,530+ −
you feed it into the second step,+ −
+ −
340+ −
00:17:09,530 --> 00:17:11,810+ −
and you take the+ −
second character.+ −
+ −
341+ −
00:17:11,810 --> 00:17:17,075+ −
Then you do this also with c.+ −
So you get a derivative r3,+ −
+ −
342+ −
00:17:17,075 --> 00:17:22,460+ −
and you build the derivative+ −
of r3 according to c,+ −
+ −
343+ −
00:17:22,460 --> 00:17:24,185+ −
you get an r4.+ −
+ −
344+ −
00:17:24,185 --> 00:17:26,300+ −
Okay, so that's the+ −
final derivative.+ −
+ −
345+ −
00:17:26,300 --> 00:17:27,530+ −
The string is exhausted.+ −
+ −
346+ −
00:17:27,530 --> 00:17:29,570+ −
We build derivatives+ −
according to a, b,+ −
+ −
347+ −
00:17:29,570 --> 00:17:34,610+ −
and c. Now we just test whether+ −
this r4 is nullable.+ −
+ −
348+ −
00:17:34,610 --> 00:17:37,175+ −
If it says yes,+ −
+ −
349+ −
00:17:37,175 --> 00:17:41,510+ −
then the regular expression matcher+ −
will just say true, yes,+ −
+ −
350+ −
00:17:41,510 --> 00:17:43,340+ −
this original regular expression,+ −
+ −
351+ −
00:17:43,340 --> 00:17:47,270+ −
the r1, will be able to+ −
match that string abc.+ −
+ −
352+ −
00:17:47,270 --> 00:17:50,585+ −
And if this test returns false,+ −
+ −
353+ −
00:17:50,585 --> 00:17:53,015+ −
then the algorithm says false.+ −
+ −
354+ −
00:17:53,015 --> 00:17:56,975+ −
This regular expression will+ −
not match the string.+ −
+ −
355+ −
00:17:56,975 --> 00:18:00,260+ −
Ok, you might ask:+ −
Why on earth does+ −
+ −
356+ −
00:18:00,260 --> 00:18:02,960+ −
that algorithm+ −
actually work?+ −
+ −
357+ −
00:18:02,960 --> 00:18:06,515+ −
Here's anather explanation+ −
for why it works.+ −
+ −
358+ −
00:18:06,515 --> 00:18:10,190+ −
Imagine you have a regular+ −
expression r1, okay?+ −
+ −
359+ −
00:18:10,190 --> 00:18:13,220+ −
And you have a string abc,+ −
+ −
360+ −
00:18:13,220 --> 00:18:14,270+ −
and you want to find out+ −
+ −
361+ −
00:18:14,270 --> 00:18:17,180+ −
whether r1 can+ −
match that string.+ −
+ −
362+ −
00:18:17,180 --> 00:18:18,799+ −
And for the moment,+ −
+ −
363+ −
00:18:18,799 --> 00:18:22,610+ −
let's assume that it+ −
can match that string.+ −
+ −
364+ −
00:18:22,610 --> 00:18:26,315+ −
Ok? So the language L of+ −
+ −
365+ −
00:18:26,315 --> 00:18:30,185+ −
r1 will actually+ −
contain that string,+ −
+ −
366+ −
00:18:30,185 --> 00:18:31,805+ −
otherwise it wouldn't match that.+ −
+ −
367+ −
00:18:31,805 --> 00:18:36,710+ −
Okay? So abc is in+ −
this language, okay?+ −
+ −
368+ −
00:18:36,710 --> 00:18:39,785+ −
If I now take the+ −
semantic derivative,+ −
+ −
369+ −
00:18:39,785 --> 00:18:43,145+ −
that means I look at all+ −
the strings in this L(r1)+ −
+ −
370+ −
00:18:43,145 --> 00:18:46,820+ −
and filter out+ −
+ −
371+ −
00:18:46,820 --> 00:18:48,740+ −
all the ones which+ −
do not start with+ −
+ −
372+ −
00:18:48,740 --> 00:18:51,260+ −
an a, I discharge them.+ −
+ −
373+ −
00:18:51,260 --> 00:18:54,545+ −
And I only look at the ones+ −
which start with an a.+ −
+ −
374+ −
00:18:54,545 --> 00:18:56,465+ −
And of those strings,+ −
+ −
375+ −
00:18:56,465 --> 00:18:58,475+ −
I chop off this a.+ −
+ −
376+ −
00:18:58,475 --> 00:19:01,025+ −
So after this+ −
semantic derivative,+ −
+ −
377+ −
00:19:01,025 --> 00:19:05,735+ −
this set of strings will+ −
contain just bc.+ −
+ −
378+ −
00:19:05,735 --> 00:19:12,830+ −
Ok. Now if I build the next+ −
semantic derivative of that,+ −
+ −
379+ −
00:19:12,830 --> 00:19:14,345+ −
then I would look at+ −
+ −
380+ −
00:19:14,345 --> 00:19:16,850+ −
all the strings which+ −
start with a b,+ −
+ −
381+ −
00:19:16,850 --> 00:19:21,350+ −
and forget about everything+ −
else. Of the remaining ones+ −
+ −
382+ −
00:19:21,350 --> 00:19:27,905+ −
I know they start with b.+ −
I just chop of the b. Ok.+ −
+ −
383+ −
00:19:27,905 --> 00:19:31,655+ −
So in this whole set here,+ −
+ −
384+ −
00:19:31,655 --> 00:19:33,785+ −
in this whole set here,+ −
+ −
385+ −
00:19:33,785 --> 00:19:39,030+ −
there will be now a string+ −
which is just c. Okay?+ −
+ −
386+ −
00:19:39,190 --> 00:19:44,420+ −
Then I built the third+ −
semantic derivative+ −
+ −
387+ −
00:19:44,420 --> 00:19:47,300+ −
because I want to find out+ −
whether abc is matched.+ −
+ −
388+ −
00:19:47,300 --> 00:19:50,540+ −
Okay? So now I look+ −
at all the strings in+ −
+ −
389+ −
00:19:50,540 --> 00:19:52,820+ −
here and look at+ −
+ −
390+ −
00:19:52,820 --> 00:19:55,340+ −
them whether they start+ −
with a c. If yes,+ −
+ −
391+ −
00:19:55,340 --> 00:19:56,885+ −
I chop off the c.+ −
+ −
392+ −
00:19:56,885 --> 00:19:59,120+ −
And put in what is remaining.+ −
+ −
393+ −
00:19:59,120 --> 00:20:00,425+ −
So in this case,+ −
+ −
394+ −
00:20:00,425 --> 00:20:02,510+ −
if I have the string c+ −
+ −
395+ −
00:20:02,510 --> 00:20:04,550+ −
in this language and+ −
I chop off this c,+ −
+ −
396+ −
00:20:04,550 --> 00:20:07,700+ −
what is remaining+ −
is the empty string.+ −
+ −
397+ −
00:20:07,700 --> 00:20:09,695+ −
So we have to check of+ −
+ −
398+ −
00:20:09,695 --> 00:20:14,510+ −
that language whether it+ −
contains the empty string.+ −
+ −
399+ −
00:20:14,510 --> 00:20:18,800+ −
If yes, then the+ −
original r1 can match+ −
+ −
400+ −
00:20:18,800 --> 00:20:21,050+ −
this abc because this abc+ −
+ −
401+ −
00:20:21,050 --> 00:20:24,119+ −
must have been in this language.+ −
+ −
402+ −
00:20:24,130 --> 00:20:28,565+ −
And if in the end there wasn't+ −
the empty string, then+ −
+ −
403+ −
00:20:28,565 --> 00:20:33,575+ −
this abs was not in+ −
this language of r1.+ −
+ −
404+ −
00:20:33,575 --> 00:20:36,260+ −
And so the lexer must have,+ −
+ −
405+ −
00:20:36,260 --> 00:20:38,880+ −
or the matcher must have failed.+ −
+ −
406+ −
00:20:39,040 --> 00:20:42,530+ −
The clever bit is that here+ −
+ −
407+ −
00:20:42,530 --> 00:20:45,530+ −
the explanation is for languages.+ −
+ −
408+ −
00:20:45,530 --> 00:20:49,835+ −
Remember, this+ −
semantic derivative+ −
+ −
409+ −
00:20:49,835 --> 00:20:53,450+ −
works over languages and they+ −
sometimes can be in finite.+ −
+ −
410+ −
00:20:53,450 --> 00:20:55,730+ −
So that's not really+ −
an algorithm.+ −
+ −
411+ −
00:20:55,730 --> 00:20:58,880+ −
That's just+ −
explaining the idea. + −
+ −
412+ −
00:20:58,880 --> 00:21:02,525+ −
What Brzozowski+ −
achieved was that he+ −
+ −
413+ −
00:21:02,525 --> 00:21:06,440+ −
now works with this derivative+ −
regular expressions and+ −
+ −
414+ −
00:21:06,440 --> 00:21:10,715+ −
somehow imitates what+ −
happens on these languages.+ −
+ −
415+ −
00:21:10,715 --> 00:21:14,135+ −
Because remember if you+ −
have a regular expression,+ −
+ −
416+ −
00:21:14,135 --> 00:21:17,405+ −
and you want to test+ −
whether it can match abc,+ −
+ −
417+ −
00:21:17,405 --> 00:21:22,550+ −
then you take first+ −
derivative according to a.+ −
+ −
418+ −
00:21:22,550 --> 00:21:25,760+ −
So you will get a regular+ −
expression which can match b+ −
+ −
419+ −
00:21:25,760 --> 00:21:29,464+ −
and c, if r could match abc.+ −
+ −
420+ −
00:21:29,464 --> 00:21:31,430+ −
So after the first derivative,+ −
+ −
421+ −
00:21:31,430 --> 00:21:33,620+ −
you will get a regular expression+ −
which can match b and+ −
+ −
422+ −
00:21:33,620 --> 00:21:37,070+ −
c. If you take the+ −
second derivative,+ −
+ −
423+ −
00:21:37,070 --> 00:21:41,225+ −
you will get a regular expression+ −
which can match c alone.+ −
+ −
424+ −
00:21:41,225 --> 00:21:44,180+ −
And if you take the+ −
final derivative,+ −
+ −
425+ −
00:21:44,180 --> 00:21:46,070+ −
then you will get+ −
+ −
426+ −
00:21:46,070 --> 00:21:48,260+ −
a regular expression which hopefully+ −
+ −
427+ −
00:21:48,260 --> 00:21:49,715+ −
can match the empty string.+ −
+ −
428+ −
00:21:49,715 --> 00:21:53,780+ −
If it does, then this+ −
r can match the abc.+ −
+ −
429+ −
00:21:53,780 --> 00:21:55,655+ −
And if it doesn't, then+ −
+ −
430+ −
00:21:55,655 --> 00:21:58,680+ −
abc couldn't be+ −
matched by this r.+ −
+ −
431+ −
00:21:58,900 --> 00:22:02,990+ −
Okay, let's have a look+ −
how this pans out in code.+ −
+ −
432+ −
00:22:02,990 --> 00:22:06,050+ −
Here's the file re1.sc.+ −
+ −
433+ −
00:22:06,050 --> 00:22:07,940+ −
It's also uploaded on KEATS,+ −
+ −
434+ −
00:22:07,940 --> 00:22:10,625+ −
so you can see exactly+ −
what I'm doing.+ −
+ −
435+ −
00:22:10,625 --> 00:22:13,970+ −
And actually you already saw+ −
that file because I showed you+ −
+ −
436+ −
00:22:13,970 --> 00:22:15,710+ −
how my regular expressions are+ −
+ −
437+ −
00:22:15,710 --> 00:22:17,960+ −
defined, with the+ −
abstract classes here.+ −
+ −
438+ −
00:22:17,960 --> 00:22:21,155+ −
And here, the six cases+ −
for 0, 1, character,+ −
+ −
439+ −
00:22:21,155 --> 00:22:23,540+ −
alternative, + −
sequence and star.+ −
+ −
440+ −
00:22:23,540 --> 00:22:26,705+ −
Ok. So the first+ −
function nullable,+ −
+ −
441+ −
00:22:26,705 --> 00:22:28,760+ −
the simple one, takes+ −
+ −
442+ −
00:22:28,760 --> 00:22:32,120+ −
a regular expression as+ −
argument and returns a boolean.+ −
+ −
443+ −
00:22:32,120 --> 00:22:34,280+ −
And then with this+ −
pattern matching,+ −
+ −
444+ −
00:22:34,280 --> 00:22:37,040+ −
we just go through+ −
all these six cases+ −
+ −
445+ −
00:22:37,040 --> 00:22:38,900+ −
0 is defined as false.+ −
+ −
446+ −
00:22:38,900 --> 00:22:43,234+ −
One is defined as true.+ −
Character, for any character,+ −
+ −
447+ −
00:22:43,234 --> 00:22:45,455+ −
this nullable will return false.+ −
+ −
448+ −
00:22:45,455 --> 00:22:47,540+ −
The alternative is defined here,+ −
+ −
449+ −
00:22:47,540 --> 00:22:50,000+ −
so that's the or in Scala.+ −
+ −
450+ −
00:22:50,000 --> 00:22:52,700+ −
And for the sequence,+ −
that's the and.+ −
+ −
451+ −
00:22:52,700 --> 00:22:56,180+ −
And this star, no matter+ −
what the regular expression is,+ −
+ −
452+ −
00:22:56,180 --> 00:22:59,540+ −
it will always match the+ −
empty string, so true.+ −
+ −
453+ −
00:22:59,540 --> 00:23:02,225+ −
So nullable is very easy.+ −
+ −
454+ −
00:23:02,225 --> 00:23:07,430+ −
The derivative is also not+ −
so much more complicated.+ −
+ −
455+ −
00:23:07,430 --> 00:23:08,974+ −
It takes two arguments,+ −
+ −
456+ −
00:23:08,974 --> 00:23:11,810+ −
a character and the+ −
regular expression,+ −
+ −
457+ −
00:23:11,810 --> 00:23:14,405+ −
and returns a regular expression.+ −
+ −
458+ −
00:23:14,405 --> 00:23:16,340+ −
So now we just have to analyze+ −
+ −
459+ −
00:23:16,340 --> 00:23:18,890+ −
what's that regular+ −
expression looks like.+ −
+ −
460+ −
00:23:18,890 --> 00:23:23,870+ −
If it's 0, we return+ −
0; if it's 1 we return 0.+ −
+ −
461+ −
00:23:23,870 --> 00:23:26,990+ −
If the character... If the+ −
regular expressions character+ −
+ −
462+ −
00:23:26,990 --> 00:23:30,260+ −
d, we test whether it's+ −
equal to this character+ −
+ −
463+ −
00:23:30,260 --> 00:23:32,225+ −
we want to take the+ −
derivative off.+ −
+ −
464+ −
00:23:32,225 --> 00:23:36,185+ −
If yes, return 1, otherwise 0.+ −
+ −
465+ −
00:23:36,185 --> 00:23:38,600+ −
The alternative which has pushed+ −
+ −
466+ −
00:23:38,600 --> 00:23:39,860+ −
the derivative under+ −
+ −
467+ −
00:23:39,860 --> 00:23:42,710+ −
this alternative,+ −
that's the easy one.+ −
+ −
468+ −
00:23:42,710 --> 00:23:44,690+ −
Here's the sequence case where we+ −
+ −
469+ −
00:23:44,690 --> 00:23:47,015+ −
first have to test+ −
if it's nullable,+ −
+ −
470+ −
00:23:47,015 --> 00:23:49,040+ −
If it's not the have to push+ −
+ −
471+ −
00:23:49,040 --> 00:23:52,160+ −
the derivative over+ −
the first, the r1.+ −
+ −
472+ −
00:23:52,160 --> 00:23:56,135+ −
And otherwise if it is nullable,+ −
we have two cases.+ −
+ −
473+ −
00:23:56,135 --> 00:24:00,605+ −
Either you have to push+ −
it over the r1 or r2.+ −
+ −
474+ −
00:24:00,605 --> 00:24:03,860+ −
And a star case, this one.+ −
+ −
475+ −
00:24:03,860 --> 00:24:07,160+ −
We just build the sequence+ −
of the derivative of+ −
+ −
476+ −
00:24:07,160 --> 00:24:11,600+ −
r1 and append the+ −
r1 as a sequence,+ −
+ −
477+ −
00:24:11,600 --> 00:24:14,195+ −
put the star at the end.+ −
+ −
478+ −
00:24:14,195 --> 00:24:19,430+ −
Okay, so here's the+ −
function for strings.+ −
+ −
479+ −
00:24:19,430 --> 00:24:21,740+ −
So a list of characters.+ −
+ −
480+ −
00:24:21,740 --> 00:24:23,870+ −
This function takes an s,+ −
+ −
481+ −
00:24:23,870 --> 00:24:26,480+ −
a list of characters as argument+ −
and a regular expression+ −
+ −
482+ −
00:24:26,480 --> 00:24:29,360+ −
as argument and returns+ −
a regular expression.+ −
+ −
483+ −
00:24:29,360 --> 00:24:31,460+ −
And we just have to+ −
pattern match what+ −
+ −
484+ −
00:24:31,460 --> 00:24:34,055+ −
that string looks like+ −
or this list looks like.+ −
+ −
485+ −
00:24:34,055 --> 00:24:35,360+ −
If it's the empty list,+ −
+ −
486+ −
00:24:35,360 --> 00:24:38,030+ −
it just immediately+ −
return the r. If+ −
+ −
487+ −
00:24:38,030 --> 00:24:42,020+ −
it's something of the+ −
form c followed by s,+ −
+ −
488+ −
00:24:42,020 --> 00:24:45,050+ −
then we build first the+ −
derivative according+ −
+ −
489+ −
00:24:45,050 --> 00:24:48,345+ −
to c. And then+ −
the result of that,+ −
+ −
490+ −
00:24:48,345 --> 00:24:50,090+ −
we recursively call+ −
+ −
491+ −
00:24:50,090 --> 00:24:53,675+ −
the derivative+ −
according to s. Okay?+ −
+ −
492+ −
00:24:53,675 --> 00:24:56,060+ −
And now the main matcher,+ −
+ −
493+ −
00:24:56,060 --> 00:24:59,360+ −
it takes a regular+ −
expression and takes+ −
+ −
494+ −
00:24:59,360 --> 00:25:02,870+ −
a string and returns a+ −
boolean, true or false.+ −
+ −
495+ −
00:25:02,870 --> 00:25:05,705+ −
And it first builds+ −
the derivative.+ −
+ −
496+ −
00:25:05,705 --> 00:25:07,430+ −
The only thing I have to do here+ −
+ −
497+ −
00:25:07,430 --> 00:25:09,410+ −
because I'm implementing+ −
it in Scala,+ −
+ −
498+ −
00:25:09,410 --> 00:25:15,064+ −
I have to coerce between strings+ −
and lists of characters.+ −
+ −
499+ −
00:25:15,064 --> 00:25:16,580+ −
So he, I take first+ −
+ −
500+ −
00:25:16,580 --> 00:25:20,810+ −
the toList of the s. That+ −
gives me a list of characters.+ −
+ −
501+ −
00:25:20,810 --> 00:25:23,135+ −
Then I build this derivative+ −
+ −
502+ −
00:25:23,135 --> 00:25:26,045+ −
with the s, because I'm+ −
looking at strings.+ −
+ −
503+ −
00:25:26,045 --> 00:25:28,160+ −
And then at the end,+ −
+ −
504+ −
00:25:28,160 --> 00:25:33,050+ −
built the nullable of+ −
the final derivative.+ −
+ −
505+ −
00:25:33,050 --> 00:25:37,310+ −
The beauty of all this+ −
is that in Scala,+ −
+ −
506+ −
00:25:37,310 --> 00:25:40,085+ −
these definition which+ −
I had on the slides+ −
+ −
507+ −
00:25:40,085 --> 00:25:43,835+ −
go almost one-to-one into code.+ −
+ −
508+ −
00:25:43,835 --> 00:25:45,605+ −
And that's of course,+ −
+ −
509+ −
00:25:45,605 --> 00:25:47,480+ −
makes it very easy+ −
to implement in+ −
+ −
510+ −
00:25:47,480 --> 00:25:49,730+ −
a functional language like Scala.+ −
+ −
511+ −
00:25:49,730 --> 00:25:52,865+ −
We can also now try+ −
out some examples.+ −
+ −
512+ −
00:25:52,865 --> 00:25:56,960+ −
This was the example+ −
I had on the slide.+ −
+ −
513+ −
00:25:56,960 --> 00:25:58,370+ −
And we could now calculate+ −
+ −
514+ −
00:25:58,370 --> 00:26:00,305+ −
what's the derivative+ −
according to a,+ −
+ −
515+ −
00:26:00,305 --> 00:26:02,720+ −
b, and c of this+ −
regular expression.+ −
+ −
516+ −
00:26:02,720 --> 00:26:07,040+ −
And I hope you didn't+ −
make any mistake.+ −
+ −
517+ −
00:26:07,040 --> 00:26:09,260+ −
Now of course we want+ −
to see whether we+ −
+ −
518+ −
00:26:09,260 --> 00:26:11,390+ −
do any better with+ −
this algorithm...+ −
+ −
519+ −
00:26:11,390 --> 00:26:13,715+ −
than Python and Ruby.+ −
+ −
520+ −
00:26:13,715 --> 00:26:16,070+ −
And let's first have a+ −
look at this example.+ −
+ −
521+ −
00:26:16,070 --> 00:26:18,079+ −
So this regular expression.+ −
+ −
522+ −
00:26:18,079 --> 00:26:19,880+ −
The problem is that in+ −
+ −
523+ −
00:26:19,880 --> 00:26:22,070+ −
our regular expression+ −
matcher so far we have+ −
+ −
524+ −
00:26:22,070 --> 00:26:24,530+ −
the sequence rregular+ −
expression, but we+ −
+ −
525+ −
00:26:24,530 --> 00:26:27,200+ −
don't have this optional+ −
regular expression.+ −
+ −
526+ −
00:26:27,200 --> 00:26:30,800+ −
And we don't have this n+ −
times regular expression,+ −
+ −
527+ −
00:26:30,800 --> 00:26:36,605+ −
but we can quickly implement+ −
that in our implementation.+ −
+ −
528+ −
00:26:36,605 --> 00:26:40,549+ −
So if you want to build the+ −
optional regular expression,+ −
+ −
529+ −
00:26:40,549 --> 00:26:41,870+ −
which is defined here,+ −
+ −
530+ −
00:26:41,870 --> 00:26:44,570+ −
a little function which+ −
takes a reg expression and+ −
+ −
531+ −
00:26:44,570 --> 00:26:47,360+ −
returns the alternative of r and 1.+ −
+ −
532+ −
00:26:47,360 --> 00:26:49,624+ −
So it essentially+ −
takes the definition+ −
+ −
533+ −
00:26:49,624 --> 00:26:53,240+ −
of optional r and+ −
replaces it by that.+ −
+ −
534+ −
00:26:53,240 --> 00:26:56,150+ −
The n-times what we+ −
essentially do there is+ −
+ −
535+ −
00:26:56,150 --> 00:27:01,535+ −
we built n copies of this r. Ok?+ −
+ −
536+ −
00:27:01,535 --> 00:27:04,745+ −
So if this n-times was 0,+ −
+ −
537+ −
00:27:04,745 --> 00:27:06,245+ −
we just return one.+ −
+ −
538+ −
00:27:06,245 --> 00:27:11,570+ −
If it's one, then we return+ −
just the regular expression.+ −
+ −
539+ −
00:27:11,570 --> 00:27:15,575+ −
And if it's something+ −
bigger than one,+ −
+ −
540+ −
00:27:15,575 --> 00:27:18,560+ −
then we just built the+ −
sequence of one of+ −
+ −
541+ −
00:27:18,560 --> 00:27:20,270+ −
these copies and call+ −
+ −
542+ −
00:27:20,270 --> 00:27:22,925+ −
this function again+ −
with n - 1.+ −
+ −
543+ −
00:27:22,925 --> 00:27:26,330+ −
So we just build now n-copies of that.+ −
+ −
544+ −
00:27:26,330 --> 00:27:30,710+ −
Okay, so we can look+ −
at our first example.+ −
+ −
545+ −
00:27:30,710 --> 00:27:32,629+ −
This is the regular expression,+ −
+ −
546+ −
00:27:32,629 --> 00:27:36,035+ −
and I call that here+ −
evil regular expression1.+ −
+ −
547+ −
00:27:36,035 --> 00:27:37,670+ −
It doesn't look as beautiful+ −
+ −
548+ −
00:27:37,670 --> 00:27:39,140+ −
as what we write down on paper.+ −
+ −
549+ −
00:27:39,140 --> 00:27:41,240+ −
We will actually look+ −
at this later on+ −
+ −
550+ −
00:27:41,240 --> 00:27:43,640+ −
if this can be improved.+ −
But here it is.+ −
+ −
551+ −
00:27:43,640 --> 00:27:45,724+ −
Here's the regular expression,+ −
+ −
552+ −
00:27:45,724 --> 00:27:49,520+ −
and here's a little function+ −
which measures the time.+ −
+ −
553+ −
00:27:49,520 --> 00:27:53,180+ −
And we can now start testing+ −
+ −
554+ −
00:27:53,180 --> 00:27:57,845+ −
this regular expression with+ −
strings just containing a's.+ −
+ −
555+ −
00:27:57,845 --> 00:28:00,020+ −
And we are first a bit cautious,+ −
+ −
556+ −
00:28:00,020 --> 00:28:04,985+ −
we test it between 0 and 20,+ −
and we count by two.+ −
+ −
557+ −
00:28:04,985 --> 00:28:08,330+ −
And I actually will not+ −
start that within Scala,+ −
+ −
558+ −
00:28:08,330 --> 00:28:12,965+ −
but actually use Ammonite.+ −
+ −
559+ −
00:28:12,965 --> 00:28:15,305+ −
And that's out.+ −
+ −
560+ −
00:28:15,305 --> 00:28:17,269+ −
And that calculates+ −
+ −
561+ −
00:28:17,269 --> 00:28:20,675+ −
for 18 a's. It's pretty fast.+ −
+ −
562+ −
00:28:20,675 --> 00:28:24,815+ −
But for 20 it already+ −
needs 2 seconds.+ −
+ −
563+ −
00:28:24,815 --> 00:28:28,265+ −
And you can see+ −
actually this jump from+ −
+ −
564+ −
00:28:28,265 --> 00:28:32,825+ −
18 to 20 in the runtime+ −
is pretty bad too.+ −
+ −
565+ −
00:28:32,825 --> 00:28:37,460+ −
And if we actually measure+ −
it more accurately,+ −
+ −
566+ −
00:28:37,460 --> 00:28:39,770+ −
then we get this picture.+ −
+ −
567+ −
00:28:39,770 --> 00:28:41,255+ −
And what turns out,+ −
+ −
568+ −
00:28:41,255 --> 00:28:45,830+ −
we are actually worse than Python+ −
and Ruby in this example.+ −
+ −
569+ −
00:28:45,830 --> 00:28:49,230+ −
So I guess that's a fail.+ −