1
00:00:09,990 --> 00:00:13,465
Welcome back to this
week's lecture.
2
00:00:13,465 --> 00:00:16,450
The task this week is to actually
3
00:00:16,450 --> 00:00:20,320
implement a regular
expression matcher.
4
00:00:20,320 --> 00:00:22,510
And we want to be a bit
5
00:00:22,510 --> 00:00:25,315
faster than the regular
expression matcher
6
00:00:25,315 --> 00:00:29,380
in Python, Ruby,
Javascript, and Java.
7
00:00:29,380 --> 00:00:31,960
Remember this regular expression
8
00:00:31,960 --> 00:00:35,785
and strings which are
just a number of a's,
9
00:00:35,785 --> 00:00:39,670
these regular expression matchers
where abysmally slow.
10
00:00:39,670 --> 00:00:43,170
They could only match
approximately 30 a's in
11
00:00:43,170 --> 00:00:45,665
something like half a minute.
12
00:00:45,665 --> 00:00:49,460
What we like to do with
our regular expression matcher.
13
00:00:49,460 --> 00:00:51,890
We will try a few times,
14
00:00:51,890 --> 00:00:55,505
but our second trial will already
be much, much better.
15
00:00:55,505 --> 00:00:58,400
It will probably
match around maybe
16
00:00:58,400 --> 00:01:02,030
thousand a's in something
in half a minute.
17
00:01:02,030 --> 00:01:03,920
But our third version in
18
00:01:03,920 --> 00:01:06,230
comparison will be
blindingly fast.
19
00:01:06,230 --> 00:01:08,180
And we'll be able to match
20
00:01:08,180 --> 00:01:10,145
strings of length 10,000 a's
21
00:01:10,145 --> 00:01:12,230
and will not require
22
00:01:12,230 --> 00:01:14,975
more than five seconds.
So let's go ahead with that.
23
00:01:14,975 --> 00:01:18,095
We will first implement
24
00:01:18,095 --> 00:01:19,880
our regular expression
matcher only
25
00:01:19,880 --> 00:01:22,069
for the basic
regular expressions.
26
00:01:22,069 --> 00:01:25,430
So we will have only six
cases to think about.
27
00:01:25,430 --> 00:01:27,620
This will simplify matters at first.
28
00:01:27,620 --> 00:01:30,350
Later we can
think about how to
29
00:01:30,350 --> 00:01:34,100
extend that to the extended
regular expressions.
30
00:01:34,100 --> 00:01:37,625
Unfortunately, before we can
start our implementation,
31
00:01:37,625 --> 00:01:39,290
we have to discuss
32
00:01:39,290 --> 00:01:42,470
when two regular
expressions are equivalent.
33
00:01:42,470 --> 00:01:46,595
And we use here this notation
of the triple equals.
34
00:01:46,595 --> 00:01:48,215
We say a regular expression
35
00:01:48,215 --> 00:01:50,180
r1 and r2 are
36
00:01:50,180 --> 00:01:56,059
equivalent if and only
if the language of r1 is
37
00:01:56,059 --> 00:01:59,360
equal to the language of r2.
38
00:01:59,360 --> 00:02:02,915
Sounds complicated,
but essentially means
39
00:02:02,915 --> 00:02:04,700
if the two regular expressions can
40
00:02:04,700 --> 00:02:06,950
match exactly the same strings,
41
00:02:06,950 --> 00:02:09,800
then we want to regard
them as equivalent.
42
00:02:09,800 --> 00:02:14,300
This equivalence justifies
why we can be sloppy
43
00:02:14,300 --> 00:02:16,040
with parentheses.
44
00:02:16,040 --> 00:02:23,870
For example, if we have
(r1 + r2) + r3,
45
00:02:23,870 --> 00:02:32,270
then this will be equivalent
to r1 + (r2 + r3).
46
00:02:32,270 --> 00:02:34,910
Remember, regular
expressions are trees,
47
00:02:34,910 --> 00:02:37,940
so these are two different
regular expressions,
48
00:02:37,940 --> 00:02:41,540
but they can match
the same strings.
49
00:02:41,540 --> 00:02:45,695
Because, well, if we take
here the meaning of that,
50
00:02:45,695 --> 00:02:54,350
that would be just L(r1 + r2 + r3),
51
00:02:54,350 --> 00:03:04,100
which is equal to L(r1 + r2) u L(r3).
52
00:03:04,100 --> 00:03:10,595
And that is equal to of
53
00:03:10,595 --> 00:03:15,710
L(r1) u L(r2) u L(r3).
54
00:03:15,710 --> 00:03:17,930
And if you do the
same calculation
55
00:03:17,930 --> 00:03:19,445
for that regular expression,
56
00:03:19,445 --> 00:03:22,940
you will find out the
two sets are the same.
57
00:03:22,940 --> 00:03:26,195
So both regular expressions
can match the same strings.
58
00:03:26,195 --> 00:03:28,805
So even if they're different
regular expressions,
59
00:03:28,805 --> 00:03:31,220
we can regard them as eqivalent.
60
00:03:31,220 --> 00:03:33,290
And so that's why we can forget
61
00:03:33,290 --> 00:03:35,270
about writing these parentheses.
62
00:03:35,270 --> 00:03:40,205
Let's have a look
at some more examples.
63
00:03:40,205 --> 00:03:41,870
So the first one here,
64
00:03:41,870 --> 00:03:43,205
that is now clear.
65
00:03:43,205 --> 00:03:45,200
We did this calculation already
66
00:03:45,200 --> 00:03:47,480
for arbitrary regular expressions.
67
00:03:47,480 --> 00:03:49,520
Here it is for
concrete ones a, b and c.
68
00:03:49,520 --> 00:03:52,690
Here on the left-hand side,
69
00:03:52,690 --> 00:03:54,895
the regular expression
has the same meaning
70
00:03:54,895 --> 00:03:56,620
as on the right-hand side.
71
00:03:56,620 --> 00:03:58,420
So they are equivalent.
72
00:03:58,420 --> 00:04:01,375
We can ignore the parentheses.
73
00:04:01,375 --> 00:04:03,220
If I have a choice,
74
00:04:03,220 --> 00:04:06,610
but the choice is
the same a or a,
75
00:04:06,610 --> 00:04:09,265
then this is
equivalent to just a.
76
00:04:09,265 --> 00:04:13,075
So the same choice doesn't
introduce anything more.
77
00:04:13,075 --> 00:04:15,370
In the next case, if I have
78
00:04:15,370 --> 00:04:19,450
a regular expression
which can match a or b,
79
00:04:19,450 --> 00:04:23,860
that can match the same
strings as b or a.
80
00:04:23,860 --> 00:04:27,325
So you have a kind of
commutativity for the plus,
81
00:04:27,325 --> 00:04:29,485
which is like on natural numbers.
82
00:04:29,485 --> 00:04:32,880
But you would not have a
communitivity for the sequence:
83
00:04:32,880 --> 00:04:37,070
if you have
first a and then b,
84
00:04:37,070 --> 00:04:40,850
that's not the same as
matching first b and then a.
85
00:04:40,850 --> 00:04:42,800
So for the sequence ...
86
00:04:42,800 --> 00:04:44,615
they are not equivalent.
87
00:04:44,615 --> 00:04:49,475
However, for the sequence I
can, like for the plus, ignore
88
00:04:49,475 --> 00:04:51,245
prarentheses.
89
00:04:51,245 --> 00:04:55,070
If I have the parentheses
and this way,
90
00:04:55,070 --> 00:04:57,785
or I have it in this way.
91
00:04:57,785 --> 00:05:00,605
These are two different
regular expressions,
92
00:05:00,605 --> 00:05:02,120
but they have the same meaning.
93
00:05:02,120 --> 00:05:03,815
So they are equivalent.
94
00:05:03,815 --> 00:05:05,780
And so I can leave out parentheses
95
00:05:05,780 --> 00:05:09,170
for times as well.
96
00:05:09,170 --> 00:05:12,185
The next one is a slightly
more interesting one.
97
00:05:12,185 --> 00:05:15,020
If I have a regular
expression which can match
98
00:05:15,020 --> 00:05:19,655
c first followed by (a + b),
99
00:05:19,655 --> 00:05:25,355
then this is the same as
first c followed by a,
100
00:05:25,355 --> 00:05:28,640
or c followed by b.
101
00:05:28,640 --> 00:05:31,265
So that's the kind of
distributivity law
102
00:05:31,265 --> 00:05:33,545
on regular expressions.
103
00:05:33,545 --> 00:05:36,350
But they're also regular expressions
which are not equivalent.
104
00:05:36,350 --> 00:05:38,990
If I have the regular
expression which can
105
00:05:38,990 --> 00:05:42,800
match the string containing
exactly two a's.
106
00:05:42,800 --> 00:05:44,240
That is not equivalent
107
00:05:44,240 --> 00:05:46,730
to the regular
expression which can just match
108
00:05:46,730 --> 00:05:49,250
a single a. And similarly
109
00:05:49,250 --> 00:05:51,680
in this case, if I can match
110
00:05:51,680 --> 00:05:56,075
a or the string b followed by c,
111
00:05:56,075 --> 00:05:59,825
that is not the same as a or b,
112
00:05:59,825 --> 00:06:02,000
followed by a or c.
113
00:06:02,000 --> 00:06:05,990
I'll let you think about
why is that not equivalent.
114
00:06:05,990 --> 00:06:08,060
Essentially you
have to find out what's
115
00:06:08,060 --> 00:06:10,475
the meaning of both
regular expressions.
116
00:06:10,475 --> 00:06:14,090
And are they the
same sets or not?
117
00:06:14,090 --> 00:06:17,435
There are again some
interesting corner cases.
118
00:06:17,435 --> 00:06:20,540
If I have a regular expression
that can match a,
119
00:06:20,540 --> 00:06:23,450
but followed by the regular
expression which cannot match
120
00:06:23,450 --> 00:06:25,670
anything, that is not equivalent
121
00:06:25,670 --> 00:06:29,255
to the regular
expression which can match a.
122
00:06:29,255 --> 00:06:31,340
Again here you have
to do the calculation
123
00:06:31,340 --> 00:06:32,915
to convince you of that.
124
00:06:32,915 --> 00:06:35,840
Similarly, if I have a regular
expression which can
125
00:06:35,840 --> 00:06:38,750
match a or the empty string,
126
00:06:38,750 --> 00:06:40,640
this is not equivalent to
127
00:06:40,640 --> 00:06:43,355
the regular expression
which can just match a.
128
00:06:43,355 --> 00:06:46,925
Here are some interesting
ones with zeros and ones.
129
00:06:46,925 --> 00:06:48,860
So if I have the regular expression
130
00:06:48,860 --> 00:06:50,975
that can match the empty string,
131
00:06:50,975 --> 00:06:53,540
this will be actually equivalent to
132
00:06:53,540 --> 00:06:56,435
the regular expression
which can match nothing,
133
00:06:56,435 --> 00:06:59,405
but star of that.
134
00:06:59,405 --> 00:07:01,970
Remember the star
always introduces,
135
00:07:01,970 --> 00:07:04,370
no matter what, the empty string.
136
00:07:04,370 --> 00:07:06,170
So this regular expression can match
137
00:07:06,170 --> 00:07:08,930
the empty string,
just like the 1.
138
00:07:08,930 --> 00:07:12,125
So these two expressions
are not the same,
139
00:07:12,125 --> 00:07:14,210
but they are equivalent.
140
00:07:14,210 --> 00:07:16,700
And it doesn't matter
whether you take
141
00:07:16,700 --> 00:07:20,090
the empty string to the star,
142
00:07:20,090 --> 00:07:23,855
because if I can match 0 or
more times the empty string,
143
00:07:23,855 --> 00:07:27,450
that will be equivalent to
just the empty string.
144
00:07:27,520 --> 00:07:32,510
Slightly similar to the
third case is this one.
145
00:07:32,510 --> 00:07:35,720
Zero star is not equivalent to 0
146
00:07:35,720 --> 00:07:40,025
because that can match
exactly the empty string.
147
00:07:40,025 --> 00:07:42,740
This cannot match anything.
148
00:07:42,740 --> 00:07:44,839
While the previous
149
00:07:44,839 --> 00:07:47,540
equivalences are all very
instructive to look at,
150
00:07:47,540 --> 00:07:49,910
these are the
equivalences we need
151
00:07:49,910 --> 00:07:52,685
in our regular expression matchers.
152
00:07:52,685 --> 00:07:54,350
And they are defined for
153
00:07:54,350 --> 00:07:58,160
all regular expressions r.
So r plus 0,
154
00:07:58,160 --> 00:08:00,470
no matter what r looks
like is equivalent
155
00:08:00,470 --> 00:08:02,945
to just r. Similarly
156
00:08:02,945 --> 00:08:05,930
0 plus r is also
equivalent to r.
157
00:08:05,930 --> 00:08:08,525
The order of this
choice doesn't matter;
158
00:08:08,525 --> 00:08:11,495
r followed by 1,
159
00:08:11,495 --> 00:08:14,030
that's the sequence
regular expression, is
160
00:08:14,030 --> 00:08:16,760
equivalent to r. The
sam, r followed by
161
00:08:16,760 --> 00:08:19,010
r is equivalent to r.
162
00:08:19,010 --> 00:08:20,990
And r followed by
163
00:08:20,990 --> 00:08:23,390
the regular expression which
cannot match anything,
164
00:08:23,390 --> 00:08:27,455
is equivalent to just 0.
165
00:08:27,455 --> 00:08:30,740
And 0 followed by r is also equivalent to 0.
166
00:08:30,740 --> 00:08:33,615
And if you have the
choice of r plus r,
167
00:08:33,615 --> 00:08:37,210
that is also
equivalent to just r.
168
00:08:37,210 --> 00:08:40,270
What we're going to do
with these equivalences is
169
00:08:40,270 --> 00:08:42,820
that we regard them as
simplification rules.
170
00:08:42,820 --> 00:08:43,930
So whenever we see
171
00:08:43,930 --> 00:08:46,465
this regular expression
in our algorithm,
172
00:08:46,465 --> 00:08:48,580
we will replace it by
the right-hand side.
173
00:08:48,580 --> 00:08:51,715
So we will orient
these equivalences.
174
00:08:51,715 --> 00:08:53,650
If we see, this regular expression,
175
00:08:53,650 --> 00:08:55,225
we will replace it by that one.
176
00:08:55,225 --> 00:08:58,945
And it will not matter
because the left-hand sides
177
00:08:58,945 --> 00:09:01,255
can match exactly
the same strings
178
00:09:01,255 --> 00:09:03,475
as the right-hand sides.
179
00:09:03,475 --> 00:09:06,370
Here two quick examples.
180
00:09:06,370 --> 00:09:08,680
The first one, let's
assume you have
181
00:09:08,680 --> 00:09:10,270
a regular expression r and
182
00:09:10,270 --> 00:09:11,905
there is something
in front of it.
183
00:09:11,905 --> 00:09:13,720
This "something in front of it"
184
00:09:13,720 --> 00:09:15,870
can be simplified as follows.
185
00:09:15,870 --> 00:09:20,000
This 1 times b
can be simplified to b.
186
00:09:20,000 --> 00:09:23,555
Then this b plus 0 can
be simplified to b.
187
00:09:23,555 --> 00:09:25,490
And assuming that nothing can
188
00:09:25,490 --> 00:09:27,470
be simplified inside this r,
189
00:09:27,470 --> 00:09:28,700
then this would be
190
00:09:28,700 --> 00:09:33,050
the simplified version
of this regular expression.
191
00:09:33,050 --> 00:09:34,820
And because the simplification
192
00:09:34,820 --> 00:09:36,965
rules preserve what can be matched,
193
00:09:36,965 --> 00:09:39,080
we can be sure that
this regular expression
194
00:09:39,080 --> 00:09:41,255
can match exactly the strings
195
00:09:41,255 --> 00:09:43,940
this regular expression can match.
196
00:09:43,940 --> 00:09:45,740
Here is the other example.
197
00:09:45,740 --> 00:09:49,550
This has just a tiny change
that this becomes here as 0.
198
00:09:49,550 --> 00:09:54,485
Then 0 times b can be
replaced by just 0.
199
00:09:54,485 --> 00:09:56,705
Then they are actually two
rules which could apply
200
00:09:56,705 --> 00:09:59,014
either that we have
the same choice
201
00:09:59,014 --> 00:10:01,160
or we can just say something plus
202
00:10:01,160 --> 00:10:04,100
0 will always go to something.
203
00:10:04,100 --> 00:10:06,245
So we can simplify it to that.
204
00:10:06,245 --> 00:10:08,510
And then we have
0 times r again,
205
00:10:08,510 --> 00:10:10,460
and that can be simplified to 0.
206
00:10:10,460 --> 00:10:12,080
So actually what we find out with
207
00:10:12,080 --> 00:10:14,645
this calculation is that
this regular expression,
208
00:10:14,645 --> 00:10:16,835
even if it looks
quite complicated,
209
00:10:16,835 --> 00:10:19,295
actually doesn't
match any string.
210
00:10:19,295 --> 00:10:21,290
Because it matches exactly
211
00:10:21,290 --> 00:10:23,420
those string this regular
expression can match.
212
00:10:23,420 --> 00:10:26,285
And this reg expression
cannot match any.
213
00:10:26,285 --> 00:10:29,930
We need one more
operation on languages.
214
00:10:29,930 --> 00:10:31,700
I call this operation
215
00:10:31,700 --> 00:10:35,165
the semantic derivative
of a language.
216
00:10:35,165 --> 00:10:37,775
This operation takes
two arguments.
217
00:10:37,775 --> 00:10:40,670
It takes a character
which we take
218
00:10:40,670 --> 00:10:43,925
the semantic derivative
according to,
219
00:10:43,925 --> 00:10:45,980
and it takes a language.
220
00:10:45,980 --> 00:10:49,579
And what it does is it
looks into this language
221
00:10:49,579 --> 00:10:51,365
and looks for all the strings
222
00:10:51,365 --> 00:10:53,735
that start with this character,
223
00:10:53,735 --> 00:10:56,405
let's say c, and then
224
00:10:56,405 --> 00:11:00,920
just strips off that c.
So here's an example.
225
00:11:00,920 --> 00:11:02,975
Suppose you have a language A,
226
00:11:02,975 --> 00:11:04,460
with the strings
227
00:11:04,460 --> 00:11:06,965
foo, bar and frak.
228
00:11:06,965 --> 00:11:10,835
And you take the semantic
derivative according to f,
229
00:11:10,835 --> 00:11:14,450
then we discard all the
strings which do not
230
00:11:14,450 --> 00:11:18,320
start with an f. So
bar will be discarded.
231
00:11:18,320 --> 00:11:22,850
Foo and frac start with
an f. So we keep them
232
00:11:22,850 --> 00:11:25,265
but we strip off the first f.
233
00:11:25,265 --> 00:11:29,045
So here we have only oo and rak.
234
00:11:29,045 --> 00:11:31,609
If you take the
semantic derivative
235
00:11:31,609 --> 00:11:33,830
of that language according to b,
236
00:11:33,830 --> 00:11:37,190
then we will discard foo and
frak because they don't
237
00:11:37,190 --> 00:11:40,940
start with b, and just keep bar.
238
00:11:40,940 --> 00:11:44,585
But again, we have to
strip off this first b.
239
00:11:44,585 --> 00:11:49,305
And if you take the semantic
derivative according to a,
240
00:11:49,305 --> 00:11:52,585
then none of these
strings start with a.
241
00:11:52,585 --> 00:11:56,800
So this will be defined
as just the empty set.
242
00:11:56,800 --> 00:11:59,695
You can slightly
generalized this
243
00:11:59,695 --> 00:12:02,560
semantic derivative
to also strings.
244
00:12:02,560 --> 00:12:05,170
So imagine you have
a language A and you
245
00:12:05,170 --> 00:12:08,274
build the semantic derivative
according to a string s.
246
00:12:08,274 --> 00:12:10,990
Then you look in
this language and
247
00:12:10,990 --> 00:12:13,420
you look for all the
strings that start with
248
00:12:13,420 --> 00:12:19,555
this s. But you strip
off that s. Ok?
249
00:12:19,555 --> 00:12:23,830
So if you have a string
starting with the s,
250
00:12:23,830 --> 00:12:26,065
then you keep that string,
251
00:12:26,065 --> 00:12:27,490
but you keep only the rest...
252
00:12:27,490 --> 00:12:28,810
what's coming after this s.
253
00:12:28,810 --> 00:12:32,010
That is here indicated
with this s'.
254
00:12:32,010 --> 00:12:35,330
I also have this convention,
this personal convention
255
00:12:35,330 --> 00:12:40,055
I have to say, that everything
which works on lists,
256
00:12:40,055 --> 00:12:42,185
remember strings are
lists of characters.
257
00:12:42,185 --> 00:12:46,655
I just put there an 's'. So
here's the one for characters.
258
00:12:46,655 --> 00:12:48,680
I just call it Der with a capital
259
00:12:48,680 --> 00:12:51,740
D. And I call that Ders,
260
00:12:51,740 --> 00:12:54,350
because that works over strings.
261
00:12:54,350 --> 00:12:55,865
And you can see how it would
262
00:12:55,865 --> 00:12:59,750
be defined if you only take this
as a primitive operation.
263
00:12:59,750 --> 00:13:01,340
We would just need to iterate
264
00:13:01,340 --> 00:13:04,050
that essentially for this Ders.
265
00:13:04,060 --> 00:13:07,970
Ok, we now have all
the theory in place.
266
00:13:07,970 --> 00:13:11,345
And we can finally start
implementing our algorithm.
267
00:13:11,345 --> 00:13:14,580
That's when we'll do
in the next video.