1
00:00:06,710 --> 00:00:09,225
Thanks for tuning in again.
2
00:00:09,225 --> 00:00:11,640
In this video, we want to specify
3
00:00:11,640 --> 00:00:14,370
what problem our regular
expression matcher
4
00:00:14,370 --> 00:00:16,155
is actually supposed to solve.
5
00:00:16,155 --> 00:00:18,900
The reason is that
we know that some of
6
00:00:18,900 --> 00:00:21,585
the existing regular
expression matching engines
7
00:00:21,585 --> 00:00:25,200
are not just abysmally
slow in some examples,
8
00:00:25,200 --> 00:00:27,105
as you've seen in the
previous video,
9
00:00:27,105 --> 00:00:30,570
but also produce sometimes
incorrect results.
10
00:00:30,570 --> 00:00:33,330
In order to avoid
this with our matcher,
11
00:00:33,330 --> 00:00:35,325
we need to somehow explain
12
00:00:35,325 --> 00:00:39,255
precisely what is the problem
our algorithm solves.
13
00:00:39,255 --> 00:00:41,935
This will require
a bit of theory, but
14
00:00:41,935 --> 00:00:45,335
I hope it is nevertheless
a bit of fun.
15
00:00:45,335 --> 00:00:47,915
First, we have to specify
16
00:00:47,915 --> 00:00:50,585
what we mean by a
regular expression.
17
00:00:50,585 --> 00:00:53,210
You've seen earlier some
examples. They were
18
00:00:53,210 --> 00:00:56,060
actually taken or
inspired by what
19
00:00:56,060 --> 00:00:58,850
is available in standard
regular expression matching
20
00:00:58,850 --> 00:01:02,330
engines, like star, plus and n-times.
21
00:01:02,330 --> 00:01:05,690
But for many tasks,
for our algorithm,
22
00:01:05,690 --> 00:01:10,174
we will focus only what I call
basic regular expressions.
23
00:01:10,174 --> 00:01:11,840
Since I'm lazy, I will call
24
00:01:11,840 --> 00:01:13,550
these basic regular expressions
25
00:01:13,550 --> 00:01:15,485
just as regular expressions.
26
00:01:15,485 --> 00:01:17,405
And the ones you've seen earlier
27
00:01:17,405 --> 00:01:19,400
as extended regular expressions.
28
00:01:19,400 --> 00:01:22,940
So the basic regulare expressions,
or just regular expressions,
29
00:01:22,940 --> 00:01:25,280
they will have characters.
30
00:01:25,280 --> 00:01:27,170
So you can match any character,
31
00:01:27,170 --> 00:01:31,370
a,b,c to z or 0 to 9.
32
00:01:31,370 --> 00:01:35,525
Any Ascii character. 'c' here
is just a representative.
33
00:01:35,525 --> 00:01:38,825
So we can match
single characters.
34
00:01:38,825 --> 00:01:42,440
Then we can match alternatives.
35
00:01:42,440 --> 00:01:44,930
That means a string
is either matched
36
00:01:44,930 --> 00:01:46,730
by the regular expression r1
37
00:01:46,730 --> 00:01:49,324
or by the regular expression r2.
38
00:01:49,324 --> 00:01:52,790
And for the
alternative we write +.
39
00:01:52,790 --> 00:01:55,175
Then we also have sequence.
40
00:01:55,175 --> 00:01:57,410
This sequence regular
expression essentially
41
00:01:57,410 --> 00:01:59,915
says that a string needs to be matched
42
00:01:59,915 --> 00:02:02,210
the first part by
the regular expression r1
43
00:02:02,210 --> 00:02:06,275
and then the second
part by the r2.
44
00:02:06,275 --> 00:02:10,190
And then we have also the
star regular expression,
45
00:02:10,190 --> 00:02:12,980
which says the regular
expression needs to match
46
00:02:12,980 --> 00:02:16,520
the string with zero
or more copies.
47
00:02:16,520 --> 00:02:18,140
And then we also have some
48
00:02:18,140 --> 00:02:20,060
slightly strange
regular expressions.
49
00:02:20,060 --> 00:02:22,505
We have the regular expression 1,
50
00:02:22,505 --> 00:02:25,910
which can only match
the empty string.
51
00:02:25,910 --> 00:02:29,075
I'm using here the
notation 1 for that
52
00:02:29,075 --> 00:02:31,340
and in my writing I will always
53
00:02:31,340 --> 00:02:33,440
make sure that for the
regular expression
54
00:02:33,440 --> 00:02:35,765
I will write the
1 in a bold font.
55
00:02:35,765 --> 00:02:38,510
So whenever you see
a 1 in bold font,
56
00:02:38,510 --> 00:02:40,395
this is not the 1, but
57
00:02:40,395 --> 00:02:44,300
the regular expression which
can match the empty string.
58
00:02:44,300 --> 00:02:48,050
And we also have the
regular expression 0,
59
00:02:48,050 --> 00:02:50,315
which cannot match
anything at all.
60
00:02:50,315 --> 00:02:51,695
You might think, well,
61
00:02:51,695 --> 00:02:54,635
that's not much use if it cannot
match anything at all,
62
00:02:54,635 --> 00:02:58,130
but you will see why that
one is important later on.
63
00:02:58,130 --> 00:03:00,785
So our basic regular expressions,
64
00:03:00,785 --> 00:03:02,375
they will be 0,
65
00:03:02,375 --> 00:03:08,390
1, characters, alternatives,
sequences and stars.
66
00:03:08,390 --> 00:03:12,170
And these are all the
basic regular expressions.
67
00:03:12,170 --> 00:03:16,280
If this definition is a
bit too abstract for you,
68
00:03:16,280 --> 00:03:18,560
we can also look at
the concrete code,
69
00:03:18,560 --> 00:03:23,060
how that would pan out when
actually writing some Scala.
70
00:03:23,060 --> 00:03:28,040
I promised you, I show
you always my code in Scala.
71
00:03:28,040 --> 00:03:29,480
So here you would have
72
00:03:29,480 --> 00:03:32,885
first an abstract class
for regular expressions.
73
00:03:32,885 --> 00:03:37,580
Then you have one regular
expression for 0,
74
00:03:37,580 --> 00:03:41,540
one regular expression for 1,
75
00:03:41,540 --> 00:03:42,875
one regular expression, which
takes an argument,
76
00:03:42,875 --> 00:03:45,050
the character you want to match,
77
00:03:45,050 --> 00:03:47,915
the characters a,b, c and so on.
78
00:03:47,915 --> 00:03:50,945
Then we have an alternative
regular expression,
79
00:03:50,945 --> 00:03:53,480
which takes the first
alternative and
80
00:03:53,480 --> 00:03:56,435
the second alternative
as arguments.
81
00:03:56,435 --> 00:03:59,690
And we have a sequence
regular expression. Again,
82
00:03:59,690 --> 00:04:01,850
which takes the
first component and
83
00:04:01,850 --> 00:04:04,730
the second component
as two arguments.
84
00:04:04,730 --> 00:04:07,249
And we have the star
regular expression,
85
00:04:07,249 --> 00:04:10,880
which just take one regular
expression as argument.
86
00:04:10,880 --> 00:04:16,115
And all these reg expressions
extend our abstract class.
87
00:04:16,115 --> 00:04:20,300
For whatever I do in
this module here I have
88
00:04:20,300 --> 00:04:23,300
the convention that all
the regular expressions
89
00:04:23,300 --> 00:04:25,550
are written with capital letters.
90
00:04:25,550 --> 00:04:26,885
As you can see that here,
91
00:04:26,885 --> 00:04:31,685
O, 1, character, these will be
always regular expressions.
92
00:04:31,685 --> 00:04:34,370
They have all capital letters.
93
00:04:34,370 --> 00:04:36,484
Let's for a moment,
94
00:04:36,484 --> 00:04:38,720
play around with this definition.
95
00:04:38,720 --> 00:04:41,945
I'm using here the
Ammonite REPL.
96
00:04:41,945 --> 00:04:46,950
And I can evaluate
this definition.
97
00:04:53,430 --> 00:04:55,810
And now I can start to
98
00:04:55,810 --> 00:04:58,570
define particular
regular expressions.
99
00:04:58,570 --> 00:05:00,340
For example, if I need
a regular expression
100
00:05:00,340 --> 00:05:02,860
which can recognise
the character a,
101
00:05:02,860 --> 00:05:06,025
then I would write
something like this.
102
00:05:06,025 --> 00:05:08,710
So this regular expression
takes an argument,
103
00:05:08,710 --> 00:05:13,615
the character 'a' to specify
which character to match.
104
00:05:13,615 --> 00:05:16,945
We do this obviously also with 'b'.
105
00:05:16,945 --> 00:05:19,405
And I can do that with
106
00:05:19,405 --> 00:05:22,975
'c'. So now we have three
regular expressions.
107
00:05:22,975 --> 00:05:25,570
If you look very carefully
at this definition,
108
00:05:25,570 --> 00:05:27,070
you can actually see
109
00:05:27,070 --> 00:05:29,940
these regular
expressions are trees.
110
00:05:29,940 --> 00:05:33,365
So no matter what we
write down on paper,
111
00:05:33,365 --> 00:05:36,755
they are behind the
scenes always trees.
112
00:05:36,755 --> 00:05:40,010
And you can see that
actually in this definition.
113
00:05:40,010 --> 00:05:44,330
If you define two regular
expressions r1 and r2.
114
00:05:44,330 --> 00:05:49,310
They are essentially
the alternative of a, b and c.
115
00:05:49,310 --> 00:05:52,760
Then this regular expression
116
00:05:52,760 --> 00:05:54,710
can match either the character
117
00:05:54,710 --> 00:05:57,980
a or the character b
or the character c.
118
00:05:57,980 --> 00:06:01,640
And the same for the
regular expression r2.
119
00:06:01,640 --> 00:06:03,875
So let me just evaluate that.
120
00:06:03,875 --> 00:06:05,690
And even though these are
121
00:06:05,690 --> 00:06:07,175
two regular expressions
which can match
122
00:06:07,175 --> 00:06:11,750
exactly the same things,
they a different trees.
123
00:06:11,750 --> 00:06:14,195
So if I ask Scala,
124
00:06:14,195 --> 00:06:16,460
are these trees different?
125
00:06:16,460 --> 00:06:19,250
Or ask if they're
126
00:06:19,250 --> 00:06:21,865
the same, then Scala will say No,
127
00:06:21,865 --> 00:06:25,440
they actually different trees.
128
00:06:25,450 --> 00:06:28,459
Let's come back to
this definition.
129
00:06:28,459 --> 00:06:31,760
If we want to write down
regular expressions on paper,
130
00:06:31,760 --> 00:06:33,620
then we want to be sloppy as
131
00:06:33,620 --> 00:06:35,750
mathematicians rather than as
132
00:06:35,750 --> 00:06:37,745
precise as computer scientists.
133
00:06:37,745 --> 00:06:40,490
So when we want to write down
a regular expression which can
134
00:06:40,490 --> 00:06:43,955
either match the character
a or the character b,
135
00:06:43,955 --> 00:06:49,130
then we would write down
something like this, a plus b.
136
00:06:49,130 --> 00:06:51,170
And if you want to have
the regular expression
137
00:06:51,170 --> 00:06:52,625
which can either match
138
00:06:52,625 --> 00:06:55,925
the character a or b or c,
139
00:06:55,925 --> 00:06:58,340
we will write
something like this.
140
00:06:58,340 --> 00:07:01,370
But of course behind the
scenes, these are trees.
141
00:07:01,370 --> 00:07:04,460
So we should have written
them with parentheses.
142
00:07:04,460 --> 00:07:06,440
And you can see
actually, there are two
143
00:07:06,440 --> 00:07:08,990
regular expressions I
could have written down.
144
00:07:08,990 --> 00:07:11,270
They're different.
145
00:07:11,270 --> 00:07:12,710
Just by convention,
146
00:07:12,710 --> 00:07:15,575
we on't write these parentheses.
147
00:07:15,575 --> 00:07:18,740
And that is similar with sequences.
148
00:07:18,740 --> 00:07:20,000
If I want to write down
149
00:07:20,000 --> 00:07:22,955
the regular expression which
can match first an 'a',
150
00:07:22,955 --> 00:07:25,010
then a 'b', and then a 'c',
151
00:07:25,010 --> 00:07:28,160
then I would write down
something like this.
152
00:07:28,160 --> 00:07:32,120
Just, there are again
153
00:07:32,120 --> 00:07:35,735
two regular expressions I
could have written down.
154
00:07:35,735 --> 00:07:38,480
Again by convention we don't
155
00:07:38,480 --> 00:07:40,670
write these parentheses though.
156
00:07:40,670 --> 00:07:42,350
However, sometimes we have to be
157
00:07:42,350 --> 00:07:43,940
very careful with parentheses,
158
00:07:43,940 --> 00:07:47,195
especially with star.
159
00:07:47,195 --> 00:07:50,525
Because this regular expression
is definitely not
160
00:07:50,525 --> 00:07:54,900
the same as this regular expression.
161
00:07:56,100 --> 00:07:59,410
The first one here can match
162
00:07:59,410 --> 00:08:03,610
any strings containing a or b's.
163
00:08:03,610 --> 00:08:05,860
While this regular expression can
164
00:08:05,860 --> 00:08:07,945
only match the single character
165
00:08:07,945 --> 00:08:13,300
a or any string
containing only b's.
166
00:08:13,300 --> 00:08:15,265
So to make the difference clear,
167
00:08:15,265 --> 00:08:20,065
in this example, we would have
to use the parentheses.
168
00:08:20,065 --> 00:08:23,140
There's one more issue
with this definition.
169
00:08:23,140 --> 00:08:26,635
Why do we focus on these
basic regular expressions?
170
00:08:26,635 --> 00:08:28,660
Why don't we also include
171
00:08:28,660 --> 00:08:31,285
the ones from the
extended regular expressions.
172
00:08:31,285 --> 00:08:33,055
The answers very easy.
173
00:08:33,055 --> 00:08:35,680
These basic regular
expressions can be used
174
00:08:35,680 --> 00:08:38,370
to represent also
the extended ones.
175
00:08:38,370 --> 00:08:40,220
Let me give you some examples.
176
00:08:40,220 --> 00:08:44,225
If I have a regular
expression r+, for example,
177
00:08:44,225 --> 00:08:46,280
then the meaning
was I have to use
178
00:08:46,280 --> 00:08:49,115
at least one or more copies
179
00:08:49,115 --> 00:08:51,200
of this r to
match a string.
180
00:08:51,200 --> 00:08:53,810
Well, one or more copies
can be represented by
181
00:08:53,810 --> 00:08:58,385
the basic ones as just
r followed by r*.
182
00:08:58,385 --> 00:09:01,760
Meaning I have to use one
copy of r, followed by
183
00:09:01,760 --> 00:09:05,150
0 or more copies of r.
184
00:09:05,150 --> 00:09:07,895
Similarly, if I have the optional
regular expression,
185
00:09:07,895 --> 00:09:10,715
which is supposed to
match a string
186
00:09:10,715 --> 00:09:13,865
by using r, or match
the empty string.
187
00:09:13,865 --> 00:09:19,295
Then this can be obviously
defined as r + 1.
188
00:09:19,295 --> 00:09:23,945
So here is the bold
regular expression 1,
189
00:09:23,945 --> 00:09:26,180
which means it either can
190
00:09:26,180 --> 00:09:28,205
recognize whatever
r can recognize,
191
00:09:28,205 --> 00:09:30,470
or it can recognize
the empty string.
192
00:09:30,470 --> 00:09:35,150
And if I have ranges, like a
to z, then I can define
193
00:09:35,150 --> 00:09:41,135
that as a + b + c + ...
and so on until z.
194
00:09:41,135 --> 00:09:45,920
Maybe this definition is not
good in terms of runtime,
195
00:09:45,920 --> 00:09:47,960
but in terms of just being able
196
00:09:47,960 --> 00:09:50,780
to recognize strings
or match strings,
197
00:09:50,780 --> 00:09:54,680
the basic regular expressions
will be just sufficient.
198
00:09:54,680 --> 00:09:56,690
Unfortunately, we
also need to have
199
00:09:56,690 --> 00:09:58,850
a quick chat about strings.
200
00:09:58,850 --> 00:10:02,255
In Scala, it's crystal
clear what a string is.
201
00:10:02,255 --> 00:10:05,480
There's a separate datatype
which is called string.
202
00:10:05,480 --> 00:10:07,895
So here, for example,
is a string.
203
00:10:07,895 --> 00:10:09,200
And as you can see,
204
00:10:09,200 --> 00:10:11,105
it is of the type string.
205
00:10:11,105 --> 00:10:13,985
And the empty string
will be just that.
206
00:10:13,985 --> 00:10:16,160
However, when we write things down on
207
00:10:16,160 --> 00:10:18,320
paper and think
about our algorithm,
208
00:10:18,320 --> 00:10:22,790
we want to think of strings
as lists of characters.
209
00:10:22,790 --> 00:10:26,070
So more something like this.
210
00:10:27,070 --> 00:10:31,745
You can see here, this is actually
a list of characters.
211
00:10:31,745 --> 00:10:35,150
And the two operations
we need are taking
212
00:10:35,150 --> 00:10:37,280
the head of this list and
213
00:10:37,280 --> 00:10:39,770
the rest of the list
or tail of the list.
214
00:10:39,770 --> 00:10:41,720
That's why we want
to regard them as
215
00:10:41,720 --> 00:10:45,260
lists rather than strings.
216
00:10:45,260 --> 00:10:48,200
So if I'm using a
string like this,
217
00:10:48,200 --> 00:10:51,935
then on paper I always will
write something like that.
218
00:10:51,935 --> 00:10:54,575
Or since I'm lazy, just that.
219
00:10:54,575 --> 00:10:56,675
And for the empty string,
220
00:10:56,675 --> 00:10:59,210
I will write either
the empty list, with
221
00:10:59,210 --> 00:11:03,920
two brackets or,
being lazy, just that.
222
00:11:03,920 --> 00:11:06,620
Actually there is one
more operation we need on
223
00:11:06,620 --> 00:11:09,410
strings and that
is concatenation.
224
00:11:09,410 --> 00:11:11,255
If you have a string s1,
225
00:11:11,255 --> 00:11:14,510
string s2, and put an
at symbol in between,
226
00:11:14,510 --> 00:11:18,050
that means we want to
concatenate both strings.
227
00:11:18,050 --> 00:11:22,625
So foo concatenated with
bar, would be foobar.
228
00:11:22,625 --> 00:11:25,085
And any string concatenated with
229
00:11:25,085 --> 00:11:27,950
the empty string
is left untouched.
230
00:11:27,950 --> 00:11:31,310
So baz concatenated with
231
00:11:31,310 --> 00:11:33,545
the empty string, is just baz.
232
00:11:33,545 --> 00:11:37,295
So that's like if we have
strings as lists of characters,
233
00:11:37,295 --> 00:11:39,755
that will be just list append.
234
00:11:39,755 --> 00:11:41,480
In the next video,
235
00:11:41,480 --> 00:11:43,160
we will use these definitions
236
00:11:43,160 --> 00:11:45,050
and introduce the notion of what
237
00:11:45,050 --> 00:11:46,850
a language is and
238
00:11:46,850 --> 00:11:49,920
what the meaning of a
regular expression is.