1
00:00:06,350 --> 00:00:10,305
They come back. I
can hear you saying,
2
00:00:10,305 --> 00:00:12,000
yes, you tried it out on
3
00:00:12,000 --> 00:00:14,760
one example and you
were much better.
4
00:00:14,760 --> 00:00:17,415
But how about on other examples?
5
00:00:17,415 --> 00:00:21,265
Specifically, we had two
regular expressions.
6
00:00:21,265 --> 00:00:23,480
How about the other case?
7
00:00:23,480 --> 00:00:27,480
Where let's have a look at
that other case in this video.
8
00:00:29,140 --> 00:00:32,705
So I'm back here
in this Reto file.
9
00:00:32,705 --> 00:00:35,180
And here's this test
case which run quite
10
00:00:35,180 --> 00:00:39,665
nicely for strings between 01000.
11
00:00:39,665 --> 00:00:42,725
Here is the other test case.
12
00:00:42,725 --> 00:00:44,090
I still run it only
13
00:00:44,090 --> 00:00:48,470
for relatively small
strings between 020.
14
00:00:48,470 --> 00:00:50,240
And let's see what it say.
15
00:00:50,240 --> 00:00:51,800
So I'm running the test in
16
00:00:51,800 --> 00:00:57,320
the ammonoids repo and
doesn't look too bad.
17
00:00:57,320 --> 00:01:01,160
But this might be because
20 is not general enough.
18
00:01:01,160 --> 00:01:03,620
So let's try it with 30.
19
00:01:03,620 --> 00:01:06,530
Let's run this test again.
20
00:01:06,530 --> 00:01:11,075
And 20 is quite okay.
21
00:01:11,075 --> 00:01:15,965
22, okay, that's now
almost ten times as much.
22
00:01:15,965 --> 00:01:18,995
And then the next
one would be 24.
23
00:01:18,995 --> 00:01:23,615
And we're waiting and waiting.
24
00:01:23,615 --> 00:01:26,410
And we are waiting.
25
00:01:26,410 --> 00:01:29,300
Actually, it isn't
calculated at all.
26
00:01:29,300 --> 00:01:31,399
It run out of memory.
27
00:01:31,399 --> 00:01:33,905
So here is something going on,
28
00:01:33,905 --> 00:01:37,640
which is Daphne bad and we
have to have a look at that.
29
00:01:37,640 --> 00:01:40,640
Okay? It's always
instructive with
30
00:01:40,640 --> 00:01:43,460
this algorithm to
look at the sizes of
31
00:01:43,460 --> 00:01:45,695
the record expressions
we calculate
32
00:01:45,695 --> 00:01:49,625
the evil to Is this
a star, star B.
33
00:01:49,625 --> 00:01:51,800
So there's nothing we
can compress there.
34
00:01:51,800 --> 00:01:55,490
It has just stars and
sequences and nothing else.
35
00:01:55,490 --> 00:01:58,430
And so it's not that we
can play the same trick
36
00:01:58,430 --> 00:02:01,805
of trying to introduce
an explicit constructor.
37
00:02:01,805 --> 00:02:04,190
But now let's have a
look at the derivatives.
38
00:02:04,190 --> 00:02:05,600
As you can see here.
39
00:02:05,600 --> 00:02:08,420
If I take this evil to wreck
40
00:02:08,420 --> 00:02:09,935
expression and they built
41
00:02:09,935 --> 00:02:12,470
now longer and
longer derivatives,
42
00:02:12,470 --> 00:02:14,090
you can see this grows.
43
00:02:14,090 --> 00:02:16,055
And as you see in this line,
44
00:02:16,055 --> 00:02:17,270
if I'm trying to take
45
00:02:17,270 --> 00:02:20,180
the derivative of a
quite large string,
46
00:02:20,180 --> 00:02:21,680
it is quite quick.
47
00:02:21,680 --> 00:02:26,870
But the size of the
reg expression,
48
00:02:26,870 --> 00:02:28,310
the number of nodes,
49
00:02:28,310 --> 00:02:30,815
is also like nearly
eight millions.
50
00:02:30,815 --> 00:02:33,845
And so even if that calculates
relatively quickly,
51
00:02:33,845 --> 00:02:37,865
that is the reason why at 24,
52
00:02:37,865 --> 00:02:39,560
it just runs out of memory.
53
00:02:39,560 --> 00:02:42,785
This reg expression
just grew too much.
54
00:02:42,785 --> 00:02:46,520
So we somehow need
to still compress
55
00:02:46,520 --> 00:02:49,655
this record expression
and make it not
56
00:02:49,655 --> 00:02:52,700
grow so much when we
build derivative.
57
00:02:52,700 --> 00:02:54,020
So we have to look at
58
00:02:54,020 --> 00:02:57,050
where does that grow
actually come from.
59
00:02:57,050 --> 00:03:00,170
Let's look at the
derivative operation
60
00:03:00,170 --> 00:03:01,895
again in more detail.
61
00:03:01,895 --> 00:03:03,740
When we introduced
it, we looked at
62
00:03:03,740 --> 00:03:06,560
this example of a
wreck expression R,
63
00:03:06,560 --> 00:03:11,465
which is a star of some
alternative and some sequence.
64
00:03:11,465 --> 00:03:13,130
And we can build now
65
00:03:13,130 --> 00:03:15,800
the derivative of this
r according to a,
66
00:03:15,800 --> 00:03:18,965
b, and c and see
what it calculates.
67
00:03:18,965 --> 00:03:21,935
And you see in case of a,
68
00:03:21,935 --> 00:03:26,570
it's like one times b plus
0 and then followed by r,
69
00:03:26,570 --> 00:03:29,015
which is this whole
wreck expression again.
70
00:03:29,015 --> 00:03:30,935
So you can also see.
71
00:03:30,935 --> 00:03:34,745
Derivative operation
introduces a lot of junk.
72
00:03:34,745 --> 00:03:38,165
This plus 0 isn't
really necessary.
73
00:03:38,165 --> 00:03:42,455
And this times one could
be also thrown away.
74
00:03:42,455 --> 00:03:43,685
So in the first case,
75
00:03:43,685 --> 00:03:48,110
actually we could just have
one times b is b plus 0,
76
00:03:48,110 --> 00:03:49,580
it still be a,
77
00:03:49,580 --> 00:03:54,905
so it would be B followed by
R. And in this second case,
78
00:03:54,905 --> 00:03:57,515
C0 times b, that will be just 0.
79
00:03:57,515 --> 00:03:59,270
Then 0 plus one is
80
00:03:59,270 --> 00:04:05,330
11 times r would be just
r. And in the last case,
81
00:04:05,330 --> 00:04:12,155
C0 times B would be 00 plus
0 is 00 times r would be 0.
82
00:04:12,155 --> 00:04:17,540
Okay? So we could throw out
all these useless zeros and
83
00:04:17,540 --> 00:04:20,960
ones because actually
we have to throw
84
00:04:20,960 --> 00:04:24,845
them out because over time
they just accumulate.
85
00:04:24,845 --> 00:04:27,035
And then we build
the next derivative.
86
00:04:27,035 --> 00:04:31,130
0 wouldn't go away by
building annuity where tests.
87
00:04:31,130 --> 00:04:34,310
So we need to somehow
a mechanism to
88
00:04:34,310 --> 00:04:39,120
delete the junk from
these derivatives.
89
00:04:39,280 --> 00:04:41,900
But how to delete junk?
90
00:04:41,900 --> 00:04:43,370
We already know we have
91
00:04:43,370 --> 00:04:47,825
the simplification rules
which say like if r plus 0,
92
00:04:47,825 --> 00:04:53,000
I can just replace by
r and the other ones.
93
00:04:53,000 --> 00:04:55,760
And so what I now
need to do is in
94
00:04:55,760 --> 00:04:58,715
my algorithm when I
built the derivative,
95
00:04:58,715 --> 00:05:01,415
this might have
introduced some chunk.
96
00:05:01,415 --> 00:05:02,960
And I now have to have
97
00:05:02,960 --> 00:05:06,590
essentially a additional function
98
00:05:06,590 --> 00:05:08,945
which deletes this junk again.
99
00:05:08,945 --> 00:05:10,490
And in doing so,
100
00:05:10,490 --> 00:05:13,340
keep steer, Rekha expression,
101
00:05:13,340 --> 00:05:16,730
relatively small, pickers debts,
102
00:05:16,730 --> 00:05:19,535
the Achilles heel
of this algorithm.
103
00:05:19,535 --> 00:05:22,565
If this regular expression
grows too much,
104
00:05:22,565 --> 00:05:25,070
then all these functions
will slow down.
105
00:05:25,070 --> 00:05:26,360
So the purpose of
106
00:05:26,360 --> 00:05:30,455
the simplification function
is to after the derivative,
107
00:05:30,455 --> 00:05:33,080
compress again this
rec expression
108
00:05:33,080 --> 00:05:35,570
and then do the next derivative.
109
00:05:35,570 --> 00:05:39,815
So if we introduce that
additional functions simp,
110
00:05:39,815 --> 00:05:42,440
which essentially
just goes through
111
00:05:42,440 --> 00:05:46,040
this reg expression and
deletes all this junk,
112
00:05:46,040 --> 00:05:50,045
then we should be in a
much better position.
113
00:05:50,045 --> 00:05:52,490
As you can see on this slide,
114
00:05:52,490 --> 00:05:54,440
the simplification
function can actually
115
00:05:54,440 --> 00:05:56,855
be implemented very easily.
116
00:05:56,855 --> 00:05:59,750
However, there are few
interesting points.
117
00:05:59,750 --> 00:06:02,720
First of all, there
are only two cases.
118
00:06:02,720 --> 00:06:05,060
The only simplify when we have
119
00:06:05,060 --> 00:06:08,255
an alternative or a sequence.
120
00:06:08,255 --> 00:06:12,859
So for example, we will
never simplify under star.
121
00:06:12,859 --> 00:06:15,950
If you look at the
derivative operation
122
00:06:15,950 --> 00:06:17,825
and you scratch your
head a little bit,
123
00:06:17,825 --> 00:06:20,180
we'll find out why
is that the case.
124
00:06:20,180 --> 00:06:22,145
It's essentially a waste of time.
125
00:06:22,145 --> 00:06:25,505
So you would not
simplify under star.
126
00:06:25,505 --> 00:06:27,650
You only simplify if you have
127
00:06:27,650 --> 00:06:30,530
an alternative and the sequence.
128
00:06:30,530 --> 00:06:34,880
Now, however, there
is one small point.
129
00:06:34,880 --> 00:06:39,785
You have to be aware of
these simplification rules.
130
00:06:39,785 --> 00:06:42,740
They need to be essentially
applied backwards.
131
00:06:42,740 --> 00:06:45,650
So I have to first essentially
go to the leaves of
132
00:06:45,650 --> 00:06:49,085
this record expression and
then have to find out,
133
00:06:49,085 --> 00:06:51,170
can I apply simplification?
134
00:06:51,170 --> 00:06:52,670
And then if yes,
135
00:06:52,670 --> 00:06:55,430
I apply the simplification
and I look at the next step,
136
00:06:55,430 --> 00:06:57,815
can I now simplify
something more?
137
00:06:57,815 --> 00:06:59,390
The reason is how
138
00:06:59,390 --> 00:07:03,125
the simplification
rules are formulated.
139
00:07:03,125 --> 00:07:05,300
They might fire in
140
00:07:05,300 --> 00:07:08,765
a higher level if something
simplifies below.
141
00:07:08,765 --> 00:07:14,315
So we have to essentially
simplify from the inside out.
142
00:07:14,315 --> 00:07:16,850
And in this
simplification function,
143
00:07:16,850 --> 00:07:20,885
what that means is the
matching this reg expression.
144
00:07:20,885 --> 00:07:23,120
Be test whether it's
an alternative or
145
00:07:23,120 --> 00:07:26,345
a sequence only then we
actually go into action.
146
00:07:26,345 --> 00:07:28,385
Once we know.
147
00:07:28,385 --> 00:07:30,530
In case of an alternative,
148
00:07:30,530 --> 00:07:32,120
what are the two components?
149
00:07:32,120 --> 00:07:35,765
P first, simplify each component,
150
00:07:35,765 --> 00:07:39,095
okay, and then we
get a result back.
151
00:07:39,095 --> 00:07:41,180
And we check whether
152
00:07:41,180 --> 00:07:45,679
this simplification of
R1 resulted into a 0.
153
00:07:45,679 --> 00:07:47,884
Then because it's an alternative
154
00:07:47,884 --> 00:07:49,190
than I can just replace it
155
00:07:49,190 --> 00:07:53,375
by r is two a simplified
version of R2.
156
00:07:53,375 --> 00:07:58,820
If it came back r as
two is actually 0,
157
00:07:58,820 --> 00:08:00,410
then I can replace it by
158
00:08:00,410 --> 00:08:03,785
the simplified version of a warm.
159
00:08:03,785 --> 00:08:07,460
If I simplify both
alternatives and it
160
00:08:07,460 --> 00:08:11,180
happens that the simplified
versions are the same,
161
00:08:11,180 --> 00:08:15,170
next century I can throw away
one of the alternatives.
162
00:08:15,170 --> 00:08:18,080
Otherwise, I just have to
keep the alternatives,
163
00:08:18,080 --> 00:08:21,155
but the simplified components
164
00:08:21,155 --> 00:08:23,915
in the sequence is
pretty much the same.
165
00:08:23,915 --> 00:08:27,950
I first have to check what
does it simplify underneath.
166
00:08:27,950 --> 00:08:31,385
So I call first simplify
and then have a look.
167
00:08:31,385 --> 00:08:33,020
Does it matches C or one of
168
00:08:33,020 --> 00:08:36,035
the simplification
damage, just return 0.
169
00:08:36,035 --> 00:08:38,330
Or if the other component is 0,
170
00:08:38,330 --> 00:08:40,535
I can also return a 0.
171
00:08:40,535 --> 00:08:43,310
If it's one, then I keep
the other component.
172
00:08:43,310 --> 00:08:45,920
And if this iss two is one,
173
00:08:45,920 --> 00:08:47,615
and I keep the first component,
174
00:08:47,615 --> 00:08:50,359
and otherwise it's
still the sequence.
175
00:08:50,359 --> 00:08:53,540
And in all the other cases I
don't have to do anything.
176
00:08:53,540 --> 00:08:55,700
It's just a plain
0. I leave it in.
177
00:08:55,700 --> 00:08:57,860
If it's a plain
warm, I leave it in.
178
00:08:57,860 --> 00:09:00,170
And if something is under a star,
179
00:09:00,170 --> 00:09:02,840
I don't open up this
door and simplify it.
180
00:09:02,840 --> 00:09:06,680
I just apply that to be
as quick as possible.
181
00:09:06,680 --> 00:09:10,280
Let's see whether this has
any effect on our code.
182
00:09:10,280 --> 00:09:12,980
So I'm now in the
file read three,
183
00:09:12,980 --> 00:09:17,405
and it's the same as Reto.
184
00:09:17,405 --> 00:09:20,885
It still has this
explicit and Times case,
185
00:09:20,885 --> 00:09:24,665
nullable and derivative
still the same.
186
00:09:24,665 --> 00:09:28,610
Except now we have this
simplification function as well.
187
00:09:28,610 --> 00:09:29,960
And when we apply
188
00:09:29,960 --> 00:09:33,725
the derivative after
the apply deteriorated,
189
00:09:33,725 --> 00:09:35,870
simplify it to throw away
190
00:09:35,870 --> 00:09:39,050
all the junk this
derivative introduced.
191
00:09:39,050 --> 00:09:41,510
Otherwise everything
stays the same.
192
00:09:41,510 --> 00:09:43,580
You still have this expansion
193
00:09:43,580 --> 00:09:45,515
of the optional reg expression.
194
00:09:45,515 --> 00:09:49,670
Here, our two evil wreck
expressions we use as a test.
195
00:09:49,670 --> 00:09:52,460
And here's now this test case,
196
00:09:52,460 --> 00:09:55,835
the first one and the
actually now trying it
197
00:09:55,835 --> 00:10:00,515
out for strings between
09 thousand a's.
198
00:10:00,515 --> 00:10:08,285
So C. So also gets
simplification makes a,
199
00:10:08,285 --> 00:10:10,655
actually this case fast on.
200
00:10:10,655 --> 00:10:16,260
So we can now run strings
up to 9 thousand.
201
00:10:16,260 --> 00:10:19,360
And we don't actually
sweat about this at all.
202
00:10:19,360 --> 00:10:24,745
For I have to say my laptop
is now starting its fan.
203
00:10:24,745 --> 00:10:28,525
And also, because the times
are relatively small,
204
00:10:28,525 --> 00:10:30,610
I'm actually running
each test three
205
00:10:30,610 --> 00:10:32,785
times and then take the average,
206
00:10:32,785 --> 00:10:34,720
which I didn't do before,
207
00:10:34,720 --> 00:10:37,720
just to be a tiny bit
more accurate map.
208
00:10:37,720 --> 00:10:42,025
So three seconds for a
string of 9 thousand a's.
209
00:10:42,025 --> 00:10:44,980
And now the more
interesting question is,
210
00:10:44,980 --> 00:10:46,240
for the second case,
211
00:10:46,240 --> 00:10:48,625
this E star, star b.
212
00:10:48,625 --> 00:10:51,250
And you can already see
on these numbers RB.
213
00:10:51,250 --> 00:10:53,260
And now you're really ambitious.
214
00:10:53,260 --> 00:10:57,860
And let's see how our
program is coping with that.
215
00:11:02,610 --> 00:11:07,780
Again. No sweat, at
least not on my part.
216
00:11:07,780 --> 00:11:10,480
The laptop thefts to
calculate quite a bit.
217
00:11:10,480 --> 00:11:12,940
But this is now a string of
218
00:11:12,940 --> 00:11:16,539
3 million A's and
it needs a second.
219
00:11:16,539 --> 00:11:20,680
And let's see how far we get.
220
00:11:20,680 --> 00:11:25,280
4 million a around two seconds.
221
00:11:27,030 --> 00:11:29,050
So say it again,
222
00:11:29,050 --> 00:11:31,690
I'm actually slowing it down.
223
00:11:31,690 --> 00:11:34,150
He artificially run the test
224
00:11:34,150 --> 00:11:36,895
three times and then
take the average.
225
00:11:36,895 --> 00:11:42,749
But it seems to be something
less than five seconds.
226
00:11:42,749 --> 00:11:48,185
And remember that is a
string of 6 million A's.
227
00:11:48,185 --> 00:11:51,170
Let's just have a
look at the graphs.
228
00:11:51,170 --> 00:11:56,105
So the simplification also
made of first case faster.
229
00:11:56,105 --> 00:11:58,880
So earlier without
simplification,
230
00:11:58,880 --> 00:12:00,710
where we just have
this explicit and
231
00:12:00,710 --> 00:12:03,050
times that at this graph.
232
00:12:03,050 --> 00:12:05,210
And now we can go up to
233
00:12:05,210 --> 00:12:09,620
10 thousand and be still
under five seconds.
234
00:12:09,620 --> 00:12:12,300
So that is good news.
235
00:12:13,270 --> 00:12:16,745
In the other example, remember,
236
00:12:16,745 --> 00:12:19,220
the existing regular
expression matches in
237
00:12:19,220 --> 00:12:22,880
Java eight, Python,
and JavaScript.
238
00:12:22,880 --> 00:12:26,030
And thanks to the
student be Ozma,
239
00:12:26,030 --> 00:12:27,935
half a graph for Swift.
240
00:12:27,935 --> 00:12:29,750
They're pretty atrocious.
241
00:12:29,750 --> 00:12:33,320
They need like for 30 Ace,
242
00:12:33,320 --> 00:12:37,490
something like on the
magnitude of thirty-seconds.
243
00:12:37,490 --> 00:12:39,740
As I said already,
244
00:12:39,740 --> 00:12:42,665
Java nine is slightly better.
245
00:12:42,665 --> 00:12:44,870
Java nine and above this,
246
00:12:44,870 --> 00:12:46,220
if you try that example,
247
00:12:46,220 --> 00:12:48,665
the same exponential and nine,
248
00:12:48,665 --> 00:12:51,230
you would be able to
process something
249
00:12:51,230 --> 00:12:54,215
like 40 thousand A's
in half a minute.
250
00:12:54,215 --> 00:12:57,740
I have to admit I'm not
a 100% sure what they
251
00:12:57,740 --> 00:13:03,575
did to improve the
performance between Java 89.
252
00:13:03,575 --> 00:13:05,510
I know they did something on
253
00:13:05,510 --> 00:13:07,790
their regular expression
matching engine,
254
00:13:07,790 --> 00:13:09,770
but I haven't really looked
255
00:13:09,770 --> 00:13:12,230
at what precisely they've done.
256
00:13:12,230 --> 00:13:17,945
But that's still not
really competition fas.
257
00:13:17,945 --> 00:13:20,915
So our third version,
258
00:13:20,915 --> 00:13:24,335
in this example, a star, star b.
259
00:13:24,335 --> 00:13:28,445
We say it's something like
We need 6 million A's.
260
00:13:28,445 --> 00:13:31,760
And again, I think these
are slightly older times,
261
00:13:31,760 --> 00:13:33,770
so he had needs 20 seconds.
262
00:13:33,770 --> 00:13:37,250
I think we had something
like below five seconds.
263
00:13:37,250 --> 00:13:40,865
But again, you can see
the trends. We rants.
264
00:13:40,865 --> 00:13:42,590
Circles around them.
265
00:13:42,590 --> 00:13:46,530
And that's quite nice.
266
00:13:46,570 --> 00:13:49,774
So what's good about
this algorithm?
267
00:13:49,774 --> 00:13:51,605
By pressure of ski?
268
00:13:51,605 --> 00:13:54,500
Well, first, it extends
269
00:13:54,500 --> 00:13:57,800
actually to quite a number
of regular expressions.
270
00:13:57,800 --> 00:13:59,900
So we saw in this example
271
00:13:59,900 --> 00:14:03,950
the End Times regular expression
is a little bit of work.
272
00:14:03,950 --> 00:14:05,405
We could extend that.
273
00:14:05,405 --> 00:14:08,480
It also extends to the
not reg expression,
274
00:14:08,480 --> 00:14:10,820
which I explain on
the next slide.
275
00:14:10,820 --> 00:14:15,290
It's very easy to implement
in a functional language.
276
00:14:15,290 --> 00:14:16,610
If you don't buy
277
00:14:16,610 --> 00:14:20,675
all that functional
programming language stuff,
278
00:14:20,675 --> 00:14:22,955
you still have to agree.
279
00:14:22,955 --> 00:14:25,715
Quite beautiful in Scala.
280
00:14:25,715 --> 00:14:28,160
And you could also
easily implemented
281
00:14:28,160 --> 00:14:31,760
in C language or in Python.
282
00:14:31,760 --> 00:14:34,250
There's really no
problem with that.
283
00:14:34,250 --> 00:14:38,390
Say the algorithm's actually
quite old already brush-off,
284
00:14:38,390 --> 00:14:44,899
ski wrote it down in
1964 and his PhD thesis.
285
00:14:44,899 --> 00:14:49,460
Somehow it was forgotten during
286
00:14:49,460 --> 00:14:54,095
the next decades and only
recently has been rediscovered.
287
00:14:54,095 --> 00:14:57,065
At the moment, we are
only solve the problem
288
00:14:57,065 --> 00:15:02,240
of Gmail reg expression
gamma string deaths,
289
00:15:02,240 --> 00:15:05,550
the regular expression
match the string yes or no.
290
00:15:05,550 --> 00:15:08,740
The want to of course, use
it as part of our lexicon.
291
00:15:08,740 --> 00:15:12,370
And there we have to do
something more complicated.
292
00:15:12,370 --> 00:15:15,384
We have to essentially
generate tokens.
293
00:15:15,384 --> 00:15:18,670
And this will still take
a little bit of work.
294
00:15:18,670 --> 00:15:22,045
And that's actually relatively
recent work by somebody
295
00:15:22,045 --> 00:15:26,110
called suits Month and
the co-worker called Lou.
296
00:15:26,110 --> 00:15:30,880
And what I personally
also find very satisfying
297
00:15:30,880 --> 00:15:32,695
about this algorithm is
298
00:15:32,695 --> 00:15:36,040
that we can actually
prove that it's correct.
299
00:15:36,040 --> 00:15:37,735
You might remember I did
300
00:15:37,735 --> 00:15:41,500
quite some interesting
transformations
301
00:15:41,500 --> 00:15:44,830
when I calculated the derivative.
302
00:15:44,830 --> 00:15:48,270
How can I be sure that
this is all correct?
303
00:15:48,270 --> 00:15:50,270
Actually, I can now go away and
304
00:15:50,270 --> 00:15:52,865
prove the correctness
of this algorithm.
305
00:15:52,865 --> 00:15:56,645
Does it really satisfy
the specification?
306
00:15:56,645 --> 00:15:58,550
Is really fs string
307
00:15:58,550 --> 00:16:00,440
is in the language
of a reg expression.
308
00:16:00,440 --> 00:16:03,050
Does that matter? Vd say yes.
309
00:16:03,050 --> 00:16:07,070
However, I leave that
for another video.
310
00:16:07,070 --> 00:16:10,580
What I also like about
this algorithm that can be
311
00:16:10,580 --> 00:16:13,775
actually extended to quite a
number of rec expressions.
312
00:16:13,775 --> 00:16:17,810
So this is t not reg
expression that is
313
00:16:17,810 --> 00:16:19,760
opposed to match strings which
314
00:16:19,760 --> 00:16:22,235
this reg expression cannot match.
315
00:16:22,235 --> 00:16:24,245
So here's an example.
316
00:16:24,245 --> 00:16:28,640
You're in the business of
trying to find out what
317
00:16:28,640 --> 00:16:33,530
our comments in languages like
Java or C. Then you know,
318
00:16:33,530 --> 00:16:35,060
they always start with a slash,
319
00:16:35,060 --> 00:16:36,245
then comes a star,
320
00:16:36,245 --> 00:16:38,240
then comes an arbitrary string.
321
00:16:38,240 --> 00:16:41,195
And then they finish
with a slash and a star.
322
00:16:41,195 --> 00:16:44,330
And how you want to specify
that is something like this.
323
00:16:44,330 --> 00:16:45,530
You want to say, OK,
324
00:16:45,530 --> 00:16:48,245
a comment starts with
a slash and a star.
325
00:16:48,245 --> 00:16:51,410
Then it comes any
string in between.
326
00:16:51,410 --> 00:16:55,340
But this string
in-between cannot contain
327
00:16:55,340 --> 00:16:58,280
any star and slash
because that would then
328
00:16:58,280 --> 00:17:01,415
indicate there's the
end already there.
329
00:17:01,415 --> 00:17:03,680
And then after you
have such a string
330
00:17:03,680 --> 00:17:06,410
which doesn't
contain as standard,
331
00:17:06,410 --> 00:17:11,180
then at the end you want to
match a star and a slash.
332
00:17:11,180 --> 00:17:13,460
So for these kind of comments,
333
00:17:13,460 --> 00:17:15,995
this reg expression is
actually quite useful.
334
00:17:15,995 --> 00:17:19,430
And if you ever seen
how to do the negation,
335
00:17:19,430 --> 00:17:21,995
this kinda break
expression with automata?
336
00:17:21,995 --> 00:17:24,710
You will know that's
a bit of a pain,
337
00:17:24,710 --> 00:17:26,675
but for the Borowski,
338
00:17:26,675 --> 00:17:28,370
it's no pain at all.
339
00:17:28,370 --> 00:17:30,710
The meaning of this
reg expression
340
00:17:30,710 --> 00:17:32,255
would be something like that.
341
00:17:32,255 --> 00:17:35,540
If I have all the
strings there are,
342
00:17:35,540 --> 00:17:38,390
and I take away all the strings,
343
00:17:38,390 --> 00:17:40,055
this arc and match.
344
00:17:40,055 --> 00:17:42,080
The remaining strings are
345
00:17:42,080 --> 00:17:44,630
what this reg expression
is supposed to match.
346
00:17:44,630 --> 00:17:47,165
So I can specify
what the meaning is.
347
00:17:47,165 --> 00:17:49,760
And then it's also very easy to
348
00:17:49,760 --> 00:17:52,174
say what is nullable
and derivative.
349
00:17:52,174 --> 00:17:54,140
So for nullable, it would be
350
00:17:54,140 --> 00:17:56,570
just a test where
the eyes nullable.
351
00:17:56,570 --> 00:17:58,985
And I just take the
negation of that.
352
00:17:58,985 --> 00:18:00,680
And men I have to build
353
00:18:00,680 --> 00:18:03,620
the derivative of this
not reg expression.
354
00:18:03,620 --> 00:18:05,420
Then I just have to move
355
00:18:05,420 --> 00:18:07,325
this permutation does derivative,
356
00:18:07,325 --> 00:18:10,070
derivative inside
the wreck expression
357
00:18:10,070 --> 00:18:12,575
and keep the not on
the outermost level.
358
00:18:12,575 --> 00:18:15,515
So there's again no
pain involved in
359
00:18:15,515 --> 00:18:19,130
adding this reg expression
to the algorithm.
360
00:18:19,130 --> 00:18:22,100
We made it to the end of
this video and we made
361
00:18:22,100 --> 00:18:24,739
it also to the end
of this algorithm.
362
00:18:24,739 --> 00:18:27,620
I can see to loose trends.
363
00:18:27,620 --> 00:18:29,420
One is we implemented
364
00:18:29,420 --> 00:18:32,855
this algorithm for the
basic regular expressions.
365
00:18:32,855 --> 00:18:38,705
And we also added the end
times out of necessity.
366
00:18:38,705 --> 00:18:41,120
And I also showed
you how to implement
367
00:18:41,120 --> 00:18:44,840
the not regular
expression on a slide.
368
00:18:44,840 --> 00:18:48,995
But your task in the
coursework actually is
369
00:18:48,995 --> 00:18:52,970
to extend that to some of
the extended reg expression.
370
00:18:52,970 --> 00:18:57,260
So especially these bounded
repetitions like from 0 to
371
00:18:57,260 --> 00:19:01,550
n times or between n and m times.
372
00:19:01,550 --> 00:19:04,325
And I think also
plus and D option.
373
00:19:04,325 --> 00:19:07,220
The other loose trend is,
374
00:19:07,220 --> 00:19:09,200
remember I did this while
375
00:19:09,200 --> 00:19:11,645
calculations with
regular expressions.
376
00:19:11,645 --> 00:19:13,205
Why on earth?
377
00:19:13,205 --> 00:19:14,480
Is that all correct?
378
00:19:14,480 --> 00:19:16,355
Why on earth should
this algorithm
379
00:19:16,355 --> 00:19:18,575
actually satisfy
our specification,
380
00:19:18,575 --> 00:19:20,450
which we set out
at the beginning.
381
00:19:20,450 --> 00:19:25,410
So that needs to be looked at
more closely. Bye for now.