--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/videos/01-evilregexes.srt Wed Sep 23 11:34:43 2020 +0100
@@ -0,0 +1,1813 @@
+1
+00:00:06,240 --> 00:00:11,050
+Welcome back. This video
+is about regular expressions.
+
+2
+00:00:11,050 --> 00:00:14,230
+We want to use regular
+expressions in our lexer.
+
+3
+00:00:14,230 --> 00:00:16,165
+And the purpose of the
+lexer is to find
+
+4
+00:00:16,165 --> 00:00:18,070
+out where the words in
+
+5
+00:00:18,070 --> 00:00:21,070
+our programs are. However
+regular expressions
+
+6
+00:00:21,070 --> 00:00:23,875
+are fundamental tool
+in computer science.
+
+7
+00:00:23,875 --> 00:00:27,910
+And I'm sure you've used them
+already on several occasions.
+
+8
+00:00:27,910 --> 00:00:30,370
+And one would expect that about
+
+9
+00:00:30,370 --> 00:00:31,750
+regular expressions since they are
+
+10
+00:00:31,750 --> 00:00:33,850
+so well-known and well studied,
+
+11
+00:00:33,850 --> 00:00:37,915
+that everything under the
+sun is known about them.
+
+12
+00:00:37,915 --> 00:00:41,080
+But actually there's
+still some surprising
+
+13
+00:00:41,080 --> 00:00:44,465
+and interesting
+problems with them.
+
+14
+00:00:44,465 --> 00:00:47,945
+And I want to show you
+them in this video.
+
+15
+00:00:47,945 --> 00:00:50,720
+I'm sure you've seen
+regular expressions
+
+16
+00:00:50,720 --> 00:00:52,445
+many, many times before.
+
+17
+00:00:52,445 --> 00:00:55,100
+But just to be on the same page,
+
+18
+00:00:55,100 --> 00:00:57,110
+let me just recap them.
+
+19
+00:00:57,110 --> 00:00:59,210
+So here in this line,
+
+20
+00:00:59,210 --> 00:01:01,790
+there is a regular expression
+which is supposed to
+
+21
+00:01:01,790 --> 00:01:05,285
+recognize some form
+of email addresses.
+
+22
+00:01:05,285 --> 00:01:07,745
+So an e-mail address
+
+23
+00:01:07,745 --> 00:01:11,000
+has part which is
+before the @ symbol,
+
+24
+00:01:11,000 --> 00:01:13,400
+which is the name of the person.
+
+25
+00:01:13,400 --> 00:01:16,880
+And that can be
+any number between
+
+26
+00:01:16,880 --> 00:01:20,195
+0 and 9, and letters between a and z.
+
+27
+00:01:20,195 --> 00:01:24,155
+Let's say we avoiding
+here capital letters.
+
+28
+00:01:24,155 --> 00:01:26,045
+There can be underscores.
+
+29
+00:01:26,045 --> 00:01:29,405
+There can be a dot and
+there can be hyphens.
+
+30
+00:01:29,405 --> 00:01:35,390
+And after the @ symbol
+comes the domain name.
+
+31
+00:01:35,390 --> 00:01:37,310
+So as you can see here,
+
+32
+00:01:37,310 --> 00:01:40,640
+we use things like star to
+
+33
+00:01:40,640 --> 00:01:44,314
+match letters
+zero or more times.
+
+34
+00:01:44,314 --> 00:01:45,985
+Or we have a plus,
+
+35
+00:01:45,985 --> 00:01:47,420
+which means you have to match
+
+36
+00:01:47,420 --> 00:01:52,489
+at least once or more
+times. Then we have.
+
+37
+00:01:52,489 --> 00:01:55,790
+question mark, which says you
+
+38
+00:01:55,790 --> 00:01:59,105
+match either it is there
+or it ss not there.
+
+39
+00:01:59,105 --> 00:02:01,340
+You are also regular
+expressions which
+
+40
+00:02:01,340 --> 00:02:03,755
+match exactly n-times.
+
+41
+00:02:03,755 --> 00:02:08,720
+Or this is a regular expression
+for between n and m times.
+
+42
+00:02:08,720 --> 00:02:12,065
+You can see in
+this email address,
+
+43
+00:02:12,065 --> 00:02:13,730
+the top-level domain
+
+44
+00:02:13,730 --> 00:02:16,130
+name can be any letter
+
+45
+00:02:16,130 --> 00:02:19,265
+between a to z,
+and contain dots,
+
+46
+00:02:19,265 --> 00:02:22,340
+but can only be two
+characters long
+
+47
+00:02:22,340 --> 00:02:25,685
+up till six characters
+and not more.
+
+48
+00:02:25,685 --> 00:02:29,240
+Then you also have
+something like ranges.
+
+49
+00:02:29,240 --> 00:02:31,220
+So you can see, letters between a
+
+50
+00:02:31,220 --> 00:02:33,635
+and z and 0 to 9 and so on.
+
+51
+00:02:33,635 --> 00:02:36,545
+Here you also have regular
+expression which can
+
+52
+00:02:36,545 --> 00:02:40,070
+match something which
+isn't in this range.
+
+53
+00:02:40,070 --> 00:02:42,560
+So for example, if
+you want for example match,
+
+54
+00:02:42,560 --> 00:02:44,030
+letters but not numbers,
+
+55
+00:02:44,030 --> 00:02:45,800
+you would say, well, if
+
+56
+00:02:45,800 --> 00:02:48,990
+this is a number that
+should not match.
+
+57
+00:02:49,090 --> 00:02:52,804
+Typically you also
+have these ranges.
+
+58
+00:02:52,804 --> 00:02:55,565
+Lowercase letters,
+capital letters.
+
+59
+00:02:55,565 --> 00:02:58,550
+Then you have some
+special regular expressions
+
+60
+00:02:58,550 --> 00:03:02,195
+like this one is only
+supposed to match digits.
+
+61
+00:03:02,195 --> 00:03:05,674
+A dot is supposed to
+match any character.
+
+62
+00:03:05,674 --> 00:03:07,370
+And then they have also something
+
+63
+00:03:07,370 --> 00:03:09,800
+called groups which
+is supposed to be
+
+64
+00:03:09,800 --> 00:03:12,799
+used when you are
+trying to extract
+
+65
+00:03:12,799 --> 00:03:15,605
+a string you've matched.
+
+66
+00:03:15,605 --> 00:03:19,925
+Okay, so these are the
+typical regular expressions.
+
+67
+00:03:19,925 --> 00:03:23,075
+And here's a particular one.
+
+68
+00:03:23,075 --> 00:03:25,820
+Trying to match something
+
+69
+00:03:25,820 --> 00:03:28,770
+which resembles
+an email address.
+
+70
+00:03:29,590 --> 00:03:33,065
+Clearly that should be all easy.
+
+71
+00:03:33,065 --> 00:03:36,230
+And our technology should
+be on top of that.
+
+72
+00:03:36,230 --> 00:03:37,865
+That we can take a
+
+73
+00:03:37,865 --> 00:03:41,015
+regular expressions and
+we can take a string,
+
+74
+00:03:41,015 --> 00:03:43,340
+and we should have programs to
+
+75
+00:03:43,340 --> 00:03:45,680
+decide whether this
+string is matched
+
+76
+00:03:45,680 --> 00:03:50,330
+by a regular expression or
+not and should be easy-peasy, no?
+
+77
+00:03:50,330 --> 00:03:56,150
+Well, let's have a
+look at two examples.
+
+78
+00:03:56,150 --> 00:04:00,860
+The first regular expression
+is a star star b.
+
+79
+00:04:00,860 --> 00:04:02,990
+And it is supposed
+to match strings of
+
+80
+00:04:02,990 --> 00:04:05,825
+the form 0 or more a's,
+
+81
+00:04:05,825 --> 00:04:10,385
+followed by a b. The parentheses
+you can ignore.
+
+82
+00:04:10,385 --> 00:04:11,990
+And a star star
+
+83
+00:04:11,990 --> 00:04:14,120
+also doesn't
+make any difference
+
+84
+00:04:14,120 --> 00:04:16,505
+to what kind of strings
+that can be matched.
+
+85
+00:04:16,505 --> 00:04:21,635
+It can only make 0 more
+a's followed by a b.
+
+86
+00:04:21,635 --> 00:04:23,900
+And the other regular expression
+
+87
+00:04:23,900 --> 00:04:26,990
+is possibly a character a,
+
+88
+00:04:26,990 --> 00:04:32,930
+n times, followed by character
+a axactly n-times.
+
+89
+00:04:32,930 --> 00:04:35,570
+And we will try out
+
+90
+00:04:35,570 --> 00:04:38,360
+these two regular expressions
+with strings of the form a,
+
+91
+00:04:38,360 --> 00:04:39,890
+aa, and so on,
+
+92
+00:04:39,890 --> 00:04:45,770
+and up to the length of n. And
+
+93
+00:04:45,770 --> 00:04:49,130
+this regular expression should
+actually not match any of
+
+94
+00:04:49,130 --> 00:04:53,315
+the strings because the
+final b is missing.
+
+95
+00:04:53,315 --> 00:04:56,150
+But that is
+okay. For example
+
+96
+00:04:56,150 --> 00:04:57,425
+if you have a regular expression
+
+97
+00:04:57,425 --> 00:05:00,110
+that is supposed to
+check whether a string is
+
+98
+00:05:00,110 --> 00:05:01,490
+an email address and the user
+
+99
+00:05:01,490 --> 00:05:03,380
+gives some random
+strings in there,
+
+100
+00:05:03,380 --> 00:05:06,545
+then this regular expression
+should not match that string.
+
+101
+00:05:06,545 --> 00:05:08,420
+And for this regular expression
+
+102
+00:05:08,420 --> 00:05:11,195
+you have to scratch a
+little bit of your head,
+
+103
+00:05:11,195 --> 00:05:12,620
+what it can actually match.
+
+104
+00:05:12,620 --> 00:05:14,720
+But after a little bit
+of head scratching,
+
+105
+00:05:14,720 --> 00:05:18,260
+you find out can match
+any string which is of
+
+106
+00:05:18,260 --> 00:05:22,580
+the length n a's up
+to 2n of a's.
+
+107
+00:05:22,580 --> 00:05:24,290
+So anything in this range,
+
+108
+00:05:24,290 --> 00:05:27,185
+this regular expression
+can actually match.
+
+109
+00:05:27,185 --> 00:05:30,395
+Okay, let's
+take a random tool,
+
+110
+00:05:30,395 --> 00:05:32,630
+maybe for example Python.
+
+111
+00:05:32,630 --> 00:05:35,240
+So here's a little
+Python program.
+
+112
+00:05:35,240 --> 00:05:38,690
+It uses the library
+function of Python to
+
+113
+00:05:38,690 --> 00:05:42,935
+match the regular expressions of
+a star star b.
+
+114
+00:05:42,935 --> 00:05:46,805
+And we measure time with longer
+and longer strings of a.
+
+115
+00:05:46,805 --> 00:05:48,770
+And so conveniently we can give
+
+116
+00:05:48,770 --> 00:05:51,140
+the number of a's here
+on the command line.
+
+117
+00:05:51,140 --> 00:05:56,900
+If I just call
+this on the command line,
+
+118
+00:05:56,900 --> 00:05:59,900
+Let's say we first
+start with five a's.
+
+119
+00:05:59,900 --> 00:06:03,920
+And I get also the times which
+in this case is next to nothing.
+
+120
+00:06:03,920 --> 00:06:05,960
+And here's the string
+we just matched.
+
+121
+00:06:05,960 --> 00:06:07,640
+And obviously the
+regular expression
+
+122
+00:06:07,640 --> 00:06:09,110
+did not match the string.
+
+123
+00:06:09,110 --> 00:06:11,255
+That's indicated by this none.
+
+124
+00:06:11,255 --> 00:06:13,925
+Let's take ten a's.
+
+125
+00:06:13,925 --> 00:06:16,490
+It's also pretty quick.
+
+126
+00:06:16,490 --> 00:06:20,780
+Fifteen a's, even quicker,
+
+127
+00:06:20,780 --> 00:06:23,180
+but these times always need to
+
+128
+00:06:23,180 --> 00:06:25,820
+be taken with a grain of salt.
+
+129
+00:06:25,820 --> 00:06:28,040
+They are not 100
+percent accurate.
+
+130
+00:06:28,040 --> 00:06:31,490
+So 15 is also a let's take
+
+131
+00:06:31,490 --> 00:06:36,965
+28th notes already
+double the time.
+
+132
+00:06:36,965 --> 00:06:42,440
+Twenty-five longer.
+
+133
+00:06:42,440 --> 00:06:45,680
+Okay, that suddenly
+from 02 seconds,
+
+134
+00:06:45,680 --> 00:06:48,960
+it takes almost four seconds.
+
+135
+00:06:49,600 --> 00:06:54,890
+Six this
+
+136
+00:06:54,890 --> 00:07:01,415
+takes six seconds
+already Double, okay?
+
+137
+00:07:01,415 --> 00:07:07,229
+Go to 28. That would be now.
+
+138
+00:07:08,890 --> 00:07:11,840
+You see the string
+isn't very long,
+
+139
+00:07:11,840 --> 00:07:13,340
+so that could be easily like
+
+140
+00:07:13,340 --> 00:07:16,070
+just the size of
+an email address.
+
+141
+00:07:16,070 --> 00:07:19,280
+And the regular
+expression matching
+
+142
+00:07:19,280 --> 00:07:22,550
+engine in Python needs
+quite a long time
+
+143
+00:07:22,550 --> 00:07:24,710
+to find out that
+this string of 28
+
+144
+00:07:24,710 --> 00:07:26,570
+AES is actually not much
+
+145
+00:07:26,570 --> 00:07:28,490
+by that you see it's
+still not finished.
+
+146
+00:07:28,490 --> 00:07:32,900
+I think it should take
+approximately like 20 seconds.
+
+147
+00:07:32,900 --> 00:07:34,400
+Okay. Already 30.
+
+148
+00:07:34,400 --> 00:07:36,530
+And if we would try
+
+149
+00:07:36,530 --> 00:07:40,805
+30 would be already
+more than a minute.
+
+150
+00:07:40,805 --> 00:07:43,940
+And if I could read
+something like hundreds,
+
+151
+00:07:43,940 --> 00:07:46,220
+you remember if a doubling in
+
+152
+00:07:46,220 --> 00:07:48,770
+each step or the second step,
+
+153
+00:07:48,770 --> 00:07:50,720
+the story with the chess board,
+
+154
+00:07:50,720 --> 00:07:53,855
+we probably would sit here
+until the next century.
+
+155
+00:07:53,855 --> 00:07:56,820
+So something strange here.
+
+156
+00:07:57,580 --> 00:08:01,355
+Okay, that might be just
+a problem of Python.
+
+157
+00:08:01,355 --> 00:08:02,990
+Let's have a look at another
+
+158
+00:08:02,990 --> 00:08:04,985
+regular expression
+matching engine.
+
+159
+00:08:04,985 --> 00:08:06,890
+This time from JavaScript,
+
+160
+00:08:06,890 --> 00:08:10,040
+also are pretty well-known
+programming language.
+
+161
+00:08:10,040 --> 00:08:13,610
+So here you can see
+it's still a star,
+
+162
+00:08:13,610 --> 00:08:16,235
+star followed by b,
+
+163
+00:08:16,235 --> 00:08:18,920
+by direct expression is
+supposed to match that from
+
+164
+00:08:18,920 --> 00:08:21,830
+the beginning of the
+string up till the end.
+
+165
+00:08:21,830 --> 00:08:23,930
+So there's not any difference
+
+166
+00:08:23,930 --> 00:08:26,150
+in the strings this work
+expression matches.
+
+167
+00:08:26,150 --> 00:08:28,610
+We'll just start at the
+beginning of the string
+
+168
+00:08:28,610 --> 00:08:31,460
+and finish at the
+end of the string.
+
+169
+00:08:31,460 --> 00:08:35,285
+And we again, we just use
+repeated A's for that.
+
+170
+00:08:35,285 --> 00:08:38,195
+And similarly, we can
+
+171
+00:08:38,195 --> 00:08:41,930
+call it on the command line
+and can do some timing.
+
+172
+00:08:41,930 --> 00:08:44,540
+So ten SBA, good.
+
+173
+00:08:44,540 --> 00:08:46,340
+Here's the string.
+
+174
+00:08:46,340 --> 00:08:48,320
+It cannot match that string.
+
+175
+00:08:48,320 --> 00:08:50,525
+And it's pretty fast.
+
+176
+00:08:50,525 --> 00:08:54,725
+Friendly. Although pretty fast.
+
+177
+00:08:54,725 --> 00:08:59,120
+Five, again,
+
+178
+00:08:59,120 --> 00:09:06,650
+somehow is kind of
+threshold that is 25, 26.
+
+179
+00:09:06,650 --> 00:09:09,485
+Suddenly it takes much longer.
+
+180
+00:09:09,485 --> 00:09:14,360
+And it has essentially the
+same problem as with Python.
+
+181
+00:09:14,360 --> 00:09:17,165
+So you'll see in now from 26 on,
+
+182
+00:09:17,165 --> 00:09:19,250
+the Times has always
+doubling from
+
+183
+00:09:19,250 --> 00:09:21,860
+three seconds to seven seconds.
+
+184
+00:09:21,860 --> 00:09:23,330
+So you can imagine what that
+
+185
+00:09:23,330 --> 00:09:24,890
+roughly takes when I put your
+
+186
+00:09:24,890 --> 00:09:30,230
+27 and you see the
+string isn't very long.
+
+187
+00:09:30,230 --> 00:09:32,165
+Let's choose twenties or maize.
+
+188
+00:09:32,165 --> 00:09:35,419
+Imagine you have to
+search a database
+
+189
+00:09:35,419 --> 00:09:38,720
+with kilobytes of data.
+
+190
+00:09:38,720 --> 00:09:42,260
+This, these regular
+expressions that would years
+
+191
+00:09:42,260 --> 00:09:48,150
+need years to go through with
+these regular expressions.
+
+192
+00:09:48,630 --> 00:09:51,850
+Okay, maybe the people in
+
+193
+00:09:51,850 --> 00:09:55,435
+Python and JavaScript,
+they're just idiots.
+
+194
+00:09:55,435 --> 00:09:58,180
+Surely Java must do much better.
+
+195
+00:09:58,180 --> 00:10:01,045
+So here's a program.
+
+196
+00:10:01,045 --> 00:10:03,415
+You can see this again
+
+197
+00:10:03,415 --> 00:10:05,980
+is the reg expression
+and we just having
+
+198
+00:10:05,980 --> 00:10:08,320
+some scaffolding to generate
+
+199
+00:10:08,320 --> 00:10:11,905
+strings from five up till 28.
+
+200
+00:10:11,905 --> 00:10:14,305
+And if we run that,
+
+201
+00:10:14,305 --> 00:10:16,660
+actually does that automatically.
+
+202
+00:10:16,660 --> 00:10:19,900
+So uphill 19, pretty fast,
+
+203
+00:10:19,900 --> 00:10:24,925
+but then starting from
+23, skidding pretty slow.
+
+204
+00:10:24,925 --> 00:10:27,445
+So the question is
+what's going on?
+
+205
+00:10:27,445 --> 00:10:29,230
+By the way, I'm not quoting here.
+
+206
+00:10:29,230 --> 00:10:33,755
+Scala, using internally
+the regular expression
+
+207
+00:10:33,755 --> 00:10:36,665
+matching engine from Java.
+
+208
+00:10:36,665 --> 00:10:39,065
+So would have exactly
+the same problem.
+
+209
+00:10:39,065 --> 00:10:41,480
+Also, I have been
+here very careful,
+
+210
+00:10:41,480 --> 00:10:43,550
+I'm using here Scala aid,
+
+211
+00:10:43,550 --> 00:10:46,085
+which nowadays is quite old.
+
+212
+00:10:46,085 --> 00:10:50,765
+But you will see also
+current Java versions.
+
+213
+00:10:50,765 --> 00:10:55,490
+We will see we can out-compete
+them by magnitudes.
+
+214
+00:10:55,490 --> 00:10:57,605
+So I think I can that.
+
+215
+00:10:57,605 --> 00:10:59,165
+Now, just finish here.
+
+216
+00:10:59,165 --> 00:11:04,025
+You see the problem. Just
+for completeness sake.
+
+217
+00:11:04,025 --> 00:11:07,010
+Here is a Ruby program.
+
+218
+00:11:07,010 --> 00:11:09,935
+This is using the other
+regular expression.
+
+219
+00:11:09,935 --> 00:11:12,935
+In this case the
+string should match.
+
+220
+00:11:12,935 --> 00:11:20,300
+And again it tries out
+strings between 130 here.
+
+221
+00:11:20,300 --> 00:11:23,450
+That's a program actually
+a former student produced.
+
+222
+00:11:23,450 --> 00:11:25,565
+And you can see four a's
+
+223
+00:11:25,565 --> 00:11:29,780
+of links up till 20
+AES is pretty fast.
+
+224
+00:11:29,780 --> 00:11:32,495
+But then starting at 26,
+
+225
+00:11:32,495 --> 00:11:35,285
+it's getting really slow.
+
+226
+00:11:35,285 --> 00:11:37,100
+So in this case,
+remember the string
+
+227
+00:11:37,100 --> 00:11:38,870
+is actually matched by
+the regular expression.
+
+228
+00:11:38,870 --> 00:11:40,130
+So it has nothing to do
+
+229
+00:11:40,130 --> 00:11:41,540
+with a regular
+expression actually
+
+230
+00:11:41,540 --> 00:11:45,485
+matches a string or does
+not match a string.
+
+231
+00:11:45,485 --> 00:11:48,260
+I admit though these
+regular expressions
+
+232
+00:11:48,260 --> 00:11:49,610
+are carefully chosen,
+
+233
+00:11:49,610 --> 00:11:52,250
+as you will see later on.
+
+234
+00:11:52,250 --> 00:11:55,620
+Hey, I also just stop that here.
+
+235
+00:11:55,710 --> 00:12:00,985
+Okay, this slight collect
+this information about times.
+
+236
+00:12:00,985 --> 00:12:03,400
+On the right hand side will
+
+237
+00:12:03,400 --> 00:12:05,860
+be our regular expression mantra,
+
+238
+00:12:05,860 --> 00:12:08,290
+which we implement next week.
+
+239
+00:12:08,290 --> 00:12:10,795
+On the left-hand side,
+are these times by
+
+240
+00:12:10,795 --> 00:12:14,260
+barriers than regular
+expression matching engines?
+
+241
+00:12:14,260 --> 00:12:17,809
+On the top is this
+regular expression.
+
+242
+00:12:19,080 --> 00:12:23,335
+Possible a n times a n times.
+
+243
+00:12:23,335 --> 00:12:26,890
+And on the lowest
+is a star, star b.
+
+244
+00:12:26,890 --> 00:12:30,370
+And the x-axis show here
+
+245
+00:12:30,370 --> 00:12:35,335
+the length of the
+string. How many a's.
+
+246
+00:12:35,335 --> 00:12:38,925
+And on the y axis is the time.
+
+247
+00:12:38,925 --> 00:12:41,660
+They need to decide whether
+
+248
+00:12:41,660 --> 00:12:44,615
+the string is matched by
+the rate expression or not.
+
+249
+00:12:44,615 --> 00:12:46,415
+So you can see here, Python,
+
+250
+00:12:46,415 --> 00:12:47,945
+Java eight in JavaScript,
+
+251
+00:12:47,945 --> 00:12:52,250
+they max out approximately
+at between 2530.
+
+252
+00:12:52,250 --> 00:12:53,900
+The kristin, it takes already
+
+253
+00:12:53,900 --> 00:12:55,160
+a half a minute to
+
+254
+00:12:55,160 --> 00:12:57,410
+decide whether the string
+is matched or not.
+
+255
+00:12:57,410 --> 00:13:00,815
+And similarly, in
+the other example,
+
+256
+00:13:00,815 --> 00:13:03,830
+Python and derived Ruby max out
+
+257
+00:13:03,830 --> 00:13:07,220
+at a similar kind of
+length of the strings.
+
+258
+00:13:07,220 --> 00:13:10,400
+Because then they use also
+half a minute to decide
+
+259
+00:13:10,400 --> 00:13:13,940
+whether this rec expression
+actually matches the string.
+
+260
+00:13:13,940 --> 00:13:16,790
+Contrast that with
+the reg expression
+
+261
+00:13:16,790 --> 00:13:19,235
+which we are regular
+expression mantra,
+
+262
+00:13:19,235 --> 00:13:21,470
+which we're going to implement.
+
+263
+00:13:21,470 --> 00:13:25,040
+This can match
+approximately 10 thousand
+
+264
+00:13:25,040 --> 00:13:30,065
+a's in this example and
+needs less than ten seconds.
+
+265
+00:13:30,065 --> 00:13:32,285
+Actually, there will be
+two versions of that.
+
+266
+00:13:32,285 --> 00:13:34,850
+First version may be
+also relatively slow.
+
+267
+00:13:34,850 --> 00:13:36,410
+But the second version,
+
+268
+00:13:36,410 --> 00:13:38,240
+in contrast to Python,
+
+269
+00:13:38,240 --> 00:13:40,295
+Ruby, we'll be blindingly fast.
+
+270
+00:13:40,295 --> 00:13:42,380
+And in the second example,
+
+271
+00:13:42,380 --> 00:13:45,740
+you have to be careful
+about the x axis because
+
+272
+00:13:45,740 --> 00:13:49,385
+that means four times
+ten to the power six.
+
+273
+00:13:49,385 --> 00:13:51,695
+It's actually 4 million A's.
+
+274
+00:13:51,695 --> 00:13:55,100
+So our regular
+expression match or need
+
+275
+00:13:55,100 --> 00:13:57,635
+less than ten seconds to
+
+276
+00:13:57,635 --> 00:14:00,725
+match a string of length
+of 4 million A's.
+
+277
+00:14:00,725 --> 00:14:04,430
+Contrast that Python, Java eight,
+
+278
+00:14:04,430 --> 00:14:06,770
+and JavaScript need half a minute
+
+279
+00:14:06,770 --> 00:14:09,905
+already for a string
+of length just 30,
+
+280
+00:14:09,905 --> 00:14:12,365
+unless you're very
+careful with Java eight.
+
+281
+00:14:12,365 --> 00:14:15,725
+Yes, Java nine and above,
+
+282
+00:14:15,725 --> 00:14:17,180
+they already have
+
+283
+00:14:17,180 --> 00:14:19,610
+a much better regular
+expression matching engine,
+
+284
+00:14:19,610 --> 00:14:22,805
+but still we will be running
+circles around them.
+
+285
+00:14:22,805 --> 00:14:27,050
+It's this data. I
+call this slide.
+
+286
+00:14:27,050 --> 00:14:29,675
+Why bother with
+regular expressions?
+
+287
+00:14:29,675 --> 00:14:33,515
+But you can probably
+see these are
+
+288
+00:14:33,515 --> 00:14:34,910
+at least more times by
+
+289
+00:14:34,910 --> 00:14:38,015
+the existing regular
+expression matching engines.
+
+290
+00:14:38,015 --> 00:14:40,070
+And it's actually
+surprising that after
+
+291
+00:14:40,070 --> 00:14:42,695
+one lecture we can already
+do substantially better.
+
+292
+00:14:42,695 --> 00:14:47,495
+And if you don't believe
+in D times, I gave here,
+
+293
+00:14:47,495 --> 00:14:50,090
+please feel free to
+play on your own
+
+294
+00:14:50,090 --> 00:14:52,865
+with the examples
+I uploaded, Keats.
+
+295
+00:14:52,865 --> 00:14:55,235
+These are exactly the programs
+
+296
+00:14:55,235 --> 00:14:57,470
+are used here in the examples.
+
+297
+00:14:57,470 --> 00:14:59,255
+So feel free.
+
+298
+00:14:59,255 --> 00:15:01,970
+You might however now think, hmm.
+
+299
+00:15:01,970 --> 00:15:05,449
+These are two very
+well chosen examples.
+
+300
+00:15:05,449 --> 00:15:07,145
+And I admit that's true.
+
+301
+00:15:07,145 --> 00:15:09,410
+And such problem there never
+
+302
+00:15:09,410 --> 00:15:12,540
+causing any problems
+in real life.
+
+303
+00:15:13,300 --> 00:15:15,980
+Regular expressions are used very
+
+304
+00:15:15,980 --> 00:15:19,415
+frequently and they
+do cause problems.
+
+305
+00:15:19,415 --> 00:15:21,410
+So here's my first example from
+
+306
+00:15:21,410 --> 00:15:23,885
+a company called cloudflare.
+
+307
+00:15:23,885 --> 00:15:27,560
+This is a huge hosting company
+
+308
+00:15:27,560 --> 00:15:30,935
+which host very
+well-known web pages.
+
+309
+00:15:30,935 --> 00:15:34,970
+And they really try hard
+to have no outage at all.
+
+310
+00:15:34,970 --> 00:15:37,340
+And they manage
+that for six years.
+
+311
+00:15:37,340 --> 00:15:39,320
+But then a Rekha expression,
+
+312
+00:15:39,320 --> 00:15:41,180
+actually this one caused
+a problem and you
+
+313
+00:15:41,180 --> 00:15:43,265
+can see they're also
+like two stars.
+
+314
+00:15:43,265 --> 00:15:44,630
+They are at the end.
+
+315
+00:15:44,630 --> 00:15:46,955
+And because of that string needed
+
+316
+00:15:46,955 --> 00:15:49,865
+too much time to be matched.
+
+317
+00:15:49,865 --> 00:15:50,990
+And because of that,
+
+318
+00:15:50,990 --> 00:15:52,430
+they had some outage for,
+
+319
+00:15:52,430 --> 00:15:54,125
+I think several hours,
+
+320
+00:15:54,125 --> 00:15:57,920
+actually in their malware
+detection subsystem.
+
+321
+00:15:57,920 --> 00:16:02,060
+And the second example
+comes from 2016,
+
+322
+00:16:02,060 --> 00:16:04,040
+where Stack Exchange,
+I guess you know
+
+323
+00:16:04,040 --> 00:16:06,650
+this webpage had
+also an outage from,
+
+324
+00:16:06,650 --> 00:16:08,390
+I think at least an hour.
+
+325
+00:16:08,390 --> 00:16:13,070
+Because a regular expression
+then needed to format posts,
+
+326
+00:16:13,070 --> 00:16:15,575
+needed too much time to
+
+327
+00:16:15,575 --> 00:16:19,010
+recognize whether this post
+should be accepted or not.
+
+328
+00:16:19,010 --> 00:16:23,390
+And again, there was a
+semi kind of problem.
+
+329
+00:16:23,390 --> 00:16:24,950
+And you can read
+the stories behind
+
+330
+00:16:24,950 --> 00:16:28,080
+that on these two given links.
+
+331
+00:16:28,720 --> 00:16:31,730
+When I looked at
+this the first time,
+
+332
+00:16:31,730 --> 00:16:34,175
+what surprised me is
+that theoretician
+
+333
+00:16:34,175 --> 00:16:37,520
+who sometimes dedicate their
+life to regular expression.
+
+334
+00:16:37,520 --> 00:16:39,440
+And no really a lot about
+
+335
+00:16:39,440 --> 00:16:41,690
+them didn't know
+anything about this.
+
+336
+00:16:41,690 --> 00:16:43,610
+But engineers, they
+already created
+
+337
+00:16:43,610 --> 00:16:46,160
+a name for that
+regular expression,
+
+338
+00:16:46,160 --> 00:16:47,975
+denial of service attack.
+
+339
+00:16:47,975 --> 00:16:49,745
+Because what you can,
+
+340
+00:16:49,745 --> 00:16:51,230
+what can happen now is that
+
+341
+00:16:51,230 --> 00:16:54,920
+attackers look for
+certain strings.
+
+342
+00:16:54,920 --> 00:16:56,780
+You make your regular expression
+
+343
+00:16:56,780 --> 00:16:59,105
+matching engine topple over.
+
+344
+00:16:59,105 --> 00:17:01,370
+And these kind of expressions,
+
+345
+00:17:01,370 --> 00:17:04,160
+regular expressions called
+Eve of reg expression.
+
+346
+00:17:04,160 --> 00:17:06,350
+And actually there are
+quite a number of them.
+
+347
+00:17:06,350 --> 00:17:08,495
+So you seen this one,
+
+348
+00:17:08,495 --> 00:17:11,255
+the first one, and the
+second one already.
+
+349
+00:17:11,255 --> 00:17:13,400
+But there are many, many more.
+
+350
+00:17:13,400 --> 00:17:15,620
+And you can easily have in
+
+351
+00:17:15,620 --> 00:17:18,560
+your program one of
+these reg expression.
+
+352
+00:17:18,560 --> 00:17:21,830
+And then you have the
+problem that if you do have
+
+353
+00:17:21,830 --> 00:17:23,240
+this regular expression and
+
+354
+00:17:23,240 --> 00:17:25,640
+somebody finds the
+corresponding string,
+
+355
+00:17:25,640 --> 00:17:29,945
+which make the records
+matching engine topple over,
+
+356
+00:17:29,945 --> 00:17:31,820
+then you have a problem
+
+357
+00:17:31,820 --> 00:17:34,295
+because your webpage is
+probably not variable.
+
+358
+00:17:34,295 --> 00:17:36,140
+This is also sometimes called
+
+359
+00:17:36,140 --> 00:17:39,350
+this phenomenon,
+catastrophic backtracking.
+
+360
+00:17:39,350 --> 00:17:43,595
+In lecture three, we will
+look at this more carefully.
+
+361
+00:17:43,595 --> 00:17:46,910
+And actually why that
+is such a problem in
+
+362
+00:17:46,910 --> 00:17:50,795
+real life is actually
+not to do with Lexus.
+
+363
+00:17:50,795 --> 00:17:53,180
+Yes, regular
+expressions are used as
+
+364
+00:17:53,180 --> 00:17:55,040
+the basic tool for implementing
+
+365
+00:17:55,040 --> 00:17:57,185
+like source bad reg expressions,
+
+366
+00:17:57,185 --> 00:18:00,065
+of course, used in
+a much wider area.
+
+367
+00:18:00,065 --> 00:18:03,770
+And they especially used for
+network intrusion detection.
+
+368
+00:18:03,770 --> 00:18:06,590
+Remember, you having to
+
+369
+00:18:06,590 --> 00:18:10,130
+administer a big network
+and you only want to let
+
+370
+00:18:10,130 --> 00:18:13,640
+in packets which you think are K
+
+371
+00:18:13,640 --> 00:18:14,930
+and you want to keep out
+
+372
+00:18:14,930 --> 00:18:17,645
+any package which might
+hack into your network.
+
+373
+00:18:17,645 --> 00:18:22,670
+So what they have is they
+have suites of thousands and
+
+374
+00:18:22,670 --> 00:18:25,745
+sometimes even more
+regular expressions which
+
+375
+00:18:25,745 --> 00:18:27,755
+check whether this package
+
+376
+00:18:27,755 --> 00:18:30,065
+satisfies some patterns or not.
+
+377
+00:18:30,065 --> 00:18:31,460
+And in this case it will be left
+
+378
+00:18:31,460 --> 00:18:34,205
+out or it will be let in.
+
+379
+00:18:34,205 --> 00:18:36,335
+And with networks,
+
+380
+00:18:36,335 --> 00:18:39,080
+the problem is that our
+hardware is already
+
+381
+00:18:39,080 --> 00:18:43,190
+so fast that the reg expressions
+
+382
+00:18:43,190 --> 00:18:45,169
+really become a bottleneck.
+
+383
+00:18:45,169 --> 00:18:47,060
+Because what do you do if now is
+
+384
+00:18:47,060 --> 00:18:49,880
+suddenly a reg expression
+takes too much time
+
+385
+00:18:49,880 --> 00:18:52,670
+to just stop the matching
+
+386
+00:18:52,670 --> 00:18:55,100
+and let the package
+in regardless?
+
+387
+00:18:55,100 --> 00:18:58,190
+Or do you just hold
+the network up
+
+388
+00:18:58,190 --> 00:19:01,715
+and don't let anything in
+until you decided that.
+
+389
+00:19:01,715 --> 00:19:04,895
+So that's actually a
+really hard problem.
+
+390
+00:19:04,895 --> 00:19:06,650
+But the first time I came across
+
+391
+00:19:06,650 --> 00:19:09,965
+that problem was actually
+by this engineer.
+
+392
+00:19:09,965 --> 00:19:13,820
+And it's always say that
+Germans don't have any Yammer.
+
+393
+00:19:13,820 --> 00:19:16,985
+But I found that
+video quite funny.
+
+394
+00:19:16,985 --> 00:19:19,145
+Maybe you have a
+different opinion,
+
+395
+00:19:19,145 --> 00:19:21,095
+but feel free to
+have a look which
+
+396
+00:19:21,095 --> 00:19:23,705
+explains exactly that problem.
+
+397
+00:19:23,705 --> 00:19:25,610
+So in the next video,
+
+398
+00:19:25,610 --> 00:19:28,445
+we will start to
+implement this matcher.
+
+399
+00:19:28,445 --> 00:19:30,870
+So I hope to see you there.