--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/videos/01-evilregexes.srt Wed Sep 23 11:34:43 2020 +0100
@@ -0,0 +1,1813 @@
+1
+00:00:06,240 --> 00:00:11,050
+Welcome back. This video
+is about regular expressions.
+
+2
+00:00:11,050 --> 00:00:14,230
+We want to use regular
+expressions in our lexer.
+
+3
+00:00:14,230 --> 00:00:16,165
+And the purpose of the
+lexer is to find
+
+4
+00:00:16,165 --> 00:00:18,070
+out where the words in
+
+5
+00:00:18,070 --> 00:00:21,070
+our programs are. However
+regular expressions
+
+6
+00:00:21,070 --> 00:00:23,875
+are fundamental tool
+in computer science.
+
+7
+00:00:23,875 --> 00:00:27,910
+And I'm sure you've used them
+already on several occasions.
+
+8
+00:00:27,910 --> 00:00:30,370
+And one would expect that about
+
+9
+00:00:30,370 --> 00:00:31,750
+regular expressions since they are
+
+10
+00:00:31,750 --> 00:00:33,850
+so well-known and well studied,
+
+11
+00:00:33,850 --> 00:00:37,915
+that everything under the
+sun is known about them.
+
+12
+00:00:37,915 --> 00:00:41,080
+But actually there's
+still some surprising
+
+13
+00:00:41,080 --> 00:00:44,465
+and interesting
+problems with them.
+
+14
+00:00:44,465 --> 00:00:47,945
+And I want to show you
+them in this video.
+
+15
+00:00:47,945 --> 00:00:50,720
+I'm sure you've seen
+regular expressions
+
+16
+00:00:50,720 --> 00:00:52,445
+many, many times before.
+
+17
+00:00:52,445 --> 00:00:55,100
+But just to be on the same page,
+
+18
+00:00:55,100 --> 00:00:57,110
+let me just recap them.
+
+19
+00:00:57,110 --> 00:00:59,210
+So here in this line,
+
+20
+00:00:59,210 --> 00:01:01,790
+there is a regular expression
+which is supposed to
+
+21
+00:01:01,790 --> 00:01:05,285
+recognize some form
+of email addresses.
+
+22
+00:01:05,285 --> 00:01:07,745
+So an e-mail address
+
+23
+00:01:07,745 --> 00:01:11,000
+has part which is
+before the @ symbol,
+
+24
+00:01:11,000 --> 00:01:13,400
+which is the name of the person.
+
+25
+00:01:13,400 --> 00:01:16,880
+And that can be
+any number between
+
+26
+00:01:16,880 --> 00:01:20,195
+0 and 9, and letters between a and z.
+
+27
+00:01:20,195 --> 00:01:24,155
+Let's say we avoiding
+here capital letters.
+
+28
+00:01:24,155 --> 00:01:26,045
+There can be underscores.
+
+29
+00:01:26,045 --> 00:01:29,405
+There can be a dot and
+there can be hyphens.
+
+30
+00:01:29,405 --> 00:01:35,390
+And after the @ symbol
+comes the domain name.
+
+31
+00:01:35,390 --> 00:01:37,310
+So as you can see here,
+
+32
+00:01:37,310 --> 00:01:40,640
+we use things like star to
+
+33
+00:01:40,640 --> 00:01:44,314
+match letters
+zero or more times.
+
+34
+00:01:44,314 --> 00:01:45,985
+Or we have a plus,
+
+35
+00:01:45,985 --> 00:01:47,420
+which means you have to match
+
+36
+00:01:47,420 --> 00:01:52,489
+at least once or more
+times. Then we have.
+
+37
+00:01:52,489 --> 00:01:55,790
+question mark, which says you
+
+38
+00:01:55,790 --> 00:01:59,105
+match either it is there
+or it ss not there.
+
+39
+00:01:59,105 --> 00:02:01,340
+You are also regular
+expressions which
+
+40
+00:02:01,340 --> 00:02:03,755
+match exactly n-times.
+
+41
+00:02:03,755 --> 00:02:08,720
+Or this is a regular expression
+for between n and m times.
+
+42
+00:02:08,720 --> 00:02:12,065
+You can see in
+this email address,
+
+43
+00:02:12,065 --> 00:02:13,730
+the top-level domain
+
+44
+00:02:13,730 --> 00:02:16,130
+name can be any letter
+
+45
+00:02:16,130 --> 00:02:19,265
+between a to z,
+and contain dots,
+
+46
+00:02:19,265 --> 00:02:22,340
+but can only be two
+characters long
+
+47
+00:02:22,340 --> 00:02:25,685
+up till six characters
+and not more.
+
+48
+00:02:25,685 --> 00:02:29,240
+Then you also have
+something like ranges.
+
+49
+00:02:29,240 --> 00:02:31,220
+So you can see, letters between a
+
+50
+00:02:31,220 --> 00:02:33,635
+and z and 0 to 9 and so on.
+
+51
+00:02:33,635 --> 00:02:36,545
+Here you also have regular
+expression which can
+
+52
+00:02:36,545 --> 00:02:40,070
+match something which
+isn't in this range.
+
+53
+00:02:40,070 --> 00:02:42,560
+So for example, if
+you want for example match,
+
+54
+00:02:42,560 --> 00:02:44,030
+letters but not numbers,
+
+55
+00:02:44,030 --> 00:02:45,800
+you would say, well, if
+
+56
+00:02:45,800 --> 00:02:48,990
+this is a number that
+should not match.
+
+57
+00:02:49,090 --> 00:02:52,804
+Typically you also
+have these ranges.
+
+58
+00:02:52,804 --> 00:02:55,565
+Lowercase letters,
+capital letters.
+
+59
+00:02:55,565 --> 00:02:58,550
+Then you have some
+special regular expressions
+
+60
+00:02:58,550 --> 00:03:02,195
+like this one is only
+supposed to match digits.
+
+61
+00:03:02,195 --> 00:03:05,674
+A dot is supposed to
+match any character.
+
+62
+00:03:05,674 --> 00:03:07,370
+And then they have also something
+
+63
+00:03:07,370 --> 00:03:09,800
+called groups which
+is supposed to be
+
+64
+00:03:09,800 --> 00:03:12,799
+used when you are
+trying to extract
+
+65
+00:03:12,799 --> 00:03:15,605
+a string you've matched.
+
+66
+00:03:15,605 --> 00:03:19,925
+Okay, so these are the
+typical regular expressions.
+
+67
+00:03:19,925 --> 00:03:23,075
+And here's a particular one.
+
+68
+00:03:23,075 --> 00:03:25,820
+Trying to match something
+
+69
+00:03:25,820 --> 00:03:28,770
+which resembles
+an email address.
+
+70
+00:03:29,590 --> 00:03:33,065
+Clearly that should be all easy.
+
+71
+00:03:33,065 --> 00:03:36,230
+And our technology should
+be on top of that.
+
+72
+00:03:36,230 --> 00:03:37,865
+That we can take a
+
+73
+00:03:37,865 --> 00:03:41,015
+regular expressions and
+we can take a string,
+
+74
+00:03:41,015 --> 00:03:43,340
+and we should have programs to
+
+75
+00:03:43,340 --> 00:03:45,680
+decide whether this
+string is matched
+
+76
+00:03:45,680 --> 00:03:50,330
+by a regular expression or
+not and should be easy-peasy, no?
+
+77
+00:03:50,330 --> 00:03:56,150
+Well, let's have a
+look at two examples.
+
+78
+00:03:56,150 --> 00:04:00,860
+The first regular expression
+is a star star b.
+
+79
+00:04:00,860 --> 00:04:02,990
+And it is supposed
+to match strings of
+
+80
+00:04:02,990 --> 00:04:05,825
+the form 0 or more a's,
+
+81
+00:04:05,825 --> 00:04:10,385
+followed by a b. The parentheses
+you can ignore.
+
+82
+00:04:10,385 --> 00:04:11,990
+And a star star
+
+83
+00:04:11,990 --> 00:04:14,120
+also doesn't
+make any difference
+
+84
+00:04:14,120 --> 00:04:16,505
+to what kind of strings
+that can be matched.
+
+85
+00:04:16,505 --> 00:04:21,635
+It can only make 0 more
+a's followed by a b.
+
+86
+00:04:21,635 --> 00:04:23,900
+And the other regular expression
+
+87
+00:04:23,900 --> 00:04:26,990
+is possibly a character a,
+
+88
+00:04:26,990 --> 00:04:32,930
+n times, followed by character
+a axactly n-times.
+
+89
+00:04:32,930 --> 00:04:35,570
+And we will try out
+
+90
+00:04:35,570 --> 00:04:38,360
+these two regular expressions
+with strings of the form a,
+
+91
+00:04:38,360 --> 00:04:39,890
+aa, and so on,
+
+92
+00:04:39,890 --> 00:04:45,770
+and up to the length of n. And
+
+93
+00:04:45,770 --> 00:04:49,130
+this regular expression should
+actually not match any of
+
+94
+00:04:49,130 --> 00:04:53,315
+the strings because the
+final b is missing.
+
+95
+00:04:53,315 --> 00:04:56,150
+But that is
+okay. For example
+
+96
+00:04:56,150 --> 00:04:57,425
+if you have a regular expression
+
+97
+00:04:57,425 --> 00:05:00,110
+that is supposed to
+check whether a string is
+
+98
+00:05:00,110 --> 00:05:01,490
+an email address and the user
+
+99
+00:05:01,490 --> 00:05:03,380
+gives some random
+strings in there,
+
+100
+00:05:03,380 --> 00:05:06,545
+then this regular expression
+should not match that string.
+
+101
+00:05:06,545 --> 00:05:08,420
+And for this regular expression
+
+102
+00:05:08,420 --> 00:05:11,195
+you have to scratch a
+little bit of your head,
+
+103
+00:05:11,195 --> 00:05:12,620
+what it can actually match.
+
+104
+00:05:12,620 --> 00:05:14,720
+But after a little bit
+of head scratching,
+
+105
+00:05:14,720 --> 00:05:18,260
+you find out can match
+any string which is of
+
+106
+00:05:18,260 --> 00:05:22,580
+the length n a's up
+to 2n of a's.
+
+107
+00:05:22,580 --> 00:05:24,290
+So anything in this range,
+
+108
+00:05:24,290 --> 00:05:27,185
+this regular expression
+can actually match.
+
+109
+00:05:27,185 --> 00:05:30,395
+Okay, let's
+take a random tool,
+
+110
+00:05:30,395 --> 00:05:32,630
+maybe for example Python.
+
+111
+00:05:32,630 --> 00:05:35,240
+So here's a little
+Python program.
+
+112
+00:05:35,240 --> 00:05:38,690
+It uses the library
+function of Python to
+
+113
+00:05:38,690 --> 00:05:42,935
+match the regular expressions of
+a star star b.
+
+114
+00:05:42,935 --> 00:05:46,805
+And we measure time with longer
+and longer strings of a.
+
+115
+00:05:46,805 --> 00:05:48,770
+And so conveniently we can give
+
+116
+00:05:48,770 --> 00:05:51,140
+the number of a's here
+on the command line.
+
+117
+00:05:51,140 --> 00:05:56,900
+If I just call
+this on the command line,
+
+118
+00:05:56,900 --> 00:05:59,900
+Let's say we first
+start with five a's.
+
+119
+00:05:59,900 --> 00:06:03,920
+And I get also the times which
+in this case is next to nothing.
+
+120
+00:06:03,920 --> 00:06:05,960
+And here's the string
+we just matched.
+
+121
+00:06:05,960 --> 00:06:07,640
+And obviously the
+regular expression
+
+122
+00:06:07,640 --> 00:06:09,110
+did not match the string.
+
+123
+00:06:09,110 --> 00:06:11,255
+That's indicated by this none.
+
+124
+00:06:11,255 --> 00:06:13,925
+Let's take ten a's.
+
+125
+00:06:13,925 --> 00:06:16,490
+It's also pretty quick.
+
+126
+00:06:16,490 --> 00:06:20,780
+Fifteen a's, even quicker,
+
+127
+00:06:20,780 --> 00:06:23,180
+but these times always need to
+
+128
+00:06:23,180 --> 00:06:25,820
+be taken with a grain of salt.
+
+129
+00:06:25,820 --> 00:06:28,040
+They are not 100
+percent accurate.
+
+130
+00:06:28,040 --> 00:06:31,490
+So 15 is also a let's take
+
+131
+00:06:31,490 --> 00:06:36,965
+28th notes already
+double the time.
+
+132
+00:06:36,965 --> 00:06:42,440
+Twenty-five longer.
+
+133
+00:06:42,440 --> 00:06:45,680
+Okay, that suddenly
+from 02 seconds,
+
+134
+00:06:45,680 --> 00:06:48,960
+it takes almost four seconds.
+
+135
+00:06:49,600 --> 00:06:54,890
+Six this
+
+136
+00:06:54,890 --> 00:07:01,415
+takes six seconds
+already Double, okay?
+
+137
+00:07:01,415 --> 00:07:07,229
+Go to 28. That would be now.
+
+138
+00:07:08,890 --> 00:07:11,840
+You see the string
+isn't very long,
+
+139
+00:07:11,840 --> 00:07:13,340
+so that could be easily like
+
+140
+00:07:13,340 --> 00:07:16,070
+just the size of
+an email address.
+
+141
+00:07:16,070 --> 00:07:19,280
+And the regular
+expression matching
+
+142
+00:07:19,280 --> 00:07:22,550
+engine in Python needs
+quite a long time
+
+143
+00:07:22,550 --> 00:07:24,710
+to find out that
+this string of 28
+
+144
+00:07:24,710 --> 00:07:26,570
+AES is actually not much
+
+145
+00:07:26,570 --> 00:07:28,490
+by that you see it's
+still not finished.
+
+146
+00:07:28,490 --> 00:07:32,900
+I think it should take
+approximately like 20 seconds.
+
+147
+00:07:32,900 --> 00:07:34,400
+Okay. Already 30.
+
+148
+00:07:34,400 --> 00:07:36,530
+And if we would try
+
+149
+00:07:36,530 --> 00:07:40,805
+30 would be already
+more than a minute.
+
+150
+00:07:40,805 --> 00:07:43,940
+And if I could read
+something like hundreds,
+
+151
+00:07:43,940 --> 00:07:46,220
+you remember if a doubling in
+
+152
+00:07:46,220 --> 00:07:48,770
+each step or the second step,
+
+153
+00:07:48,770 --> 00:07:50,720
+the story with the chess board,
+
+154
+00:07:50,720 --> 00:07:53,855
+we probably would sit here
+until the next century.
+
+155
+00:07:53,855 --> 00:07:56,820
+So something strange here.
+
+156
+00:07:57,580 --> 00:08:01,355
+Okay, that might be just
+a problem of Python.
+
+157
+00:08:01,355 --> 00:08:02,990
+Let's have a look at another
+
+158
+00:08:02,990 --> 00:08:04,985
+regular expression
+matching engine.
+
+159
+00:08:04,985 --> 00:08:06,890
+This time from JavaScript,
+
+160
+00:08:06,890 --> 00:08:10,040
+also are pretty well-known
+programming language.
+
+161
+00:08:10,040 --> 00:08:13,610
+So here you can see
+it's still a star,
+
+162
+00:08:13,610 --> 00:08:16,235
+star followed by b,
+
+163
+00:08:16,235 --> 00:08:18,920
+by direct expression is
+supposed to match that from
+
+164
+00:08:18,920 --> 00:08:21,830
+the beginning of the
+string up till the end.
+
+165
+00:08:21,830 --> 00:08:23,930
+So there's not any difference
+
+166
+00:08:23,930 --> 00:08:26,150
+in the strings this work
+expression matches.
+
+167
+00:08:26,150 --> 00:08:28,610
+We'll just start at the
+beginning of the string
+
+168
+00:08:28,610 --> 00:08:31,460
+and finish at the
+end of the string.
+
+169
+00:08:31,460 --> 00:08:35,285
+And we again, we just use
+repeated A's for that.
+
+170
+00:08:35,285 --> 00:08:38,195
+And similarly, we can
+
+171
+00:08:38,195 --> 00:08:41,930
+call it on the command line
+and can do some timing.
+
+172
+00:08:41,930 --> 00:08:44,540
+So ten SBA, good.
+
+173
+00:08:44,540 --> 00:08:46,340
+Here's the string.
+
+174
+00:08:46,340 --> 00:08:48,320
+It cannot match that string.
+
+175
+00:08:48,320 --> 00:08:50,525
+And it's pretty fast.
+
+176
+00:08:50,525 --> 00:08:54,725
+Friendly. Although pretty fast.
+
+177
+00:08:54,725 --> 00:08:59,120
+Five, again,
+
+178
+00:08:59,120 --> 00:09:06,650
+somehow is kind of
+threshold that is 25, 26.
+
+179
+00:09:06,650 --> 00:09:09,485
+Suddenly it takes much longer.
+
+180
+00:09:09,485 --> 00:09:14,360
+And it has essentially the
+same problem as with Python.
+
+181
+00:09:14,360 --> 00:09:17,165
+So you'll see in now from 26 on,
+
+182
+00:09:17,165 --> 00:09:19,250
+the Times has always
+doubling from
+
+183
+00:09:19,250 --> 00:09:21,860
+three seconds to seven seconds.
+
+184
+00:09:21,860 --> 00:09:23,330
+So you can imagine what that
+
+185
+00:09:23,330 --> 00:09:24,890
+roughly takes when I put your
+
+186
+00:09:24,890 --> 00:09:30,230
+27 and you see the
+string isn't very long.
+
+187
+00:09:30,230 --> 00:09:32,165
+Let's choose twenties or maize.
+
+188
+00:09:32,165 --> 00:09:35,419
+Imagine you have to
+search a database
+
+189
+00:09:35,419 --> 00:09:38,720
+with kilobytes of data.
+
+190
+00:09:38,720 --> 00:09:42,260
+This, these regular
+expressions that would years
+
+191
+00:09:42,260 --> 00:09:48,150
+need years to go through with
+these regular expressions.
+
+192
+00:09:48,630 --> 00:09:51,850
+Okay, maybe the people in
+
+193
+00:09:51,850 --> 00:09:55,435
+Python and JavaScript,
+they're just idiots.
+
+194
+00:09:55,435 --> 00:09:58,180
+Surely Java must do much better.
+
+195
+00:09:58,180 --> 00:10:01,045
+So here's a program.
+
+196
+00:10:01,045 --> 00:10:03,415
+You can see this again
+
+197
+00:10:03,415 --> 00:10:05,980
+is the reg expression
+and we just having
+
+198
+00:10:05,980 --> 00:10:08,320
+some scaffolding to generate
+
+199
+00:10:08,320 --> 00:10:11,905
+strings from five up till 28.
+
+200
+00:10:11,905 --> 00:10:14,305
+And if we run that,
+
+201
+00:10:14,305 --> 00:10:16,660
+actually does that automatically.
+
+202
+00:10:16,660 --> 00:10:19,900
+So uphill 19, pretty fast,
+
+203
+00:10:19,900 --> 00:10:24,925
+but then starting from
+23, skidding pretty slow.
+
+204
+00:10:24,925 --> 00:10:27,445
+So the question is
+what's going on?
+
+205
+00:10:27,445 --> 00:10:29,230
+By the way, I'm not quoting here.
+
+206
+00:10:29,230 --> 00:10:33,755
+Scala, using internally
+the regular expression
+
+207
+00:10:33,755 --> 00:10:36,665
+matching engine from Java.
+
+208
+00:10:36,665 --> 00:10:39,065
+So would have exactly
+the same problem.
+
+209
+00:10:39,065 --> 00:10:41,480
+Also, I have been
+here very careful,
+
+210
+00:10:41,480 --> 00:10:43,550
+I'm using here Scala aid,
+
+211
+00:10:43,550 --> 00:10:46,085
+which nowadays is quite old.
+
+212
+00:10:46,085 --> 00:10:50,765
+But you will see also
+current Java versions.
+
+213
+00:10:50,765 --> 00:10:55,490
+We will see we can out-compete
+them by magnitudes.
+
+214
+00:10:55,490 --> 00:10:57,605
+So I think I can that.
+
+215
+00:10:57,605 --> 00:10:59,165
+Now, just finish here.
+
+216
+00:10:59,165 --> 00:11:04,025
+You see the problem. Just
+for completeness sake.
+
+217
+00:11:04,025 --> 00:11:07,010
+Here is a Ruby program.
+
+218
+00:11:07,010 --> 00:11:09,935
+This is using the other
+regular expression.
+
+219
+00:11:09,935 --> 00:11:12,935
+In this case the
+string should match.
+
+220
+00:11:12,935 --> 00:11:20,300
+And again it tries out
+strings between 130 here.
+
+221
+00:11:20,300 --> 00:11:23,450
+That's a program actually
+a former student produced.
+
+222
+00:11:23,450 --> 00:11:25,565
+And you can see four a's
+
+223
+00:11:25,565 --> 00:11:29,780
+of links up till 20
+AES is pretty fast.
+
+224
+00:11:29,780 --> 00:11:32,495
+But then starting at 26,
+
+225
+00:11:32,495 --> 00:11:35,285
+it's getting really slow.
+
+226
+00:11:35,285 --> 00:11:37,100
+So in this case,
+remember the string
+
+227
+00:11:37,100 --> 00:11:38,870
+is actually matched by
+the regular expression.
+
+228
+00:11:38,870 --> 00:11:40,130
+So it has nothing to do
+
+229
+00:11:40,130 --> 00:11:41,540
+with a regular
+expression actually
+
+230
+00:11:41,540 --> 00:11:45,485
+matches a string or does
+not match a string.
+
+231
+00:11:45,485 --> 00:11:48,260
+I admit though these
+regular expressions
+
+232
+00:11:48,260 --> 00:11:49,610
+are carefully chosen,
+
+233
+00:11:49,610 --> 00:11:52,250
+as you will see later on.
+
+234
+00:11:52,250 --> 00:11:55,620
+Hey, I also just stop that here.
+
+235
+00:11:55,710 --> 00:12:00,985
+Okay, this slight collect
+this information about times.
+
+236
+00:12:00,985 --> 00:12:03,400
+On the right hand side will
+
+237
+00:12:03,400 --> 00:12:05,860
+be our regular expression mantra,
+
+238
+00:12:05,860 --> 00:12:08,290
+which we implement next week.
+
+239
+00:12:08,290 --> 00:12:10,795
+On the left-hand side,
+are these times by
+
+240
+00:12:10,795 --> 00:12:14,260
+barriers than regular
+expression matching engines?
+
+241
+00:12:14,260 --> 00:12:17,809
+On the top is this
+regular expression.
+
+242
+00:12:19,080 --> 00:12:23,335
+Possible a n times a n times.
+
+243
+00:12:23,335 --> 00:12:26,890
+And on the lowest
+is a star, star b.
+
+244
+00:12:26,890 --> 00:12:30,370
+And the x-axis show here
+
+245
+00:12:30,370 --> 00:12:35,335
+the length of the
+string. How many a's.
+
+246
+00:12:35,335 --> 00:12:38,925
+And on the y axis is the time.
+
+247
+00:12:38,925 --> 00:12:41,660
+They need to decide whether
+
+248
+00:12:41,660 --> 00:12:44,615
+the string is matched by
+the rate expression or not.
+
+249
+00:12:44,615 --> 00:12:46,415
+So you can see here, Python,
+
+250
+00:12:46,415 --> 00:12:47,945
+Java eight in JavaScript,
+
+251
+00:12:47,945 --> 00:12:52,250
+they max out approximately
+at between 2530.
+
+252
+00:12:52,250 --> 00:12:53,900
+The kristin, it takes already
+
+253
+00:12:53,900 --> 00:12:55,160
+a half a minute to
+
+254
+00:12:55,160 --> 00:12:57,410
+decide whether the string
+is matched or not.
+
+255
+00:12:57,410 --> 00:13:00,815
+And similarly, in
+the other example,
+
+256
+00:13:00,815 --> 00:13:03,830
+Python and derived Ruby max out
+
+257
+00:13:03,830 --> 00:13:07,220
+at a similar kind of
+length of the strings.
+
+258
+00:13:07,220 --> 00:13:10,400
+Because then they use also
+half a minute to decide
+
+259
+00:13:10,400 --> 00:13:13,940
+whether this rec expression
+actually matches the string.
+
+260
+00:13:13,940 --> 00:13:16,790
+Contrast that with
+the reg expression
+
+261
+00:13:16,790 --> 00:13:19,235
+which we are regular
+expression mantra,
+
+262
+00:13:19,235 --> 00:13:21,470
+which we're going to implement.
+
+263
+00:13:21,470 --> 00:13:25,040
+This can match
+approximately 10 thousand
+
+264
+00:13:25,040 --> 00:13:30,065
+a's in this example and
+needs less than ten seconds.
+
+265
+00:13:30,065 --> 00:13:32,285
+Actually, there will be
+two versions of that.
+
+266
+00:13:32,285 --> 00:13:34,850
+First version may be
+also relatively slow.
+
+267
+00:13:34,850 --> 00:13:36,410
+But the second version,
+
+268
+00:13:36,410 --> 00:13:38,240
+in contrast to Python,
+
+269
+00:13:38,240 --> 00:13:40,295
+Ruby, we'll be blindingly fast.
+
+270
+00:13:40,295 --> 00:13:42,380
+And in the second example,
+
+271
+00:13:42,380 --> 00:13:45,740
+you have to be careful
+about the x axis because
+
+272
+00:13:45,740 --> 00:13:49,385
+that means four times
+ten to the power six.
+
+273
+00:13:49,385 --> 00:13:51,695
+It's actually 4 million A's.
+
+274
+00:13:51,695 --> 00:13:55,100
+So our regular
+expression match or need
+
+275
+00:13:55,100 --> 00:13:57,635
+less than ten seconds to
+
+276
+00:13:57,635 --> 00:14:00,725
+match a string of length
+of 4 million A's.
+
+277
+00:14:00,725 --> 00:14:04,430
+Contrast that Python, Java eight,
+
+278
+00:14:04,430 --> 00:14:06,770
+and JavaScript need half a minute
+
+279
+00:14:06,770 --> 00:14:09,905
+already for a string
+of length just 30,
+
+280
+00:14:09,905 --> 00:14:12,365
+unless you're very
+careful with Java eight.
+
+281
+00:14:12,365 --> 00:14:15,725
+Yes, Java nine and above,
+
+282
+00:14:15,725 --> 00:14:17,180
+they already have
+
+283
+00:14:17,180 --> 00:14:19,610
+a much better regular
+expression matching engine,
+
+284
+00:14:19,610 --> 00:14:22,805
+but still we will be running
+circles around them.
+
+285
+00:14:22,805 --> 00:14:27,050
+It's this data. I
+call this slide.
+
+286
+00:14:27,050 --> 00:14:29,675
+Why bother with
+regular expressions?
+
+287
+00:14:29,675 --> 00:14:33,515
+But you can probably
+see these are
+
+288
+00:14:33,515 --> 00:14:34,910
+at least more times by
+
+289
+00:14:34,910 --> 00:14:38,015
+the existing regular
+expression matching engines.
+
+290
+00:14:38,015 --> 00:14:40,070
+And it's actually
+surprising that after
+
+291
+00:14:40,070 --> 00:14:42,695
+one lecture we can already
+do substantially better.
+
+292
+00:14:42,695 --> 00:14:47,495
+And if you don't believe
+in D times, I gave here,
+
+293
+00:14:47,495 --> 00:14:50,090
+please feel free to
+play on your own
+
+294
+00:14:50,090 --> 00:14:52,865
+with the examples
+I uploaded, Keats.
+
+295
+00:14:52,865 --> 00:14:55,235
+These are exactly the programs
+
+296
+00:14:55,235 --> 00:14:57,470
+are used here in the examples.
+
+297
+00:14:57,470 --> 00:14:59,255
+So feel free.
+
+298
+00:14:59,255 --> 00:15:01,970
+You might however now think, hmm.
+
+299
+00:15:01,970 --> 00:15:05,449
+These are two very
+well chosen examples.
+
+300
+00:15:05,449 --> 00:15:07,145
+And I admit that's true.
+
+301
+00:15:07,145 --> 00:15:09,410
+And such problem there never
+
+302
+00:15:09,410 --> 00:15:12,540
+causing any problems
+in real life.
+
+303
+00:15:13,300 --> 00:15:15,980
+Regular expressions are used very
+
+304
+00:15:15,980 --> 00:15:19,415
+frequently and they
+do cause problems.
+
+305
+00:15:19,415 --> 00:15:21,410
+So here's my first example from
+
+306
+00:15:21,410 --> 00:15:23,885
+a company called cloudflare.
+
+307
+00:15:23,885 --> 00:15:27,560
+This is a huge hosting company
+
+308
+00:15:27,560 --> 00:15:30,935
+which host very
+well-known web pages.
+
+309
+00:15:30,935 --> 00:15:34,970
+And they really try hard
+to have no outage at all.
+
+310
+00:15:34,970 --> 00:15:37,340
+And they manage
+that for six years.
+
+311
+00:15:37,340 --> 00:15:39,320
+But then a Rekha expression,
+
+312
+00:15:39,320 --> 00:15:41,180
+actually this one caused
+a problem and you
+
+313
+00:15:41,180 --> 00:15:43,265
+can see they're also
+like two stars.
+
+314
+00:15:43,265 --> 00:15:44,630
+They are at the end.
+
+315
+00:15:44,630 --> 00:15:46,955
+And because of that string needed
+
+316
+00:15:46,955 --> 00:15:49,865
+too much time to be matched.
+
+317
+00:15:49,865 --> 00:15:50,990
+And because of that,
+
+318
+00:15:50,990 --> 00:15:52,430
+they had some outage for,
+
+319
+00:15:52,430 --> 00:15:54,125
+I think several hours,
+
+320
+00:15:54,125 --> 00:15:57,920
+actually in their malware
+detection subsystem.
+
+321
+00:15:57,920 --> 00:16:02,060
+And the second example
+comes from 2016,
+
+322
+00:16:02,060 --> 00:16:04,040
+where Stack Exchange,
+I guess you know
+
+323
+00:16:04,040 --> 00:16:06,650
+this webpage had
+also an outage from,
+
+324
+00:16:06,650 --> 00:16:08,390
+I think at least an hour.
+
+325
+00:16:08,390 --> 00:16:13,070
+Because a regular expression
+then needed to format posts,
+
+326
+00:16:13,070 --> 00:16:15,575
+needed too much time to
+
+327
+00:16:15,575 --> 00:16:19,010
+recognize whether this post
+should be accepted or not.
+
+328
+00:16:19,010 --> 00:16:23,390
+And again, there was a
+semi kind of problem.
+
+329
+00:16:23,390 --> 00:16:24,950
+And you can read
+the stories behind
+
+330
+00:16:24,950 --> 00:16:28,080
+that on these two given links.
+
+331
+00:16:28,720 --> 00:16:31,730
+When I looked at
+this the first time,
+
+332
+00:16:31,730 --> 00:16:34,175
+what surprised me is
+that theoretician
+
+333
+00:16:34,175 --> 00:16:37,520
+who sometimes dedicate their
+life to regular expression.
+
+334
+00:16:37,520 --> 00:16:39,440
+And no really a lot about
+
+335
+00:16:39,440 --> 00:16:41,690
+them didn't know
+anything about this.
+
+336
+00:16:41,690 --> 00:16:43,610
+But engineers, they
+already created
+
+337
+00:16:43,610 --> 00:16:46,160
+a name for that
+regular expression,
+
+338
+00:16:46,160 --> 00:16:47,975
+denial of service attack.
+
+339
+00:16:47,975 --> 00:16:49,745
+Because what you can,
+
+340
+00:16:49,745 --> 00:16:51,230
+what can happen now is that
+
+341
+00:16:51,230 --> 00:16:54,920
+attackers look for
+certain strings.
+
+342
+00:16:54,920 --> 00:16:56,780
+You make your regular expression
+
+343
+00:16:56,780 --> 00:16:59,105
+matching engine topple over.
+
+344
+00:16:59,105 --> 00:17:01,370
+And these kind of expressions,
+
+345
+00:17:01,370 --> 00:17:04,160
+regular expressions called
+Eve of reg expression.
+
+346
+00:17:04,160 --> 00:17:06,350
+And actually there are
+quite a number of them.
+
+347
+00:17:06,350 --> 00:17:08,495
+So you seen this one,
+
+348
+00:17:08,495 --> 00:17:11,255
+the first one, and the
+second one already.
+
+349
+00:17:11,255 --> 00:17:13,400
+But there are many, many more.
+
+350
+00:17:13,400 --> 00:17:15,620
+And you can easily have in
+
+351
+00:17:15,620 --> 00:17:18,560
+your program one of
+these reg expression.
+
+352
+00:17:18,560 --> 00:17:21,830
+And then you have the
+problem that if you do have
+
+353
+00:17:21,830 --> 00:17:23,240
+this regular expression and
+
+354
+00:17:23,240 --> 00:17:25,640
+somebody finds the
+corresponding string,
+
+355
+00:17:25,640 --> 00:17:29,945
+which make the records
+matching engine topple over,
+
+356
+00:17:29,945 --> 00:17:31,820
+then you have a problem
+
+357
+00:17:31,820 --> 00:17:34,295
+because your webpage is
+probably not variable.
+
+358
+00:17:34,295 --> 00:17:36,140
+This is also sometimes called
+
+359
+00:17:36,140 --> 00:17:39,350
+this phenomenon,
+catastrophic backtracking.
+
+360
+00:17:39,350 --> 00:17:43,595
+In lecture three, we will
+look at this more carefully.
+
+361
+00:17:43,595 --> 00:17:46,910
+And actually why that
+is such a problem in
+
+362
+00:17:46,910 --> 00:17:50,795
+real life is actually
+not to do with Lexus.
+
+363
+00:17:50,795 --> 00:17:53,180
+Yes, regular
+expressions are used as
+
+364
+00:17:53,180 --> 00:17:55,040
+the basic tool for implementing
+
+365
+00:17:55,040 --> 00:17:57,185
+like source bad reg expressions,
+
+366
+00:17:57,185 --> 00:18:00,065
+of course, used in
+a much wider area.
+
+367
+00:18:00,065 --> 00:18:03,770
+And they especially used for
+network intrusion detection.
+
+368
+00:18:03,770 --> 00:18:06,590
+Remember, you having to
+
+369
+00:18:06,590 --> 00:18:10,130
+administer a big network
+and you only want to let
+
+370
+00:18:10,130 --> 00:18:13,640
+in packets which you think are K
+
+371
+00:18:13,640 --> 00:18:14,930
+and you want to keep out
+
+372
+00:18:14,930 --> 00:18:17,645
+any package which might
+hack into your network.
+
+373
+00:18:17,645 --> 00:18:22,670
+So what they have is they
+have suites of thousands and
+
+374
+00:18:22,670 --> 00:18:25,745
+sometimes even more
+regular expressions which
+
+375
+00:18:25,745 --> 00:18:27,755
+check whether this package
+
+376
+00:18:27,755 --> 00:18:30,065
+satisfies some patterns or not.
+
+377
+00:18:30,065 --> 00:18:31,460
+And in this case it will be left
+
+378
+00:18:31,460 --> 00:18:34,205
+out or it will be let in.
+
+379
+00:18:34,205 --> 00:18:36,335
+And with networks,
+
+380
+00:18:36,335 --> 00:18:39,080
+the problem is that our
+hardware is already
+
+381
+00:18:39,080 --> 00:18:43,190
+so fast that the reg expressions
+
+382
+00:18:43,190 --> 00:18:45,169
+really become a bottleneck.
+
+383
+00:18:45,169 --> 00:18:47,060
+Because what do you do if now is
+
+384
+00:18:47,060 --> 00:18:49,880
+suddenly a reg expression
+takes too much time
+
+385
+00:18:49,880 --> 00:18:52,670
+to just stop the matching
+
+386
+00:18:52,670 --> 00:18:55,100
+and let the package
+in regardless?
+
+387
+00:18:55,100 --> 00:18:58,190
+Or do you just hold
+the network up
+
+388
+00:18:58,190 --> 00:19:01,715
+and don't let anything in
+until you decided that.
+
+389
+00:19:01,715 --> 00:19:04,895
+So that's actually a
+really hard problem.
+
+390
+00:19:04,895 --> 00:19:06,650
+But the first time I came across
+
+391
+00:19:06,650 --> 00:19:09,965
+that problem was actually
+by this engineer.
+
+392
+00:19:09,965 --> 00:19:13,820
+And it's always say that
+Germans don't have any Yammer.
+
+393
+00:19:13,820 --> 00:19:16,985
+But I found that
+video quite funny.
+
+394
+00:19:16,985 --> 00:19:19,145
+Maybe you have a
+different opinion,
+
+395
+00:19:19,145 --> 00:19:21,095
+but feel free to
+have a look which
+
+396
+00:19:21,095 --> 00:19:23,705
+explains exactly that problem.
+
+397
+00:19:23,705 --> 00:19:25,610
+So in the next video,
+
+398
+00:19:25,610 --> 00:19:28,445
+we will start to
+implement this matcher.
+
+399
+00:19:28,445 --> 00:19:30,870
+So I hope to see you there.
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/videos/01-intro.srt Wed Sep 23 11:34:43 2020 +0100
@@ -0,0 +1,2405 @@
+1
+00:00:16,280 --> 00:00:18,330
+A warm welcome to
+
+2
+00:00:18,330 --> 00:00:20,820
+the compilers and formal
+languages module?
+
+3
+00:00:20,820 --> 00:00:24,390
+My name is Christian Urban.
+Thank you for coming.
+
+4
+00:00:24,390 --> 00:00:27,644
+Compilers. I guess
+you use compilers
+
+5
+00:00:27,644 --> 00:00:31,680
+in your daily work, be it
+the C or Java compiler.
+
+6
+00:00:31,680 --> 00:00:34,245
+And you might be curious
+in what they do.
+
+7
+00:00:34,245 --> 00:00:35,700
+But you might also be
+
+8
+00:00:35,700 --> 00:00:38,130
+intimidated to look
+what they do underneath
+
+9
+00:00:38,130 --> 00:00:39,900
+the bonnet because they are
+
+10
+00:00:39,900 --> 00:00:42,520
+quite large software systems.
+
+11
+00:00:42,520 --> 00:00:46,130
+What I like to show you in
+this module is that you
+
+12
+00:00:46,130 --> 00:00:49,310
+do not need to be an
+Ãœber hacker to implement your own
+
+13
+00:00:49,310 --> 00:00:51,305
+compiler for your
+own language, say.
+
+14
+00:00:51,305 --> 00:00:54,155
+So that will be the main
+goal of this module.
+
+15
+00:00:54,155 --> 00:00:56,210
+You will implement
+your own compiler,
+
+16
+00:00:56,210 --> 00:00:58,310
+of course with my help.
+
+17
+00:00:58,310 --> 00:01:02,360
+What I personally like
+about compilers is that
+
+18
+00:01:02,360 --> 00:01:04,580
+the subject is a
+perfect combination
+
+19
+00:01:04,580 --> 00:01:06,350
+of theory and practice.
+
+20
+00:01:06,350 --> 00:01:07,790
+I like to hack things,
+
+21
+00:01:07,790 --> 00:01:10,595
+writing actual code,
+but I also like theory.
+
+22
+00:01:10,595 --> 00:01:13,040
+I want to understand what
+my code actually does.
+
+23
+00:01:13,040 --> 00:01:17,040
+So compilers are the
+perfect subject for me.
+
+24
+00:01:18,040 --> 00:01:20,809
+Let's have a look at the details.
+
+25
+00:01:20,809 --> 00:01:23,779
+Here's an airplane
+view of a compiler.
+
+26
+00:01:23,779 --> 00:01:25,850
+On the left-hand side you can see
+
+27
+00:01:25,850 --> 00:01:28,745
+the input program a
+developer would write.
+
+28
+00:01:28,745 --> 00:01:31,955
+And on the right-hand side is
+the output of the compiler,
+
+29
+00:01:31,955 --> 00:01:36,360
+the binary code the developer
+wants to actually run.
+
+30
+00:01:36,370 --> 00:01:40,340
+What makes such a project
+actually feasible is that
+
+31
+00:01:40,340 --> 00:01:44,165
+compilers fall neatly into
+separate components.
+
+32
+00:01:44,165 --> 00:01:47,210
+So you can see the lexer and the
+
+33
+00:01:47,210 --> 00:01:48,860
+parser, which are often called the
+
+34
+00:01:48,860 --> 00:01:50,990
+front end of the compiler.
+
+35
+00:01:50,990 --> 00:01:53,000
+And the code generation is
+
+36
+00:01:53,000 --> 00:01:55,700
+called the backend
+of the compiler.
+
+37
+00:01:55,700 --> 00:01:57,620
+And it so happens
+that we will spend
+
+38
+00:01:57,620 --> 00:01:59,930
+the first five weeks
+on the lexer and
+
+39
+00:01:59,930 --> 00:02:04,970
+the parser, and the next five
+weeks on the code generator.
+
+40
+00:02:04,970 --> 00:02:09,575
+The first component of the
+compiler is the lexer.
+
+41
+00:02:09,575 --> 00:02:14,480
+The purpose of the lexer
+is to scan a flat
+
+42
+00:02:14,480 --> 00:02:16,610
+string, which the
+input program is at
+
+43
+00:02:16,610 --> 00:02:19,145
+the beginning and separate out
+
+44
+00:02:19,145 --> 00:02:21,275
+where are the words?
+
+45
+00:02:21,275 --> 00:02:23,600
+You might think
+in Western languages,
+
+46
+00:02:23,600 --> 00:02:24,920
+that is very easy:
+
+47
+00:02:24,920 --> 00:02:26,690
+you just try to
+find out where are
+
+48
+00:02:26,690 --> 00:02:28,580
+the whitespaces
+and then you know,
+
+49
+00:02:28,580 --> 00:02:31,955
+where one word stops and
+where the next word begins.
+
+50
+00:02:31,955 --> 00:02:35,300
+But, actually, computer
+languages are more
+
+51
+00:02:35,300 --> 00:02:38,180
+like ancient languages
+that you see here.
+
+52
+00:02:38,180 --> 00:02:39,860
+For example, ancient Greek.
+
+53
+00:02:39,860 --> 00:02:42,065
+The writer wrote one word
+
+54
+00:02:42,065 --> 00:02:44,915
+after the next and not
+leaving any space.
+
+55
+00:02:44,915 --> 00:02:47,930
+So for example, in this
+very simple string here
+
+56
+00:02:47,930 --> 00:02:50,960
+read n, there is no space at all.
+
+57
+00:02:50,960 --> 00:02:53,540
+But the purpose of
+the lexer is to
+
+58
+00:02:53,540 --> 00:02:56,570
+separate it into five
+different components.
+
+59
+00:02:56,570 --> 00:02:59,450
+It has to first say,
+well there is a read,
+
+60
+00:02:59,450 --> 00:03:01,595
+then there is a left parenthesis,
+
+61
+00:03:01,595 --> 00:03:04,025
+then there's an
+identifier called n,
+
+62
+00:03:04,025 --> 00:03:07,715
+then there's a right parenthesis,
+and finally a semicolon.
+
+63
+00:03:07,715 --> 00:03:11,345
+And as you can see, there
+is no space here at all.
+
+64
+00:03:11,345 --> 00:03:14,150
+And so the task of finding where
+
+65
+00:03:14,150 --> 00:03:17,705
+are the words is actually
+quite involved.
+
+66
+00:03:17,705 --> 00:03:19,460
+Also the classification is
+
+67
+00:03:19,460 --> 00:03:22,055
+sometimes not so straightforward.
+
+68
+00:03:22,055 --> 00:03:24,350
+If, for example, a writer
+
+69
+00:03:24,350 --> 00:03:26,360
+wrote "if" on its own,
+
+70
+00:03:26,360 --> 00:03:29,000
+then this should be a
+keyword classified as
+
+71
+00:03:29,000 --> 00:03:32,615
+a keyword because it's
+from the if-then-else.
+
+72
+00:03:32,615 --> 00:03:36,800
+But if the developer wrote
+something longer like "iffoo",
+
+73
+00:03:36,800 --> 00:03:38,030
+then this might just be
+
+74
+00:03:38,030 --> 00:03:39,860
+a strange variable name
+and he needs to be
+
+75
+00:03:39,860 --> 00:03:44,250
+classified as a variable name.
+
+76
+00:03:45,220 --> 00:03:50,720
+The output of the lexer
+is a sequence of tokens.
+
+77
+00:03:50,720 --> 00:03:53,480
+These are essentially the words in
+
+78
+00:03:53,480 --> 00:03:56,405
+a sentence and their
+classification:
+
+79
+00:03:56,405 --> 00:03:59,540
+they are nouns,
+verbs or adjectives.
+
+80
+00:03:59,540 --> 00:04:02,885
+For us, of course, the
+classification would be keywords,
+
+81
+00:04:02,885 --> 00:04:06,005
+identifiers,
+operators, and so on.
+
+82
+00:04:06,005 --> 00:04:11,480
+And these tokens are the
+input for the parser,
+
+83
+00:04:11,480 --> 00:04:13,085
+the next component in
+
+84
+00:04:13,085 --> 00:04:17,240
+our compiler. The parser
+essentially takes this list
+
+85
+00:04:17,240 --> 00:04:23,300
+of tokens and transforms it
+into a abstract syntax tree.
+
+86
+00:04:23,300 --> 00:04:27,230
+That means we have now the
+sentence, we have the words.
+
+87
+00:04:27,230 --> 00:04:30,275
+We have to relate
+the words to each other
+
+88
+00:04:30,275 --> 00:04:33,680
+in order to find out what
+this sentence actually means.
+
+89
+00:04:33,680 --> 00:04:35,405
+In this case here,
+
+90
+00:04:35,405 --> 00:04:38,595
+we have to do a read...which
+variable is affected?
+
+91
+00:04:38,595 --> 00:04:45,040
+The variable n. Once we have
+the abstract syntax tree,
+
+92
+00:04:45,040 --> 00:04:49,225
+it can go to the next component,
+to the code generator.
+
+93
+00:04:49,225 --> 00:04:52,480
+Whilst it doesn't look
+like this in this picture,
+
+94
+00:04:52,480 --> 00:04:54,100
+the code generators is usually the
+
+95
+00:04:54,100 --> 00:04:56,080
+biggest part in a compiler.
+
+96
+00:04:56,080 --> 00:04:58,720
+And here we actually
+have to cut corners.
+
+97
+00:04:58,720 --> 00:05:02,470
+Instead of producing binary
+code, which can be run
+
+98
+00:05:02,470 --> 00:05:06,820
+directly on a CPU, like X86 or ARM,
+
+99
+00:05:06,820 --> 00:05:11,035
+we actually target the
+Java Virtual Machine, the JVM.
+
+100
+00:05:11,035 --> 00:05:13,600
+This is very similar
+to the Scala compiler,
+
+101
+00:05:13,600 --> 00:05:16,480
+for example, which
+produces JVM code.
+
+102
+00:05:16,480 --> 00:05:18,940
+Or, of course, also
+the Java compiler,
+
+103
+00:05:18,940 --> 00:05:20,845
+which produces JVM code.
+
+104
+00:05:20,845 --> 00:05:23,900
+So here's a typical
+example code which we're
+
+105
+00:05:23,900 --> 00:05:27,305
+going to produce: Something
+like variable 2
+
+106
+00:05:27,305 --> 00:05:30,545
+gets allocated in the stack.
+
+107
+00:05:30,545 --> 00:05:32,390
+You subtract ten from it.
+
+108
+00:05:32,390 --> 00:05:36,050
+You test whether the
+variable is 0 and if yes,
+
+109
+00:05:36,050 --> 00:05:40,170
+you jump somewhere and otherwise
+you reload the variable.
+
+110
+00:05:41,560 --> 00:05:45,935
+Even though we cut corners
+into could generate apart.
+
+111
+00:05:45,935 --> 00:05:48,575
+By producing code for the JVM,
+
+112
+00:05:48,575 --> 00:05:51,710
+we can still obtain quite
+impressive results.
+
+113
+00:05:51,710 --> 00:05:56,225
+Here's a program with
+a cubic runtime behaviour.
+
+114
+00:05:56,225 --> 00:05:59,525
+When it has to
+calculate with 400 units,
+
+115
+00:05:59,525 --> 00:06:02,165
+it might need five
+seconds to calculate.
+
+116
+00:06:02,165 --> 00:06:05,345
+But if it has to
+calculate with 1200 units,
+
+117
+00:06:05,345 --> 00:06:08,165
+it already needs more
+than two minutes.
+
+118
+00:06:08,165 --> 00:06:10,940
+Now these timings,
+the blue timings
+
+119
+00:06:10,940 --> 00:06:13,910
+are obtained with an
+interpreter of that program.
+
+120
+00:06:13,910 --> 00:06:16,265
+And you can see
+just next to it,
+
+121
+00:06:16,265 --> 00:06:18,050
+the red times.
+
+122
+00:06:18,050 --> 00:06:20,359
+They are obtained
+with the compiler
+
+123
+00:06:20,359 --> 00:06:23,135
+we're going to write
+in this module.
+
+124
+00:06:23,135 --> 00:06:25,100
+And there you can see, even
+
+125
+00:06:25,100 --> 00:06:29,000
+with 1200, the times get
+hardly off the x-axis.
+
+126
+00:06:29,000 --> 00:06:30,485
+So in this instance,
+
+127
+00:06:30,485 --> 00:06:32,630
+our compiler will
+have a speed up from
+
+128
+00:06:32,630 --> 00:06:37,589
+approximately 1 million in
+comparison to the interpreter.
+
+129
+00:06:38,350 --> 00:06:42,020
+This might be a fun task
+for your spare time.
+
+130
+00:06:42,020 --> 00:06:44,480
+This is a compiler explorer,
+which allows you to
+
+131
+00:06:44,480 --> 00:06:47,060
+write C code on the
+left-hand side.
+
+132
+00:06:47,060 --> 00:06:49,115
+It shows you on the
+right-hand side
+
+133
+00:06:49,115 --> 00:06:53,270
+the machine code an
+X86 CPU can run.
+
+134
+00:06:53,270 --> 00:06:56,060
+For example, it gives
+you these color scheme and
+
+135
+00:06:56,060 --> 00:06:59,000
+says that this addition
+in the num + num in
+
+136
+00:06:59,000 --> 00:07:01,580
+the return statement,
+translates to
+
+137
+00:07:01,580 --> 00:07:02,960
+essentially an addition of
+
+138
+00:07:02,960 --> 00:07:05,930
+an register in the machine code.
+
+139
+00:07:05,930 --> 00:07:08,480
+I think this compiler
+explorer also works for
+
+140
+00:07:08,480 --> 00:07:11,495
+the Haskell language and
+also produces ARM code,
+
+141
+00:07:11,495 --> 00:07:15,245
+not just code for the X86 CPUs.
+
+142
+00:07:15,245 --> 00:07:18,950
+As an aside, I also recommend
+to watch the movie of
+
+143
+00:07:18,950 --> 00:07:20,870
+this guy because that is
+
+144
+00:07:20,870 --> 00:07:22,940
+very much like how I worked
+at the beginning
+
+145
+00:07:22,940 --> 00:07:25,355
+when implementing our compiler.
+
+146
+00:07:25,355 --> 00:07:29,300
+I wrote some code which I
+knew how it should behave.
+
+147
+00:07:29,300 --> 00:07:30,725
+And then I just had a look
+
+148
+00:07:30,725 --> 00:07:32,900
+what another compiler produces.
+
+149
+00:07:32,900 --> 00:07:37,190
+And I imitated that code
+as much as I could.
+
+150
+00:07:37,190 --> 00:07:39,380
+Such a compiler explorer
+
+151
+00:07:39,380 --> 00:07:41,375
+also exists for
+the Java language.
+
+152
+00:07:41,375 --> 00:07:42,935
+Here's one where you can write
+
+153
+00:07:42,935 --> 00:07:44,915
+Java code on the left-hand side,
+
+154
+00:07:44,915 --> 00:07:47,930
+and on the right-hand
+side you get JVM code.
+
+155
+00:07:47,930 --> 00:07:50,255
+JVM code is a byte code
+
+156
+00:07:50,255 --> 00:07:53,405
+which cannot be run
+directly by the CPU,
+
+157
+00:07:53,405 --> 00:07:55,220
+but needs the Java
+Virtual Machine
+
+158
+00:07:55,220 --> 00:07:57,170
+to essentially interpret that.
+
+159
+00:07:57,170 --> 00:07:58,760
+This means it's not
+
+160
+00:07:58,760 --> 00:08:01,235
+the most efficient way
+how to run programs -
+
+161
+00:08:01,235 --> 00:08:02,780
+it would be much faster to
+
+162
+00:08:02,780 --> 00:08:05,285
+generate direct
+code for the CPU.
+
+163
+00:08:05,285 --> 00:08:07,700
+But by producing bytecode,
+
+164
+00:08:07,700 --> 00:08:11,435
+we still run the
+programs quite fast
+
+165
+00:08:11,435 --> 00:08:13,520
+and it also simplifies
+
+166
+00:08:13,520 --> 00:08:16,280
+a number of issues
+in our compiler.
+
+167
+00:08:16,280 --> 00:08:18,980
+One issue is about
+memory management.
+
+168
+00:08:18,980 --> 00:08:22,055
+We don't have to be concerned
+about register allocation,
+
+169
+00:08:22,055 --> 00:08:25,505
+which we would need
+to do in a real CPU.
+
+170
+00:08:25,505 --> 00:08:27,680
+This will be done by the JVM.
+
+171
+00:08:27,680 --> 00:08:29,750
+It's also much easier to produce
+
+172
+00:08:29,750 --> 00:08:33,635
+this bytecode rather than
+actual machine code.
+
+173
+00:08:33,635 --> 00:08:37,385
+I think it's now a good time
+to come to the question,
+
+174
+00:08:37,385 --> 00:08:39,950
+why on Earth studying compilers.
+
+175
+00:08:39,950 --> 00:08:42,650
+Compilers are such an
+established subject
+
+176
+00:08:42,650 --> 00:08:43,985
+in computer science.
+
+177
+00:08:43,985 --> 00:08:46,100
+Compilers do what they do.
+
+178
+00:08:46,100 --> 00:08:48,410
+Probably forrests have
+been killed by
+
+179
+00:08:48,410 --> 00:08:52,190
+all the books that have been
+published on compilers.
+
+180
+00:08:52,190 --> 00:08:56,690
+Why on Earth studying
+compilers in 2020? And even worse
+
+181
+00:08:56,690 --> 00:08:59,659
+why implementing
+your own compiler?
+
+182
+00:08:59,659 --> 00:09:02,720
+Well, a slightly humorous take on
+
+183
+00:09:02,720 --> 00:09:05,105
+that question is
+given by John Regehr,
+
+184
+00:09:05,105 --> 00:09:08,375
+who is a compiler hacker
+from the University of Utah.
+
+185
+00:09:08,375 --> 00:09:09,770
+Essentially what he says,
+
+186
+00:09:09,770 --> 00:09:12,110
+if you're a good compiler hacker,
+
+187
+00:09:12,110 --> 00:09:14,690
+you have no problems
+of finding a job.
+
+188
+00:09:14,690 --> 00:09:17,210
+He puts it as: It's effectively
+
+189
+00:09:17,210 --> 00:09:22,320
+a "Perpetual Employment Act"
+for solid compiler hackers.
+
+190
+00:09:22,990 --> 00:09:27,380
+Regehr gives two justifications
+for that statement.
+
+191
+00:09:27,380 --> 00:09:29,690
+First, he says
+hardware is getting
+
+192
+00:09:29,690 --> 00:09:32,585
+wierder, rather than
+getting clocked faster.
+
+193
+00:09:32,585 --> 00:09:34,520
+And that's definitely true.
+
+194
+00:09:34,520 --> 00:09:36,590
+My first computer many, many,
+
+195
+00:09:36,590 --> 00:09:40,040
+many years ago contained
+only a single core CPU.
+
+196
+00:09:40,040 --> 00:09:44,030
+And it was such a simple CPU
+that we wrote machine code
+
+197
+00:09:44,030 --> 00:09:46,220
+directly for that CPU in order to
+
+198
+00:09:46,220 --> 00:09:49,740
+run our programs as
+fast as possible.
+
+199
+00:09:50,260 --> 00:09:53,600
+In contrast, today, Regehr writes,
+
+200
+00:09:53,600 --> 00:09:57,005
+almost all processors are
+multi-core nowadays.
+
+201
+00:09:57,005 --> 00:09:59,870
+And it looks like there's
+an increasing asymmetry
+
+202
+00:09:59,870 --> 00:10:02,015
+in resources across cores.
+
+203
+00:10:02,015 --> 00:10:04,445
+Processors come
+with vector units,
+
+204
+00:10:04,445 --> 00:10:07,189
+crypto accelerators, etc.
+
+205
+00:10:07,189 --> 00:10:11,930
+We have TSPs, GPUs, ARM
+big,little, Xeon Phis,
+
+206
+00:10:11,930 --> 00:10:14,630
+and this only scratches the surface.
+
+207
+00:10:14,630 --> 00:10:17,255
+And that is really a
+problem for compilers,
+
+208
+00:10:17,255 --> 00:10:20,495
+because if we now
+have multi-core CPUs,
+
+209
+00:10:20,495 --> 00:10:23,180
+that means our programs need
+
+210
+00:10:23,180 --> 00:10:26,075
+to be scheduled
+over several CPUs.
+
+211
+00:10:26,075 --> 00:10:28,220
+But the developer, of course,
+
+212
+00:10:28,220 --> 00:10:30,545
+doesn't want to know
+anything about that.
+
+213
+00:10:30,545 --> 00:10:34,655
+Also, now we have more
+CPUs in a computer,
+
+214
+00:10:34,655 --> 00:10:37,400
+but they seem to also come
+with different resources.
+
+215
+00:10:37,400 --> 00:10:40,310
+So certain tasks in
+a program need to
+
+216
+00:10:40,310 --> 00:10:43,460
+be scheduled on some
+cores, but not on others.
+
+217
+00:10:43,460 --> 00:10:46,685
+We also have for a
+long time already GPUs,
+
+218
+00:10:46,685 --> 00:10:49,025
+which are highly specialized for
+
+219
+00:10:49,025 --> 00:10:53,240
+very parallel computations
+to do with graphics.
+
+220
+00:10:53,240 --> 00:10:56,015
+They at least in
+the past few years,
+
+221
+00:10:56,015 --> 00:10:59,360
+they could only do very
+special computations,
+
+222
+00:10:59,360 --> 00:11:02,255
+but very, very fast
+and highly parallel.
+
+223
+00:11:02,255 --> 00:11:05,539
+You couldn't use them for
+general-purpose calculations,
+
+224
+00:11:05,539 --> 00:11:10,205
+which you would usually
+use CPUs. Similarly, DSPs.
+
+225
+00:11:10,205 --> 00:11:14,075
+They are needed for all
+the signal processing
+
+226
+00:11:14,075 --> 00:11:16,040
+in mobile phones.
+
+227
+00:11:16,040 --> 00:11:20,780
+Without them, we just
+wouldn't have mobile phones.
+
+228
+00:11:20,780 --> 00:11:23,525
+The second reason Regehr gives is
+
+229
+00:11:23,525 --> 00:11:25,550
+that we are getting tired of
+
+230
+00:11:25,550 --> 00:11:27,620
+low-level languages and
+
+231
+00:11:27,620 --> 00:11:30,335
+their associated
+security disasters.
+
+232
+00:11:30,335 --> 00:11:32,435
+While at the beginning
+we were still
+
+233
+00:11:32,435 --> 00:11:34,760
+happy to write machine
+code directly,
+
+234
+00:11:34,760 --> 00:11:37,175
+nobody wants to do this nowadays.
+
+235
+00:11:37,175 --> 00:11:39,515
+He writes: :We want
+to write new code
+
+236
+00:11:39,515 --> 00:11:44,120
+to whatever extent possible in
+safer high-level languages.
+
+237
+00:11:44,120 --> 00:11:46,130
+Compilers are caught right
+
+238
+00:11:46,130 --> 00:11:48,365
+in the middle of these
+opposing trends:
+
+239
+00:11:48,365 --> 00:11:50,765
+one of their main jobs
+is to have bridged
+
+240
+00:11:50,765 --> 00:11:53,240
+the large and growing gap between
+
+241
+00:11:53,240 --> 00:11:55,460
+increasingly high-level languages
+
+242
+00:11:55,460 --> 00:11:58,565
+and increasingly
+wacky platforms.
+
+243
+00:11:58,565 --> 00:12:00,320
+So here you have it:
+
+244
+00:12:00,320 --> 00:12:02,750
+It's still interesting to study
+
+245
+00:12:02,750 --> 00:12:06,244
+compilers nowadays,
+especially nowadays.
+
+246
+00:12:06,244 --> 00:12:09,875
+Well, if you want to work
+on interesting problems,
+
+247
+00:12:09,875 --> 00:12:14,570
+then very often you have to
+know compilers. Here's one:
+
+248
+00:12:14,570 --> 00:12:16,940
+In the good old
+times when we were
+
+249
+00:12:16,940 --> 00:12:19,310
+still able to fly on holidays,
+
+250
+00:12:19,310 --> 00:12:23,300
+I'm sure you flew in
+a 777 Boeing airplane.
+
+251
+00:12:23,300 --> 00:12:25,850
+It's actually already
+a quite old airplane.
+
+252
+00:12:25,850 --> 00:12:28,295
+But they had the
+following problem.
+
+253
+00:12:28,295 --> 00:12:33,020
+What happens if a CPU
+calculates the wrong result?
+
+254
+00:12:33,020 --> 00:12:36,065
+And that's actually not
+such a hypothetical problem
+
+255
+00:12:36,065 --> 00:12:40,295
+because you remember
+the Pentium CPU
+
+256
+00:12:40,295 --> 00:12:43,655
+in around 2000
+contained a bug and it cost
+
+257
+00:12:43,655 --> 00:12:48,140
+a lot of money for Intel
+to replace this CPU.
+
+258
+00:12:48,140 --> 00:12:51,470
+What happens if an CPU calculates
+
+259
+00:12:51,470 --> 00:12:56,105
+the wrong result in
+some navigation data?
+
+260
+00:12:56,105 --> 00:12:58,520
+Do you just let the
+airplane crash?
+
+261
+00:12:58,520 --> 00:13:00,650
+Well, the engineers at Boeing
+
+262
+00:13:00,650 --> 00:13:02,675
+came up with the
+following solution:
+
+263
+00:13:02,675 --> 00:13:05,055
+They writing one program that
+
+264
+00:13:05,055 --> 00:13:08,690
+essentially controls
+how their airplane
+
+265
+00:13:08,690 --> 00:13:10,295
+is supposed to fly.
+
+266
+00:13:10,295 --> 00:13:13,040
+In a programming language
+called Ada that is
+
+267
+00:13:13,040 --> 00:13:15,770
+slightly obscure
+programming language
+
+268
+00:13:15,770 --> 00:13:18,650
+bat is very well-known in
+
+269
+00:13:18,650 --> 00:13:22,040
+areas where safety
+is really paramount.
+
+270
+00:13:22,040 --> 00:13:25,010
+And what they did is they
+took this one Ada program
+
+271
+00:13:25,010 --> 00:13:28,010
+and they essentially
+took three compilers,
+
+272
+00:13:28,010 --> 00:13:29,435
+independent compilers,
+
+273
+00:13:29,435 --> 00:13:32,045
+which compiled the program
+to different CPUs.
+
+274
+00:13:32,045 --> 00:13:33,815
+One CPU from Intel,
+
+275
+00:13:33,815 --> 00:13:37,235
+one CPU for Motorola,
+and one from AMD.
+
+276
+00:13:37,235 --> 00:13:38,930
+And these are quite different
+
+277
+00:13:38,930 --> 00:13:40,955
+CPUs. Also some of them
+
+278
+00:13:40,955 --> 00:13:42,755
+have quite different philosophies
+
+279
+00:13:42,755 --> 00:13:45,245
+on how they do their calculations.
+
+280
+00:13:45,245 --> 00:13:47,330
+Now what they could do is, they
+
+281
+00:13:47,330 --> 00:13:50,690
+could put these three computers
+on a single board and
+
+282
+00:13:50,690 --> 00:13:54,335
+could now run all their
+calculations in parallel.
+
+283
+00:13:54,335 --> 00:13:56,690
+One with an Intel
+CPU, one with
+
+284
+00:13:56,690 --> 00:14:00,245
+Motorola, and one with a
+Risc computers say.
+
+285
+00:14:00,245 --> 00:14:02,795
+And then they could
+compare the results.
+
+286
+00:14:02,795 --> 00:14:05,030
+And if these results differed,
+
+287
+00:14:05,030 --> 00:14:07,940
+then they knew one CPU must have
+
+288
+00:14:07,940 --> 00:14:11,600
+calculated the wrong result
+and probably told the pilot,
+
+289
+00:14:11,600 --> 00:14:14,850
+please don't let
+that airplane crash.
+
+290
+00:14:14,950 --> 00:14:17,270
+Not just Boeing is doing
+
+291
+00:14:17,270 --> 00:14:19,355
+interesting things
+with compilers.
+
+292
+00:14:19,355 --> 00:14:22,640
+Also, Airbus in a completely
+different setting
+
+293
+00:14:22,640 --> 00:14:24,260
+is using compilers in
+
+294
+00:14:24,260 --> 00:14:26,420
+a interesting way to get
+
+295
+00:14:26,420 --> 00:14:30,195
+their airplanes up in the
+air and let them not crush.
+
+296
+00:14:30,195 --> 00:14:33,010
+In another example, I
+have friends working
+
+297
+00:14:33,010 --> 00:14:35,350
+at Facebook who work on
+
+298
+00:14:35,350 --> 00:14:37,900
+compilers to make sense
+are they are heaps
+
+299
+00:14:37,900 --> 00:14:41,470
+and heaps and heaps of
+code written in PHP,
+
+300
+00:14:41,470 --> 00:14:42,880
+which is one of the most
+
+301
+00:14:42,880 --> 00:14:44,920
+horrible languages on the planet.
+
+302
+00:14:44,920 --> 00:14:46,630
+The bottom line is,
+
+303
+00:14:46,630 --> 00:14:50,499
+compilers are still very,
+very important nowadays.
+
+304
+00:14:50,499 --> 00:14:52,810
+And for the foreseeable future,
+
+305
+00:14:52,810 --> 00:14:56,150
+compilers still need
+to be developed.
+
+306
+00:14:57,270 --> 00:15:00,235
+When one talks about
+compilers then
+
+307
+00:15:00,235 --> 00:15:01,630
+there is, of course,
+
+308
+00:15:01,630 --> 00:15:05,395
+a magic element involved. They're
+large software systems.
+
+309
+00:15:05,395 --> 00:15:07,750
+And yes, they're supposed to
+
+310
+00:15:07,750 --> 00:15:10,525
+generate a runnable binary for
+
+311
+00:15:10,525 --> 00:15:12,105
+a high-level program,
+
+312
+00:15:12,105 --> 00:15:15,230
+but they are also supposed
+to make our programs better.
+
+313
+00:15:15,230 --> 00:15:18,920
+They optimize our
+programs to run faster.
+
+314
+00:15:18,920 --> 00:15:22,610
+And there's a lot of
+"magic" involved in that.
+
+315
+00:15:22,610 --> 00:15:24,890
+Magic in inverted quotes.
+
+316
+00:15:24,890 --> 00:15:26,480
+I would like to show you one of
+
+317
+00:15:26,480 --> 00:15:29,340
+this magic in one example.
+
+318
+00:15:31,090 --> 00:15:33,110
+I hope you still have
+
+319
+00:15:33,110 --> 00:15:36,485
+fond memories of the
+PEP module last year.
+
+320
+00:15:36,485 --> 00:15:40,280
+And you might remember BF
+language, we had to look at.
+
+321
+00:15:40,280 --> 00:15:43,670
+This BF language contains
+the kind of memory and
+
+322
+00:15:43,670 --> 00:15:45,680
+the memory pointer which can
+
+323
+00:15:45,680 --> 00:15:48,140
+be moved to the left
+or to the right,
+
+324
+00:15:48,140 --> 00:15:50,270
+where an integer in
+
+325
+00:15:50,270 --> 00:15:52,925
+the memory can be either
+increased or decreased.
+
+326
+00:15:52,925 --> 00:15:55,730
+We can print
+out the content of
+
+327
+00:15:55,730 --> 00:15:59,645
+the current cell or input
+something into a cell.
+
+328
+00:15:59,645 --> 00:16:01,850
+And we have loops, and
+
+329
+00:16:01,850 --> 00:16:04,220
+everything else is
+considered a comment.
+
+330
+00:16:04,220 --> 00:16:06,290
+What's good about
+this BF language is that
+
+331
+00:16:06,290 --> 00:16:08,180
+we don't even need a front end.
+
+332
+00:16:08,180 --> 00:16:09,920
+We can immediately start writing
+
+333
+00:16:09,920 --> 00:16:13,529
+an interpreter or a compiler.
+
+334
+00:16:15,850 --> 00:16:18,155
+Okay, I have here
+
+335
+00:16:18,155 --> 00:16:20,600
+a very straightforward
+interpreter for
+
+336
+00:16:20,600 --> 00:16:22,865
+the BF language. You
+might recognize it.
+
+337
+00:16:22,865 --> 00:16:27,120
+And I run it with a
+benchmark program, which
+
+338
+00:16:27,760 --> 00:16:30,960
+might also recognize.
+
+339
+00:16:31,560 --> 00:16:36,835
+And this will now take
+approximately ten minutes.
+
+340
+00:16:36,835 --> 00:16:39,920
+So see you in a bit.
+
+341
+00:16:40,710 --> 00:16:43,660
+Okay, this has finished now.
+
+342
+00:16:43,660 --> 00:16:45,925
+Almost took 11 minutes.
+
+343
+00:16:45,925 --> 00:16:47,410
+The question now is,
+
+344
+00:16:47,410 --> 00:16:51,820
+can we to better?
+
+345
+00:16:51,820 --> 00:16:53,260
+Actually, it is not difficult
+to do better
+
+346
+00:16:53,260 --> 00:16:54,970
+than this simple-minded
+interpreter.
+
+347
+00:16:54,970 --> 00:16:58,690
+It is relatively
+straightforward to take
+
+348
+00:16:58,690 --> 00:17:03,520
+a BF program and generate equivalent C code.
+
+349
+00:17:03,520 --> 00:17:06,520
+This can be easily
+done by somehow
+
+350
+00:17:06,520 --> 00:17:09,490
+representing the BF memory in
+
+351
+00:17:09,490 --> 00:17:12,310
+C. We can do this
+by just an array of
+
+352
+00:17:12,310 --> 00:17:15,380
+characters and a memory pointer,
+
+353
+00:17:15,380 --> 00:17:19,265
+which points, in this case
+in the middle of this array.
+
+354
+00:17:19,265 --> 00:17:21,800
+Then it's very easy to
+
+355
+00:17:21,800 --> 00:17:24,275
+translate the movement
+of the
+
+356
+00:17:24,275 --> 00:17:28,610
+BF memory pointer into increments
+
+357
+00:17:28,610 --> 00:17:31,520
+and decrements of
+the C pointer.
+
+358
+00:17:31,520 --> 00:17:33,050
+Similarly, if you want to increment or
+
+359
+00:17:33,050 --> 00:17:35,915
+decrement an element
+in this array,
+
+360
+00:17:35,915 --> 00:17:38,975
+you just have to look up what
+the pointer is and increment
+
+361
+00:17:38,975 --> 00:17:42,140
+and decrement what's
+under the pointer. Similarly
+
+362
+00:17:42,140 --> 00:17:43,790
+we can print out something from
+
+363
+00:17:43,790 --> 00:17:47,450
+this array and we can put
+something into this array.
+
+364
+00:17:47,450 --> 00:17:49,610
+What is great is that the loops
+
+365
+00:17:49,610 --> 00:17:51,530
+from the BF language directly
+
+366
+00:17:51,530 --> 00:17:55,580
+translate into while
+loops of the C language.
+
+367
+00:17:55,580 --> 00:17:58,100
+We essentially have to check is
+
+368
+00:17:58,100 --> 00:18:02,450
+the memory pointer pointing
+to a field that is 0.
+
+369
+00:18:02,450 --> 00:18:03,950
+Then we exit the loop,
+
+370
+00:18:03,950 --> 00:18:05,719
+and otherwise we continue.
+
+371
+00:18:05,719 --> 00:18:09,575
+And that can be done
+nicely in C, like so.
+
+372
+00:18:09,575 --> 00:18:12,995
+And everything else is
+again, just a comment.
+
+373
+00:18:12,995 --> 00:18:16,445
+So I have implemented
+this translation for you.
+
+374
+00:18:16,445 --> 00:18:19,700
+Remember this was the BF
+program for the Mandelbrot Set.
+
+375
+00:18:19,700 --> 00:18:23,690
+The equivalent C code
+would look like this.
+
+376
+00:18:23,690 --> 00:18:27,110
+So you can see at the beginning
+is this generation of
+
+377
+00:18:27,110 --> 00:18:30,140
+the BF memory represented
+
+378
+00:18:30,140 --> 00:18:32,345
+as an array and the
+memory pointer.
+
+379
+00:18:32,345 --> 00:18:34,550
+And then inside there are lots
+
+380
+00:18:34,550 --> 00:18:36,770
+of increments and decrements
+
+381
+00:18:36,770 --> 00:18:41,915
+of pointers and also
+contents of this array.
+
+382
+00:18:41,915 --> 00:18:45,199
+Now fingers crossed that this
+is the correct translation.
+
+383
+00:18:45,199 --> 00:18:48,125
+And I can also run this
+
+384
+00:18:48,125 --> 00:18:52,805
+and you should see that it runs
+substantially faster.
+
+385
+00:18:52,805 --> 00:18:55,880
+I'm using now my GCC on
+
+386
+00:18:55,880 --> 00:19:01,040
+my computer to generate an
+executable for the C program.
+
+387
+00:19:01,040 --> 00:19:04,295
+And it should run for
+approximately 20 seconds.
+
+388
+00:19:04,295 --> 00:19:07,620
+So let's just wait for that.
+
+389
+00:19:11,430 --> 00:19:14,950
+Okay. What is important to note
+
+390
+00:19:14,950 --> 00:19:19,885
+here is that I'm running GCC,
+
+391
+00:19:19,885 --> 00:19:22,450
+this the option -O0.
+
+392
+00:19:22,450 --> 00:19:25,600
+That means I tell GCC to
+
+393
+00:19:25,600 --> 00:19:28,840
+generate a binary which I
+can run as you can see.
+
+394
+00:19:28,840 --> 00:19:31,810
+But don't apply any optimization.
+
+395
+00:19:31,810 --> 00:19:38,395
+Keep this in mind.
+As a second try,
+
+396
+00:19:38,395 --> 00:19:42,595
+of course, I can try to now
+generate a better C program.
+
+397
+00:19:42,595 --> 00:19:46,060
+And as you'll remember from
+the PEP course, it can,
+
+398
+00:19:46,060 --> 00:19:50,095
+for example, combine
+several steps
+
+399
+00:19:50,095 --> 00:19:51,670
+going to the right of the memory
+
+400
+00:19:51,670 --> 00:19:53,310
+pointer or to the left.
+
+401
+00:19:53,310 --> 00:19:55,280
+We can combine that into
+
+402
+00:19:55,280 --> 00:19:58,760
+one single increment or
+decrement of not just one,
+
+403
+00:19:58,760 --> 00:20:00,050
+but of like n,
+
+404
+00:20:00,050 --> 00:20:02,570
+where n is greater
+than 1. Similarly,
+
+405
+00:20:02,570 --> 00:20:06,980
+if we increment or decrement
+the content of this array,
+
+406
+00:20:06,980 --> 00:20:09,740
+we can do this in one
+big step by incrementing
+
+407
+00:20:09,740 --> 00:20:12,710
+that by not just one
+and increment by one,
+
+408
+00:20:12,710 --> 00:20:15,635
+but increment and decrement
+by bigger numbers.
+
+409
+00:20:15,635 --> 00:20:18,870
+Everything else stays the same.
+
+410
+00:20:20,830 --> 00:20:23,270
+Again, I have implemented that
+
+411
+00:20:23,270 --> 00:20:24,950
+for you and you can see now
+
+412
+00:20:24,950 --> 00:20:26,835
+the C program moves
+
+413
+00:20:26,835 --> 00:20:30,455
+the memory pointer and bigger
+chunks and also increases,
+
+414
+00:20:30,455 --> 00:20:32,810
+for example, here, memory content
+
+415
+00:20:32,810 --> 00:20:35,555
+by 15 than just by 1.
+
+416
+00:20:35,555 --> 00:20:38,520
+Now let's run this program.
+
+417
+00:20:38,530 --> 00:20:40,760
+Again, I use GCC
+
+418
+00:20:40,760 --> 00:20:45,350
+to compile the
+C program and run it.
+
+419
+00:20:45,350 --> 00:20:49,025
+And again, I made sure
+that it only runs this with
+
+420
+00:20:49,025 --> 00:20:51,530
+no optimizations switched on.
+
+421
+00:20:51,530 --> 00:20:54,050
+So there's minus O0.
+
+422
+00:20:54,050 --> 00:20:56,090
+And you can see
+it's now down from
+
+423
+00:20:56,090 --> 00:20:59,990
+20 seconds to just 6 seconds.
+
+424
+00:20:59,990 --> 00:21:06,065
+I show you, the GCC is
+called with -O0.
+
+425
+00:21:06,065 --> 00:21:08,990
+So this reduction
+in time is purely
+
+426
+00:21:08,990 --> 00:21:12,755
+because I produced better C code,
+
+427
+00:21:12,755 --> 00:21:17,220
+which the compiler then just
+transformed into a binary.
+
+428
+00:21:18,910 --> 00:21:22,055
+So far there's
+nothing interesting.
+
+429
+00:21:22,055 --> 00:21:25,385
+We used in the first instance
+
+430
+00:21:25,385 --> 00:21:29,240
+single increments and use GCC with
+
+431
+00:21:29,240 --> 00:21:31,700
+O0 to not introduce
+
+432
+00:21:31,700 --> 00:21:35,255
+any optimizations and run
+essentially 20 seconds.
+
+433
+00:21:35,255 --> 00:21:37,880
+If we then increment
+
+434
+00:21:37,880 --> 00:21:40,895
+in bigger chunks or
+decrement in bigger chunks,
+
+435
+00:21:40,895 --> 00:21:45,380
+use again GCC with -O0,
+
+436
+00:21:45,380 --> 00:21:50,030
+then it runs in approximately
+five to six seconds.
+
+437
+00:21:50,030 --> 00:21:51,950
+Now let me do the following:
+
+438
+00:21:51,950 --> 00:21:55,430
+I take the first program which
+
+439
+00:21:55,430 --> 00:21:58,070
+increments everything
+in a single steps
+
+440
+00:21:58,070 --> 00:22:00,185
+or decremented in single steps.
+
+441
+00:22:00,185 --> 00:22:04,835
+But now I use the full
+capacity of the GCC compiler
+
+442
+00:22:04,835 --> 00:22:06,560
+and I tell it,
+
+443
+00:22:06,560 --> 00:22:11,370
+please do introduce some
+optimizations as you want.
+
+444
+00:22:11,590 --> 00:22:15,799
+And I'm now running
+exactly the same program...
+
+445
+00:22:15,799 --> 00:22:17,870
+just the GCC compiler will
+
+446
+00:22:17,870 --> 00:22:22,325
+now have the optimizations
+switched on.
+
+447
+00:22:22,325 --> 00:22:24,380
+Let's see what happens.
+
+448
+00:22:24,380 --> 00:22:27,320
+One first needs to compile it.
+
+449
+00:22:27,320 --> 00:22:29,795
+And that takes a little while.
+
+450
+00:22:29,795 --> 00:22:32,000
+Okay, this has now
+finished and also
+
+451
+00:22:32,000 --> 00:22:34,115
+the calculation of the
+picture has finished.
+
+452
+00:22:34,115 --> 00:22:35,960
+And as you can see, it took
+
+453
+00:22:35,960 --> 00:22:38,510
+approximately eight
+seconds to calculate.
+
+454
+00:22:38,510 --> 00:22:41,645
+That is down from
+approximately 20 seconds.
+
+455
+00:22:41,645 --> 00:22:46,040
+So the difference from
+switching from -O0 to
+
+456
+00:22:46,040 --> 00:22:51,935
+-O3 meant that now
+the program runs almost as
+
+457
+00:22:51,935 --> 00:22:54,800
+fast as where I by hand
+
+458
+00:22:54,800 --> 00:22:58,610
+combined several steps
+into a big chunk.
+
+459
+00:22:58,610 --> 00:23:00,170
+That is essentially what
+
+460
+00:23:00,170 --> 00:23:03,485
+the GCC compiler
+found out on its own.
+
+461
+00:23:03,485 --> 00:23:05,840
+That rather than jumping
+
+462
+00:23:05,840 --> 00:23:08,465
+in just single increments by one,
+
+463
+00:23:08,465 --> 00:23:11,510
+they can be all combined
+into bigger chunks.
+
+464
+00:23:11,510 --> 00:23:16,595
+It hasn't been as successful
+as if I do this explicitly.
+
+465
+00:23:16,595 --> 00:23:18,620
+But that is the magic that
+
+466
+00:23:18,620 --> 00:23:22,560
+the compiler essentially found
+out, that this can be done.
+
+467
+00:23:22,960 --> 00:23:25,700
+Just a quick recap of what I did.
+
+468
+00:23:25,700 --> 00:23:28,160
+I first run the program with
+
+469
+00:23:28,160 --> 00:23:31,730
+an interpreter and it took
+approximately 11 minutes.
+
+470
+00:23:31,730 --> 00:23:36,559
+Then I had a simple compiler
+generating a C program.
+
+471
+00:23:36,559 --> 00:23:40,880
+And the C compiler then had
+no optimization switched on.
+
+472
+00:23:40,880 --> 00:23:44,645
+In the C program had only
+small single increments and
+
+473
+00:23:44,645 --> 00:23:46,820
+small jumps. Then it took
+
+474
+00:23:46,820 --> 00:23:49,775
+approximately 20 seconds for
+to same program.
+
+475
+00:23:49,775 --> 00:23:52,460
+Then I had a more advanced
+compiler which does
+
+476
+00:23:52,460 --> 00:23:55,730
+big increments and
+also big jumps.
+
+477
+00:23:55,730 --> 00:23:57,470
+But again, the compiler didn't
+
+478
+00:23:57,470 --> 00:23:59,210
+introduce any optimization on its
+
+479
+00:23:59,210 --> 00:24:02,915
+own. Then it took
+approximately five seconds.
+
+480
+00:24:02,915 --> 00:24:05,210
+Then I went back to
+
+481
+00:24:05,210 --> 00:24:08,465
+the simple compiler with
+only small steps.
+
+482
+00:24:08,465 --> 00:24:10,400
+But then told the C compiler
+
+483
+00:24:10,400 --> 00:24:12,950
+to fully optimize this program.
+
+484
+00:24:12,950 --> 00:24:14,690
+And then it took more
+or less the same
+
+485
+00:24:14,690 --> 00:24:17,465
+time as the more advanced compiler.
+
+486
+00:24:17,465 --> 00:24:20,465
+I encourage you to
+look at this yourself.
+
+487
+00:24:20,465 --> 00:24:24,240
+As usual, all the programs
+are uploaded on KEATS.
+
+488
+00:24:25,090 --> 00:24:27,590
+To finish this
+introduction video,
+
+489
+00:24:27,590 --> 00:24:30,170
+let me give you an
+extremely brief overview
+
+490
+00:24:30,170 --> 00:24:32,855
+over the history of compilers.
+
+491
+00:24:32,855 --> 00:24:35,450
+While I argued at the beginning
+
+492
+00:24:35,450 --> 00:24:38,915
+that it's interesting to
+study compilers nowadays,
+
+493
+00:24:38,915 --> 00:24:40,400
+obviously in this field,
+
+494
+00:24:40,400 --> 00:24:43,295
+we are standing on the
+shoulders of giants.
+
+495
+00:24:43,295 --> 00:24:46,520
+The field started with Turing and
+Turing machines,
+
+496
+00:24:46,520 --> 00:24:49,130
+which were introduced in 1936.
+
+497
+00:24:49,130 --> 00:24:52,175
+Turing machines already had
+this concept of memory
+
+498
+00:24:52,175 --> 00:24:55,190
+and also programs
+of Turing machines
+
+499
+00:24:55,190 --> 00:24:58,775
+needed to be translated
+between different formalisms.
+
+500
+00:24:58,775 --> 00:25:01,100
+Regular expressions,
+which will play
+
+501
+00:25:01,100 --> 00:25:03,905
+an important role in our
+lexer of our compiler,
+
+502
+00:25:03,905 --> 00:25:06,800
+were introduced in 1956.
+
+503
+00:25:06,800 --> 00:25:10,370
+Arguably the first compiler
+was introduced in
+
+504
+00:25:10,370 --> 00:25:13,850
+1957 for a language called COBOL.
+
+505
+00:25:13,850 --> 00:25:16,550
+This was done by Grace Hopper.
+
+506
+00:25:16,550 --> 00:25:18,770
+And I recommend that
+you have and look
+
+507
+00:25:18,770 --> 00:25:20,900
+at this interview
+of Grace Hopper,
+
+508
+00:25:20,900 --> 00:25:22,130
+which shows she must have been
+
+509
+00:25:22,130 --> 00:25:24,424
+a very interesting personality.
+
+510
+00:25:24,424 --> 00:25:27,770
+But despite the
+maturity of this field,
+
+511
+00:25:27,770 --> 00:25:29,465
+it might be surprising that
+
+512
+00:25:29,465 --> 00:25:31,505
+research papers are
+still published.
+
+513
+00:25:31,505 --> 00:25:34,835
+And we will find that
+out in the module.
+
+514
+00:25:34,835 --> 00:25:37,760
+And a number of
+problems which we would
+
+515
+00:25:37,760 --> 00:25:41,270
+assume are already solved by now,
+
+516
+00:25:41,270 --> 00:25:43,730
+actually turning out
+that they are not solved.
+
+517
+00:25:43,730 --> 00:25:45,620
+Here in this blog post,
+
+518
+00:25:45,620 --> 00:25:47,825
+for example, Laurie Tratt
+
+519
+00:25:47,825 --> 00:25:49,550
+who is also from this department,
+
+520
+00:25:49,550 --> 00:25:51,740
+argued that parsing, for example,
+
+521
+00:25:51,740 --> 00:25:54,035
+isn't a solved problem yet.
+
+522
+00:25:54,035 --> 00:25:56,300
+I hope you will have as much fun
+
+523
+00:25:56,300 --> 00:25:58,130
+with this module as I have.
+
+524
+00:25:58,130 --> 00:26:00,750
+Thank you very much
+for listening.