videos/01-intro.srt
author Christian Urban <christian.urban@kcl.ac.uk>
Fri, 29 Nov 2024 18:58:18 +0000
changeset 975 ae5c03560d4d
parent 833 aad5957eb7e4
permissions -rw-r--r--
update

1
00:00:16,280 --> 00:00:18,330
A warm welcome to

2
00:00:18,330 --> 00:00:20,820
the compilers and formal
languages module?

3
00:00:20,820 --> 00:00:24,390
My name is Christian Urban.
Thank you for coming.

4
00:00:24,390 --> 00:00:27,644
Compilers. I guess
you use compilers

5
00:00:27,644 --> 00:00:31,680
in your daily work, be it
the C or Java compiler.

6
00:00:31,680 --> 00:00:34,245
And you might be curious
in what they do.

7
00:00:34,245 --> 00:00:35,700
But you might also be

8
00:00:35,700 --> 00:00:38,130
intimidated to look
what they do underneath

9
00:00:38,130 --> 00:00:39,900
the bonnet because they are

10
00:00:39,900 --> 00:00:42,520
quite large software systems.

11
00:00:42,520 --> 00:00:46,130
What I like to show you in
this module is that you

12
00:00:46,130 --> 00:00:49,310
do not need to be an
Überhacker to implement your own

13
00:00:49,310 --> 00:00:51,305
compiler for your
own language, say.

14
00:00:51,305 --> 00:00:54,155
So that will be the main
goal of this module.

15
00:00:54,155 --> 00:00:56,210
You will implement
your own compiler,

16
00:00:56,210 --> 00:00:58,310
of course with my help.

17
00:00:58,310 --> 00:01:02,360
What I personally like
about compilers is that

18
00:01:02,360 --> 00:01:04,580
the subject is a
perfect combination

19
00:01:04,580 --> 00:01:06,350
of theory and practice.

20
00:01:06,350 --> 00:01:07,790
I like to hack things,

21
00:01:07,790 --> 00:01:10,595
writing actual code,
but I also like theory.

22
00:01:10,595 --> 00:01:13,040
I want to understand what
my code actually does.

23
00:01:13,040 --> 00:01:17,040
So compilers are the
perfect subject for me.

24
00:01:18,040 --> 00:01:20,809
Let's have a look at the details.

25
00:01:20,809 --> 00:01:23,779
Here's an airplane
view of a compiler.

26
00:01:23,779 --> 00:01:25,850
On the left-hand side you can see

27
00:01:25,850 --> 00:01:28,745
the input program a
developer would write.

28
00:01:28,745 --> 00:01:31,955
And on the right-hand side is
the output of the compiler,

29
00:01:31,955 --> 00:01:36,360
the binary code the developer
wants to actually run.

30
00:01:36,370 --> 00:01:40,340
What makes such a project
actually feasible is that

31
00:01:40,340 --> 00:01:44,165
compilers fall neatly into
separate components.

32
00:01:44,165 --> 00:01:47,210
So you can see the lexer and the

33
00:01:47,210 --> 00:01:48,860
parser, which are often called the

34
00:01:48,860 --> 00:01:50,990
front end of the compiler.

35
00:01:50,990 --> 00:01:53,000
And the code generation is

36
00:01:53,000 --> 00:01:55,700
called the backend
of the compiler.

37
00:01:55,700 --> 00:01:57,620
And it so happens
that we will spend

38
00:01:57,620 --> 00:01:59,930
the first five weeks
on the lexer and

39
00:01:59,930 --> 00:02:04,970
the parser, and the next five
weeks on the code generator.

40
00:02:04,970 --> 00:02:09,575
The first component of the
compiler is the lexer.

41
00:02:09,575 --> 00:02:14,480
The purpose of the lexer
is to scan a flat

42
00:02:14,480 --> 00:02:16,610
string, which the
input program is at

43
00:02:16,610 --> 00:02:19,145
the beginning and separate out

44
00:02:19,145 --> 00:02:21,275
where are the words?

45
00:02:21,275 --> 00:02:23,600
You might think
in Western languages,

46
00:02:23,600 --> 00:02:24,920
that is very easy:

47
00:02:24,920 --> 00:02:26,690
you just try to
find out where are

48
00:02:26,690 --> 00:02:28,580
the whitespaces
and then you know,

49
00:02:28,580 --> 00:02:31,955
where one word stops and
where the next word begins.

50
00:02:31,955 --> 00:02:35,300
But, actually, computer
languages are more

51
00:02:35,300 --> 00:02:38,180
like ancient languages
that you see here.

52
00:02:38,180 --> 00:02:39,860
For example, ancient Greek.

53
00:02:39,860 --> 00:02:42,065
The writer wrote one word

54
00:02:42,065 --> 00:02:44,915
after the next and not
leaving any space.

55
00:02:44,915 --> 00:02:47,930
So for example, in this
very simple string here

56
00:02:47,930 --> 00:02:50,960
read n, there is no space at all.

57
00:02:50,960 --> 00:02:53,540
But the purpose of
the lexer is to

58
00:02:53,540 --> 00:02:56,570
separate it into five
different components.

59
00:02:56,570 --> 00:02:59,450
It has to first say,
well there is a read,

60
00:02:59,450 --> 00:03:01,595
then there is a left parenthesis,

61
00:03:01,595 --> 00:03:04,025
then there's an
identifier called n,

62
00:03:04,025 --> 00:03:07,715
then there's a right parenthesis,
and finally a semicolon.

63
00:03:07,715 --> 00:03:11,345
And as you can see, there
is no space here at all.

64
00:03:11,345 --> 00:03:14,150
And so the task of finding where

65
00:03:14,150 --> 00:03:17,705
are the words is actually
quite involved.

66
00:03:17,705 --> 00:03:19,460
Also the classification is

67
00:03:19,460 --> 00:03:22,055
sometimes not so straightforward.

68
00:03:22,055 --> 00:03:24,350
If, for example, a writer

69
00:03:24,350 --> 00:03:26,360
wrote "if" on its own,

70
00:03:26,360 --> 00:03:29,000
then this should be a
keyword classified as

71
00:03:29,000 --> 00:03:32,615
a keyword because it's
from the if-then-else.

72
00:03:32,615 --> 00:03:36,800
But if the developer wrote
something longer like "iffoo",

73
00:03:36,800 --> 00:03:38,030
then this might just be

74
00:03:38,030 --> 00:03:39,860
a strange variable name
and he needs to be

75
00:03:39,860 --> 00:03:44,250
classified as a variable name.

76
00:03:45,220 --> 00:03:50,720
The output of the lexer 
is a sequence of tokens.

77
00:03:50,720 --> 00:03:53,480
These are essentially the words in

78
00:03:53,480 --> 00:03:56,405
a sentence and their
classification:

79
00:03:56,405 --> 00:03:59,540
they are nouns,
verbs or adjectives.

80
00:03:59,540 --> 00:04:02,885
For us, of course, the
classification would be keywords,

81
00:04:02,885 --> 00:04:06,005
identifiers,
operators, and so on.

82
00:04:06,005 --> 00:04:11,480
And these tokens are the
input for the parser,

83
00:04:11,480 --> 00:04:13,085
the next component in

84
00:04:13,085 --> 00:04:17,240
our compiler. The parser
essentially takes this list

85
00:04:17,240 --> 00:04:23,300
of tokens and transforms it
into a abstract syntax tree.

86
00:04:23,300 --> 00:04:27,230
That means we have now the
sentence, we have the words.

87
00:04:27,230 --> 00:04:30,275
We have to relate
the words to each other

88
00:04:30,275 --> 00:04:33,680
in order to find out what
this sentence actually means.

89
00:04:33,680 --> 00:04:35,405
In this case here,

90
00:04:35,405 --> 00:04:38,595
we have to do a read...which
variable is affected?

91
00:04:38,595 --> 00:04:45,040
The variable n. Once we have
the abstract syntax tree,

92
00:04:45,040 --> 00:04:49,225
it can go to the next component,
to the code generator.

93
00:04:49,225 --> 00:04:52,480
Whilst it doesn't look
like this in this picture,

94
00:04:52,480 --> 00:04:54,100
the code generators is usually the

95
00:04:54,100 --> 00:04:56,080
biggest part in a compiler.

96
00:04:56,080 --> 00:04:58,720
And here we actually
have to cut corners.

97
00:04:58,720 --> 00:05:02,470
Instead of producing binary
code, which can be run

98
00:05:02,470 --> 00:05:06,820
directly on a CPU, like X86 or ARM,

99
00:05:06,820 --> 00:05:11,035
we actually target the
Java Virtual Machine, the JVM.

100
00:05:11,035 --> 00:05:13,600
This is very similar
to the Scala compiler,

101
00:05:13,600 --> 00:05:16,480
for example, which
produces JVM code.

102
00:05:16,480 --> 00:05:18,940
Or, of course, also 
the Java compiler,

103
00:05:18,940 --> 00:05:20,845
which produces JVM code.

104
00:05:20,845 --> 00:05:23,900
So here's a typical
example code which we're

105
00:05:23,900 --> 00:05:27,305
going to produce: Something
like variable 2

106
00:05:27,305 --> 00:05:30,545
gets allocated in the stack.

107
00:05:30,545 --> 00:05:32,390
You subtract ten from it.

108
00:05:32,390 --> 00:05:36,050
You test whether the
variable is 0 and if yes,

109
00:05:36,050 --> 00:05:40,170
you jump somewhere and otherwise
you reload the variable.

110
00:05:41,560 --> 00:05:45,935
Even though we cut corners
in the code generater part

111
00:05:45,935 --> 00:05:48,575
by producing code for the JVM,

112
00:05:48,575 --> 00:05:51,710
we can still obtain quite
impressive results.

113
00:05:51,710 --> 00:05:56,225
Here's a program with
a cubic runtime behaviour.

114
00:05:56,225 --> 00:05:59,525
When it has to
calculate with 400 units,

115
00:05:59,525 --> 00:06:02,165
it might need five
seconds to calculate.

116
00:06:02,165 --> 00:06:05,345
But if it  has to
calculate with 1200 units,

117
00:06:05,345 --> 00:06:08,165
it already needs more
than two minutes.

118
00:06:08,165 --> 00:06:10,940
Now these timings,
the blue timings

119
00:06:10,940 --> 00:06:13,910
are obtained with an
interpreter of that program.

120
00:06:13,910 --> 00:06:16,265
And you can see
just next to it,

121
00:06:16,265 --> 00:06:18,050
the red times.

122
00:06:18,050 --> 00:06:20,359
They are obtained
with the compiler

123
00:06:20,359 --> 00:06:23,135
we're going to write
in this module.

124
00:06:23,135 --> 00:06:25,100
And there you can see, even

125
00:06:25,100 --> 00:06:29,000
with 1200,  the times get
hardly off the x-axis.

126
00:06:29,000 --> 00:06:30,485
So in this instance,

127
00:06:30,485 --> 00:06:32,630
our compiler will
have a speed up from

128
00:06:32,630 --> 00:06:37,589
approximately 1 million in
comparison to the interpreter.

129
00:06:38,350 --> 00:06:42,020
This might be a fun task
for your spare time.

130
00:06:42,020 --> 00:06:44,480
This is a compiler explorer,
which allows you to

131
00:06:44,480 --> 00:06:47,060
write C code on the
left-hand side.

132
00:06:47,060 --> 00:06:49,115
It shows you on the
right-hand side

133
00:06:49,115 --> 00:06:53,270
the machine code an
X86 CPU can run.

134
00:06:53,270 --> 00:06:56,060
For example, it gives
you these color scheme and

135
00:06:56,060 --> 00:06:59,000
says that this addition
in the num + num in

136
00:06:59,000 --> 00:07:01,580
the return statement,
translates to

137
00:07:01,580 --> 00:07:02,960
essentially an addition of

138
00:07:02,960 --> 00:07:05,930
an register in the machine code.

139
00:07:05,930 --> 00:07:08,480
I think this compiler
explorer also works for

140
00:07:08,480 --> 00:07:11,495
the Haskell language and
also produces ARM code,

141
00:07:11,495 --> 00:07:15,245
not just code for the X86 CPUs.

142
00:07:15,245 --> 00:07:18,950
As an aside, I also recommend
to watch the movie of

143
00:07:18,950 --> 00:07:20,870
this guy because that is

144
00:07:20,870 --> 00:07:22,940
very much like how I worked
at the beginning

145
00:07:22,940 --> 00:07:25,355
when implementing our compiler.

146
00:07:25,355 --> 00:07:29,300
I wrote some code which I
knew how it should behave.

147
00:07:29,300 --> 00:07:30,725
And then I just had a look

148
00:07:30,725 --> 00:07:32,900
what another compiler produces.

149
00:07:32,900 --> 00:07:37,190
And I imitated that code
as much as I could.

150
00:07:37,190 --> 00:07:39,380
Such a compiler explorer

151
00:07:39,380 --> 00:07:41,375
also exists for
the Java language.

152
00:07:41,375 --> 00:07:42,935
Here's one where you can write

153
00:07:42,935 --> 00:07:44,915
Java code on the left-hand side,

154
00:07:44,915 --> 00:07:47,930
and on the right-hand
side you get JVM code.

155
00:07:47,930 --> 00:07:50,255
JVM code is a byte code

156
00:07:50,255 --> 00:07:53,405
which cannot be run
directly by the CPU,

157
00:07:53,405 --> 00:07:55,220
but needs the Java
Virtual Machine

158
00:07:55,220 --> 00:07:57,170
to essentially interpret that.

159
00:07:57,170 --> 00:07:58,760
This means it's not

160
00:07:58,760 --> 00:08:01,235
the most efficient way
how to run programs - 

161
00:08:01,235 --> 00:08:02,780
it would be much faster to

162
00:08:02,780 --> 00:08:05,285
generate direct
code for the CPU.

163
00:08:05,285 --> 00:08:07,700
But by producing bytecode,

164
00:08:07,700 --> 00:08:11,435
we still run the
programs quite fast

165
00:08:11,435 --> 00:08:13,520
and it also simplifies

166
00:08:13,520 --> 00:08:16,280
a number of issues
in our compiler.

167
00:08:16,280 --> 00:08:18,980
One issue is about
memory management.

168
00:08:18,980 --> 00:08:22,055
We don't have to be concerned
about register allocation,

169
00:08:22,055 --> 00:08:25,505
which we would need
to do in a real CPU.

170
00:08:25,505 --> 00:08:27,680
This will be done by the JVM.

171
00:08:27,680 --> 00:08:29,750
It's also much easier to produce

172
00:08:29,750 --> 00:08:33,635
this bytecode rather than
actual machine code.

173
00:08:33,635 --> 00:08:37,385
I think it's now a good time
to come to the question,

174
00:08:37,385 --> 00:08:39,950
why on Earth studying compilers.

175
00:08:39,950 --> 00:08:42,650
Compilers are such an
established subject

176
00:08:42,650 --> 00:08:43,985
in computer science.

177
00:08:43,985 --> 00:08:46,100
Compilers do what they do.

178
00:08:46,100 --> 00:08:48,410
Probably forrests have
been killed by

179
00:08:48,410 --> 00:08:52,190
all the books that have been
published on compilers.

180
00:08:52,190 --> 00:08:56,690
Why on Earth studying
compilers in 2020 (and of course in 2021)? 

181
00:08:56,690 --> 00:08:59,659
And even worse: Why implementing
your own compiler?

182
00:08:59,659 --> 00:09:02,720
Well, a slightly humorous take on

183
00:09:02,720 --> 00:09:05,105
that question is
given by John Regehr,

184
00:09:05,105 --> 00:09:08,375
who is a compiler hacker
from the University of Utah.

185
00:09:08,375 --> 00:09:09,770
Essentially what he says,

186
00:09:09,770 --> 00:09:12,110
if you're a good compiler hacker,

187
00:09:12,110 --> 00:09:14,690
you have no problems
of finding a job.

188
00:09:14,690 --> 00:09:17,210
He puts it as: It's effectively

189
00:09:17,210 --> 00:09:22,320
a "Perpetual Employment Act"
for solid compiler hackers.

190
00:09:22,990 --> 00:09:27,380
Regehr gives two justifications
for that statement.

191
00:09:27,380 --> 00:09:29,690
First, he says
hardware is getting

192
00:09:29,690 --> 00:09:32,585
weirder, rather than
getting clocked faster.

193
00:09:32,585 --> 00:09:34,520
And that's definitely true.

194
00:09:34,520 --> 00:09:36,590
My first computer many, many,

195
00:09:36,590 --> 00:09:40,040
many years ago contained
only a single core CPU.

196
00:09:40,040 --> 00:09:44,030
And it was such a simple CPU
that we wrote machine code

197
00:09:44,030 --> 00:09:46,220
directly for that CPU in order to

198
00:09:46,220 --> 00:09:49,740
run our programs as
fast as possible.

199
00:09:50,260 --> 00:09:53,600
In contrast, today, Regehr writes,

200
00:09:53,600 --> 00:09:57,005
almost all processors are
multi-core nowadays.

201
00:09:57,005 --> 00:09:59,870
And it looks like there's
an increasing asymmetry

202
00:09:59,870 --> 00:10:02,015
in resources across cores.

203
00:10:02,015 --> 00:10:04,445
Processors come
with vector units,

204
00:10:04,445 --> 00:10:07,189
crypto accelerators, etc.

205
00:10:07,189 --> 00:10:11,930
We have TSPs, GPUs, ARM
big,little, Xeon Phi,

206
00:10:11,930 --> 00:10:14,630
and this only scratches the surface.

207
00:10:14,630 --> 00:10:17,255
And that is really a
problem for compilers,

208
00:10:17,255 --> 00:10:20,495
because if we now
have multi-core CPUs,

209
00:10:20,495 --> 00:10:23,180
that means our programs need

210
00:10:23,180 --> 00:10:26,075
to be scheduled
over several CPUs.

211
00:10:26,075 --> 00:10:28,220
But the developer, of course,

212
00:10:28,220 --> 00:10:30,545
doesn't want to know
anything about that.

213
00:10:30,545 --> 00:10:34,655
Also, now we have more
CPUs in a computer,

214
00:10:34,655 --> 00:10:37,400
but they seem to also come
with different resources.

215
00:10:37,400 --> 00:10:40,310
So certain tasks in
a program need to

216
00:10:40,310 --> 00:10:43,460
be scheduled on some
cores, but not on others.

217
00:10:43,460 --> 00:10:46,685
We also have for a
long time already GPUs,

218
00:10:46,685 --> 00:10:49,025
which are highly specialized for

219
00:10:49,025 --> 00:10:53,240
very parallel computations
to do with graphics.

220
00:10:53,240 --> 00:10:56,015
They at least in
the past few years,

221
00:10:56,015 --> 00:10:59,360
they could only do very
special computations,

222
00:10:59,360 --> 00:11:02,255
but very, very fast
and highly parallel.

223
00:11:02,255 --> 00:11:05,539
You couldn't use them for
general-purpose calculations,

224
00:11:05,539 --> 00:11:10,205
which you would usually
use CPUs. Similarly, DSPs.

225
00:11:10,205 --> 00:11:14,075
They are needed for all
the signal processing

226
00:11:14,075 --> 00:11:16,040
in mobile phones.

227
00:11:16,040 --> 00:11:20,780
Without them, we just
wouldn't have mobile phones.

228
00:11:20,780 --> 00:11:23,525
The second reason Regehr gives is

229
00:11:23,525 --> 00:11:25,550
that we are getting tired of

230
00:11:25,550 --> 00:11:27,620
low-level languages and

231
00:11:27,620 --> 00:11:30,335
their associated
security disasters.

232
00:11:30,335 --> 00:11:32,435
While at the beginning
we were still

233
00:11:32,435 --> 00:11:34,760
happy to write machine
code directly,

234
00:11:34,760 --> 00:11:37,175
nobody wants to do this nowadays.

235
00:11:37,175 --> 00:11:39,515
He writes: :We want
to write new code

236
00:11:39,515 --> 00:11:44,120
to whatever extent possible in
safer high-level languages.

237
00:11:44,120 --> 00:11:46,130
Compilers are caught right

238
00:11:46,130 --> 00:11:48,365
in the middle of these
opposing trends:

239
00:11:48,365 --> 00:11:50,765
one of their main jobs
is to have bridged

240
00:11:50,765 --> 00:11:53,240
the large and growing gap between

241
00:11:53,240 --> 00:11:55,460
increasingly high-level languages

242
00:11:55,460 --> 00:11:58,565
and increasingly
wacky platforms.

243
00:11:58,565 --> 00:12:00,320
So here you have it:

244
00:12:00,320 --> 00:12:02,750
It's still interesting to study

245
00:12:02,750 --> 00:12:06,244
compilers nowadays,
especially nowadays.

246
00:12:06,244 --> 00:12:09,875
Well, if you want to work
on interesting problems,

247
00:12:09,875 --> 00:12:14,570
then very often you have to
know compilers. Here's one:

248
00:12:14,570 --> 00:12:16,940
In the good old
times when we were

249
00:12:16,940 --> 00:12:19,310
still able to fly on holidays,

250
00:12:19,310 --> 00:12:23,300
I'm sure you flew in
a 777 Boeing airplane.

251
00:12:23,300 --> 00:12:25,850
It's actually already
a quite old airplane.

252
00:12:25,850 --> 00:12:28,295
But they had the
following problem.

253
00:12:28,295 --> 00:12:33,020
What happens if a CPU
calculates the wrong result?

254
00:12:33,020 --> 00:12:36,065
And that's actually not
such a hypothetical problem

255
00:12:36,065 --> 00:12:40,295
because you remember
the Pentium CPU

256
00:12:40,295 --> 00:12:43,655
in around 2000
contained a bug and it cost

257
00:12:43,655 --> 00:12:48,140
a lot of money for Intel
to replace this CPU.

258
00:12:48,140 --> 00:12:51,470
What happens if an CPU calculates

259
00:12:51,470 --> 00:12:56,105
the wrong result in
some navigation data?

260
00:12:56,105 --> 00:12:58,520
Do you just let the
airplane crash?

261
00:12:58,520 --> 00:13:00,650
Well, the engineers at Boeing

262
00:13:00,650 --> 00:13:02,675
came up with the
following solution:

263
00:13:02,675 --> 00:13:05,055
They writing one program that

264
00:13:05,055 --> 00:13:08,690
essentially controls
how their airplane

265
00:13:08,690 --> 00:13:10,295
is supposed to fly.

266
00:13:10,295 --> 00:13:13,040
In a programming language
called Ada that is

267
00:13:13,040 --> 00:13:15,770
slightly obscure
programming language

268
00:13:15,770 --> 00:13:18,650
but is very well-known in

269
00:13:18,650 --> 00:13:22,040
areas where safety
is really paramount.

270
00:13:22,040 --> 00:13:25,010
And what they did is they
took this one Ada program

271
00:13:25,010 --> 00:13:28,010
and they essentially
took three compilers,

272
00:13:28,010 --> 00:13:29,435
independent compilers,

273
00:13:29,435 --> 00:13:32,045
which compiled the program
to different CPUs.

274
00:13:32,045 --> 00:13:33,815
One CPU from Intel,

275
00:13:33,815 --> 00:13:37,235
one CPU for Motorola,
and one from AMD.

276
00:13:37,235 --> 00:13:38,930
And these are quite different

277
00:13:38,930 --> 00:13:40,955
CPUs. Also some of them

278
00:13:40,955 --> 00:13:42,755
have quite different philosophies

279
00:13:42,755 --> 00:13:45,245
on how they do their calculations.

280
00:13:45,245 --> 00:13:47,330
Now what they could do is, they

281
00:13:47,330 --> 00:13:50,690
could put these three computers
on a single board and

282
00:13:50,690 --> 00:13:54,335
could now run all their
calculations in parallel.

283
00:13:54,335 --> 00:13:56,690
One with an Intel
CPU, one with

284
00:13:56,690 --> 00:14:00,245
Motorola, and one with a 
Risc computers say.

285
00:14:00,245 --> 00:14:02,795
And then they could
compare the results.

286
00:14:02,795 --> 00:14:05,030
And if these results differed,

287
00:14:05,030 --> 00:14:07,940
then they knew one CPU must have

288
00:14:07,940 --> 00:14:11,600
calculated the wrong result
and probably told the pilot,

289
00:14:11,600 --> 00:14:14,850
please don't let
that airplane crash.

290
00:14:14,950 --> 00:14:17,270
Not just Boeing is doing

291
00:14:17,270 --> 00:14:19,355
interesting things
with compilers.

292
00:14:19,355 --> 00:14:22,640
Also, Airbus in a completely
different setting

293
00:14:22,640 --> 00:14:24,260
is using compilers in

294
00:14:24,260 --> 00:14:26,420
a interesting way to get

295
00:14:26,420 --> 00:14:30,195
their airplanes up in the
air and let them not crash.

296
00:14:30,195 --> 00:14:33,010
In another example, I
have friends working

297
00:14:33,010 --> 00:14:35,350
at Facebook who work on

298
00:14:35,350 --> 00:14:37,900
compilers to make sense
are they are heaps

299
00:14:37,900 --> 00:14:41,470
and heaps and heaps of
code written in PHP,

300
00:14:41,470 --> 00:14:42,880
which is one of the most

301
00:14:42,880 --> 00:14:44,920
horrible languages on the planet.

302
00:14:44,920 --> 00:14:46,630
The bottom line is,

303
00:14:46,630 --> 00:14:50,499
compilers are still very,
very important nowadays.

304
00:14:50,499 --> 00:14:52,810
And for the foreseeable future,

305
00:14:52,810 --> 00:14:56,150
compilers still need
to be developed.

306
00:14:57,270 --> 00:15:00,235
When one talks about
compilers then

307
00:15:00,235 --> 00:15:01,630
there is, of course,

308
00:15:01,630 --> 00:15:05,395
a magic element involved. They're
large software systems.

309
00:15:05,395 --> 00:15:07,750
And yes, they're supposed to

310
00:15:07,750 --> 00:15:10,525
generate a runnable binary for

311
00:15:10,525 --> 00:15:12,105
a high-level program,

312
00:15:12,105 --> 00:15:15,230
but they are also supposed
to make our programs better.

313
00:15:15,230 --> 00:15:18,920
They optimize our
programs to run faster.

314
00:15:18,920 --> 00:15:22,610
And there's a lot of
"magic" involved in that.

315
00:15:22,610 --> 00:15:24,890
Magic in inverted quotes.

316
00:15:24,890 --> 00:15:26,480
I would like to show you one of

317
00:15:26,480 --> 00:15:29,340
this magic in one example.

318
00:15:31,090 --> 00:15:33,110
I hope you still have

319
00:15:33,110 --> 00:15:36,485
fond memories of the
PEP module last year.

320
00:15:36,485 --> 00:15:40,280
And you might remember BF
language, we had to look at.

321
00:15:40,280 --> 00:15:43,670
This BF language contains
a kind of memory and

322
00:15:43,670 --> 00:15:45,680
a memory pointer which can

323
00:15:45,680 --> 00:15:48,140
be moved to the left
or to the right,

324
00:15:48,140 --> 00:15:50,270
where an integer in

325
00:15:50,270 --> 00:15:52,925
the memory can be either
increased or decreased.

326
00:15:52,925 --> 00:15:55,730
We can print
out the content of

327
00:15:55,730 --> 00:15:59,645
the current cell or input
something into a cell.

328
00:15:59,645 --> 00:16:01,850
And we have loops, and

329
00:16:01,850 --> 00:16:04,220
everything else is
considered a comment.

330
00:16:04,220 --> 00:16:06,290
What's good about
this BF language is that 

331
00:16:06,290 --> 00:16:08,180
we don't even need a front end.

332
00:16:08,180 --> 00:16:09,920
We can immediately start writing

333
00:16:09,920 --> 00:16:13,529
an interpreter or a compiler.

334
00:16:15,850 --> 00:16:18,155
Okay, I have here

335
00:16:18,155 --> 00:16:20,600
a very straightforward
interpreter for

336
00:16:20,600 --> 00:16:22,865
the BF language. You
might recognize it.

337
00:16:22,865 --> 00:16:27,120
And I run it with a
benchmark program, which you 

338
00:16:27,760 --> 00:16:30,960
might also recognize.

339
00:16:31,560 --> 00:16:36,835
And this will now take
approximately ten minutes.

340
00:16:36,835 --> 00:16:39,920
So see you in a bit.

341
00:16:40,710 --> 00:16:43,660
Okay, this has finished now.

342
00:16:43,660 --> 00:16:45,925
Almost took 11 minutes.

343
00:16:45,925 --> 00:16:47,410
The question now is,

344
00:16:47,410 --> 00:16:51,820
can we to better? 

345
00:16:51,820 --> 00:16:53,260
Actually, it is not difficult
to do better

346
00:16:53,260 --> 00:16:54,970
than this simple-minded
interpreter.

347
00:16:54,970 --> 00:16:58,690
It is relatively
straightforward to take

348
00:16:58,690 --> 00:17:03,520
a BF program and generate equivalent C code.

349
00:17:03,520 --> 00:17:06,520
This can be easily
done by somehow

350
00:17:06,520 --> 00:17:09,490
representing the BF memory in C.

351
00:17:09,490 --> 00:17:12,310
We can do this
by just an array of

352
00:17:12,310 --> 00:17:15,380
characters and a memory pointer,

353
00:17:15,380 --> 00:17:19,265
which points, in this case
in the middle of this array.

354
00:17:19,265 --> 00:17:21,800
Then it's very easy to

355
00:17:21,800 --> 00:17:24,275
translate the movement
of the 

356
00:17:24,275 --> 00:17:28,610
BF memory pointer into increments

357
00:17:28,610 --> 00:17:31,520
and decrements of
the C pointer.

358
00:17:31,520 --> 00:17:33,050
Similarly, if you want to increment or

359
00:17:33,050 --> 00:17:35,915
decrement an element
in this array,

360
00:17:35,915 --> 00:17:38,975
you just have to look up what
the pointer is and increment

361
00:17:38,975 --> 00:17:42,140
and decrement what's
under the pointer. Similarly

362
00:17:42,140 --> 00:17:43,790
we can print out something from

363
00:17:43,790 --> 00:17:47,450
this array and we can put
something into this array.

364
00:17:47,450 --> 00:17:49,610
What is great is that the loops

365
00:17:49,610 --> 00:17:51,530
from the BF language directly

366
00:17:51,530 --> 00:17:55,580
translate into while
loops of the C language.

367
00:17:55,580 --> 00:17:58,100
We essentially have to check is

368
00:17:58,100 --> 00:18:02,450
the memory pointer pointing
to a field that is 0.

369
00:18:02,450 --> 00:18:03,950
Then we exit the loop,

370
00:18:03,950 --> 00:18:05,719
and otherwise we continue.

371
00:18:05,719 --> 00:18:09,575
And that can be done
nicely in C, like so.

372
00:18:09,575 --> 00:18:12,995
And everything else is
again, just a comment.

373
00:18:12,995 --> 00:18:16,445
So I have implemented
this translation for you.

374
00:18:16,445 --> 00:18:19,700
Remember this was the BF
program for the Mandelbrot Set.

375
00:18:19,700 --> 00:18:23,690
The equivalent C code
would look like this.

376
00:18:23,690 --> 00:18:27,110
So you can see at the beginning
is this generation of

377
00:18:27,110 --> 00:18:30,140
the BF memory represented

378
00:18:30,140 --> 00:18:32,345
as an array and the
memory pointer.

379
00:18:32,345 --> 00:18:34,550
And then inside there are lots

380
00:18:34,550 --> 00:18:36,770
of increments and decrements

381
00:18:36,770 --> 00:18:41,915
of pointers and also
contents of this array.

382
00:18:41,915 --> 00:18:45,199
Now fingers crossed that this
is the correct translation.

383
00:18:45,199 --> 00:18:48,125
And I can also run this

384
00:18:48,125 --> 00:18:52,805
and you should see that it runs
substantially faster.

385
00:18:52,805 --> 00:18:55,880
I'm using now my GCC on

386
00:18:55,880 --> 00:19:01,040
my computer to generate an
executable for the C program.

387
00:19:01,040 --> 00:19:04,295
And it should run for
approximately 20 seconds.

388
00:19:04,295 --> 00:19:07,620
So let's just wait for that.

389
00:19:11,430 --> 00:19:14,950
Okay. What is important to note

390
00:19:14,950 --> 00:19:19,885
here is that I'm running GCC,

391
00:19:19,885 --> 00:19:22,450
this the option -O0.

392
00:19:22,450 --> 00:19:25,600
That means I tell GCC to

393
00:19:25,600 --> 00:19:28,840
generate a binary which I
can run as you can see.

394
00:19:28,840 --> 00:19:31,810
But don't apply any optimization.

395
00:19:31,810 --> 00:19:38,395
Keep this in mind.


396
00:19:38,395 --> 00:19:42,595
As a second try, of course, I can try to 
generate a better C program.

397
00:19:42,595 --> 00:19:46,060
And as you'll remember from
the PEP course, it can,

398
00:19:46,060 --> 00:19:50,095
for example, combine
several steps

399
00:19:50,095 --> 00:19:51,670
going to the right of the memory

400
00:19:51,670 --> 00:19:53,310
pointer or to the left.

401
00:19:53,310 --> 00:19:55,280
We can combine that into

402
00:19:55,280 --> 00:19:58,760
one single increment or
decrement of not just one,

403
00:19:58,760 --> 00:20:00,050
but of like n,

404
00:20:00,050 --> 00:20:02,570
where n is greater
than 1. Similarly, 

405
00:20:02,570 --> 00:20:06,980
if we increment or decrement
the content of this array,

406
00:20:06,980 --> 00:20:09,740
we can do this in one
big step by incrementing

407
00:20:09,740 --> 00:20:12,710
that by not just one
and increment by one,

408
00:20:12,710 --> 00:20:15,635
but increment and decrement
by bigger numbers.

409
00:20:15,635 --> 00:20:18,870
Everything else stays the same.

410
00:20:20,830 --> 00:20:23,270
Again, I have implemented that

411
00:20:23,270 --> 00:20:24,950
for you and you can see now

412
00:20:24,950 --> 00:20:26,835
the C program moves

413
00:20:26,835 --> 00:20:30,455
the memory pointer and bigger
chunks and also increases,

414
00:20:30,455 --> 00:20:32,810
for example, here, memory content

415
00:20:32,810 --> 00:20:35,555
by 15 than just by 1.

416
00:20:35,555 --> 00:20:38,520
Now let's run this program.

417
00:20:38,530 --> 00:20:40,760
Again, I use GCC

418
00:20:40,760 --> 00:20:45,350
to compile the
C program and run it.

419
00:20:45,350 --> 00:20:49,025
And again, I made sure
that it only runs this with

420
00:20:49,025 --> 00:20:51,530
no optimizations switched on.

421
00:20:51,530 --> 00:20:54,050
So it runs with minus O0.

422
00:20:54,050 --> 00:20:56,090
And you can see
it's now down from

423
00:20:56,090 --> 00:20:59,990
20 seconds to just 6 seconds.

424
00:20:59,990 --> 00:21:06,065
I show you, the GCC is
called with -O0.

425
00:21:06,065 --> 00:21:08,990
So this reduction
in time is purely

426
00:21:08,990 --> 00:21:12,755
because I produced better C code,

427
00:21:12,755 --> 00:21:17,220
which the compiler then just
transformed into a binary.

428
00:21:18,910 --> 00:21:22,055
So far there's
nothing interesting.

429
00:21:22,055 --> 00:21:25,385
We used in the first instance

430
00:21:25,385 --> 00:21:29,240
single increments and use GCC with

431
00:21:29,240 --> 00:21:31,700
O0 to not introduce

432
00:21:31,700 --> 00:21:35,255
any optimizations and it runs
essentially for 20 seconds.

433
00:21:35,255 --> 00:21:37,880
If we then increment

434
00:21:37,880 --> 00:21:40,895
in bigger chunks or
decrement in bigger chunks,

435
00:21:40,895 --> 00:21:45,380
use again GCC with -O0,

436
00:21:45,380 --> 00:21:50,030
then it runs in approximately
five to six seconds.

437
00:21:50,030 --> 00:21:51,950
Now let me do the following:

438
00:21:51,950 --> 00:21:55,430
I take the first program which

439
00:21:55,430 --> 00:21:58,070
increments everything
in single steps

440
00:21:58,070 --> 00:22:00,185
or decrements in single steps.

441
00:22:00,185 --> 00:22:04,835
But now I use the full
capacity of the GCC compiler

442
00:22:04,835 --> 00:22:06,560
and I tell it,

443
00:22:06,560 --> 00:22:11,370
please do introduce some
optimizations as you want.

444
00:22:11,590 --> 00:22:15,799
And I'm now running
exactly the same program...

445
00:22:15,799 --> 00:22:17,870
just the GCC compiler will

446
00:22:17,870 --> 00:22:22,325
now have the optimizations
switched on.

447
00:22:22,325 --> 00:22:24,380
Let's see what happens.

448
00:22:24,380 --> 00:22:27,320
One first needs to compile it.

449
00:22:27,320 --> 00:22:29,795
And that takes a little while.

450
00:22:29,795 --> 00:22:32,000
Okay, this has now
finished and also

451
00:22:32,000 --> 00:22:34,115
the calculation of the
picture has finished.

452
00:22:34,115 --> 00:22:35,960
And as you can see, it took

453
00:22:35,960 --> 00:22:38,510
approximately eight
seconds to calculate.

454
00:22:38,510 --> 00:22:41,645
That is down from
approximately 20 seconds.

455
00:22:41,645 --> 00:22:46,040
So the difference from
switching from -O0 to

456
00:22:46,040 --> 00:22:51,935
-O3 meant that now
the program runs almost as

457
00:22:51,935 --> 00:22:54,800
fast as where I by hand

458
00:22:54,800 --> 00:22:58,610
combined several steps
into a big chunk.

459
00:22:58,610 --> 00:23:00,170
That is essentially what

460
00:23:00,170 --> 00:23:03,485
the GCC compiler
found out on its own.

461
00:23:03,485 --> 00:23:05,840
That rather than jumping

462
00:23:05,840 --> 00:23:08,465
in just single increments by one,

463
00:23:08,465 --> 00:23:11,510
they can be all combined
into bigger chunks.

464
00:23:11,510 --> 00:23:16,595
It hasn't been as successful
as if I do this explicitly.

465
00:23:16,595 --> 00:23:18,620
But that is the magic that

466
00:23:18,620 --> 00:23:22,560
the compiler essentially found
out, that this can be done.

467
00:23:22,960 --> 00:23:25,700
Just a quick recap of what I did.

468
00:23:25,700 --> 00:23:28,160
I first run the Mandelbrot program with

469
00:23:28,160 --> 00:23:31,730
an interpreter and it took
approximately 11 minutes.

470
00:23:31,730 --> 00:23:36,559
Then I had a simple compiler
generating a C program.

471
00:23:36,559 --> 00:23:40,880
And the C compiler then had
no optimization switched on.

472
00:23:40,880 --> 00:23:44,645
In the C program had only
small single increments and

473
00:23:44,645 --> 00:23:46,820
small jumps. Then it took

474
00:23:46,820 --> 00:23:49,775
approximately 20 seconds for
to same program.

475
00:23:49,775 --> 00:23:52,460
Then I had a more advanced
compiler which does

476
00:23:52,460 --> 00:23:55,730
big increments and
also big jumps.

477
00:23:55,730 --> 00:23:57,470
But again, the compiler didn't

478
00:23:57,470 --> 00:23:59,210
introduce any optimization on its

479
00:23:59,210 --> 00:24:02,915
own. Then it took
approximately five seconds.

480
00:24:02,915 --> 00:24:05,210
Then I went back to

481
00:24:05,210 --> 00:24:08,465
the simple compiler with
only small steps.

482
00:24:08,465 --> 00:24:10,400
But then told the C compiler

483
00:24:10,400 --> 00:24:12,950
to fully optimize this program.

484
00:24:12,950 --> 00:24:14,690
And then it took more
or less the same

485
00:24:14,690 --> 00:24:17,465
time as the more advanced compiler.

486
00:24:17,465 --> 00:24:20,465
I encourage you to
look at this yourself.

487
00:24:20,465 --> 00:24:24,240
As usual, all the programs
are uploaded on KEATS.

488
00:24:25,090 --> 00:24:27,590
To finish this
introduction video,

489
00:24:27,590 --> 00:24:30,170
let me give you an
extremely brief overview

490
00:24:30,170 --> 00:24:32,855
over the history of compilers.

491
00:24:32,855 --> 00:24:35,450
While I  argued at the beginning

492
00:24:35,450 --> 00:24:38,915
that it's interesting to
study compilers nowadays,

493
00:24:38,915 --> 00:24:40,400
obviously in this field,

494
00:24:40,400 --> 00:24:43,295
we are standing on the
shoulders of giants.

495
00:24:43,295 --> 00:24:46,520
The field started with Turing and 
Turing machines,

496
00:24:46,520 --> 00:24:49,130
which were introduced in 1936.

497
00:24:49,130 --> 00:24:52,175
Turing machines already had
this concept of memory

498
00:24:52,175 --> 00:24:55,190
and also programs
of Turing machines

499
00:24:55,190 --> 00:24:58,775
needed to be translated
between different formalisms.

500
00:24:58,775 --> 00:25:01,100
Regular expressions,
which will play

501
00:25:01,100 --> 00:25:03,905
an important role in our
lexer of our compiler,

502
00:25:03,905 --> 00:25:06,800
were introduced in 1956.

503
00:25:06,800 --> 00:25:10,370
Arguably the first compiler
was introduced in

504
00:25:10,370 --> 00:25:13,850
1957 for a language called COBOL.

505
00:25:13,850 --> 00:25:16,550
This was done by Grace Hopper.

506
00:25:16,550 --> 00:25:18,770
And I recommend that
you have and look

507
00:25:18,770 --> 00:25:20,900
at this interview
of Grace Hopper,

508
00:25:20,900 --> 00:25:22,130
which shows she must have been

509
00:25:22,130 --> 00:25:24,424
a very interesting personality.

510
00:25:24,424 --> 00:25:27,770
But despite the
maturity of this field,

511
00:25:27,770 --> 00:25:29,465
it might be surprising that

512
00:25:29,465 --> 00:25:31,505
research papers are
still published.

513
00:25:31,505 --> 00:25:34,835
And we will find that
out in the module.

514
00:25:34,835 --> 00:25:37,760
And a number of
problems which we would

515
00:25:37,760 --> 00:25:41,270
assume are already solved by now,

516
00:25:41,270 --> 00:25:43,730
actually turning out
that they are not solved.

517
00:25:43,730 --> 00:25:45,620
Here in this blog post,

518
00:25:45,620 --> 00:25:47,825
for example, Laurie Tratt

519
00:25:47,825 --> 00:25:49,550
who is also from this department,

520
00:25:49,550 --> 00:25:51,740
argued that parsing, for example,

521
00:25:51,740 --> 00:25:54,035
isn't a solved problem yet.

522
00:25:54,035 --> 00:25:56,300
I hope you will have as much fun

523
00:25:56,300 --> 00:25:58,130
with this module as I have.

524
00:25:58,130 --> 00:26:00,750
Thank you very much
for listening.