# HG changeset patch # User Christian Urban # Date 1600857283 -3600 # Node ID 82a1315c128dede5333407288057b67c477de5bb # Parent d41956ea544eea5aa943303728c654f976ff92a7 updated diff -r d41956ea544e -r 82a1315c128d slides/slides01.pdf Binary file slides/slides01.pdf has changed diff -r d41956ea544e -r 82a1315c128d slides/slides01.tex --- a/slides/slides01.tex Mon Sep 21 13:16:02 2020 +0100 +++ b/slides/slides01.tex Wed Sep 23 11:34:43 2020 +0100 @@ -1056,6 +1056,32 @@ \end{frame} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +\begin{frame}[c] +\frametitle{The Meaning of a Regex} + + ...all the strings a regular expression can match. + +\begin{center} + \begin{tabular}{rcl} + \bl{$L(\ZERO)$} & \bl{$\dn$} & \bl{$\{\}$}\\ + \bl{$L(\ONE)$} & \bl{$\dn$} & \bl{$\{[]\}$}\\ + \bl{$L(c)$} & \bl{$\dn$} & \bl{$\{[c]\}$}\\ + \bl{$L(r_1 + r_2)$} & \bl{$\dn$} & \bl{$L(r_1) \cup L(r_2)$}\\ + \bl{$L(r_1 \cdot r_2)$} & \bl{$\dn$} & \bl{$L(r_1) \,@\, L(r_2)$}\\ + \bl{$L(r^*)$} & \bl{$\dn$} & \\ + \end{tabular} +\end{center} + +\begin{textblock}{14}(1.5,13.5)\small +\bl{$L$} is a function from regular expressions to +sets of strings (languages):\smallskip\\ +\bl{\quad$L$ : Rexp $\Rightarrow$ Set$[$String$]$} +\end{textblock} + +\end{frame} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{frame}[c] diff -r d41956ea544e -r 82a1315c128d videos/01-evilregexes.srt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/videos/01-evilregexes.srt Wed Sep 23 11:34:43 2020 +0100 @@ -0,0 +1,1813 @@ +1 +00:00:06,240 --> 00:00:11,050 +Welcome back. This video +is about regular expressions. + +2 +00:00:11,050 --> 00:00:14,230 +We want to use regular +expressions in our lexer. + +3 +00:00:14,230 --> 00:00:16,165 +And the purpose of the +lexer is to find + +4 +00:00:16,165 --> 00:00:18,070 +out where the words in + +5 +00:00:18,070 --> 00:00:21,070 +our programs are. However +regular expressions + +6 +00:00:21,070 --> 00:00:23,875 +are fundamental tool +in computer science. + +7 +00:00:23,875 --> 00:00:27,910 +And I'm sure you've used them +already on several occasions. + +8 +00:00:27,910 --> 00:00:30,370 +And one would expect that about + +9 +00:00:30,370 --> 00:00:31,750 +regular expressions since they are + +10 +00:00:31,750 --> 00:00:33,850 +so well-known and well studied, + +11 +00:00:33,850 --> 00:00:37,915 +that everything under the +sun is known about them. + +12 +00:00:37,915 --> 00:00:41,080 +But actually there's +still some surprising + +13 +00:00:41,080 --> 00:00:44,465 +and interesting +problems with them. + +14 +00:00:44,465 --> 00:00:47,945 +And I want to show you +them in this video. + +15 +00:00:47,945 --> 00:00:50,720 +I'm sure you've seen +regular expressions + +16 +00:00:50,720 --> 00:00:52,445 +many, many times before. + +17 +00:00:52,445 --> 00:00:55,100 +But just to be on the same page, + +18 +00:00:55,100 --> 00:00:57,110 +let me just recap them. + +19 +00:00:57,110 --> 00:00:59,210 +So here in this line, + +20 +00:00:59,210 --> 00:01:01,790 +there is a regular expression +which is supposed to + +21 +00:01:01,790 --> 00:01:05,285 +recognize some form +of email addresses. + +22 +00:01:05,285 --> 00:01:07,745 +So an e-mail address + +23 +00:01:07,745 --> 00:01:11,000 +has part which is +before the @ symbol, + +24 +00:01:11,000 --> 00:01:13,400 +which is the name of the person. + +25 +00:01:13,400 --> 00:01:16,880 +And that can be +any number between + +26 +00:01:16,880 --> 00:01:20,195 +0 and 9, and letters between a and z. + +27 +00:01:20,195 --> 00:01:24,155 +Let's say we avoiding +here capital letters. + +28 +00:01:24,155 --> 00:01:26,045 +There can be underscores. + +29 +00:01:26,045 --> 00:01:29,405 +There can be a dot and +there can be hyphens. + +30 +00:01:29,405 --> 00:01:35,390 +And after the @ symbol +comes the domain name. + +31 +00:01:35,390 --> 00:01:37,310 +So as you can see here, + +32 +00:01:37,310 --> 00:01:40,640 +we use things like star to + +33 +00:01:40,640 --> 00:01:44,314 +match letters +zero or more times. + +34 +00:01:44,314 --> 00:01:45,985 +Or we have a plus, + +35 +00:01:45,985 --> 00:01:47,420 +which means you have to match + +36 +00:01:47,420 --> 00:01:52,489 +at least once or more +times. Then we have. + +37 +00:01:52,489 --> 00:01:55,790 +question mark, which says you + +38 +00:01:55,790 --> 00:01:59,105 +match either it is there +or it ss not there. + +39 +00:01:59,105 --> 00:02:01,340 +You are also regular +expressions which + +40 +00:02:01,340 --> 00:02:03,755 +match exactly n-times. + +41 +00:02:03,755 --> 00:02:08,720 +Or this is a regular expression +for between n and m times. + +42 +00:02:08,720 --> 00:02:12,065 +You can see in +this email address, + +43 +00:02:12,065 --> 00:02:13,730 +the top-level domain + +44 +00:02:13,730 --> 00:02:16,130 +name can be any letter + +45 +00:02:16,130 --> 00:02:19,265 +between a to z, +and contain dots, + +46 +00:02:19,265 --> 00:02:22,340 +but can only be two +characters long + +47 +00:02:22,340 --> 00:02:25,685 +up till six characters +and not more. + +48 +00:02:25,685 --> 00:02:29,240 +Then you also have +something like ranges. + +49 +00:02:29,240 --> 00:02:31,220 +So you can see, letters between a + +50 +00:02:31,220 --> 00:02:33,635 +and z and 0 to 9 and so on. + +51 +00:02:33,635 --> 00:02:36,545 +Here you also have regular +expression which can + +52 +00:02:36,545 --> 00:02:40,070 +match something which +isn't in this range. + +53 +00:02:40,070 --> 00:02:42,560 +So for example, if +you want for example match, + +54 +00:02:42,560 --> 00:02:44,030 +letters but not numbers, + +55 +00:02:44,030 --> 00:02:45,800 +you would say, well, if + +56 +00:02:45,800 --> 00:02:48,990 +this is a number that +should not match. + +57 +00:02:49,090 --> 00:02:52,804 +Typically you also +have these ranges. + +58 +00:02:52,804 --> 00:02:55,565 +Lowercase letters, +capital letters. + +59 +00:02:55,565 --> 00:02:58,550 +Then you have some +special regular expressions + +60 +00:02:58,550 --> 00:03:02,195 +like this one is only +supposed to match digits. + +61 +00:03:02,195 --> 00:03:05,674 +A dot is supposed to +match any character. + +62 +00:03:05,674 --> 00:03:07,370 +And then they have also something + +63 +00:03:07,370 --> 00:03:09,800 +called groups which +is supposed to be + +64 +00:03:09,800 --> 00:03:12,799 +used when you are +trying to extract + +65 +00:03:12,799 --> 00:03:15,605 +a string you've matched. + +66 +00:03:15,605 --> 00:03:19,925 +Okay, so these are the +typical regular expressions. + +67 +00:03:19,925 --> 00:03:23,075 +And here's a particular one. + +68 +00:03:23,075 --> 00:03:25,820 +Trying to match something + +69 +00:03:25,820 --> 00:03:28,770 +which resembles +an email address. + +70 +00:03:29,590 --> 00:03:33,065 +Clearly that should be all easy. + +71 +00:03:33,065 --> 00:03:36,230 +And our technology should +be on top of that. + +72 +00:03:36,230 --> 00:03:37,865 +That we can take a + +73 +00:03:37,865 --> 00:03:41,015 +regular expressions and +we can take a string, + +74 +00:03:41,015 --> 00:03:43,340 +and we should have programs to + +75 +00:03:43,340 --> 00:03:45,680 +decide whether this +string is matched + +76 +00:03:45,680 --> 00:03:50,330 +by a regular expression or +not and should be easy-peasy, no? + +77 +00:03:50,330 --> 00:03:56,150 +Well, let's have a +look at two examples. + +78 +00:03:56,150 --> 00:04:00,860 +The first regular expression +is a star star b. + +79 +00:04:00,860 --> 00:04:02,990 +And it is supposed +to match strings of + +80 +00:04:02,990 --> 00:04:05,825 +the form 0 or more a's, + +81 +00:04:05,825 --> 00:04:10,385 +followed by a b. The parentheses +you can ignore. + +82 +00:04:10,385 --> 00:04:11,990 +And a star star + +83 +00:04:11,990 --> 00:04:14,120 +also doesn't +make any difference + +84 +00:04:14,120 --> 00:04:16,505 +to what kind of strings +that can be matched. + +85 +00:04:16,505 --> 00:04:21,635 +It can only make 0 more +a's followed by a b. + +86 +00:04:21,635 --> 00:04:23,900 +And the other regular expression + +87 +00:04:23,900 --> 00:04:26,990 +is possibly a character a, + +88 +00:04:26,990 --> 00:04:32,930 +n times, followed by character +a axactly n-times. + +89 +00:04:32,930 --> 00:04:35,570 +And we will try out + +90 +00:04:35,570 --> 00:04:38,360 +these two regular expressions +with strings of the form a, + +91 +00:04:38,360 --> 00:04:39,890 +aa, and so on, + +92 +00:04:39,890 --> 00:04:45,770 +and up to the length of n. And + +93 +00:04:45,770 --> 00:04:49,130 +this regular expression should +actually not match any of + +94 +00:04:49,130 --> 00:04:53,315 +the strings because the +final b is missing. + +95 +00:04:53,315 --> 00:04:56,150 +But that is +okay. For example + +96 +00:04:56,150 --> 00:04:57,425 +if you have a regular expression + +97 +00:04:57,425 --> 00:05:00,110 +that is supposed to +check whether a string is + +98 +00:05:00,110 --> 00:05:01,490 +an email address and the user + +99 +00:05:01,490 --> 00:05:03,380 +gives some random +strings in there, + +100 +00:05:03,380 --> 00:05:06,545 +then this regular expression +should not match that string. + +101 +00:05:06,545 --> 00:05:08,420 +And for this regular expression + +102 +00:05:08,420 --> 00:05:11,195 +you have to scratch a +little bit of your head, + +103 +00:05:11,195 --> 00:05:12,620 +what it can actually match. + +104 +00:05:12,620 --> 00:05:14,720 +But after a little bit +of head scratching, + +105 +00:05:14,720 --> 00:05:18,260 +you find out can match +any string which is of + +106 +00:05:18,260 --> 00:05:22,580 +the length n a's up +to 2n of a's. + +107 +00:05:22,580 --> 00:05:24,290 +So anything in this range, + +108 +00:05:24,290 --> 00:05:27,185 +this regular expression +can actually match. + +109 +00:05:27,185 --> 00:05:30,395 +Okay, let's +take a random tool, + +110 +00:05:30,395 --> 00:05:32,630 +maybe for example Python. + +111 +00:05:32,630 --> 00:05:35,240 +So here's a little +Python program. + +112 +00:05:35,240 --> 00:05:38,690 +It uses the library +function of Python to + +113 +00:05:38,690 --> 00:05:42,935 +match the regular expressions of +a star star b. + +114 +00:05:42,935 --> 00:05:46,805 +And we measure time with longer +and longer strings of a. + +115 +00:05:46,805 --> 00:05:48,770 +And so conveniently we can give + +116 +00:05:48,770 --> 00:05:51,140 +the number of a's here +on the command line. + +117 +00:05:51,140 --> 00:05:56,900 +If I just call +this on the command line, + +118 +00:05:56,900 --> 00:05:59,900 +Let's say we first +start with five a's. + +119 +00:05:59,900 --> 00:06:03,920 +And I get also the times which +in this case is next to nothing. + +120 +00:06:03,920 --> 00:06:05,960 +And here's the string +we just matched. + +121 +00:06:05,960 --> 00:06:07,640 +And obviously the +regular expression + +122 +00:06:07,640 --> 00:06:09,110 +did not match the string. + +123 +00:06:09,110 --> 00:06:11,255 +That's indicated by this none. + +124 +00:06:11,255 --> 00:06:13,925 +Let's take ten a's. + +125 +00:06:13,925 --> 00:06:16,490 +It's also pretty quick. + +126 +00:06:16,490 --> 00:06:20,780 +Fifteen a's, even quicker, + +127 +00:06:20,780 --> 00:06:23,180 +but these times always need to + +128 +00:06:23,180 --> 00:06:25,820 +be taken with a grain of salt. + +129 +00:06:25,820 --> 00:06:28,040 +They are not 100 +percent accurate. + +130 +00:06:28,040 --> 00:06:31,490 +So 15 is also a let's take + +131 +00:06:31,490 --> 00:06:36,965 +28th notes already +double the time. + +132 +00:06:36,965 --> 00:06:42,440 +Twenty-five longer. + +133 +00:06:42,440 --> 00:06:45,680 +Okay, that suddenly +from 02 seconds, + +134 +00:06:45,680 --> 00:06:48,960 +it takes almost four seconds. + +135 +00:06:49,600 --> 00:06:54,890 +Six this + +136 +00:06:54,890 --> 00:07:01,415 +takes six seconds +already Double, okay? + +137 +00:07:01,415 --> 00:07:07,229 +Go to 28. That would be now. + +138 +00:07:08,890 --> 00:07:11,840 +You see the string +isn't very long, + +139 +00:07:11,840 --> 00:07:13,340 +so that could be easily like + +140 +00:07:13,340 --> 00:07:16,070 +just the size of +an email address. + +141 +00:07:16,070 --> 00:07:19,280 +And the regular +expression matching + +142 +00:07:19,280 --> 00:07:22,550 +engine in Python needs +quite a long time + +143 +00:07:22,550 --> 00:07:24,710 +to find out that +this string of 28 + +144 +00:07:24,710 --> 00:07:26,570 +AES is actually not much + +145 +00:07:26,570 --> 00:07:28,490 +by that you see it's +still not finished. + +146 +00:07:28,490 --> 00:07:32,900 +I think it should take +approximately like 20 seconds. + +147 +00:07:32,900 --> 00:07:34,400 +Okay. Already 30. + +148 +00:07:34,400 --> 00:07:36,530 +And if we would try + +149 +00:07:36,530 --> 00:07:40,805 +30 would be already +more than a minute. + +150 +00:07:40,805 --> 00:07:43,940 +And if I could read +something like hundreds, + +151 +00:07:43,940 --> 00:07:46,220 +you remember if a doubling in + +152 +00:07:46,220 --> 00:07:48,770 +each step or the second step, + +153 +00:07:48,770 --> 00:07:50,720 +the story with the chess board, + +154 +00:07:50,720 --> 00:07:53,855 +we probably would sit here +until the next century. + +155 +00:07:53,855 --> 00:07:56,820 +So something strange here. + +156 +00:07:57,580 --> 00:08:01,355 +Okay, that might be just +a problem of Python. + +157 +00:08:01,355 --> 00:08:02,990 +Let's have a look at another + +158 +00:08:02,990 --> 00:08:04,985 +regular expression +matching engine. + +159 +00:08:04,985 --> 00:08:06,890 +This time from JavaScript, + +160 +00:08:06,890 --> 00:08:10,040 +also are pretty well-known +programming language. + +161 +00:08:10,040 --> 00:08:13,610 +So here you can see +it's still a star, + +162 +00:08:13,610 --> 00:08:16,235 +star followed by b, + +163 +00:08:16,235 --> 00:08:18,920 +by direct expression is +supposed to match that from + +164 +00:08:18,920 --> 00:08:21,830 +the beginning of the +string up till the end. + +165 +00:08:21,830 --> 00:08:23,930 +So there's not any difference + +166 +00:08:23,930 --> 00:08:26,150 +in the strings this work +expression matches. + +167 +00:08:26,150 --> 00:08:28,610 +We'll just start at the +beginning of the string + +168 +00:08:28,610 --> 00:08:31,460 +and finish at the +end of the string. + +169 +00:08:31,460 --> 00:08:35,285 +And we again, we just use +repeated A's for that. + +170 +00:08:35,285 --> 00:08:38,195 +And similarly, we can + +171 +00:08:38,195 --> 00:08:41,930 +call it on the command line +and can do some timing. + +172 +00:08:41,930 --> 00:08:44,540 +So ten SBA, good. + +173 +00:08:44,540 --> 00:08:46,340 +Here's the string. + +174 +00:08:46,340 --> 00:08:48,320 +It cannot match that string. + +175 +00:08:48,320 --> 00:08:50,525 +And it's pretty fast. + +176 +00:08:50,525 --> 00:08:54,725 +Friendly. Although pretty fast. + +177 +00:08:54,725 --> 00:08:59,120 +Five, again, + +178 +00:08:59,120 --> 00:09:06,650 +somehow is kind of +threshold that is 25, 26. + +179 +00:09:06,650 --> 00:09:09,485 +Suddenly it takes much longer. + +180 +00:09:09,485 --> 00:09:14,360 +And it has essentially the +same problem as with Python. + +181 +00:09:14,360 --> 00:09:17,165 +So you'll see in now from 26 on, + +182 +00:09:17,165 --> 00:09:19,250 +the Times has always +doubling from + +183 +00:09:19,250 --> 00:09:21,860 +three seconds to seven seconds. + +184 +00:09:21,860 --> 00:09:23,330 +So you can imagine what that + +185 +00:09:23,330 --> 00:09:24,890 +roughly takes when I put your + +186 +00:09:24,890 --> 00:09:30,230 +27 and you see the +string isn't very long. + +187 +00:09:30,230 --> 00:09:32,165 +Let's choose twenties or maize. + +188 +00:09:32,165 --> 00:09:35,419 +Imagine you have to +search a database + +189 +00:09:35,419 --> 00:09:38,720 +with kilobytes of data. + +190 +00:09:38,720 --> 00:09:42,260 +This, these regular +expressions that would years + +191 +00:09:42,260 --> 00:09:48,150 +need years to go through with +these regular expressions. + +192 +00:09:48,630 --> 00:09:51,850 +Okay, maybe the people in + +193 +00:09:51,850 --> 00:09:55,435 +Python and JavaScript, +they're just idiots. + +194 +00:09:55,435 --> 00:09:58,180 +Surely Java must do much better. + +195 +00:09:58,180 --> 00:10:01,045 +So here's a program. + +196 +00:10:01,045 --> 00:10:03,415 +You can see this again + +197 +00:10:03,415 --> 00:10:05,980 +is the reg expression +and we just having + +198 +00:10:05,980 --> 00:10:08,320 +some scaffolding to generate + +199 +00:10:08,320 --> 00:10:11,905 +strings from five up till 28. + +200 +00:10:11,905 --> 00:10:14,305 +And if we run that, + +201 +00:10:14,305 --> 00:10:16,660 +actually does that automatically. + +202 +00:10:16,660 --> 00:10:19,900 +So uphill 19, pretty fast, + +203 +00:10:19,900 --> 00:10:24,925 +but then starting from +23, skidding pretty slow. + +204 +00:10:24,925 --> 00:10:27,445 +So the question is +what's going on? + +205 +00:10:27,445 --> 00:10:29,230 +By the way, I'm not quoting here. + +206 +00:10:29,230 --> 00:10:33,755 +Scala, using internally +the regular expression + +207 +00:10:33,755 --> 00:10:36,665 +matching engine from Java. + +208 +00:10:36,665 --> 00:10:39,065 +So would have exactly +the same problem. + +209 +00:10:39,065 --> 00:10:41,480 +Also, I have been +here very careful, + +210 +00:10:41,480 --> 00:10:43,550 +I'm using here Scala aid, + +211 +00:10:43,550 --> 00:10:46,085 +which nowadays is quite old. + +212 +00:10:46,085 --> 00:10:50,765 +But you will see also +current Java versions. + +213 +00:10:50,765 --> 00:10:55,490 +We will see we can out-compete +them by magnitudes. + +214 +00:10:55,490 --> 00:10:57,605 +So I think I can that. + +215 +00:10:57,605 --> 00:10:59,165 +Now, just finish here. + +216 +00:10:59,165 --> 00:11:04,025 +You see the problem. Just +for completeness sake. + +217 +00:11:04,025 --> 00:11:07,010 +Here is a Ruby program. + +218 +00:11:07,010 --> 00:11:09,935 +This is using the other +regular expression. + +219 +00:11:09,935 --> 00:11:12,935 +In this case the +string should match. + +220 +00:11:12,935 --> 00:11:20,300 +And again it tries out +strings between 130 here. + +221 +00:11:20,300 --> 00:11:23,450 +That's a program actually +a former student produced. + +222 +00:11:23,450 --> 00:11:25,565 +And you can see four a's + +223 +00:11:25,565 --> 00:11:29,780 +of links up till 20 +AES is pretty fast. + +224 +00:11:29,780 --> 00:11:32,495 +But then starting at 26, + +225 +00:11:32,495 --> 00:11:35,285 +it's getting really slow. + +226 +00:11:35,285 --> 00:11:37,100 +So in this case, +remember the string + +227 +00:11:37,100 --> 00:11:38,870 +is actually matched by +the regular expression. + +228 +00:11:38,870 --> 00:11:40,130 +So it has nothing to do + +229 +00:11:40,130 --> 00:11:41,540 +with a regular +expression actually + +230 +00:11:41,540 --> 00:11:45,485 +matches a string or does +not match a string. + +231 +00:11:45,485 --> 00:11:48,260 +I admit though these +regular expressions + +232 +00:11:48,260 --> 00:11:49,610 +are carefully chosen, + +233 +00:11:49,610 --> 00:11:52,250 +as you will see later on. + +234 +00:11:52,250 --> 00:11:55,620 +Hey, I also just stop that here. + +235 +00:11:55,710 --> 00:12:00,985 +Okay, this slight collect +this information about times. + +236 +00:12:00,985 --> 00:12:03,400 +On the right hand side will + +237 +00:12:03,400 --> 00:12:05,860 +be our regular expression mantra, + +238 +00:12:05,860 --> 00:12:08,290 +which we implement next week. + +239 +00:12:08,290 --> 00:12:10,795 +On the left-hand side, +are these times by + +240 +00:12:10,795 --> 00:12:14,260 +barriers than regular +expression matching engines? + +241 +00:12:14,260 --> 00:12:17,809 +On the top is this +regular expression. + +242 +00:12:19,080 --> 00:12:23,335 +Possible a n times a n times. + +243 +00:12:23,335 --> 00:12:26,890 +And on the lowest +is a star, star b. + +244 +00:12:26,890 --> 00:12:30,370 +And the x-axis show here + +245 +00:12:30,370 --> 00:12:35,335 +the length of the +string. How many a's. + +246 +00:12:35,335 --> 00:12:38,925 +And on the y axis is the time. + +247 +00:12:38,925 --> 00:12:41,660 +They need to decide whether + +248 +00:12:41,660 --> 00:12:44,615 +the string is matched by +the rate expression or not. + +249 +00:12:44,615 --> 00:12:46,415 +So you can see here, Python, + +250 +00:12:46,415 --> 00:12:47,945 +Java eight in JavaScript, + +251 +00:12:47,945 --> 00:12:52,250 +they max out approximately +at between 2530. + +252 +00:12:52,250 --> 00:12:53,900 +The kristin, it takes already + +253 +00:12:53,900 --> 00:12:55,160 +a half a minute to + +254 +00:12:55,160 --> 00:12:57,410 +decide whether the string +is matched or not. + +255 +00:12:57,410 --> 00:13:00,815 +And similarly, in +the other example, + +256 +00:13:00,815 --> 00:13:03,830 +Python and derived Ruby max out + +257 +00:13:03,830 --> 00:13:07,220 +at a similar kind of +length of the strings. + +258 +00:13:07,220 --> 00:13:10,400 +Because then they use also +half a minute to decide + +259 +00:13:10,400 --> 00:13:13,940 +whether this rec expression +actually matches the string. + +260 +00:13:13,940 --> 00:13:16,790 +Contrast that with +the reg expression + +261 +00:13:16,790 --> 00:13:19,235 +which we are regular +expression mantra, + +262 +00:13:19,235 --> 00:13:21,470 +which we're going to implement. + +263 +00:13:21,470 --> 00:13:25,040 +This can match +approximately 10 thousand + +264 +00:13:25,040 --> 00:13:30,065 +a's in this example and +needs less than ten seconds. + +265 +00:13:30,065 --> 00:13:32,285 +Actually, there will be +two versions of that. + +266 +00:13:32,285 --> 00:13:34,850 +First version may be +also relatively slow. + +267 +00:13:34,850 --> 00:13:36,410 +But the second version, + +268 +00:13:36,410 --> 00:13:38,240 +in contrast to Python, + +269 +00:13:38,240 --> 00:13:40,295 +Ruby, we'll be blindingly fast. + +270 +00:13:40,295 --> 00:13:42,380 +And in the second example, + +271 +00:13:42,380 --> 00:13:45,740 +you have to be careful +about the x axis because + +272 +00:13:45,740 --> 00:13:49,385 +that means four times +ten to the power six. + +273 +00:13:49,385 --> 00:13:51,695 +It's actually 4 million A's. + +274 +00:13:51,695 --> 00:13:55,100 +So our regular +expression match or need + +275 +00:13:55,100 --> 00:13:57,635 +less than ten seconds to + +276 +00:13:57,635 --> 00:14:00,725 +match a string of length +of 4 million A's. + +277 +00:14:00,725 --> 00:14:04,430 +Contrast that Python, Java eight, + +278 +00:14:04,430 --> 00:14:06,770 +and JavaScript need half a minute + +279 +00:14:06,770 --> 00:14:09,905 +already for a string +of length just 30, + +280 +00:14:09,905 --> 00:14:12,365 +unless you're very +careful with Java eight. + +281 +00:14:12,365 --> 00:14:15,725 +Yes, Java nine and above, + +282 +00:14:15,725 --> 00:14:17,180 +they already have + +283 +00:14:17,180 --> 00:14:19,610 +a much better regular +expression matching engine, + +284 +00:14:19,610 --> 00:14:22,805 +but still we will be running +circles around them. + +285 +00:14:22,805 --> 00:14:27,050 +It's this data. I +call this slide. + +286 +00:14:27,050 --> 00:14:29,675 +Why bother with +regular expressions? + +287 +00:14:29,675 --> 00:14:33,515 +But you can probably +see these are + +288 +00:14:33,515 --> 00:14:34,910 +at least more times by + +289 +00:14:34,910 --> 00:14:38,015 +the existing regular +expression matching engines. + +290 +00:14:38,015 --> 00:14:40,070 +And it's actually +surprising that after + +291 +00:14:40,070 --> 00:14:42,695 +one lecture we can already +do substantially better. + +292 +00:14:42,695 --> 00:14:47,495 +And if you don't believe +in D times, I gave here, + +293 +00:14:47,495 --> 00:14:50,090 +please feel free to +play on your own + +294 +00:14:50,090 --> 00:14:52,865 +with the examples +I uploaded, Keats. + +295 +00:14:52,865 --> 00:14:55,235 +These are exactly the programs + +296 +00:14:55,235 --> 00:14:57,470 +are used here in the examples. + +297 +00:14:57,470 --> 00:14:59,255 +So feel free. + +298 +00:14:59,255 --> 00:15:01,970 +You might however now think, hmm. + +299 +00:15:01,970 --> 00:15:05,449 +These are two very +well chosen examples. + +300 +00:15:05,449 --> 00:15:07,145 +And I admit that's true. + +301 +00:15:07,145 --> 00:15:09,410 +And such problem there never + +302 +00:15:09,410 --> 00:15:12,540 +causing any problems +in real life. + +303 +00:15:13,300 --> 00:15:15,980 +Regular expressions are used very + +304 +00:15:15,980 --> 00:15:19,415 +frequently and they +do cause problems. + +305 +00:15:19,415 --> 00:15:21,410 +So here's my first example from + +306 +00:15:21,410 --> 00:15:23,885 +a company called cloudflare. + +307 +00:15:23,885 --> 00:15:27,560 +This is a huge hosting company + +308 +00:15:27,560 --> 00:15:30,935 +which host very +well-known web pages. + +309 +00:15:30,935 --> 00:15:34,970 +And they really try hard +to have no outage at all. + +310 +00:15:34,970 --> 00:15:37,340 +And they manage +that for six years. + +311 +00:15:37,340 --> 00:15:39,320 +But then a Rekha expression, + +312 +00:15:39,320 --> 00:15:41,180 +actually this one caused +a problem and you + +313 +00:15:41,180 --> 00:15:43,265 +can see they're also +like two stars. + +314 +00:15:43,265 --> 00:15:44,630 +They are at the end. + +315 +00:15:44,630 --> 00:15:46,955 +And because of that string needed + +316 +00:15:46,955 --> 00:15:49,865 +too much time to be matched. + +317 +00:15:49,865 --> 00:15:50,990 +And because of that, + +318 +00:15:50,990 --> 00:15:52,430 +they had some outage for, + +319 +00:15:52,430 --> 00:15:54,125 +I think several hours, + +320 +00:15:54,125 --> 00:15:57,920 +actually in their malware +detection subsystem. + +321 +00:15:57,920 --> 00:16:02,060 +And the second example +comes from 2016, + +322 +00:16:02,060 --> 00:16:04,040 +where Stack Exchange, +I guess you know + +323 +00:16:04,040 --> 00:16:06,650 +this webpage had +also an outage from, + +324 +00:16:06,650 --> 00:16:08,390 +I think at least an hour. + +325 +00:16:08,390 --> 00:16:13,070 +Because a regular expression +then needed to format posts, + +326 +00:16:13,070 --> 00:16:15,575 +needed too much time to + +327 +00:16:15,575 --> 00:16:19,010 +recognize whether this post +should be accepted or not. + +328 +00:16:19,010 --> 00:16:23,390 +And again, there was a +semi kind of problem. + +329 +00:16:23,390 --> 00:16:24,950 +And you can read +the stories behind + +330 +00:16:24,950 --> 00:16:28,080 +that on these two given links. + +331 +00:16:28,720 --> 00:16:31,730 +When I looked at +this the first time, + +332 +00:16:31,730 --> 00:16:34,175 +what surprised me is +that theoretician + +333 +00:16:34,175 --> 00:16:37,520 +who sometimes dedicate their +life to regular expression. + +334 +00:16:37,520 --> 00:16:39,440 +And no really a lot about + +335 +00:16:39,440 --> 00:16:41,690 +them didn't know +anything about this. + +336 +00:16:41,690 --> 00:16:43,610 +But engineers, they +already created + +337 +00:16:43,610 --> 00:16:46,160 +a name for that +regular expression, + +338 +00:16:46,160 --> 00:16:47,975 +denial of service attack. + +339 +00:16:47,975 --> 00:16:49,745 +Because what you can, + +340 +00:16:49,745 --> 00:16:51,230 +what can happen now is that + +341 +00:16:51,230 --> 00:16:54,920 +attackers look for +certain strings. + +342 +00:16:54,920 --> 00:16:56,780 +You make your regular expression + +343 +00:16:56,780 --> 00:16:59,105 +matching engine topple over. + +344 +00:16:59,105 --> 00:17:01,370 +And these kind of expressions, + +345 +00:17:01,370 --> 00:17:04,160 +regular expressions called +Eve of reg expression. + +346 +00:17:04,160 --> 00:17:06,350 +And actually there are +quite a number of them. + +347 +00:17:06,350 --> 00:17:08,495 +So you seen this one, + +348 +00:17:08,495 --> 00:17:11,255 +the first one, and the +second one already. + +349 +00:17:11,255 --> 00:17:13,400 +But there are many, many more. + +350 +00:17:13,400 --> 00:17:15,620 +And you can easily have in + +351 +00:17:15,620 --> 00:17:18,560 +your program one of +these reg expression. + +352 +00:17:18,560 --> 00:17:21,830 +And then you have the +problem that if you do have + +353 +00:17:21,830 --> 00:17:23,240 +this regular expression and + +354 +00:17:23,240 --> 00:17:25,640 +somebody finds the +corresponding string, + +355 +00:17:25,640 --> 00:17:29,945 +which make the records +matching engine topple over, + +356 +00:17:29,945 --> 00:17:31,820 +then you have a problem + +357 +00:17:31,820 --> 00:17:34,295 +because your webpage is +probably not variable. + +358 +00:17:34,295 --> 00:17:36,140 +This is also sometimes called + +359 +00:17:36,140 --> 00:17:39,350 +this phenomenon, +catastrophic backtracking. + +360 +00:17:39,350 --> 00:17:43,595 +In lecture three, we will +look at this more carefully. + +361 +00:17:43,595 --> 00:17:46,910 +And actually why that +is such a problem in + +362 +00:17:46,910 --> 00:17:50,795 +real life is actually +not to do with Lexus. + +363 +00:17:50,795 --> 00:17:53,180 +Yes, regular +expressions are used as + +364 +00:17:53,180 --> 00:17:55,040 +the basic tool for implementing + +365 +00:17:55,040 --> 00:17:57,185 +like source bad reg expressions, + +366 +00:17:57,185 --> 00:18:00,065 +of course, used in +a much wider area. + +367 +00:18:00,065 --> 00:18:03,770 +And they especially used for +network intrusion detection. + +368 +00:18:03,770 --> 00:18:06,590 +Remember, you having to + +369 +00:18:06,590 --> 00:18:10,130 +administer a big network +and you only want to let + +370 +00:18:10,130 --> 00:18:13,640 +in packets which you think are K + +371 +00:18:13,640 --> 00:18:14,930 +and you want to keep out + +372 +00:18:14,930 --> 00:18:17,645 +any package which might +hack into your network. + +373 +00:18:17,645 --> 00:18:22,670 +So what they have is they +have suites of thousands and + +374 +00:18:22,670 --> 00:18:25,745 +sometimes even more +regular expressions which + +375 +00:18:25,745 --> 00:18:27,755 +check whether this package + +376 +00:18:27,755 --> 00:18:30,065 +satisfies some patterns or not. + +377 +00:18:30,065 --> 00:18:31,460 +And in this case it will be left + +378 +00:18:31,460 --> 00:18:34,205 +out or it will be let in. + +379 +00:18:34,205 --> 00:18:36,335 +And with networks, + +380 +00:18:36,335 --> 00:18:39,080 +the problem is that our +hardware is already + +381 +00:18:39,080 --> 00:18:43,190 +so fast that the reg expressions + +382 +00:18:43,190 --> 00:18:45,169 +really become a bottleneck. + +383 +00:18:45,169 --> 00:18:47,060 +Because what do you do if now is + +384 +00:18:47,060 --> 00:18:49,880 +suddenly a reg expression +takes too much time + +385 +00:18:49,880 --> 00:18:52,670 +to just stop the matching + +386 +00:18:52,670 --> 00:18:55,100 +and let the package +in regardless? + +387 +00:18:55,100 --> 00:18:58,190 +Or do you just hold +the network up + +388 +00:18:58,190 --> 00:19:01,715 +and don't let anything in +until you decided that. + +389 +00:19:01,715 --> 00:19:04,895 +So that's actually a +really hard problem. + +390 +00:19:04,895 --> 00:19:06,650 +But the first time I came across + +391 +00:19:06,650 --> 00:19:09,965 +that problem was actually +by this engineer. + +392 +00:19:09,965 --> 00:19:13,820 +And it's always say that +Germans don't have any Yammer. + +393 +00:19:13,820 --> 00:19:16,985 +But I found that +video quite funny. + +394 +00:19:16,985 --> 00:19:19,145 +Maybe you have a +different opinion, + +395 +00:19:19,145 --> 00:19:21,095 +but feel free to +have a look which + +396 +00:19:21,095 --> 00:19:23,705 +explains exactly that problem. + +397 +00:19:23,705 --> 00:19:25,610 +So in the next video, + +398 +00:19:25,610 --> 00:19:28,445 +we will start to +implement this matcher. + +399 +00:19:28,445 --> 00:19:30,870 +So I hope to see you there. diff -r d41956ea544e -r 82a1315c128d videos/01-intro.srt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/videos/01-intro.srt Wed Sep 23 11:34:43 2020 +0100 @@ -0,0 +1,2405 @@ +1 +00:00:16,280 --> 00:00:18,330 +A warm welcome to + +2 +00:00:18,330 --> 00:00:20,820 +the compilers and formal +languages module? + +3 +00:00:20,820 --> 00:00:24,390 +My name is Christian Urban. +Thank you for coming. + +4 +00:00:24,390 --> 00:00:27,644 +Compilers. I guess +you use compilers + +5 +00:00:27,644 --> 00:00:31,680 +in your daily work, be it +the C or Java compiler. + +6 +00:00:31,680 --> 00:00:34,245 +And you might be curious +in what they do. + +7 +00:00:34,245 --> 00:00:35,700 +But you might also be + +8 +00:00:35,700 --> 00:00:38,130 +intimidated to look +what they do underneath + +9 +00:00:38,130 --> 00:00:39,900 +the bonnet because they are + +10 +00:00:39,900 --> 00:00:42,520 +quite large software systems. + +11 +00:00:42,520 --> 00:00:46,130 +What I like to show you in +this module is that you + +12 +00:00:46,130 --> 00:00:49,310 +do not need to be an +Über hacker to implement your own + +13 +00:00:49,310 --> 00:00:51,305 +compiler for your +own language, say. + +14 +00:00:51,305 --> 00:00:54,155 +So that will be the main +goal of this module. + +15 +00:00:54,155 --> 00:00:56,210 +You will implement +your own compiler, + +16 +00:00:56,210 --> 00:00:58,310 +of course with my help. + +17 +00:00:58,310 --> 00:01:02,360 +What I personally like +about compilers is that + +18 +00:01:02,360 --> 00:01:04,580 +the subject is a +perfect combination + +19 +00:01:04,580 --> 00:01:06,350 +of theory and practice. + +20 +00:01:06,350 --> 00:01:07,790 +I like to hack things, + +21 +00:01:07,790 --> 00:01:10,595 +writing actual code, +but I also like theory. + +22 +00:01:10,595 --> 00:01:13,040 +I want to understand what +my code actually does. + +23 +00:01:13,040 --> 00:01:17,040 +So compilers are the +perfect subject for me. + +24 +00:01:18,040 --> 00:01:20,809 +Let's have a look at the details. + +25 +00:01:20,809 --> 00:01:23,779 +Here's an airplane +view of a compiler. + +26 +00:01:23,779 --> 00:01:25,850 +On the left-hand side you can see + +27 +00:01:25,850 --> 00:01:28,745 +the input program a +developer would write. + +28 +00:01:28,745 --> 00:01:31,955 +And on the right-hand side is +the output of the compiler, + +29 +00:01:31,955 --> 00:01:36,360 +the binary code the developer +wants to actually run. + +30 +00:01:36,370 --> 00:01:40,340 +What makes such a project +actually feasible is that + +31 +00:01:40,340 --> 00:01:44,165 +compilers fall neatly into +separate components. + +32 +00:01:44,165 --> 00:01:47,210 +So you can see the lexer and the + +33 +00:01:47,210 --> 00:01:48,860 +parser, which are often called the + +34 +00:01:48,860 --> 00:01:50,990 +front end of the compiler. + +35 +00:01:50,990 --> 00:01:53,000 +And the code generation is + +36 +00:01:53,000 --> 00:01:55,700 +called the backend +of the compiler. + +37 +00:01:55,700 --> 00:01:57,620 +And it so happens +that we will spend + +38 +00:01:57,620 --> 00:01:59,930 +the first five weeks +on the lexer and + +39 +00:01:59,930 --> 00:02:04,970 +the parser, and the next five +weeks on the code generator. + +40 +00:02:04,970 --> 00:02:09,575 +The first component of the +compiler is the lexer. + +41 +00:02:09,575 --> 00:02:14,480 +The purpose of the lexer +is to scan a flat + +42 +00:02:14,480 --> 00:02:16,610 +string, which the +input program is at + +43 +00:02:16,610 --> 00:02:19,145 +the beginning and separate out + +44 +00:02:19,145 --> 00:02:21,275 +where are the words? + +45 +00:02:21,275 --> 00:02:23,600 +You might think +in Western languages, + +46 +00:02:23,600 --> 00:02:24,920 +that is very easy: + +47 +00:02:24,920 --> 00:02:26,690 +you just try to +find out where are + +48 +00:02:26,690 --> 00:02:28,580 +the whitespaces +and then you know, + +49 +00:02:28,580 --> 00:02:31,955 +where one word stops and +where the next word begins. + +50 +00:02:31,955 --> 00:02:35,300 +But, actually, computer +languages are more + +51 +00:02:35,300 --> 00:02:38,180 +like ancient languages +that you see here. + +52 +00:02:38,180 --> 00:02:39,860 +For example, ancient Greek. + +53 +00:02:39,860 --> 00:02:42,065 +The writer wrote one word + +54 +00:02:42,065 --> 00:02:44,915 +after the next and not +leaving any space. + +55 +00:02:44,915 --> 00:02:47,930 +So for example, in this +very simple string here + +56 +00:02:47,930 --> 00:02:50,960 +read n, there is no space at all. + +57 +00:02:50,960 --> 00:02:53,540 +But the purpose of +the lexer is to + +58 +00:02:53,540 --> 00:02:56,570 +separate it into five +different components. + +59 +00:02:56,570 --> 00:02:59,450 +It has to first say, +well there is a read, + +60 +00:02:59,450 --> 00:03:01,595 +then there is a left parenthesis, + +61 +00:03:01,595 --> 00:03:04,025 +then there's an +identifier called n, + +62 +00:03:04,025 --> 00:03:07,715 +then there's a right parenthesis, +and finally a semicolon. + +63 +00:03:07,715 --> 00:03:11,345 +And as you can see, there +is no space here at all. + +64 +00:03:11,345 --> 00:03:14,150 +And so the task of finding where + +65 +00:03:14,150 --> 00:03:17,705 +are the words is actually +quite involved. + +66 +00:03:17,705 --> 00:03:19,460 +Also the classification is + +67 +00:03:19,460 --> 00:03:22,055 +sometimes not so straightforward. + +68 +00:03:22,055 --> 00:03:24,350 +If, for example, a writer + +69 +00:03:24,350 --> 00:03:26,360 +wrote "if" on its own, + +70 +00:03:26,360 --> 00:03:29,000 +then this should be a +keyword classified as + +71 +00:03:29,000 --> 00:03:32,615 +a keyword because it's +from the if-then-else. + +72 +00:03:32,615 --> 00:03:36,800 +But if the developer wrote +something longer like "iffoo", + +73 +00:03:36,800 --> 00:03:38,030 +then this might just be + +74 +00:03:38,030 --> 00:03:39,860 +a strange variable name +and he needs to be + +75 +00:03:39,860 --> 00:03:44,250 +classified as a variable name. + +76 +00:03:45,220 --> 00:03:50,720 +The output of the lexer +is a sequence of tokens. + +77 +00:03:50,720 --> 00:03:53,480 +These are essentially the words in + +78 +00:03:53,480 --> 00:03:56,405 +a sentence and their +classification: + +79 +00:03:56,405 --> 00:03:59,540 +they are nouns, +verbs or adjectives. + +80 +00:03:59,540 --> 00:04:02,885 +For us, of course, the +classification would be keywords, + +81 +00:04:02,885 --> 00:04:06,005 +identifiers, +operators, and so on. + +82 +00:04:06,005 --> 00:04:11,480 +And these tokens are the +input for the parser, + +83 +00:04:11,480 --> 00:04:13,085 +the next component in + +84 +00:04:13,085 --> 00:04:17,240 +our compiler. The parser +essentially takes this list + +85 +00:04:17,240 --> 00:04:23,300 +of tokens and transforms it +into a abstract syntax tree. + +86 +00:04:23,300 --> 00:04:27,230 +That means we have now the +sentence, we have the words. + +87 +00:04:27,230 --> 00:04:30,275 +We have to relate +the words to each other + +88 +00:04:30,275 --> 00:04:33,680 +in order to find out what +this sentence actually means. + +89 +00:04:33,680 --> 00:04:35,405 +In this case here, + +90 +00:04:35,405 --> 00:04:38,595 +we have to do a read...which +variable is affected? + +91 +00:04:38,595 --> 00:04:45,040 +The variable n. Once we have +the abstract syntax tree, + +92 +00:04:45,040 --> 00:04:49,225 +it can go to the next component, +to the code generator. + +93 +00:04:49,225 --> 00:04:52,480 +Whilst it doesn't look +like this in this picture, + +94 +00:04:52,480 --> 00:04:54,100 +the code generators is usually the + +95 +00:04:54,100 --> 00:04:56,080 +biggest part in a compiler. + +96 +00:04:56,080 --> 00:04:58,720 +And here we actually +have to cut corners. + +97 +00:04:58,720 --> 00:05:02,470 +Instead of producing binary +code, which can be run + +98 +00:05:02,470 --> 00:05:06,820 +directly on a CPU, like X86 or ARM, + +99 +00:05:06,820 --> 00:05:11,035 +we actually target the +Java Virtual Machine, the JVM. + +100 +00:05:11,035 --> 00:05:13,600 +This is very similar +to the Scala compiler, + +101 +00:05:13,600 --> 00:05:16,480 +for example, which +produces JVM code. + +102 +00:05:16,480 --> 00:05:18,940 +Or, of course, also +the Java compiler, + +103 +00:05:18,940 --> 00:05:20,845 +which produces JVM code. + +104 +00:05:20,845 --> 00:05:23,900 +So here's a typical +example code which we're + +105 +00:05:23,900 --> 00:05:27,305 +going to produce: Something +like variable 2 + +106 +00:05:27,305 --> 00:05:30,545 +gets allocated in the stack. + +107 +00:05:30,545 --> 00:05:32,390 +You subtract ten from it. + +108 +00:05:32,390 --> 00:05:36,050 +You test whether the +variable is 0 and if yes, + +109 +00:05:36,050 --> 00:05:40,170 +you jump somewhere and otherwise +you reload the variable. + +110 +00:05:41,560 --> 00:05:45,935 +Even though we cut corners +into could generate apart. + +111 +00:05:45,935 --> 00:05:48,575 +By producing code for the JVM, + +112 +00:05:48,575 --> 00:05:51,710 +we can still obtain quite +impressive results. + +113 +00:05:51,710 --> 00:05:56,225 +Here's a program with +a cubic runtime behaviour. + +114 +00:05:56,225 --> 00:05:59,525 +When it has to +calculate with 400 units, + +115 +00:05:59,525 --> 00:06:02,165 +it might need five +seconds to calculate. + +116 +00:06:02,165 --> 00:06:05,345 +But if it has to +calculate with 1200 units, + +117 +00:06:05,345 --> 00:06:08,165 +it already needs more +than two minutes. + +118 +00:06:08,165 --> 00:06:10,940 +Now these timings, +the blue timings + +119 +00:06:10,940 --> 00:06:13,910 +are obtained with an +interpreter of that program. + +120 +00:06:13,910 --> 00:06:16,265 +And you can see +just next to it, + +121 +00:06:16,265 --> 00:06:18,050 +the red times. + +122 +00:06:18,050 --> 00:06:20,359 +They are obtained +with the compiler + +123 +00:06:20,359 --> 00:06:23,135 +we're going to write +in this module. + +124 +00:06:23,135 --> 00:06:25,100 +And there you can see, even + +125 +00:06:25,100 --> 00:06:29,000 +with 1200, the times get +hardly off the x-axis. + +126 +00:06:29,000 --> 00:06:30,485 +So in this instance, + +127 +00:06:30,485 --> 00:06:32,630 +our compiler will +have a speed up from + +128 +00:06:32,630 --> 00:06:37,589 +approximately 1 million in +comparison to the interpreter. + +129 +00:06:38,350 --> 00:06:42,020 +This might be a fun task +for your spare time. + +130 +00:06:42,020 --> 00:06:44,480 +This is a compiler explorer, +which allows you to + +131 +00:06:44,480 --> 00:06:47,060 +write C code on the +left-hand side. + +132 +00:06:47,060 --> 00:06:49,115 +It shows you on the +right-hand side + +133 +00:06:49,115 --> 00:06:53,270 +the machine code an +X86 CPU can run. + +134 +00:06:53,270 --> 00:06:56,060 +For example, it gives +you these color scheme and + +135 +00:06:56,060 --> 00:06:59,000 +says that this addition +in the num + num in + +136 +00:06:59,000 --> 00:07:01,580 +the return statement, +translates to + +137 +00:07:01,580 --> 00:07:02,960 +essentially an addition of + +138 +00:07:02,960 --> 00:07:05,930 +an register in the machine code. + +139 +00:07:05,930 --> 00:07:08,480 +I think this compiler +explorer also works for + +140 +00:07:08,480 --> 00:07:11,495 +the Haskell language and +also produces ARM code, + +141 +00:07:11,495 --> 00:07:15,245 +not just code for the X86 CPUs. + +142 +00:07:15,245 --> 00:07:18,950 +As an aside, I also recommend +to watch the movie of + +143 +00:07:18,950 --> 00:07:20,870 +this guy because that is + +144 +00:07:20,870 --> 00:07:22,940 +very much like how I worked +at the beginning + +145 +00:07:22,940 --> 00:07:25,355 +when implementing our compiler. + +146 +00:07:25,355 --> 00:07:29,300 +I wrote some code which I +knew how it should behave. + +147 +00:07:29,300 --> 00:07:30,725 +And then I just had a look + +148 +00:07:30,725 --> 00:07:32,900 +what another compiler produces. + +149 +00:07:32,900 --> 00:07:37,190 +And I imitated that code +as much as I could. + +150 +00:07:37,190 --> 00:07:39,380 +Such a compiler explorer + +151 +00:07:39,380 --> 00:07:41,375 +also exists for +the Java language. + +152 +00:07:41,375 --> 00:07:42,935 +Here's one where you can write + +153 +00:07:42,935 --> 00:07:44,915 +Java code on the left-hand side, + +154 +00:07:44,915 --> 00:07:47,930 +and on the right-hand +side you get JVM code. + +155 +00:07:47,930 --> 00:07:50,255 +JVM code is a byte code + +156 +00:07:50,255 --> 00:07:53,405 +which cannot be run +directly by the CPU, + +157 +00:07:53,405 --> 00:07:55,220 +but needs the Java +Virtual Machine + +158 +00:07:55,220 --> 00:07:57,170 +to essentially interpret that. + +159 +00:07:57,170 --> 00:07:58,760 +This means it's not + +160 +00:07:58,760 --> 00:08:01,235 +the most efficient way +how to run programs - + +161 +00:08:01,235 --> 00:08:02,780 +it would be much faster to + +162 +00:08:02,780 --> 00:08:05,285 +generate direct +code for the CPU. + +163 +00:08:05,285 --> 00:08:07,700 +But by producing bytecode, + +164 +00:08:07,700 --> 00:08:11,435 +we still run the +programs quite fast + +165 +00:08:11,435 --> 00:08:13,520 +and it also simplifies + +166 +00:08:13,520 --> 00:08:16,280 +a number of issues +in our compiler. + +167 +00:08:16,280 --> 00:08:18,980 +One issue is about +memory management. + +168 +00:08:18,980 --> 00:08:22,055 +We don't have to be concerned +about register allocation, + +169 +00:08:22,055 --> 00:08:25,505 +which we would need +to do in a real CPU. + +170 +00:08:25,505 --> 00:08:27,680 +This will be done by the JVM. + +171 +00:08:27,680 --> 00:08:29,750 +It's also much easier to produce + +172 +00:08:29,750 --> 00:08:33,635 +this bytecode rather than +actual machine code. + +173 +00:08:33,635 --> 00:08:37,385 +I think it's now a good time +to come to the question, + +174 +00:08:37,385 --> 00:08:39,950 +why on Earth studying compilers. + +175 +00:08:39,950 --> 00:08:42,650 +Compilers are such an +established subject + +176 +00:08:42,650 --> 00:08:43,985 +in computer science. + +177 +00:08:43,985 --> 00:08:46,100 +Compilers do what they do. + +178 +00:08:46,100 --> 00:08:48,410 +Probably forrests have +been killed by + +179 +00:08:48,410 --> 00:08:52,190 +all the books that have been +published on compilers. + +180 +00:08:52,190 --> 00:08:56,690 +Why on Earth studying +compilers in 2020? And even worse + +181 +00:08:56,690 --> 00:08:59,659 +why implementing +your own compiler? + +182 +00:08:59,659 --> 00:09:02,720 +Well, a slightly humorous take on + +183 +00:09:02,720 --> 00:09:05,105 +that question is +given by John Regehr, + +184 +00:09:05,105 --> 00:09:08,375 +who is a compiler hacker +from the University of Utah. + +185 +00:09:08,375 --> 00:09:09,770 +Essentially what he says, + +186 +00:09:09,770 --> 00:09:12,110 +if you're a good compiler hacker, + +187 +00:09:12,110 --> 00:09:14,690 +you have no problems +of finding a job. + +188 +00:09:14,690 --> 00:09:17,210 +He puts it as: It's effectively + +189 +00:09:17,210 --> 00:09:22,320 +a "Perpetual Employment Act" +for solid compiler hackers. + +190 +00:09:22,990 --> 00:09:27,380 +Regehr gives two justifications +for that statement. + +191 +00:09:27,380 --> 00:09:29,690 +First, he says +hardware is getting + +192 +00:09:29,690 --> 00:09:32,585 +wierder, rather than +getting clocked faster. + +193 +00:09:32,585 --> 00:09:34,520 +And that's definitely true. + +194 +00:09:34,520 --> 00:09:36,590 +My first computer many, many, + +195 +00:09:36,590 --> 00:09:40,040 +many years ago contained +only a single core CPU. + +196 +00:09:40,040 --> 00:09:44,030 +And it was such a simple CPU +that we wrote machine code + +197 +00:09:44,030 --> 00:09:46,220 +directly for that CPU in order to + +198 +00:09:46,220 --> 00:09:49,740 +run our programs as +fast as possible. + +199 +00:09:50,260 --> 00:09:53,600 +In contrast, today, Regehr writes, + +200 +00:09:53,600 --> 00:09:57,005 +almost all processors are +multi-core nowadays. + +201 +00:09:57,005 --> 00:09:59,870 +And it looks like there's +an increasing asymmetry + +202 +00:09:59,870 --> 00:10:02,015 +in resources across cores. + +203 +00:10:02,015 --> 00:10:04,445 +Processors come +with vector units, + +204 +00:10:04,445 --> 00:10:07,189 +crypto accelerators, etc. + +205 +00:10:07,189 --> 00:10:11,930 +We have TSPs, GPUs, ARM +big,little, Xeon Phis, + +206 +00:10:11,930 --> 00:10:14,630 +and this only scratches the surface. + +207 +00:10:14,630 --> 00:10:17,255 +And that is really a +problem for compilers, + +208 +00:10:17,255 --> 00:10:20,495 +because if we now +have multi-core CPUs, + +209 +00:10:20,495 --> 00:10:23,180 +that means our programs need + +210 +00:10:23,180 --> 00:10:26,075 +to be scheduled +over several CPUs. + +211 +00:10:26,075 --> 00:10:28,220 +But the developer, of course, + +212 +00:10:28,220 --> 00:10:30,545 +doesn't want to know +anything about that. + +213 +00:10:30,545 --> 00:10:34,655 +Also, now we have more +CPUs in a computer, + +214 +00:10:34,655 --> 00:10:37,400 +but they seem to also come +with different resources. + +215 +00:10:37,400 --> 00:10:40,310 +So certain tasks in +a program need to + +216 +00:10:40,310 --> 00:10:43,460 +be scheduled on some +cores, but not on others. + +217 +00:10:43,460 --> 00:10:46,685 +We also have for a +long time already GPUs, + +218 +00:10:46,685 --> 00:10:49,025 +which are highly specialized for + +219 +00:10:49,025 --> 00:10:53,240 +very parallel computations +to do with graphics. + +220 +00:10:53,240 --> 00:10:56,015 +They at least in +the past few years, + +221 +00:10:56,015 --> 00:10:59,360 +they could only do very +special computations, + +222 +00:10:59,360 --> 00:11:02,255 +but very, very fast +and highly parallel. + +223 +00:11:02,255 --> 00:11:05,539 +You couldn't use them for +general-purpose calculations, + +224 +00:11:05,539 --> 00:11:10,205 +which you would usually +use CPUs. Similarly, DSPs. + +225 +00:11:10,205 --> 00:11:14,075 +They are needed for all +the signal processing + +226 +00:11:14,075 --> 00:11:16,040 +in mobile phones. + +227 +00:11:16,040 --> 00:11:20,780 +Without them, we just +wouldn't have mobile phones. + +228 +00:11:20,780 --> 00:11:23,525 +The second reason Regehr gives is + +229 +00:11:23,525 --> 00:11:25,550 +that we are getting tired of + +230 +00:11:25,550 --> 00:11:27,620 +low-level languages and + +231 +00:11:27,620 --> 00:11:30,335 +their associated +security disasters. + +232 +00:11:30,335 --> 00:11:32,435 +While at the beginning +we were still + +233 +00:11:32,435 --> 00:11:34,760 +happy to write machine +code directly, + +234 +00:11:34,760 --> 00:11:37,175 +nobody wants to do this nowadays. + +235 +00:11:37,175 --> 00:11:39,515 +He writes: :We want +to write new code + +236 +00:11:39,515 --> 00:11:44,120 +to whatever extent possible in +safer high-level languages. + +237 +00:11:44,120 --> 00:11:46,130 +Compilers are caught right + +238 +00:11:46,130 --> 00:11:48,365 +in the middle of these +opposing trends: + +239 +00:11:48,365 --> 00:11:50,765 +one of their main jobs +is to have bridged + +240 +00:11:50,765 --> 00:11:53,240 +the large and growing gap between + +241 +00:11:53,240 --> 00:11:55,460 +increasingly high-level languages + +242 +00:11:55,460 --> 00:11:58,565 +and increasingly +wacky platforms. + +243 +00:11:58,565 --> 00:12:00,320 +So here you have it: + +244 +00:12:00,320 --> 00:12:02,750 +It's still interesting to study + +245 +00:12:02,750 --> 00:12:06,244 +compilers nowadays, +especially nowadays. + +246 +00:12:06,244 --> 00:12:09,875 +Well, if you want to work +on interesting problems, + +247 +00:12:09,875 --> 00:12:14,570 +then very often you have to +know compilers. Here's one: + +248 +00:12:14,570 --> 00:12:16,940 +In the good old +times when we were + +249 +00:12:16,940 --> 00:12:19,310 +still able to fly on holidays, + +250 +00:12:19,310 --> 00:12:23,300 +I'm sure you flew in +a 777 Boeing airplane. + +251 +00:12:23,300 --> 00:12:25,850 +It's actually already +a quite old airplane. + +252 +00:12:25,850 --> 00:12:28,295 +But they had the +following problem. + +253 +00:12:28,295 --> 00:12:33,020 +What happens if a CPU +calculates the wrong result? + +254 +00:12:33,020 --> 00:12:36,065 +And that's actually not +such a hypothetical problem + +255 +00:12:36,065 --> 00:12:40,295 +because you remember +the Pentium CPU + +256 +00:12:40,295 --> 00:12:43,655 +in around 2000 +contained a bug and it cost + +257 +00:12:43,655 --> 00:12:48,140 +a lot of money for Intel +to replace this CPU. + +258 +00:12:48,140 --> 00:12:51,470 +What happens if an CPU calculates + +259 +00:12:51,470 --> 00:12:56,105 +the wrong result in +some navigation data? + +260 +00:12:56,105 --> 00:12:58,520 +Do you just let the +airplane crash? + +261 +00:12:58,520 --> 00:13:00,650 +Well, the engineers at Boeing + +262 +00:13:00,650 --> 00:13:02,675 +came up with the +following solution: + +263 +00:13:02,675 --> 00:13:05,055 +They writing one program that + +264 +00:13:05,055 --> 00:13:08,690 +essentially controls +how their airplane + +265 +00:13:08,690 --> 00:13:10,295 +is supposed to fly. + +266 +00:13:10,295 --> 00:13:13,040 +In a programming language +called Ada that is + +267 +00:13:13,040 --> 00:13:15,770 +slightly obscure +programming language + +268 +00:13:15,770 --> 00:13:18,650 +bat is very well-known in + +269 +00:13:18,650 --> 00:13:22,040 +areas where safety +is really paramount. + +270 +00:13:22,040 --> 00:13:25,010 +And what they did is they +took this one Ada program + +271 +00:13:25,010 --> 00:13:28,010 +and they essentially +took three compilers, + +272 +00:13:28,010 --> 00:13:29,435 +independent compilers, + +273 +00:13:29,435 --> 00:13:32,045 +which compiled the program +to different CPUs. + +274 +00:13:32,045 --> 00:13:33,815 +One CPU from Intel, + +275 +00:13:33,815 --> 00:13:37,235 +one CPU for Motorola, +and one from AMD. + +276 +00:13:37,235 --> 00:13:38,930 +And these are quite different + +277 +00:13:38,930 --> 00:13:40,955 +CPUs. Also some of them + +278 +00:13:40,955 --> 00:13:42,755 +have quite different philosophies + +279 +00:13:42,755 --> 00:13:45,245 +on how they do their calculations. + +280 +00:13:45,245 --> 00:13:47,330 +Now what they could do is, they + +281 +00:13:47,330 --> 00:13:50,690 +could put these three computers +on a single board and + +282 +00:13:50,690 --> 00:13:54,335 +could now run all their +calculations in parallel. + +283 +00:13:54,335 --> 00:13:56,690 +One with an Intel +CPU, one with + +284 +00:13:56,690 --> 00:14:00,245 +Motorola, and one with a +Risc computers say. + +285 +00:14:00,245 --> 00:14:02,795 +And then they could +compare the results. + +286 +00:14:02,795 --> 00:14:05,030 +And if these results differed, + +287 +00:14:05,030 --> 00:14:07,940 +then they knew one CPU must have + +288 +00:14:07,940 --> 00:14:11,600 +calculated the wrong result +and probably told the pilot, + +289 +00:14:11,600 --> 00:14:14,850 +please don't let +that airplane crash. + +290 +00:14:14,950 --> 00:14:17,270 +Not just Boeing is doing + +291 +00:14:17,270 --> 00:14:19,355 +interesting things +with compilers. + +292 +00:14:19,355 --> 00:14:22,640 +Also, Airbus in a completely +different setting + +293 +00:14:22,640 --> 00:14:24,260 +is using compilers in + +294 +00:14:24,260 --> 00:14:26,420 +a interesting way to get + +295 +00:14:26,420 --> 00:14:30,195 +their airplanes up in the +air and let them not crush. + +296 +00:14:30,195 --> 00:14:33,010 +In another example, I +have friends working + +297 +00:14:33,010 --> 00:14:35,350 +at Facebook who work on + +298 +00:14:35,350 --> 00:14:37,900 +compilers to make sense +are they are heaps + +299 +00:14:37,900 --> 00:14:41,470 +and heaps and heaps of +code written in PHP, + +300 +00:14:41,470 --> 00:14:42,880 +which is one of the most + +301 +00:14:42,880 --> 00:14:44,920 +horrible languages on the planet. + +302 +00:14:44,920 --> 00:14:46,630 +The bottom line is, + +303 +00:14:46,630 --> 00:14:50,499 +compilers are still very, +very important nowadays. + +304 +00:14:50,499 --> 00:14:52,810 +And for the foreseeable future, + +305 +00:14:52,810 --> 00:14:56,150 +compilers still need +to be developed. + +306 +00:14:57,270 --> 00:15:00,235 +When one talks about +compilers then + +307 +00:15:00,235 --> 00:15:01,630 +there is, of course, + +308 +00:15:01,630 --> 00:15:05,395 +a magic element involved. They're +large software systems. + +309 +00:15:05,395 --> 00:15:07,750 +And yes, they're supposed to + +310 +00:15:07,750 --> 00:15:10,525 +generate a runnable binary for + +311 +00:15:10,525 --> 00:15:12,105 +a high-level program, + +312 +00:15:12,105 --> 00:15:15,230 +but they are also supposed +to make our programs better. + +313 +00:15:15,230 --> 00:15:18,920 +They optimize our +programs to run faster. + +314 +00:15:18,920 --> 00:15:22,610 +And there's a lot of +"magic" involved in that. + +315 +00:15:22,610 --> 00:15:24,890 +Magic in inverted quotes. + +316 +00:15:24,890 --> 00:15:26,480 +I would like to show you one of + +317 +00:15:26,480 --> 00:15:29,340 +this magic in one example. + +318 +00:15:31,090 --> 00:15:33,110 +I hope you still have + +319 +00:15:33,110 --> 00:15:36,485 +fond memories of the +PEP module last year. + +320 +00:15:36,485 --> 00:15:40,280 +And you might remember BF +language, we had to look at. + +321 +00:15:40,280 --> 00:15:43,670 +This BF language contains +the kind of memory and + +322 +00:15:43,670 --> 00:15:45,680 +the memory pointer which can + +323 +00:15:45,680 --> 00:15:48,140 +be moved to the left +or to the right, + +324 +00:15:48,140 --> 00:15:50,270 +where an integer in + +325 +00:15:50,270 --> 00:15:52,925 +the memory can be either +increased or decreased. + +326 +00:15:52,925 --> 00:15:55,730 +We can print +out the content of + +327 +00:15:55,730 --> 00:15:59,645 +the current cell or input +something into a cell. + +328 +00:15:59,645 --> 00:16:01,850 +And we have loops, and + +329 +00:16:01,850 --> 00:16:04,220 +everything else is +considered a comment. + +330 +00:16:04,220 --> 00:16:06,290 +What's good about +this BF language is that + +331 +00:16:06,290 --> 00:16:08,180 +we don't even need a front end. + +332 +00:16:08,180 --> 00:16:09,920 +We can immediately start writing + +333 +00:16:09,920 --> 00:16:13,529 +an interpreter or a compiler. + +334 +00:16:15,850 --> 00:16:18,155 +Okay, I have here + +335 +00:16:18,155 --> 00:16:20,600 +a very straightforward +interpreter for + +336 +00:16:20,600 --> 00:16:22,865 +the BF language. You +might recognize it. + +337 +00:16:22,865 --> 00:16:27,120 +And I run it with a +benchmark program, which + +338 +00:16:27,760 --> 00:16:30,960 +might also recognize. + +339 +00:16:31,560 --> 00:16:36,835 +And this will now take +approximately ten minutes. + +340 +00:16:36,835 --> 00:16:39,920 +So see you in a bit. + +341 +00:16:40,710 --> 00:16:43,660 +Okay, this has finished now. + +342 +00:16:43,660 --> 00:16:45,925 +Almost took 11 minutes. + +343 +00:16:45,925 --> 00:16:47,410 +The question now is, + +344 +00:16:47,410 --> 00:16:51,820 +can we to better? + +345 +00:16:51,820 --> 00:16:53,260 +Actually, it is not difficult +to do better + +346 +00:16:53,260 --> 00:16:54,970 +than this simple-minded +interpreter. + +347 +00:16:54,970 --> 00:16:58,690 +It is relatively +straightforward to take + +348 +00:16:58,690 --> 00:17:03,520 +a BF program and generate equivalent C code. + +349 +00:17:03,520 --> 00:17:06,520 +This can be easily +done by somehow + +350 +00:17:06,520 --> 00:17:09,490 +representing the BF memory in + +351 +00:17:09,490 --> 00:17:12,310 +C. We can do this +by just an array of + +352 +00:17:12,310 --> 00:17:15,380 +characters and a memory pointer, + +353 +00:17:15,380 --> 00:17:19,265 +which points, in this case +in the middle of this array. + +354 +00:17:19,265 --> 00:17:21,800 +Then it's very easy to + +355 +00:17:21,800 --> 00:17:24,275 +translate the movement +of the + +356 +00:17:24,275 --> 00:17:28,610 +BF memory pointer into increments + +357 +00:17:28,610 --> 00:17:31,520 +and decrements of +the C pointer. + +358 +00:17:31,520 --> 00:17:33,050 +Similarly, if you want to increment or + +359 +00:17:33,050 --> 00:17:35,915 +decrement an element +in this array, + +360 +00:17:35,915 --> 00:17:38,975 +you just have to look up what +the pointer is and increment + +361 +00:17:38,975 --> 00:17:42,140 +and decrement what's +under the pointer. Similarly + +362 +00:17:42,140 --> 00:17:43,790 +we can print out something from + +363 +00:17:43,790 --> 00:17:47,450 +this array and we can put +something into this array. + +364 +00:17:47,450 --> 00:17:49,610 +What is great is that the loops + +365 +00:17:49,610 --> 00:17:51,530 +from the BF language directly + +366 +00:17:51,530 --> 00:17:55,580 +translate into while +loops of the C language. + +367 +00:17:55,580 --> 00:17:58,100 +We essentially have to check is + +368 +00:17:58,100 --> 00:18:02,450 +the memory pointer pointing +to a field that is 0. + +369 +00:18:02,450 --> 00:18:03,950 +Then we exit the loop, + +370 +00:18:03,950 --> 00:18:05,719 +and otherwise we continue. + +371 +00:18:05,719 --> 00:18:09,575 +And that can be done +nicely in C, like so. + +372 +00:18:09,575 --> 00:18:12,995 +And everything else is +again, just a comment. + +373 +00:18:12,995 --> 00:18:16,445 +So I have implemented +this translation for you. + +374 +00:18:16,445 --> 00:18:19,700 +Remember this was the BF +program for the Mandelbrot Set. + +375 +00:18:19,700 --> 00:18:23,690 +The equivalent C code +would look like this. + +376 +00:18:23,690 --> 00:18:27,110 +So you can see at the beginning +is this generation of + +377 +00:18:27,110 --> 00:18:30,140 +the BF memory represented + +378 +00:18:30,140 --> 00:18:32,345 +as an array and the +memory pointer. + +379 +00:18:32,345 --> 00:18:34,550 +And then inside there are lots + +380 +00:18:34,550 --> 00:18:36,770 +of increments and decrements + +381 +00:18:36,770 --> 00:18:41,915 +of pointers and also +contents of this array. + +382 +00:18:41,915 --> 00:18:45,199 +Now fingers crossed that this +is the correct translation. + +383 +00:18:45,199 --> 00:18:48,125 +And I can also run this + +384 +00:18:48,125 --> 00:18:52,805 +and you should see that it runs +substantially faster. + +385 +00:18:52,805 --> 00:18:55,880 +I'm using now my GCC on + +386 +00:18:55,880 --> 00:19:01,040 +my computer to generate an +executable for the C program. + +387 +00:19:01,040 --> 00:19:04,295 +And it should run for +approximately 20 seconds. + +388 +00:19:04,295 --> 00:19:07,620 +So let's just wait for that. + +389 +00:19:11,430 --> 00:19:14,950 +Okay. What is important to note + +390 +00:19:14,950 --> 00:19:19,885 +here is that I'm running GCC, + +391 +00:19:19,885 --> 00:19:22,450 +this the option -O0. + +392 +00:19:22,450 --> 00:19:25,600 +That means I tell GCC to + +393 +00:19:25,600 --> 00:19:28,840 +generate a binary which I +can run as you can see. + +394 +00:19:28,840 --> 00:19:31,810 +But don't apply any optimization. + +395 +00:19:31,810 --> 00:19:38,395 +Keep this in mind. +As a second try, + +396 +00:19:38,395 --> 00:19:42,595 +of course, I can try to now +generate a better C program. + +397 +00:19:42,595 --> 00:19:46,060 +And as you'll remember from +the PEP course, it can, + +398 +00:19:46,060 --> 00:19:50,095 +for example, combine +several steps + +399 +00:19:50,095 --> 00:19:51,670 +going to the right of the memory + +400 +00:19:51,670 --> 00:19:53,310 +pointer or to the left. + +401 +00:19:53,310 --> 00:19:55,280 +We can combine that into + +402 +00:19:55,280 --> 00:19:58,760 +one single increment or +decrement of not just one, + +403 +00:19:58,760 --> 00:20:00,050 +but of like n, + +404 +00:20:00,050 --> 00:20:02,570 +where n is greater +than 1. Similarly, + +405 +00:20:02,570 --> 00:20:06,980 +if we increment or decrement +the content of this array, + +406 +00:20:06,980 --> 00:20:09,740 +we can do this in one +big step by incrementing + +407 +00:20:09,740 --> 00:20:12,710 +that by not just one +and increment by one, + +408 +00:20:12,710 --> 00:20:15,635 +but increment and decrement +by bigger numbers. + +409 +00:20:15,635 --> 00:20:18,870 +Everything else stays the same. + +410 +00:20:20,830 --> 00:20:23,270 +Again, I have implemented that + +411 +00:20:23,270 --> 00:20:24,950 +for you and you can see now + +412 +00:20:24,950 --> 00:20:26,835 +the C program moves + +413 +00:20:26,835 --> 00:20:30,455 +the memory pointer and bigger +chunks and also increases, + +414 +00:20:30,455 --> 00:20:32,810 +for example, here, memory content + +415 +00:20:32,810 --> 00:20:35,555 +by 15 than just by 1. + +416 +00:20:35,555 --> 00:20:38,520 +Now let's run this program. + +417 +00:20:38,530 --> 00:20:40,760 +Again, I use GCC + +418 +00:20:40,760 --> 00:20:45,350 +to compile the +C program and run it. + +419 +00:20:45,350 --> 00:20:49,025 +And again, I made sure +that it only runs this with + +420 +00:20:49,025 --> 00:20:51,530 +no optimizations switched on. + +421 +00:20:51,530 --> 00:20:54,050 +So there's minus O0. + +422 +00:20:54,050 --> 00:20:56,090 +And you can see +it's now down from + +423 +00:20:56,090 --> 00:20:59,990 +20 seconds to just 6 seconds. + +424 +00:20:59,990 --> 00:21:06,065 +I show you, the GCC is +called with -O0. + +425 +00:21:06,065 --> 00:21:08,990 +So this reduction +in time is purely + +426 +00:21:08,990 --> 00:21:12,755 +because I produced better C code, + +427 +00:21:12,755 --> 00:21:17,220 +which the compiler then just +transformed into a binary. + +428 +00:21:18,910 --> 00:21:22,055 +So far there's +nothing interesting. + +429 +00:21:22,055 --> 00:21:25,385 +We used in the first instance + +430 +00:21:25,385 --> 00:21:29,240 +single increments and use GCC with + +431 +00:21:29,240 --> 00:21:31,700 +O0 to not introduce + +432 +00:21:31,700 --> 00:21:35,255 +any optimizations and run +essentially 20 seconds. + +433 +00:21:35,255 --> 00:21:37,880 +If we then increment + +434 +00:21:37,880 --> 00:21:40,895 +in bigger chunks or +decrement in bigger chunks, + +435 +00:21:40,895 --> 00:21:45,380 +use again GCC with -O0, + +436 +00:21:45,380 --> 00:21:50,030 +then it runs in approximately +five to six seconds. + +437 +00:21:50,030 --> 00:21:51,950 +Now let me do the following: + +438 +00:21:51,950 --> 00:21:55,430 +I take the first program which + +439 +00:21:55,430 --> 00:21:58,070 +increments everything +in a single steps + +440 +00:21:58,070 --> 00:22:00,185 +or decremented in single steps. + +441 +00:22:00,185 --> 00:22:04,835 +But now I use the full +capacity of the GCC compiler + +442 +00:22:04,835 --> 00:22:06,560 +and I tell it, + +443 +00:22:06,560 --> 00:22:11,370 +please do introduce some +optimizations as you want. + +444 +00:22:11,590 --> 00:22:15,799 +And I'm now running +exactly the same program... + +445 +00:22:15,799 --> 00:22:17,870 +just the GCC compiler will + +446 +00:22:17,870 --> 00:22:22,325 +now have the optimizations +switched on. + +447 +00:22:22,325 --> 00:22:24,380 +Let's see what happens. + +448 +00:22:24,380 --> 00:22:27,320 +One first needs to compile it. + +449 +00:22:27,320 --> 00:22:29,795 +And that takes a little while. + +450 +00:22:29,795 --> 00:22:32,000 +Okay, this has now +finished and also + +451 +00:22:32,000 --> 00:22:34,115 +the calculation of the +picture has finished. + +452 +00:22:34,115 --> 00:22:35,960 +And as you can see, it took + +453 +00:22:35,960 --> 00:22:38,510 +approximately eight +seconds to calculate. + +454 +00:22:38,510 --> 00:22:41,645 +That is down from +approximately 20 seconds. + +455 +00:22:41,645 --> 00:22:46,040 +So the difference from +switching from -O0 to + +456 +00:22:46,040 --> 00:22:51,935 +-O3 meant that now +the program runs almost as + +457 +00:22:51,935 --> 00:22:54,800 +fast as where I by hand + +458 +00:22:54,800 --> 00:22:58,610 +combined several steps +into a big chunk. + +459 +00:22:58,610 --> 00:23:00,170 +That is essentially what + +460 +00:23:00,170 --> 00:23:03,485 +the GCC compiler +found out on its own. + +461 +00:23:03,485 --> 00:23:05,840 +That rather than jumping + +462 +00:23:05,840 --> 00:23:08,465 +in just single increments by one, + +463 +00:23:08,465 --> 00:23:11,510 +they can be all combined +into bigger chunks. + +464 +00:23:11,510 --> 00:23:16,595 +It hasn't been as successful +as if I do this explicitly. + +465 +00:23:16,595 --> 00:23:18,620 +But that is the magic that + +466 +00:23:18,620 --> 00:23:22,560 +the compiler essentially found +out, that this can be done. + +467 +00:23:22,960 --> 00:23:25,700 +Just a quick recap of what I did. + +468 +00:23:25,700 --> 00:23:28,160 +I first run the program with + +469 +00:23:28,160 --> 00:23:31,730 +an interpreter and it took +approximately 11 minutes. + +470 +00:23:31,730 --> 00:23:36,559 +Then I had a simple compiler +generating a C program. + +471 +00:23:36,559 --> 00:23:40,880 +And the C compiler then had +no optimization switched on. + +472 +00:23:40,880 --> 00:23:44,645 +In the C program had only +small single increments and + +473 +00:23:44,645 --> 00:23:46,820 +small jumps. Then it took + +474 +00:23:46,820 --> 00:23:49,775 +approximately 20 seconds for +to same program. + +475 +00:23:49,775 --> 00:23:52,460 +Then I had a more advanced +compiler which does + +476 +00:23:52,460 --> 00:23:55,730 +big increments and +also big jumps. + +477 +00:23:55,730 --> 00:23:57,470 +But again, the compiler didn't + +478 +00:23:57,470 --> 00:23:59,210 +introduce any optimization on its + +479 +00:23:59,210 --> 00:24:02,915 +own. Then it took +approximately five seconds. + +480 +00:24:02,915 --> 00:24:05,210 +Then I went back to + +481 +00:24:05,210 --> 00:24:08,465 +the simple compiler with +only small steps. + +482 +00:24:08,465 --> 00:24:10,400 +But then told the C compiler + +483 +00:24:10,400 --> 00:24:12,950 +to fully optimize this program. + +484 +00:24:12,950 --> 00:24:14,690 +And then it took more +or less the same + +485 +00:24:14,690 --> 00:24:17,465 +time as the more advanced compiler. + +486 +00:24:17,465 --> 00:24:20,465 +I encourage you to +look at this yourself. + +487 +00:24:20,465 --> 00:24:24,240 +As usual, all the programs +are uploaded on KEATS. + +488 +00:24:25,090 --> 00:24:27,590 +To finish this +introduction video, + +489 +00:24:27,590 --> 00:24:30,170 +let me give you an +extremely brief overview + +490 +00:24:30,170 --> 00:24:32,855 +over the history of compilers. + +491 +00:24:32,855 --> 00:24:35,450 +While I argued at the beginning + +492 +00:24:35,450 --> 00:24:38,915 +that it's interesting to +study compilers nowadays, + +493 +00:24:38,915 --> 00:24:40,400 +obviously in this field, + +494 +00:24:40,400 --> 00:24:43,295 +we are standing on the +shoulders of giants. + +495 +00:24:43,295 --> 00:24:46,520 +The field started with Turing and +Turing machines, + +496 +00:24:46,520 --> 00:24:49,130 +which were introduced in 1936. + +497 +00:24:49,130 --> 00:24:52,175 +Turing machines already had +this concept of memory + +498 +00:24:52,175 --> 00:24:55,190 +and also programs +of Turing machines + +499 +00:24:55,190 --> 00:24:58,775 +needed to be translated +between different formalisms. + +500 +00:24:58,775 --> 00:25:01,100 +Regular expressions, +which will play + +501 +00:25:01,100 --> 00:25:03,905 +an important role in our +lexer of our compiler, + +502 +00:25:03,905 --> 00:25:06,800 +were introduced in 1956. + +503 +00:25:06,800 --> 00:25:10,370 +Arguably the first compiler +was introduced in + +504 +00:25:10,370 --> 00:25:13,850 +1957 for a language called COBOL. + +505 +00:25:13,850 --> 00:25:16,550 +This was done by Grace Hopper. + +506 +00:25:16,550 --> 00:25:18,770 +And I recommend that +you have and look + +507 +00:25:18,770 --> 00:25:20,900 +at this interview +of Grace Hopper, + +508 +00:25:20,900 --> 00:25:22,130 +which shows she must have been + +509 +00:25:22,130 --> 00:25:24,424 +a very interesting personality. + +510 +00:25:24,424 --> 00:25:27,770 +But despite the +maturity of this field, + +511 +00:25:27,770 --> 00:25:29,465 +it might be surprising that + +512 +00:25:29,465 --> 00:25:31,505 +research papers are +still published. + +513 +00:25:31,505 --> 00:25:34,835 +And we will find that +out in the module. + +514 +00:25:34,835 --> 00:25:37,760 +And a number of +problems which we would + +515 +00:25:37,760 --> 00:25:41,270 +assume are already solved by now, + +516 +00:25:41,270 --> 00:25:43,730 +actually turning out +that they are not solved. + +517 +00:25:43,730 --> 00:25:45,620 +Here in this blog post, + +518 +00:25:45,620 --> 00:25:47,825 +for example, Laurie Tratt + +519 +00:25:47,825 --> 00:25:49,550 +who is also from this department, + +520 +00:25:49,550 --> 00:25:51,740 +argued that parsing, for example, + +521 +00:25:51,740 --> 00:25:54,035 +isn't a solved problem yet. + +522 +00:25:54,035 --> 00:25:56,300 +I hope you will have as much fun + +523 +00:25:56,300 --> 00:25:58,130 +with this module as I have. + +524 +00:25:58,130 --> 00:26:00,750 +Thank you very much +for listening.