# HG changeset patch # User Christian Urban # Date 1601160340 -3600 # Node ID 4e628958c01a25a9166b3afd59c1e1efd5ef35f1 # Parent e70df76926c0efd7b3bb1acb8938a34abf8026b2 updated diff -r e70df76926c0 -r 4e628958c01a slides/slides02.pdf Binary file slides/slides02.pdf has changed diff -r e70df76926c0 -r 4e628958c01a slides/slides02.tex --- a/slides/slides02.tex Fri Sep 25 10:17:11 2020 +0100 +++ b/slides/slides02.tex Sat Sep 26 23:45:40 2020 +0100 @@ -178,7 +178,8 @@ \begin{bubble}[10cm] \large - Two regular expressions \bl{$r_1$} and \bl{$r_2$} are equivalent + Two regular expressions \bl{$r_1$} and \bl{$r_2$} are + \alert{\bf{}equivalent} provided:\medskip \begin{center} \bl{$r_1 \equiv r_2 \;\;\dn\;\; L(r_1) = L(r_2)$} @@ -243,29 +244,6 @@ \end{frame} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -\begin{frame}[c] -\frametitle{The Specification for Matching} - -\begin{bubble}[10cm] -\large -A regular expression \bl{$r$} matches a string~\bl{$s$} -provided: -\begin{center} -\bl{$s \in L(r)$} -\end{center}\medskip -\end{bubble} - -\bigskip\bigskip - -\ldots and the point of the this lecture is to decide this problem as -fast as possible (unlike Python, Ruby, Java etc) - -\end{frame} -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - - - %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{frame}[t] @@ -301,6 +279,27 @@ \end{frame} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +\begin{frame}[c] +\frametitle{The Specification for Matching} + +\begin{bubble}[10cm] +\large +A regular expression \bl{$r$} matches a string~\bl{$s$} +provided: +\begin{center} +\bl{$s \in L(r)$} +\end{center}\medskip +\end{bubble} + +\bigskip\bigskip + +\ldots and the point of the this lecture is to decide this problem as +fast as possible (unlike Python, Ruby, Java etc) + +\end{frame} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \begin{frame}[c] diff -r e70df76926c0 -r 4e628958c01a videos/01-basics1.srt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/videos/01-basics1.srt Sat Sep 26 23:45:40 2020 +0100 @@ -0,0 +1,1093 @@ +1 +00:00:06,710 --> 00:00:09,225 +Thanks for tuning in again. + +2 +00:00:09,225 --> 00:00:11,640 +In this video, we want to specify + +3 +00:00:11,640 --> 00:00:14,370 +what problem our regular +expression matcher + +4 +00:00:14,370 --> 00:00:16,155 +is actually supposed to solve. + +5 +00:00:16,155 --> 00:00:18,900 +The reason is that +we know that some of + +6 +00:00:18,900 --> 00:00:21,585 +the existing regular +expression matching engines + +7 +00:00:21,585 --> 00:00:25,200 +are not just abysmally +slow in some examples, + +8 +00:00:25,200 --> 00:00:27,105 +as you've seen in the +previous video, + +9 +00:00:27,105 --> 00:00:30,570 +but also produce sometimes +incorrect results. + +10 +00:00:30,570 --> 00:00:33,330 +In order to avoid +this with our matcher, + +11 +00:00:33,330 --> 00:00:35,325 +we need to somehow explain + +12 +00:00:35,325 --> 00:00:39,255 +precisely what is the problem +our algorithm solves. + +13 +00:00:39,255 --> 00:00:41,935 +This will require +a bit of theory, but + +14 +00:00:41,935 --> 00:00:45,335 +I hope it is nevertheless +a bit of fun. + +15 +00:00:45,335 --> 00:00:47,915 +First, we have to specify + +16 +00:00:47,915 --> 00:00:50,585 +what we mean by a +regular expression. + +17 +00:00:50,585 --> 00:00:53,210 +You've seen earlier some +examples. They were + +18 +00:00:53,210 --> 00:00:56,060 +actually taken or +inspired by what + +19 +00:00:56,060 --> 00:00:58,850 +is available in standard +regular expression matching + +20 +00:00:58,850 --> 00:01:02,330 +engines, like star, plus and n-times. + +21 +00:01:02,330 --> 00:01:05,690 +But for many tasks, +for our algorithm, + +22 +00:01:05,690 --> 00:01:10,174 +we will focus only what I call +basic regular expressions. + +23 +00:01:10,174 --> 00:01:11,840 +Since I'm lazy, I will call + +24 +00:01:11,840 --> 00:01:13,550 +these basic regular expressions + +25 +00:01:13,550 --> 00:01:15,485 +just as regular expressions. + +26 +00:01:15,485 --> 00:01:17,405 +And the ones you've seen earlier + +27 +00:01:17,405 --> 00:01:19,400 +as extended regular expressions. + +28 +00:01:19,400 --> 00:01:22,940 +So the basic regulare expressions, +or just regular expressions, + +29 +00:01:22,940 --> 00:01:25,280 +they will have characters. + +30 +00:01:25,280 --> 00:01:27,170 +So you can match any character, + +31 +00:01:27,170 --> 00:01:31,370 +a,b,c to z or 0 to 9. + +32 +00:01:31,370 --> 00:01:35,525 +Any Ascii character. 'c' here +is just a representative. + +33 +00:01:35,525 --> 00:01:38,825 +So we can match +single characters. + +34 +00:01:38,825 --> 00:01:42,440 +Then we can match alternatives. + +35 +00:01:42,440 --> 00:01:44,930 +That means a string +is either matched + +36 +00:01:44,930 --> 00:01:46,730 +by the regular expression r1 + +37 +00:01:46,730 --> 00:01:49,324 +or by the regular expression r2. + +38 +00:01:49,324 --> 00:01:52,790 +And for the +alternative we write +. + +39 +00:01:52,790 --> 00:01:55,175 +Then we also have sequence. + +40 +00:01:55,175 --> 00:01:57,410 +This sequence regular +expression essentially + +41 +00:01:57,410 --> 00:01:59,915 +says that a string needs to be matched + +42 +00:01:59,915 --> 00:02:02,210 +the first part by +the regular expression r1 + +43 +00:02:02,210 --> 00:02:06,275 +and then the second +part by the r2. + +44 +00:02:06,275 --> 00:02:10,190 +And then we have also the +star regular expression, + +45 +00:02:10,190 --> 00:02:12,980 +which says the regular +expression needs to match + +46 +00:02:12,980 --> 00:02:16,520 +the string with zero +or more copies. + +47 +00:02:16,520 --> 00:02:18,140 +And then we also have some + +48 +00:02:18,140 --> 00:02:20,060 +slightly strange +regular expressions. + +49 +00:02:20,060 --> 00:02:22,505 +We have the regular expression 1, + +50 +00:02:22,505 --> 00:02:25,910 +which can only match +the empty string. + +51 +00:02:25,910 --> 00:02:29,075 +I'm using here the +notation 1 for that + +52 +00:02:29,075 --> 00:02:31,340 +and in my writing I will always + +53 +00:02:31,340 --> 00:02:33,440 +make sure that for the +regular expression + +54 +00:02:33,440 --> 00:02:35,765 +I will write the +1 in a bold font. + +55 +00:02:35,765 --> 00:02:38,510 +So whenever you see +a 1 in bold font, + +56 +00:02:38,510 --> 00:02:40,395 +this is not the 1, but + +57 +00:02:40,395 --> 00:02:44,300 +the regular expression which +can match the empty string. + +58 +00:02:44,300 --> 00:02:48,050 +And we also have the +regular expression 0, + +59 +00:02:48,050 --> 00:02:50,315 +which cannot match +anything at all. + +60 +00:02:50,315 --> 00:02:51,695 +You might think, well, + +61 +00:02:51,695 --> 00:02:54,635 +that's not much use if it cannot +match anything at all, + +62 +00:02:54,635 --> 00:02:58,130 +but you will see why that +one is important later on. + +63 +00:02:58,130 --> 00:03:00,785 +So our basic regular expressions, + +64 +00:03:00,785 --> 00:03:02,375 +they will be 0, + +65 +00:03:02,375 --> 00:03:08,390 +1, characters, alternatives, +sequences and stars. + +66 +00:03:08,390 --> 00:03:12,170 +And these are all the +basic regular expressions. + +67 +00:03:12,170 --> 00:03:16,280 +If this definition is a +bit too abstract for you, + +68 +00:03:16,280 --> 00:03:18,560 +we can also look at +the concrete code, + +69 +00:03:18,560 --> 00:03:23,060 +how that would pan out when +actually writing some Scala. + +70 +00:03:23,060 --> 00:03:28,040 +I promised you, I show +you always my code in Scala. + +71 +00:03:28,040 --> 00:03:29,480 +So here you would have + +72 +00:03:29,480 --> 00:03:32,885 +first an abstract class +for regular expressions. + +73 +00:03:32,885 --> 00:03:37,580 +Then you have one regular +expression for 0, + +74 +00:03:37,580 --> 00:03:41,540 +one regular expression for 1, + +75 +00:03:41,540 --> 00:03:42,875 +one regular expression, which +takes an argument, + +76 +00:03:42,875 --> 00:03:45,050 +the character you want to match, + +77 +00:03:45,050 --> 00:03:47,915 +the characters a,b, c and so on. + +78 +00:03:47,915 --> 00:03:50,945 +Then we have an alternative +regular expression, + +79 +00:03:50,945 --> 00:03:53,480 +which takes the first +alternative and + +80 +00:03:53,480 --> 00:03:56,435 +the second alternative +as arguments. + +81 +00:03:56,435 --> 00:03:59,690 +And we have a sequence +regular expression. Again, + +82 +00:03:59,690 --> 00:04:01,850 +which takes the +first component and + +83 +00:04:01,850 --> 00:04:04,730 +the second component +as two arguments. + +84 +00:04:04,730 --> 00:04:07,249 +And we have the star +regular expression, + +85 +00:04:07,249 --> 00:04:10,880 +which just take one regular +expression as argument. + +86 +00:04:10,880 --> 00:04:16,115 +And all these reg expressions +extend our abstract class. + +87 +00:04:16,115 --> 00:04:20,300 +For whatever I do in +this module here I have + +88 +00:04:20,300 --> 00:04:23,300 +the convention that all +the regular expressions + +89 +00:04:23,300 --> 00:04:25,550 +are written with capital letters. + +90 +00:04:25,550 --> 00:04:26,885 +As you can see that here, + +91 +00:04:26,885 --> 00:04:31,685 +O, 1, character, these will be +always regular expressions. + +92 +00:04:31,685 --> 00:04:34,370 +They have all capital letters. + +93 +00:04:34,370 --> 00:04:36,484 +Let's for a moment, + +94 +00:04:36,484 --> 00:04:38,720 +play around with this definition. + +95 +00:04:38,720 --> 00:04:41,945 +I'm using here the +Ammonite REPL. + +96 +00:04:41,945 --> 00:04:46,950 +And I can evaluate +this definition. + +97 +00:04:53,430 --> 00:04:55,810 +And now I can start to + +98 +00:04:55,810 --> 00:04:58,570 +define particular +regular expressions. + +99 +00:04:58,570 --> 00:05:00,340 +For example, if I need +a regular expression + +100 +00:05:00,340 --> 00:05:02,860 +which can recognise +the character a, + +101 +00:05:02,860 --> 00:05:06,025 +then I would write +something like this. + +102 +00:05:06,025 --> 00:05:08,710 +So this regular expression +takes an argument, + +103 +00:05:08,710 --> 00:05:13,615 +the character 'a' to specify +which character to match. + +104 +00:05:13,615 --> 00:05:16,945 +We do this obviously also with 'b'. + +105 +00:05:16,945 --> 00:05:19,405 +And I can do that with + +106 +00:05:19,405 --> 00:05:22,975 +'c'. So now we have three +regular expressions. + +107 +00:05:22,975 --> 00:05:25,570 +If you look very carefully +at this definition, + +108 +00:05:25,570 --> 00:05:27,070 +you can actually see + +109 +00:05:27,070 --> 00:05:29,940 +these regular +expressions are trees. + +110 +00:05:29,940 --> 00:05:33,365 +So no matter what we +write down on paper, + +111 +00:05:33,365 --> 00:05:36,755 +they are behind the +scenes always trees. + +112 +00:05:36,755 --> 00:05:40,010 +And you can see that +actually in this definition. + +113 +00:05:40,010 --> 00:05:44,330 +If you define two regular +expressions r1 and r2. + +114 +00:05:44,330 --> 00:05:49,310 +They are essentially +the alternative of a, b and c. + +115 +00:05:49,310 --> 00:05:52,760 +Then this regular expression + +116 +00:05:52,760 --> 00:05:54,710 +can match either the character + +117 +00:05:54,710 --> 00:05:57,980 +a or the character b +or the character c. + +118 +00:05:57,980 --> 00:06:01,640 +And the same for the +regular expression r2. + +119 +00:06:01,640 --> 00:06:03,875 +So let me just evaluate that. + +120 +00:06:03,875 --> 00:06:05,690 +And even though these are + +121 +00:06:05,690 --> 00:06:07,175 +two regular expressions +which can match + +122 +00:06:07,175 --> 00:06:11,750 +exactly the same things, +they a different trees. + +123 +00:06:11,750 --> 00:06:14,195 +So if I ask Scala, + +124 +00:06:14,195 --> 00:06:16,460 +are these trees different? + +125 +00:06:16,460 --> 00:06:19,250 +Or ask if they're + +126 +00:06:19,250 --> 00:06:21,865 +the same, then Scala will say No, + +127 +00:06:21,865 --> 00:06:25,440 +they actually different trees. + +128 +00:06:25,450 --> 00:06:28,459 +Let's come back to +this definition. + +129 +00:06:28,459 --> 00:06:31,760 +If we want to write down +regular expressions on paper, + +130 +00:06:31,760 --> 00:06:33,620 +then we want to be sloppy as + +131 +00:06:33,620 --> 00:06:35,750 +mathematicians rather than as + +132 +00:06:35,750 --> 00:06:37,745 +precise as computer scientists. + +133 +00:06:37,745 --> 00:06:40,490 +So when we want to write down +a regular expression which can + +134 +00:06:40,490 --> 00:06:43,955 +either match the character +a or the character b, + +135 +00:06:43,955 --> 00:06:49,130 +then we would write down +something like this, a plus b. + +136 +00:06:49,130 --> 00:06:51,170 +And if you want to have +the regular expression + +137 +00:06:51,170 --> 00:06:52,625 +which can either match + +138 +00:06:52,625 --> 00:06:55,925 +the character a or b or c, + +139 +00:06:55,925 --> 00:06:58,340 +we will write +something like this. + +140 +00:06:58,340 --> 00:07:01,370 +But of course behind the +scenes, these are trees. + +141 +00:07:01,370 --> 00:07:04,460 +So we should have written +them with parentheses. + +142 +00:07:04,460 --> 00:07:06,440 +And you can see +actually, there are two + +143 +00:07:06,440 --> 00:07:08,990 +regular expressions I +could have written down. + +144 +00:07:08,990 --> 00:07:11,270 +They're different. + +145 +00:07:11,270 --> 00:07:12,710 +Just by convention, + +146 +00:07:12,710 --> 00:07:15,575 +we on't write these parentheses. + +147 +00:07:15,575 --> 00:07:18,740 +And that is similar with sequences. + +148 +00:07:18,740 --> 00:07:20,000 +If I want to write down + +149 +00:07:20,000 --> 00:07:22,955 +the regular expression which +can match first an 'a', + +150 +00:07:22,955 --> 00:07:25,010 +then a 'b', and then a 'c', + +151 +00:07:25,010 --> 00:07:28,160 +then I would write down +something like this. + +152 +00:07:28,160 --> 00:07:32,120 +Just, there are again + +153 +00:07:32,120 --> 00:07:35,735 +two regular expressions I +could have written down. + +154 +00:07:35,735 --> 00:07:38,480 +Again by convention we don't + +155 +00:07:38,480 --> 00:07:40,670 +write these parentheses though. + +156 +00:07:40,670 --> 00:07:42,350 +However, sometimes we have to be + +157 +00:07:42,350 --> 00:07:43,940 +very careful with parentheses, + +158 +00:07:43,940 --> 00:07:47,195 +especially with star. + +159 +00:07:47,195 --> 00:07:50,525 +Because this regular expression +is definitely not + +160 +00:07:50,525 --> 00:07:54,900 +the same as this regular expression. + +161 +00:07:56,100 --> 00:07:59,410 +The first one here can match + +162 +00:07:59,410 --> 00:08:03,610 +any strings containing a or b's. + +163 +00:08:03,610 --> 00:08:05,860 +While this regular expression can + +164 +00:08:05,860 --> 00:08:07,945 +only match the single character + +165 +00:08:07,945 --> 00:08:13,300 +a or any string +containing only b's. + +166 +00:08:13,300 --> 00:08:15,265 +So to make the difference clear, + +167 +00:08:15,265 --> 00:08:20,065 +in this example, we would have +to use the parentheses. + +168 +00:08:20,065 --> 00:08:23,140 +There's one more issue +with this definition. + +169 +00:08:23,140 --> 00:08:26,635 +Why do we focus on these +basic regular expressions? + +170 +00:08:26,635 --> 00:08:28,660 +Why don't we also include + +171 +00:08:28,660 --> 00:08:31,285 +the ones from the +extended regular expressions. + +172 +00:08:31,285 --> 00:08:33,055 +The answers very easy. + +173 +00:08:33,055 --> 00:08:35,680 +These basic regular +expressions can be used + +174 +00:08:35,680 --> 00:08:38,370 +to represent also +the extended ones. + +175 +00:08:38,370 --> 00:08:40,220 +Let me give you some examples. + +176 +00:08:40,220 --> 00:08:44,225 +If I have a regular +expression r+, for example, + +177 +00:08:44,225 --> 00:08:46,280 +then the meaning +was I have to use + +178 +00:08:46,280 --> 00:08:49,115 +at least one or more copies + +179 +00:08:49,115 --> 00:08:51,200 +of this r to +match a string. + +180 +00:08:51,200 --> 00:08:53,810 +Well, one or more copies +can be represented by + +181 +00:08:53,810 --> 00:08:58,385 +the basic ones as just +r followed by r*. + +182 +00:08:58,385 --> 00:09:01,760 +Meaning I have to use one +copy of r, followed by + +183 +00:09:01,760 --> 00:09:05,150 +0 or more copies of r. + +184 +00:09:05,150 --> 00:09:07,895 +Similarly, if I have the optional +regular expression, + +185 +00:09:07,895 --> 00:09:10,715 +which is supposed to +match a string + +186 +00:09:10,715 --> 00:09:13,865 +by using r, or match +the empty string. + +187 +00:09:13,865 --> 00:09:19,295 +Then this can be obviously +defined as r + 1. + +188 +00:09:19,295 --> 00:09:23,945 +So here is the bold +regular expression 1, + +189 +00:09:23,945 --> 00:09:26,180 +which means it either can + +190 +00:09:26,180 --> 00:09:28,205 +recognize whatever +r can recognize, + +191 +00:09:28,205 --> 00:09:30,470 +or it can recognize +the empty string. + +192 +00:09:30,470 --> 00:09:35,150 +And if I have ranges, like a +to z, then I can define + +193 +00:09:35,150 --> 00:09:41,135 +that as a + b + c + ... +and so on until z. + +194 +00:09:41,135 --> 00:09:45,920 +Maybe this definition is not +good in terms of runtime, + +195 +00:09:45,920 --> 00:09:47,960 +but in terms of just being able + +196 +00:09:47,960 --> 00:09:50,780 +to recognize strings +or match strings, + +197 +00:09:50,780 --> 00:09:54,680 +the basic regular expressions +will be just sufficient. + +198 +00:09:54,680 --> 00:09:56,690 +Unfortunately, we +also need to have + +199 +00:09:56,690 --> 00:09:58,850 +a quick chat about strings. + +200 +00:09:58,850 --> 00:10:02,255 +In Scala, it's crystal +clear what a string is. + +201 +00:10:02,255 --> 00:10:05,480 +There's a separate datatype +which is called string. + +202 +00:10:05,480 --> 00:10:07,895 +So here, for example, +is a string. + +203 +00:10:07,895 --> 00:10:09,200 +And as you can see, + +204 +00:10:09,200 --> 00:10:11,105 +it is of the type string. + +205 +00:10:11,105 --> 00:10:13,985 +And the empty string +will be just that. + +206 +00:10:13,985 --> 00:10:16,160 +However, when we write things down on + +207 +00:10:16,160 --> 00:10:18,320 +paper and think +about our algorithm, + +208 +00:10:18,320 --> 00:10:22,790 +we want to think of strings +as lists of characters. + +209 +00:10:22,790 --> 00:10:26,070 +So more something like this. + +210 +00:10:27,070 --> 00:10:31,745 +You can see here, this is actually +a list of characters. + +211 +00:10:31,745 --> 00:10:35,150 +And the two operations +we need are taking + +212 +00:10:35,150 --> 00:10:37,280 +the head of this list and + +213 +00:10:37,280 --> 00:10:39,770 +the rest of the list +or tail of the list. + +214 +00:10:39,770 --> 00:10:41,720 +That's why we want +to regard them as + +215 +00:10:41,720 --> 00:10:45,260 +lists rather than strings. + +216 +00:10:45,260 --> 00:10:48,200 +So if I'm using a +string like this, + +217 +00:10:48,200 --> 00:10:51,935 +then on paper I always will +write something like that. + +218 +00:10:51,935 --> 00:10:54,575 +Or since I'm lazy, just that. + +219 +00:10:54,575 --> 00:10:56,675 +And for the empty string, + +220 +00:10:56,675 --> 00:10:59,210 +I will write either +the empty list, with + +221 +00:10:59,210 --> 00:11:03,920 +two brackets or, +being lazy, just that. + +222 +00:11:03,920 --> 00:11:06,620 +Actually there is one +more operation we need on + +223 +00:11:06,620 --> 00:11:09,410 +strings and that +is concatenation. + +224 +00:11:09,410 --> 00:11:11,255 +If you have a string s1, + +225 +00:11:11,255 --> 00:11:14,510 +string s2, and put an +at symbol in between, + +226 +00:11:14,510 --> 00:11:18,050 +that means we want to +concatenate both strings. + +227 +00:11:18,050 --> 00:11:22,625 +So foo concatenated with +bar, would be foobar. + +228 +00:11:22,625 --> 00:11:25,085 +And any string concatenated with + +229 +00:11:25,085 --> 00:11:27,950 +the empty string +is left untouched. + +230 +00:11:27,950 --> 00:11:31,310 +So baz concatenated with + +231 +00:11:31,310 --> 00:11:33,545 +the empty string, is just baz. + +232 +00:11:33,545 --> 00:11:37,295 +So that's like if we have +strings as lists of characters, + +233 +00:11:37,295 --> 00:11:39,755 +that will be just list append. + +234 +00:11:39,755 --> 00:11:41,480 +In the next video, + +235 +00:11:41,480 --> 00:11:43,160 +we will use these definitions + +236 +00:11:43,160 --> 00:11:45,050 +and introduce the notion of what + +237 +00:11:45,050 --> 00:11:46,850 +a language is and + +238 +00:11:46,850 --> 00:11:49,920 +what the meaning of a +regular expression is. diff -r e70df76926c0 -r 4e628958c01a videos/01-housekeeping.srt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/videos/01-housekeeping.srt Sat Sep 26 23:45:40 2020 +0100 @@ -0,0 +1,1218 @@ +1 +00:00:05,750 --> 00:00:10,395 +They come back! Before we can +start with any serious work, + +2 +00:00:10,395 --> 00:00:12,255 +let's do some housekeeping. + +3 +00:00:12,255 --> 00:00:17,380 +As you know, this year is a tad special. +While the broad direction is clear, + +4 +00:00:17,380 --> 00:00:19,730 +there are some organizational +details that + +5 +00:00:19,730 --> 00:00:22,370 +still need to be worked +out as we go along. + +6 +00:00:22,370 --> 00:00:23,930 +The main hub for + +7 +00:00:23,930 --> 00:00:27,215 +all the information is the +KEATS page of this module. + +8 +00:00:27,215 --> 00:00:29,300 +Let me say a few words on how + +9 +00:00:29,300 --> 00:00:35,269 +it is organized. The module is +organized into weekly topics. + +10 +00:00:35,269 --> 00:00:39,785 +So there are entries for weeks +one up till week ten. + +11 +00:00:39,785 --> 00:00:44,390 +Let's also proceed in +approximately weekly steps. + +12 +00:00:44,390 --> 00:00:46,790 +Because if you +already asked me in + +13 +00:00:46,790 --> 00:00:49,130 +week two something +about week ten, + +14 +00:00:49,130 --> 00:00:51,275 +then you might confuse me + +15 +00:00:51,275 --> 00:00:53,720 +and also your fellow students. + +16 +00:00:53,720 --> 00:00:55,970 +All the communication in + +17 +00:00:55,970 --> 00:00:59,945 +this module can be done in +the course discussion area. + +18 +00:00:59,945 --> 00:01:02,300 +You can ask any questions. + +19 +00:01:02,300 --> 00:01:05,465 +For example, I haven't +understood this point, + +20 +00:01:05,465 --> 00:01:07,910 +this video isn't clear, + +21 +00:01:07,910 --> 00:01:10,160 +I cannot run this code, + +22 +00:01:10,160 --> 00:01:13,910 +I have problems with +accessing the coursework. + +23 +00:01:13,910 --> 00:01:16,295 +This can be all discussed here. + +24 +00:01:16,295 --> 00:01:18,590 +If you have any private matter, + +25 +00:01:18,590 --> 00:01:20,900 +like, I can't understand why + +26 +00:01:20,900 --> 00:01:23,465 +did I get such a good +mark, for example? + +27 +00:01:23,465 --> 00:01:27,245 +Or there are some private +reasons why you can't keep up. + +28 +00:01:27,245 --> 00:01:31,430 +Please feel free to use +my private email address. + +29 +00:01:31,430 --> 00:01:35,630 +But for any other matter +related to the module, + +30 +00:01:35,630 --> 00:01:38,480 +try to use the course +discussion area. + +31 +00:01:38,480 --> 00:01:41,510 +And I will try to stay on top + +32 +00:01:41,510 --> 00:01:45,450 +of all the discussions +as much as I can. + +33 +00:01:45,580 --> 00:01:48,500 +Each week is organized into + +34 +00:01:48,500 --> 00:01:52,550 +a mandatory section and +an optional section. + +35 +00:01:52,550 --> 00:01:55,835 +The mandatory section +contains videos, + +36 +00:01:55,835 --> 00:01:58,190 +a handout, and the slides, + +37 +00:01:58,190 --> 00:02:00,725 +which I used to +produce the videos. + +38 +00:02:00,725 --> 00:02:04,324 +I recommend that your +first read the handouts, + +39 +00:02:04,324 --> 00:02:07,430 +then watch the videos and then + +40 +00:02:07,430 --> 00:02:11,030 +read the handout +again for each week. + +41 +00:02:11,030 --> 00:02:14,795 +Apologies, some of these +handouts are quite lengthy. + +42 +00:02:14,795 --> 00:02:17,615 +I think the longest +one is 20 pages. + +43 +00:02:17,615 --> 00:02:19,235 +On the good side, + +44 +00:02:19,235 --> 00:02:21,650 +these handouts are +written in such a way + +45 +00:02:21,650 --> 00:02:24,935 +that you can read them just +before you fall asleep + +46 +00:02:24,935 --> 00:02:26,720 +and you should still be able + +47 +00:02:26,720 --> 00:02:28,835 +to understand what is going on. + +48 +00:02:28,835 --> 00:02:32,105 +Also, they contain lots +of pictures and code. + +49 +00:02:32,105 --> 00:02:35,330 +So 20 pages is +a bit misleading. + +50 +00:02:35,330 --> 00:02:38,870 +You can see the second week +is organized similarly. + +51 +00:02:38,870 --> 00:02:41,030 +You will have some videos. + +52 +00:02:41,030 --> 00:02:42,260 +You will have a handout, + +53 +00:02:42,260 --> 00:02:43,624 +you have the slides, + +54 +00:02:43,624 --> 00:02:45,770 +and some optional resources. + +55 +00:02:45,770 --> 00:02:51,410 +I have also put up a general +section, with general + +56 +00:02:51,410 --> 00:02:57,965 +material, where I put +a PDF about my notation. + +57 +00:02:57,965 --> 00:03:01,385 +This might be very +useful because + +58 +00:03:01,385 --> 00:03:04,610 +remember this topic +is already many, + +59 +00:03:04,610 --> 00:03:07,445 +many decades old, +at least 50 years, + +60 +00:03:07,445 --> 00:03:09,185 +and over the time, + +61 +00:03:09,185 --> 00:03:11,510 +many authors and many lecturers + +62 +00:03:11,510 --> 00:03:13,745 +have introduced +their own notation. + +63 +00:03:13,745 --> 00:03:16,565 +So if you read anything +on the web or in books, + +64 +00:03:16,565 --> 00:03:19,640 +you might be confused about that + +65 +00:03:19,640 --> 00:03:24,035 +they call something, which +I call X, they call Y. + +66 +00:03:24,035 --> 00:03:26,150 +So in this PDF, + +67 +00:03:26,150 --> 00:03:30,320 +I have collected all the +information on what kind of + +68 +00:03:30,320 --> 00:03:34,940 +notation I am using and where +I introduce some shortcuts. + +69 +00:03:34,940 --> 00:03:38,240 +The problem is, you always +want to be precise, + +70 +00:03:38,240 --> 00:03:40,070 +but we also lazy. + +71 +00:03:40,070 --> 00:03:43,745 +We don't want to write +everything into the last detail. + +72 +00:03:43,745 --> 00:03:47,375 +So here I say something +about my notation. + +73 +00:03:47,375 --> 00:03:50,180 +I have also put up +a document about + +74 +00:03:50,180 --> 00:03:53,510 +the Scala I'm going to +use in this module. + +75 +00:03:53,510 --> 00:03:56,090 +This is important +for the coursework. + +76 +00:03:56,090 --> 00:03:58,175 +In the coursework which +will come in a moment, + +77 +00:03:58,175 --> 00:04:00,440 +you can use any +programming language you like + +78 +00:04:00,440 --> 00:04:03,665 +to solve the questions and +implement your code. + +79 +00:04:03,665 --> 00:04:05,615 +But, I'm sorry, + +80 +00:04:05,615 --> 00:04:09,830 +all the code I'm going to +show you will be in Scala. + +81 +00:04:09,830 --> 00:04:12,155 +And the main difference is that + +82 +00:04:12,155 --> 00:04:15,095 +between PEP course from last year + +83 +00:04:15,095 --> 00:04:17,675 +is that I'm going to +use the Ammonite + +84 +00:04:17,675 --> 00:04:21,170 +REPL instead of +the Scala REPL. + +85 +00:04:21,170 --> 00:04:23,615 +They work very similarly. + +86 +00:04:23,615 --> 00:04:26,070 +For example, + +87 +00:04:28,120 --> 00:04:33,710 +I can now evaluate code +and get a feedback of what + +88 +00:04:33,710 --> 00:04:37,100 +it calculates, just like the original one. + +89 +00:04:37,100 --> 00:04:40,220 +The difference is that +Ammonite is + +90 +00:04:40,220 --> 00:04:42,440 +an extension and +provides some features + +91 +00:04:42,440 --> 00:04:44,975 +which will be very useful for us. + +92 +00:04:44,975 --> 00:04:47,600 +And I recorded some +of the features + +93 +00:04:47,600 --> 00:04:50,970 +we are going to use +in this document. + +94 +00:04:52,330 --> 00:04:55,970 +The last point to note +on the KEATS webpage + +95 +00:04:55,970 --> 00:04:59,405 +is that in the optional section, + +96 +00:04:59,405 --> 00:05:02,270 +I will always give +the source code of + +97 +00:05:02,270 --> 00:05:05,845 +the programs I discussed +in the videos and in the handouts. + +98 +00:05:05,845 --> 00:05:08,420 +And I really, really +encourage you + +99 +00:05:08,420 --> 00:05:11,060 +that you play around with this code + +100 +00:05:11,060 --> 00:05:14,420 +to make sure you understand +what was discussed. + +101 +00:05:14,420 --> 00:05:17,540 +At the also have sometimes +some additional pointers + +102 +00:05:17,540 --> 00:05:19,490 +to interesting topics, which + +103 +00:05:19,490 --> 00:05:22,680 +you can read up in your own time. + +104 +00:05:22,690 --> 00:05:26,240 +Exams, I'm sorry, +there will be exams. + +105 +00:05:26,240 --> 00:05:31,760 +There will be a final exam in +January, counting for 30%. + +106 +00:05:31,760 --> 00:05:34,190 +There will be a +midterm shortly after + +107 +00:05:34,190 --> 00:05:37,565 +Reading Week, accounting for 10%. + +108 +00:05:37,565 --> 00:05:40,625 +This will be a normal exams, + +109 +00:05:40,625 --> 00:05:44,780 +just that they will be online +in some form or shape. + +110 +00:05:44,780 --> 00:05:49,939 +There will be also a weekly +engagement component. + +111 +00:05:49,939 --> 00:05:52,370 +So 1% in each week, + +112 +00:05:52,370 --> 00:05:56,360 +where I make sure that +you are engaged with + +113 +00:05:56,360 --> 00:06:01,625 +the material and you will +get 1% each week for that. + +114 +00:06:01,625 --> 00:06:06,020 +How that is going to be +working out in practice, + +115 +00:06:06,020 --> 00:06:08,000 +unfortunately, I do not know + +116 +00:06:08,000 --> 00:06:11,285 +yet what tools we use +or what it means, + +117 +00:06:11,285 --> 00:06:13,685 +but I will let you know about it. + +118 +00:06:13,685 --> 00:06:19,295 +There's also an optional +weekly homework. + +119 +00:06:19,295 --> 00:06:22,999 +The homework is +uploaded on KEATS. + +120 +00:06:22,999 --> 00:06:27,230 +You should send your answers +via email to me. + +121 +00:06:27,230 --> 00:06:31,955 +I will respond individually +and give some feedback. + +122 +00:06:31,955 --> 00:06:33,920 +Normally, if everything is okay + +123 +00:06:33,920 --> 00:06:36,605 +with a question, I will +just respond "OK"? + +124 +00:06:36,605 --> 00:06:38,630 +If there is +something wrong, + +125 +00:06:38,630 --> 00:06:41,060 +I will sometimes answer with a longer + +126 +00:06:41,060 --> 00:06:44,480 +email. All the questions in + +127 +00:06:44,480 --> 00:06:49,415 +the exam and in the midterm +will be from this homework. + +128 +00:06:49,415 --> 00:06:53,330 +So I do not give out the +solutions to the homework. + +129 +00:06:53,330 --> 00:06:55,415 +You have to email me first + +130 +00:06:55,415 --> 00:06:57,290 +what do you think +the solution is... + +131 +00:06:57,290 --> 00:07:01,280 +and I will give answers +individually back. + +132 +00:07:01,280 --> 00:07:05,270 +And I will then ensure +that all the questions in + +133 +00:07:05,270 --> 00:07:09,560 +the exam and the midterm coming +only from this homework. + +134 +00:07:09,560 --> 00:07:11,825 +The simple reason for that is + +135 +00:07:11,825 --> 00:07:14,645 +that I hate if students ask me + +136 +00:07:14,645 --> 00:07:17,705 +if something is relevant +for the exam or not. + +137 +00:07:17,705 --> 00:07:21,620 +I really like this material +and whatever I tell you, + +138 +00:07:21,620 --> 00:07:23,840 +I feel like you +should know about it. + +139 +00:07:23,840 --> 00:07:26,645 +So, to prevent that question, + +140 +00:07:26,645 --> 00:07:28,415 +if it's relevant to the exams, + +141 +00:07:28,415 --> 00:07:32,165 +you can just look at +what's in the homework. + +142 +00:07:32,165 --> 00:07:34,370 +And if this question +is in the homework, + +143 +00:07:34,370 --> 00:07:36,335 +then yes, it will be relevant. + +144 +00:07:36,335 --> 00:07:40,410 +And if not, it will +not. Thank you. + +145 +00:07:42,280 --> 00:07:46,610 +Can you please make one more +point about the homework? + +146 +00:07:46,610 --> 00:07:50,959 +Please submit your +answers only as PDF. + +147 +00:07:50,959 --> 00:07:53,780 +That just makes it easy +for me to look at. + +148 +00:07:53,780 --> 00:07:55,160 +Remember, I might get + +149 +00:07:55,160 --> 00:07:58,820 +60 or more emails about + +150 +00:07:58,820 --> 00:08:03,110 +that and it's just much +easier for me to read PDFs. + +151 +00:08:03,110 --> 00:08:07,880 +Also, please in your answer +copy the question. + +152 +00:08:07,880 --> 00:08:11,480 +I have an extremely limited memory + +153 +00:08:11,480 --> 00:08:15,290 +and have no idea what I actually +asked in this homework until + +154 +00:08:15,290 --> 00:08:16,910 +I actually look it up. + +155 +00:08:16,910 --> 00:08:18,530 +Just to prevent that, + +156 +00:08:18,530 --> 00:08:20,255 +that I have to look it up myself, + +157 +00:08:20,255 --> 00:08:23,765 +please send in +always the question + +158 +00:08:23,765 --> 00:08:26,330 +and then give your answer. + +159 +00:08:26,330 --> 00:08:29,915 +And finally, there are some + +160 +00:08:29,915 --> 00:08:32,540 +optional homework. They are just to + +161 +00:08:32,540 --> 00:08:34,130 +reminders actually that + +162 +00:08:34,130 --> 00:08:36,140 +you really should +have a look at that. + +163 +00:08:36,140 --> 00:08:39,830 +But they're optional. The ones + +164 +00:08:39,830 --> 00:08:43,520 +that you are supposed to +solve, they look like this. + +165 +00:08:43,520 --> 00:08:45,890 +And as I said, + +166 +00:08:45,890 --> 00:08:50,645 +please copy the question +and then give your answer. + +167 +00:08:50,645 --> 00:08:53,780 +And at the end, +package it into a PDF. + +168 +00:08:53,780 --> 00:08:58,505 +Thank you. A very +important component + +169 +00:08:58,505 --> 00:09:02,240 +for assessment in this +module is the coursework. + +170 +00:09:02,240 --> 00:09:04,160 +There will be five of them. + +171 +00:09:04,160 --> 00:09:06,590 +We will first implement a matcher. + +172 +00:09:06,590 --> 00:09:09,770 +Then a lexer building +on top of the matcher. + +173 +00:09:09,770 --> 00:09:13,655 +Then build a parser +and an interpreter. + +174 +00:09:13,655 --> 00:09:16,730 +And then a +compiler for the JVM + +175 +00:09:16,730 --> 00:09:20,420 +and a compiler +targetting the LLVM. + +176 +00:09:20,420 --> 00:09:24,155 +So you can see this +accounts for 45%. + +177 +00:09:24,155 --> 00:09:27,200 +You will submit your +code and you also have + +178 +00:09:27,200 --> 00:09:30,260 +to give some explanation +about your code. + +179 +00:09:30,260 --> 00:09:31,520 +This will be always + +180 +00:09:31,520 --> 00:09:34,655 +explained in the +coursework description. + +181 +00:09:34,655 --> 00:09:36,785 +For the coursework, you can use + +182 +00:09:36,785 --> 00:09:38,945 +any programming +language you like. + +183 +00:09:38,945 --> 00:09:42,860 +In the past, people have +used Haskell and Rust. + +184 +00:09:42,860 --> 00:09:45,905 +Obviously many go for Scala. + +185 +00:09:45,905 --> 00:09:49,730 +You can use any code +I showed you in + +186 +00:09:49,730 --> 00:09:54,230 +the videos or in handouts +or uploaded on KEATS. + +187 +00:09:54,230 --> 00:09:56,390 +But nothing else. + +188 +00:09:56,390 --> 00:09:59,450 +Remember, unlike +in the PEP course, + +189 +00:09:59,450 --> 00:10:02,030 +here, in the Compiler course, + +190 +00:10:02,030 --> 00:10:04,130 +I actually will look with + +191 +00:10:04,130 --> 00:10:07,910 +my own eyes at the submissions +and make judgments. + +192 +00:10:07,910 --> 00:10:13,775 +And I'll see if people have +copied from each other. + +193 +00:10:13,775 --> 00:10:15,545 +Please don't do that! + +194 +00:10:15,545 --> 00:10:20,160 +That's very annoying +for everybody involved. + +195 +00:10:20,440 --> 00:10:23,570 +To conclude this +housekeeping video, + +196 +00:10:23,570 --> 00:10:25,580 +let me tell you a +bit more about how + +197 +00:10:25,580 --> 00:10:28,025 +this module is organized. + +198 +00:10:28,025 --> 00:10:32,600 +The first five weeks we will +spend on lexing and parsing. + +199 +00:10:32,600 --> 00:10:37,370 +This is essentially the +frontend to our compiler. + +200 +00:10:37,370 --> 00:10:39,500 +The first part will +be about lexing. + +201 +00:10:39,500 --> 00:10:42,995 +Actually, we will emphasize quite a +bit on the lexing part, + +202 +00:10:42,995 --> 00:10:45,005 +because that's about my research. + +203 +00:10:45,005 --> 00:10:47,855 +And sorry, I can talk +about this for hours. + +204 +00:10:47,855 --> 00:10:49,400 +But I also like to show you that + +205 +00:10:49,400 --> 00:10:50,750 +there's actually something really + +206 +00:10:50,750 --> 00:10:52,340 +interesting going on and +I want to + +207 +00:10:52,340 --> 00:10:54,529 +show you a number of techniques. + +208 +00:10:54,529 --> 00:10:57,435 +Then obviously we have +to do parsing because + +209 +00:10:57,435 --> 00:11:01,669 +lexing essentially only +finds the words in our programs. + +210 +00:11:01,669 --> 00:11:03,860 +We then have to put +them into sentences to + +211 +00:11:03,860 --> 00:11:07,350 +understand what our +programs actually do. + +212 +00:11:09,820 --> 00:11:12,200 +The next five weeks, + +213 +00:11:12,200 --> 00:11:15,330 +actually that overlaps +with the parsing part, + +214 +00:11:15,330 --> 00:11:18,850 +we will first start with +writing an interpreter. + +215 +00:11:18,850 --> 00:11:20,470 +This is essentially to get + +216 +00:11:20,470 --> 00:11:22,285 +the baseline for our compiler. + +217 +00:11:22,285 --> 00:11:24,325 +Also, you will see compilers are + +218 +00:11:24,325 --> 00:11:27,370 +sometimes fiendishly +difficult to get correct. + +219 +00:11:27,370 --> 00:11:30,880 +And to actually know what +the results should be, + +220 +00:11:30,880 --> 00:11:32,320 +by having it calculated with + +221 +00:11:32,320 --> 00:11:35,515 +an interpreter, is +really important. + +222 +00:11:35,515 --> 00:11:40,030 +And then we really +spent time on a compiler. + +223 +00:11:40,030 --> 00:11:45,310 +We will first write JVM code +for a little While language. + +224 +00:11:45,310 --> 00:11:46,780 +That is an extremely, + +225 +00:11:46,780 --> 00:11:50,335 +extremely simple C-like +language, if you want. + +226 +00:11:50,335 --> 00:11:53,770 +And also JVM Code for +a functional language. + +227 +00:11:53,770 --> 00:11:55,615 +Again, something very small. + +228 +00:11:55,615 --> 00:11:57,815 +And then for that +functional language + +229 +00:11:57,815 --> 00:12:03,080 +also generate LLVM-IR code +that can be + +230 +00:12:03,080 --> 00:12:09,270 +then used to generate directly +CPU code for X86 or ARM. + +231 +00:12:09,460 --> 00:12:12,095 +Just to whet your appetite, + +232 +00:12:12,095 --> 00:12:14,780 +here's the first compiler for + +233 +00:12:14,780 --> 00:12:20,240 +the While-language calculating +the Mandelbrot program. + +234 +00:12:20,240 --> 00:12:22,160 +It generates code for + +235 +00:12:22,160 --> 00:12:25,460 +the JVM, for the Java +virtual machine. + +236 +00:12:25,460 --> 00:12:27,350 +So what it first does, like in + +237 +00:12:27,350 --> 00:12:29,855 +the example I showed you earlier, + +238 +00:12:29,855 --> 00:12:33,215 +it takes the Mandelbrot +program from the BF language, + +239 +00:12:33,215 --> 00:12:35,480 +generates a While program + +240 +00:12:35,480 --> 00:12:38,520 +in our language we are +going to implement... + +241 +00:12:38,530 --> 00:12:40,940 +this has produced a program which is + +242 +00:12:40,940 --> 00:12:44,795 +approximately 70k of source code. + +243 +00:12:44,795 --> 00:12:47,585 +Then our compiler produced + +244 +00:12:47,585 --> 00:12:53,015 +a Java Virtual Machine +byte code file. + +245 +00:12:53,015 --> 00:12:54,890 +This is the readable version of + +246 +00:12:54,890 --> 00:12:57,500 +the Java Virtual Machine bytecode. + +247 +00:12:57,500 --> 00:12:59,480 +And we use then an assembler + +248 +00:12:59,480 --> 00:13:01,535 +to actually produce +the class file. + +249 +00:13:01,535 --> 00:13:04,460 +And then this compiler +just runs this class file. + +250 +00:13:04,460 --> 00:13:06,830 +It takes a little while and +unfortunately this compiler + +251 +00:13:06,830 --> 00:13:09,365 +doesn't show it incrementally. +Only at the end + +252 +00:13:09,365 --> 00:13:13,205 +it shows us it needed +approximately 40 seconds. + +253 +00:13:13,205 --> 00:13:16,745 +And you have to +remember the GCC with -O0 + +254 +00:13:16,745 --> 00:13:21,485 +took a little +bit more than 20 seconds. + +255 +00:13:21,485 --> 00:13:24,440 +And so this is quite +remarkable result + +256 +00:13:24,440 --> 00:13:27,575 +considering we are +running it on the JVM. + +257 +00:13:27,575 --> 00:13:29,345 +Just to assure you, + +258 +00:13:29,345 --> 00:13:33,245 +the Mandelbrot program will not be the +only program we can compile. + +259 +00:13:33,245 --> 00:13:36,380 +But we will focus on integers + +260 +00:13:36,380 --> 00:13:39,605 +and strings in our +programming language. + +261 +00:13:39,605 --> 00:13:43,280 +Because for more complicated +data structures + +262 +00:13:43,280 --> 00:13:46,250 +we don't have enough time +to fit into this module. + +263 +00:13:46,250 --> 00:13:49,040 +So let's start now +with the serious work. + +264 +00:13:49,040 --> 00:13:52,020 +I hope I see you in a bit.