diff -r e70df76926c0 -r 4e628958c01a videos/01-basics1.srt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/videos/01-basics1.srt Sat Sep 26 23:45:40 2020 +0100 @@ -0,0 +1,1093 @@ +1 +00:00:06,710 --> 00:00:09,225 +Thanks for tuning in again. + +2 +00:00:09,225 --> 00:00:11,640 +In this video, we want to specify + +3 +00:00:11,640 --> 00:00:14,370 +what problem our regular +expression matcher + +4 +00:00:14,370 --> 00:00:16,155 +is actually supposed to solve. + +5 +00:00:16,155 --> 00:00:18,900 +The reason is that +we know that some of + +6 +00:00:18,900 --> 00:00:21,585 +the existing regular +expression matching engines + +7 +00:00:21,585 --> 00:00:25,200 +are not just abysmally +slow in some examples, + +8 +00:00:25,200 --> 00:00:27,105 +as you've seen in the +previous video, + +9 +00:00:27,105 --> 00:00:30,570 +but also produce sometimes +incorrect results. + +10 +00:00:30,570 --> 00:00:33,330 +In order to avoid +this with our matcher, + +11 +00:00:33,330 --> 00:00:35,325 +we need to somehow explain + +12 +00:00:35,325 --> 00:00:39,255 +precisely what is the problem +our algorithm solves. + +13 +00:00:39,255 --> 00:00:41,935 +This will require +a bit of theory, but + +14 +00:00:41,935 --> 00:00:45,335 +I hope it is nevertheless +a bit of fun. + +15 +00:00:45,335 --> 00:00:47,915 +First, we have to specify + +16 +00:00:47,915 --> 00:00:50,585 +what we mean by a +regular expression. + +17 +00:00:50,585 --> 00:00:53,210 +You've seen earlier some +examples. They were + +18 +00:00:53,210 --> 00:00:56,060 +actually taken or +inspired by what + +19 +00:00:56,060 --> 00:00:58,850 +is available in standard +regular expression matching + +20 +00:00:58,850 --> 00:01:02,330 +engines, like star, plus and n-times. + +21 +00:01:02,330 --> 00:01:05,690 +But for many tasks, +for our algorithm, + +22 +00:01:05,690 --> 00:01:10,174 +we will focus only what I call +basic regular expressions. + +23 +00:01:10,174 --> 00:01:11,840 +Since I'm lazy, I will call + +24 +00:01:11,840 --> 00:01:13,550 +these basic regular expressions + +25 +00:01:13,550 --> 00:01:15,485 +just as regular expressions. + +26 +00:01:15,485 --> 00:01:17,405 +And the ones you've seen earlier + +27 +00:01:17,405 --> 00:01:19,400 +as extended regular expressions. + +28 +00:01:19,400 --> 00:01:22,940 +So the basic regulare expressions, +or just regular expressions, + +29 +00:01:22,940 --> 00:01:25,280 +they will have characters. + +30 +00:01:25,280 --> 00:01:27,170 +So you can match any character, + +31 +00:01:27,170 --> 00:01:31,370 +a,b,c to z or 0 to 9. + +32 +00:01:31,370 --> 00:01:35,525 +Any Ascii character. 'c' here +is just a representative. + +33 +00:01:35,525 --> 00:01:38,825 +So we can match +single characters. + +34 +00:01:38,825 --> 00:01:42,440 +Then we can match alternatives. + +35 +00:01:42,440 --> 00:01:44,930 +That means a string +is either matched + +36 +00:01:44,930 --> 00:01:46,730 +by the regular expression r1 + +37 +00:01:46,730 --> 00:01:49,324 +or by the regular expression r2. + +38 +00:01:49,324 --> 00:01:52,790 +And for the +alternative we write +. + +39 +00:01:52,790 --> 00:01:55,175 +Then we also have sequence. + +40 +00:01:55,175 --> 00:01:57,410 +This sequence regular +expression essentially + +41 +00:01:57,410 --> 00:01:59,915 +says that a string needs to be matched + +42 +00:01:59,915 --> 00:02:02,210 +the first part by +the regular expression r1 + +43 +00:02:02,210 --> 00:02:06,275 +and then the second +part by the r2. + +44 +00:02:06,275 --> 00:02:10,190 +And then we have also the +star regular expression, + +45 +00:02:10,190 --> 00:02:12,980 +which says the regular +expression needs to match + +46 +00:02:12,980 --> 00:02:16,520 +the string with zero +or more copies. + +47 +00:02:16,520 --> 00:02:18,140 +And then we also have some + +48 +00:02:18,140 --> 00:02:20,060 +slightly strange +regular expressions. + +49 +00:02:20,060 --> 00:02:22,505 +We have the regular expression 1, + +50 +00:02:22,505 --> 00:02:25,910 +which can only match +the empty string. + +51 +00:02:25,910 --> 00:02:29,075 +I'm using here the +notation 1 for that + +52 +00:02:29,075 --> 00:02:31,340 +and in my writing I will always + +53 +00:02:31,340 --> 00:02:33,440 +make sure that for the +regular expression + +54 +00:02:33,440 --> 00:02:35,765 +I will write the +1 in a bold font. + +55 +00:02:35,765 --> 00:02:38,510 +So whenever you see +a 1 in bold font, + +56 +00:02:38,510 --> 00:02:40,395 +this is not the 1, but + +57 +00:02:40,395 --> 00:02:44,300 +the regular expression which +can match the empty string. + +58 +00:02:44,300 --> 00:02:48,050 +And we also have the +regular expression 0, + +59 +00:02:48,050 --> 00:02:50,315 +which cannot match +anything at all. + +60 +00:02:50,315 --> 00:02:51,695 +You might think, well, + +61 +00:02:51,695 --> 00:02:54,635 +that's not much use if it cannot +match anything at all, + +62 +00:02:54,635 --> 00:02:58,130 +but you will see why that +one is important later on. + +63 +00:02:58,130 --> 00:03:00,785 +So our basic regular expressions, + +64 +00:03:00,785 --> 00:03:02,375 +they will be 0, + +65 +00:03:02,375 --> 00:03:08,390 +1, characters, alternatives, +sequences and stars. + +66 +00:03:08,390 --> 00:03:12,170 +And these are all the +basic regular expressions. + +67 +00:03:12,170 --> 00:03:16,280 +If this definition is a +bit too abstract for you, + +68 +00:03:16,280 --> 00:03:18,560 +we can also look at +the concrete code, + +69 +00:03:18,560 --> 00:03:23,060 +how that would pan out when +actually writing some Scala. + +70 +00:03:23,060 --> 00:03:28,040 +I promised you, I show +you always my code in Scala. + +71 +00:03:28,040 --> 00:03:29,480 +So here you would have + +72 +00:03:29,480 --> 00:03:32,885 +first an abstract class +for regular expressions. + +73 +00:03:32,885 --> 00:03:37,580 +Then you have one regular +expression for 0, + +74 +00:03:37,580 --> 00:03:41,540 +one regular expression for 1, + +75 +00:03:41,540 --> 00:03:42,875 +one regular expression, which +takes an argument, + +76 +00:03:42,875 --> 00:03:45,050 +the character you want to match, + +77 +00:03:45,050 --> 00:03:47,915 +the characters a,b, c and so on. + +78 +00:03:47,915 --> 00:03:50,945 +Then we have an alternative +regular expression, + +79 +00:03:50,945 --> 00:03:53,480 +which takes the first +alternative and + +80 +00:03:53,480 --> 00:03:56,435 +the second alternative +as arguments. + +81 +00:03:56,435 --> 00:03:59,690 +And we have a sequence +regular expression. Again, + +82 +00:03:59,690 --> 00:04:01,850 +which takes the +first component and + +83 +00:04:01,850 --> 00:04:04,730 +the second component +as two arguments. + +84 +00:04:04,730 --> 00:04:07,249 +And we have the star +regular expression, + +85 +00:04:07,249 --> 00:04:10,880 +which just take one regular +expression as argument. + +86 +00:04:10,880 --> 00:04:16,115 +And all these reg expressions +extend our abstract class. + +87 +00:04:16,115 --> 00:04:20,300 +For whatever I do in +this module here I have + +88 +00:04:20,300 --> 00:04:23,300 +the convention that all +the regular expressions + +89 +00:04:23,300 --> 00:04:25,550 +are written with capital letters. + +90 +00:04:25,550 --> 00:04:26,885 +As you can see that here, + +91 +00:04:26,885 --> 00:04:31,685 +O, 1, character, these will be +always regular expressions. + +92 +00:04:31,685 --> 00:04:34,370 +They have all capital letters. + +93 +00:04:34,370 --> 00:04:36,484 +Let's for a moment, + +94 +00:04:36,484 --> 00:04:38,720 +play around with this definition. + +95 +00:04:38,720 --> 00:04:41,945 +I'm using here the +Ammonite REPL. + +96 +00:04:41,945 --> 00:04:46,950 +And I can evaluate +this definition. + +97 +00:04:53,430 --> 00:04:55,810 +And now I can start to + +98 +00:04:55,810 --> 00:04:58,570 +define particular +regular expressions. + +99 +00:04:58,570 --> 00:05:00,340 +For example, if I need +a regular expression + +100 +00:05:00,340 --> 00:05:02,860 +which can recognise +the character a, + +101 +00:05:02,860 --> 00:05:06,025 +then I would write +something like this. + +102 +00:05:06,025 --> 00:05:08,710 +So this regular expression +takes an argument, + +103 +00:05:08,710 --> 00:05:13,615 +the character 'a' to specify +which character to match. + +104 +00:05:13,615 --> 00:05:16,945 +We do this obviously also with 'b'. + +105 +00:05:16,945 --> 00:05:19,405 +And I can do that with + +106 +00:05:19,405 --> 00:05:22,975 +'c'. So now we have three +regular expressions. + +107 +00:05:22,975 --> 00:05:25,570 +If you look very carefully +at this definition, + +108 +00:05:25,570 --> 00:05:27,070 +you can actually see + +109 +00:05:27,070 --> 00:05:29,940 +these regular +expressions are trees. + +110 +00:05:29,940 --> 00:05:33,365 +So no matter what we +write down on paper, + +111 +00:05:33,365 --> 00:05:36,755 +they are behind the +scenes always trees. + +112 +00:05:36,755 --> 00:05:40,010 +And you can see that +actually in this definition. + +113 +00:05:40,010 --> 00:05:44,330 +If you define two regular +expressions r1 and r2. + +114 +00:05:44,330 --> 00:05:49,310 +They are essentially +the alternative of a, b and c. + +115 +00:05:49,310 --> 00:05:52,760 +Then this regular expression + +116 +00:05:52,760 --> 00:05:54,710 +can match either the character + +117 +00:05:54,710 --> 00:05:57,980 +a or the character b +or the character c. + +118 +00:05:57,980 --> 00:06:01,640 +And the same for the +regular expression r2. + +119 +00:06:01,640 --> 00:06:03,875 +So let me just evaluate that. + +120 +00:06:03,875 --> 00:06:05,690 +And even though these are + +121 +00:06:05,690 --> 00:06:07,175 +two regular expressions +which can match + +122 +00:06:07,175 --> 00:06:11,750 +exactly the same things, +they a different trees. + +123 +00:06:11,750 --> 00:06:14,195 +So if I ask Scala, + +124 +00:06:14,195 --> 00:06:16,460 +are these trees different? + +125 +00:06:16,460 --> 00:06:19,250 +Or ask if they're + +126 +00:06:19,250 --> 00:06:21,865 +the same, then Scala will say No, + +127 +00:06:21,865 --> 00:06:25,440 +they actually different trees. + +128 +00:06:25,450 --> 00:06:28,459 +Let's come back to +this definition. + +129 +00:06:28,459 --> 00:06:31,760 +If we want to write down +regular expressions on paper, + +130 +00:06:31,760 --> 00:06:33,620 +then we want to be sloppy as + +131 +00:06:33,620 --> 00:06:35,750 +mathematicians rather than as + +132 +00:06:35,750 --> 00:06:37,745 +precise as computer scientists. + +133 +00:06:37,745 --> 00:06:40,490 +So when we want to write down +a regular expression which can + +134 +00:06:40,490 --> 00:06:43,955 +either match the character +a or the character b, + +135 +00:06:43,955 --> 00:06:49,130 +then we would write down +something like this, a plus b. + +136 +00:06:49,130 --> 00:06:51,170 +And if you want to have +the regular expression + +137 +00:06:51,170 --> 00:06:52,625 +which can either match + +138 +00:06:52,625 --> 00:06:55,925 +the character a or b or c, + +139 +00:06:55,925 --> 00:06:58,340 +we will write +something like this. + +140 +00:06:58,340 --> 00:07:01,370 +But of course behind the +scenes, these are trees. + +141 +00:07:01,370 --> 00:07:04,460 +So we should have written +them with parentheses. + +142 +00:07:04,460 --> 00:07:06,440 +And you can see +actually, there are two + +143 +00:07:06,440 --> 00:07:08,990 +regular expressions I +could have written down. + +144 +00:07:08,990 --> 00:07:11,270 +They're different. + +145 +00:07:11,270 --> 00:07:12,710 +Just by convention, + +146 +00:07:12,710 --> 00:07:15,575 +we on't write these parentheses. + +147 +00:07:15,575 --> 00:07:18,740 +And that is similar with sequences. + +148 +00:07:18,740 --> 00:07:20,000 +If I want to write down + +149 +00:07:20,000 --> 00:07:22,955 +the regular expression which +can match first an 'a', + +150 +00:07:22,955 --> 00:07:25,010 +then a 'b', and then a 'c', + +151 +00:07:25,010 --> 00:07:28,160 +then I would write down +something like this. + +152 +00:07:28,160 --> 00:07:32,120 +Just, there are again + +153 +00:07:32,120 --> 00:07:35,735 +two regular expressions I +could have written down. + +154 +00:07:35,735 --> 00:07:38,480 +Again by convention we don't + +155 +00:07:38,480 --> 00:07:40,670 +write these parentheses though. + +156 +00:07:40,670 --> 00:07:42,350 +However, sometimes we have to be + +157 +00:07:42,350 --> 00:07:43,940 +very careful with parentheses, + +158 +00:07:43,940 --> 00:07:47,195 +especially with star. + +159 +00:07:47,195 --> 00:07:50,525 +Because this regular expression +is definitely not + +160 +00:07:50,525 --> 00:07:54,900 +the same as this regular expression. + +161 +00:07:56,100 --> 00:07:59,410 +The first one here can match + +162 +00:07:59,410 --> 00:08:03,610 +any strings containing a or b's. + +163 +00:08:03,610 --> 00:08:05,860 +While this regular expression can + +164 +00:08:05,860 --> 00:08:07,945 +only match the single character + +165 +00:08:07,945 --> 00:08:13,300 +a or any string +containing only b's. + +166 +00:08:13,300 --> 00:08:15,265 +So to make the difference clear, + +167 +00:08:15,265 --> 00:08:20,065 +in this example, we would have +to use the parentheses. + +168 +00:08:20,065 --> 00:08:23,140 +There's one more issue +with this definition. + +169 +00:08:23,140 --> 00:08:26,635 +Why do we focus on these +basic regular expressions? + +170 +00:08:26,635 --> 00:08:28,660 +Why don't we also include + +171 +00:08:28,660 --> 00:08:31,285 +the ones from the +extended regular expressions. + +172 +00:08:31,285 --> 00:08:33,055 +The answers very easy. + +173 +00:08:33,055 --> 00:08:35,680 +These basic regular +expressions can be used + +174 +00:08:35,680 --> 00:08:38,370 +to represent also +the extended ones. + +175 +00:08:38,370 --> 00:08:40,220 +Let me give you some examples. + +176 +00:08:40,220 --> 00:08:44,225 +If I have a regular +expression r+, for example, + +177 +00:08:44,225 --> 00:08:46,280 +then the meaning +was I have to use + +178 +00:08:46,280 --> 00:08:49,115 +at least one or more copies + +179 +00:08:49,115 --> 00:08:51,200 +of this r to +match a string. + +180 +00:08:51,200 --> 00:08:53,810 +Well, one or more copies +can be represented by + +181 +00:08:53,810 --> 00:08:58,385 +the basic ones as just +r followed by r*. + +182 +00:08:58,385 --> 00:09:01,760 +Meaning I have to use one +copy of r, followed by + +183 +00:09:01,760 --> 00:09:05,150 +0 or more copies of r. + +184 +00:09:05,150 --> 00:09:07,895 +Similarly, if I have the optional +regular expression, + +185 +00:09:07,895 --> 00:09:10,715 +which is supposed to +match a string + +186 +00:09:10,715 --> 00:09:13,865 +by using r, or match +the empty string. + +187 +00:09:13,865 --> 00:09:19,295 +Then this can be obviously +defined as r + 1. + +188 +00:09:19,295 --> 00:09:23,945 +So here is the bold +regular expression 1, + +189 +00:09:23,945 --> 00:09:26,180 +which means it either can + +190 +00:09:26,180 --> 00:09:28,205 +recognize whatever +r can recognize, + +191 +00:09:28,205 --> 00:09:30,470 +or it can recognize +the empty string. + +192 +00:09:30,470 --> 00:09:35,150 +And if I have ranges, like a +to z, then I can define + +193 +00:09:35,150 --> 00:09:41,135 +that as a + b + c + ... +and so on until z. + +194 +00:09:41,135 --> 00:09:45,920 +Maybe this definition is not +good in terms of runtime, + +195 +00:09:45,920 --> 00:09:47,960 +but in terms of just being able + +196 +00:09:47,960 --> 00:09:50,780 +to recognize strings +or match strings, + +197 +00:09:50,780 --> 00:09:54,680 +the basic regular expressions +will be just sufficient. + +198 +00:09:54,680 --> 00:09:56,690 +Unfortunately, we +also need to have + +199 +00:09:56,690 --> 00:09:58,850 +a quick chat about strings. + +200 +00:09:58,850 --> 00:10:02,255 +In Scala, it's crystal +clear what a string is. + +201 +00:10:02,255 --> 00:10:05,480 +There's a separate datatype +which is called string. + +202 +00:10:05,480 --> 00:10:07,895 +So here, for example, +is a string. + +203 +00:10:07,895 --> 00:10:09,200 +And as you can see, + +204 +00:10:09,200 --> 00:10:11,105 +it is of the type string. + +205 +00:10:11,105 --> 00:10:13,985 +And the empty string +will be just that. + +206 +00:10:13,985 --> 00:10:16,160 +However, when we write things down on + +207 +00:10:16,160 --> 00:10:18,320 +paper and think +about our algorithm, + +208 +00:10:18,320 --> 00:10:22,790 +we want to think of strings +as lists of characters. + +209 +00:10:22,790 --> 00:10:26,070 +So more something like this. + +210 +00:10:27,070 --> 00:10:31,745 +You can see here, this is actually +a list of characters. + +211 +00:10:31,745 --> 00:10:35,150 +And the two operations +we need are taking + +212 +00:10:35,150 --> 00:10:37,280 +the head of this list and + +213 +00:10:37,280 --> 00:10:39,770 +the rest of the list +or tail of the list. + +214 +00:10:39,770 --> 00:10:41,720 +That's why we want +to regard them as + +215 +00:10:41,720 --> 00:10:45,260 +lists rather than strings. + +216 +00:10:45,260 --> 00:10:48,200 +So if I'm using a +string like this, + +217 +00:10:48,200 --> 00:10:51,935 +then on paper I always will +write something like that. + +218 +00:10:51,935 --> 00:10:54,575 +Or since I'm lazy, just that. + +219 +00:10:54,575 --> 00:10:56,675 +And for the empty string, + +220 +00:10:56,675 --> 00:10:59,210 +I will write either +the empty list, with + +221 +00:10:59,210 --> 00:11:03,920 +two brackets or, +being lazy, just that. + +222 +00:11:03,920 --> 00:11:06,620 +Actually there is one +more operation we need on + +223 +00:11:06,620 --> 00:11:09,410 +strings and that +is concatenation. + +224 +00:11:09,410 --> 00:11:11,255 +If you have a string s1, + +225 +00:11:11,255 --> 00:11:14,510 +string s2, and put an +at symbol in between, + +226 +00:11:14,510 --> 00:11:18,050 +that means we want to +concatenate both strings. + +227 +00:11:18,050 --> 00:11:22,625 +So foo concatenated with +bar, would be foobar. + +228 +00:11:22,625 --> 00:11:25,085 +And any string concatenated with + +229 +00:11:25,085 --> 00:11:27,950 +the empty string +is left untouched. + +230 +00:11:27,950 --> 00:11:31,310 +So baz concatenated with + +231 +00:11:31,310 --> 00:11:33,545 +the empty string, is just baz. + +232 +00:11:33,545 --> 00:11:37,295 +So that's like if we have +strings as lists of characters, + +233 +00:11:37,295 --> 00:11:39,755 +that will be just list append. + +234 +00:11:39,755 --> 00:11:41,480 +In the next video, + +235 +00:11:41,480 --> 00:11:43,160 +we will use these definitions + +236 +00:11:43,160 --> 00:11:45,050 +and introduce the notion of what + +237 +00:11:45,050 --> 00:11:46,850 +a language is and + +238 +00:11:46,850 --> 00:11:49,920 +what the meaning of a +regular expression is.