diff -r b294cfbb5c01 -r e8402d8ec8e6 videos/02-prelims.srt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/videos/02-prelims.srt Tue Sep 29 12:52:07 2020 +0100 @@ -0,0 +1,1217 @@ +1 +00:00:09,990 --> 00:00:13,465 +Welcome back to this +week's lecture. + +2 +00:00:13,465 --> 00:00:16,450 +The task this week is to actually + +3 +00:00:16,450 --> 00:00:20,320 +implement a regular +expression matcher. + +4 +00:00:20,320 --> 00:00:22,510 +And we want to be a bit + +5 +00:00:22,510 --> 00:00:25,315 +faster than the regular +expression matcher + +6 +00:00:25,315 --> 00:00:29,380 +in Python, Ruby, +Javascript, and Java. + +7 +00:00:29,380 --> 00:00:31,960 +Remember this regular expression + +8 +00:00:31,960 --> 00:00:35,785 +and strings which are +just a number of a's, + +9 +00:00:35,785 --> 00:00:39,670 +these regular expression matchers +where abysmally slow. + +10 +00:00:39,670 --> 00:00:43,170 +They could only match +approximately 30 a's in + +11 +00:00:43,170 --> 00:00:45,665 +something like half a minute. + +12 +00:00:45,665 --> 00:00:49,460 +What we like to do with +our regular expression matcher. + +13 +00:00:49,460 --> 00:00:51,890 +We will try a few times, + +14 +00:00:51,890 --> 00:00:55,505 +but our second trial will already +be much, much better. + +15 +00:00:55,505 --> 00:00:58,400 +It will probably +match around maybe + +16 +00:00:58,400 --> 00:01:02,030 +thousand a's in something +in half a minute. + +17 +00:01:02,030 --> 00:01:03,920 +But our third version in + +18 +00:01:03,920 --> 00:01:06,230 +comparison will be +blindingly fast. + +19 +00:01:06,230 --> 00:01:08,180 +And we'll be able to match + +20 +00:01:08,180 --> 00:01:10,145 +strings of length 10,000 a's + +21 +00:01:10,145 --> 00:01:12,230 +and will not require + +22 +00:01:12,230 --> 00:01:14,975 +more than five seconds. +So let's go ahead with that. + +23 +00:01:14,975 --> 00:01:18,095 +We will first implement + +24 +00:01:18,095 --> 00:01:19,880 +our regular expression +matcher only + +25 +00:01:19,880 --> 00:01:22,069 +for the basic +regular expressions. + +26 +00:01:22,069 --> 00:01:25,430 +So we will have only six +cases to think about. + +27 +00:01:25,430 --> 00:01:27,620 +This will simplify matters at first. + +28 +00:01:27,620 --> 00:01:30,350 +Later we can +think about how to + +29 +00:01:30,350 --> 00:01:34,100 +extend that to the extended +regular expressions. + +30 +00:01:34,100 --> 00:01:37,625 +Unfortunately, before we can +start our implementation, + +31 +00:01:37,625 --> 00:01:39,290 +we have to discuss + +32 +00:01:39,290 --> 00:01:42,470 +when two regular +expressions are equivalent. + +33 +00:01:42,470 --> 00:01:46,595 +And we use here this notation +of the triple equals. + +34 +00:01:46,595 --> 00:01:48,215 +We say a regular expression + +35 +00:01:48,215 --> 00:01:50,180 +r1 and r2 are + +36 +00:01:50,180 --> 00:01:56,059 +equivalent if and only +if the language of r1 is + +37 +00:01:56,059 --> 00:01:59,360 +equal to the language of r2. + +38 +00:01:59,360 --> 00:02:02,915 +Sounds complicated, +but essentially means + +39 +00:02:02,915 --> 00:02:04,700 +if the two regular expressions can + +40 +00:02:04,700 --> 00:02:06,950 +match exactly the same strings, + +41 +00:02:06,950 --> 00:02:09,800 +then we want to regard +them as equivalent. + +42 +00:02:09,800 --> 00:02:14,300 +This equivalence justifies +why we can be sloppy + +43 +00:02:14,300 --> 00:02:16,040 +with parentheses. + +44 +00:02:16,040 --> 00:02:23,870 +For example, if we have +(r1 + r2) + r3, + +45 +00:02:23,870 --> 00:02:32,270 +then this will be equivalent +to r1 + (r2 + r3). + +46 +00:02:32,270 --> 00:02:34,910 +Remember, regular +expressions are trees, + +47 +00:02:34,910 --> 00:02:37,940 +so these are two different +regular expressions, + +48 +00:02:37,940 --> 00:02:41,540 +but they can match +the same strings. + +49 +00:02:41,540 --> 00:02:45,695 +Because, well, if we take +here the meaning of that, + +50 +00:02:45,695 --> 00:02:54,350 +that would be just L(r1 + r2 + r3), + + +51 +00:02:54,350 --> 00:03:04,100 +which is equal to L(r1 + r2) u L(r3). + + +52 +00:03:04,100 --> 00:03:10,595 +And that is equal to of + +53 +00:03:10,595 --> 00:03:15,710 +L(r1) u L(r2) u L(r3). + + +54 +00:03:15,710 --> 00:03:17,930 +And if you do the +same calculation + +55 +00:03:17,930 --> 00:03:19,445 +for that regular expression, + +56 +00:03:19,445 --> 00:03:22,940 +you will find out the +two sets are the same. + +57 +00:03:22,940 --> 00:03:26,195 +So both regular expressions +can match the same strings. + +58 +00:03:26,195 --> 00:03:28,805 +So even if they're different +regular expressions, + +59 +00:03:28,805 --> 00:03:31,220 +we can regard them as eqivalent. + +60 +00:03:31,220 --> 00:03:33,290 +And so that's why we can forget + +61 +00:03:33,290 --> 00:03:35,270 +about writing these parentheses. + +62 +00:03:35,270 --> 00:03:40,205 +Let's have a look +at some more examples. + +63 +00:03:40,205 --> 00:03:41,870 +So the first one here, + +64 +00:03:41,870 --> 00:03:43,205 +that is now clear. + +65 +00:03:43,205 --> 00:03:45,200 +We did this calculation already + +66 +00:03:45,200 --> 00:03:47,480 +for arbitrary regular expressions. + +67 +00:03:47,480 --> 00:03:49,520 +Here it is for +concrete ones a, b and c. + +68 +00:03:49,520 --> 00:03:52,690 +Here on the left-hand side, + +69 +00:03:52,690 --> 00:03:54,895 +the regular expression +has the same meaning + +70 +00:03:54,895 --> 00:03:56,620 +as on the right-hand side. + +71 +00:03:56,620 --> 00:03:58,420 +So they are equivalent. + +72 +00:03:58,420 --> 00:04:01,375 +We can ignore the parentheses. + +73 +00:04:01,375 --> 00:04:03,220 +If I have a choice, + +74 +00:04:03,220 --> 00:04:06,610 +but the choice is +the same a or a, + +75 +00:04:06,610 --> 00:04:09,265 +then this is +equivalent to just a. + +76 +00:04:09,265 --> 00:04:13,075 +So the same choice doesn't +introduce anything more. + +77 +00:04:13,075 --> 00:04:15,370 +In the next case, if I have + +78 +00:04:15,370 --> 00:04:19,450 +a regular expression +which can match a or b, + +79 +00:04:19,450 --> 00:04:23,860 +that can match the same +strings as b or a. + +80 +00:04:23,860 --> 00:04:27,325 +So you have a kind of +commutativity for the plus, + +81 +00:04:27,325 --> 00:04:29,485 +which is like on natural numbers. + +82 +00:04:29,485 --> 00:04:32,880 +But you would not have a +communitivity for the sequence: + +83 +00:04:32,880 --> 00:04:37,070 +if you have +first a and then b, + +84 +00:04:37,070 --> 00:04:40,850 +that's not the same as +matching first b and then a. + +85 +00:04:40,850 --> 00:04:42,800 +So for the sequence ... + +86 +00:04:42,800 --> 00:04:44,615 +they are not equivalent. + +87 +00:04:44,615 --> 00:04:49,475 +However, for the sequence I +can, like for the plus, ignore + +88 +00:04:49,475 --> 00:04:51,245 +prarentheses. + +89 +00:04:51,245 --> 00:04:55,070 +If I have the parentheses +and this way, + +90 +00:04:55,070 --> 00:04:57,785 +or I have it in this way. + +91 +00:04:57,785 --> 00:05:00,605 +These are two different +regular expressions, + +92 +00:05:00,605 --> 00:05:02,120 +but they have the same meaning. + +93 +00:05:02,120 --> 00:05:03,815 +So they are equivalent. + +94 +00:05:03,815 --> 00:05:05,780 +And so I can leave out parentheses + +95 +00:05:05,780 --> 00:05:09,170 +for times as well. + +96 +00:05:09,170 --> 00:05:12,185 +The next one is a slightly +more interesting one. + +97 +00:05:12,185 --> 00:05:15,020 +If I have a regular +expression which can match + +98 +00:05:15,020 --> 00:05:19,655 +c first followed by (a + b), + +99 +00:05:19,655 --> 00:05:25,355 +then this is the same as +first c followed by a, + +100 +00:05:25,355 --> 00:05:28,640 +or c followed by b. + +101 +00:05:28,640 --> 00:05:31,265 +So that's the kind of +distributivity law + +102 +00:05:31,265 --> 00:05:33,545 +on regular expressions. + +103 +00:05:33,545 --> 00:05:36,350 +But they're also regular expressions +which are not equivalent. + +104 +00:05:36,350 --> 00:05:38,990 +If I have the regular +expression which can + +105 +00:05:38,990 --> 00:05:42,800 +match the string containing +exactly two a's. + +106 +00:05:42,800 --> 00:05:44,240 +That is not equivalent + +107 +00:05:44,240 --> 00:05:46,730 +to the regular +expression which can just match + +108 +00:05:46,730 --> 00:05:49,250 +a single a. And similarly + +109 +00:05:49,250 --> 00:05:51,680 +in this case, if I can match + +110 +00:05:51,680 --> 00:05:56,075 +a or the string b followed by c, + +111 +00:05:56,075 --> 00:05:59,825 +that is not the same as a or b, + +112 +00:05:59,825 --> 00:06:02,000 +followed by a or c. + +113 +00:06:02,000 --> 00:06:05,990 +I'll let you think about +why is that not equivalent. + +114 +00:06:05,990 --> 00:06:08,060 +Essentially you +have to find out what's + +115 +00:06:08,060 --> 00:06:10,475 +the meaning of both +regular expressions. + +116 +00:06:10,475 --> 00:06:14,090 +And are they the +same sets or not? + +117 +00:06:14,090 --> 00:06:17,435 +There are again some +interesting corner cases. + +118 +00:06:17,435 --> 00:06:20,540 +If I have a regular expression +that can match a, + +119 +00:06:20,540 --> 00:06:23,450 +but followed by the regular +expression which cannot match + +120 +00:06:23,450 --> 00:06:25,670 +anything, that is not equivalent + +121 +00:06:25,670 --> 00:06:29,255 +to the regular +expression which can match a. + +122 +00:06:29,255 --> 00:06:31,340 +Again here you have +to do the calculation + +123 +00:06:31,340 --> 00:06:32,915 +to convince you of that. + +124 +00:06:32,915 --> 00:06:35,840 +Similarly, if I have a regular +expression which can + +125 +00:06:35,840 --> 00:06:38,750 +match a or the empty string, + +126 +00:06:38,750 --> 00:06:40,640 +this is not equivalent to + +127 +00:06:40,640 --> 00:06:43,355 +the regular expression +which can just match a. + +128 +00:06:43,355 --> 00:06:46,925 +Here are some interesting +ones with zeros and ones. + +129 +00:06:46,925 --> 00:06:48,860 +So if I have the regular expression + +130 +00:06:48,860 --> 00:06:50,975 +that can match the empty string, + +131 +00:06:50,975 --> 00:06:53,540 +this will be actually equivalent to + +132 +00:06:53,540 --> 00:06:56,435 +the regular expression +which can match nothing, + +133 +00:06:56,435 --> 00:06:59,405 +but star of that. + +134 +00:06:59,405 --> 00:07:01,970 +Remember the star +always introduces, + +135 +00:07:01,970 --> 00:07:04,370 +no matter what, the empty string. + +136 +00:07:04,370 --> 00:07:06,170 +So this regular expression can match + +137 +00:07:06,170 --> 00:07:08,930 +the empty string, +just like the 1. + +138 +00:07:08,930 --> 00:07:12,125 +So these two expressions +are not the same, + +139 +00:07:12,125 --> 00:07:14,210 +but they are equivalent. + +140 +00:07:14,210 --> 00:07:16,700 +And it doesn't matter +whether you take + +141 +00:07:16,700 --> 00:07:20,090 +the empty string to the star, + +142 +00:07:20,090 --> 00:07:23,855 +because if I can match 0 or +more times the empty string, + +143 +00:07:23,855 --> 00:07:27,450 +that will be equivalent to +just the empty string. + +144 +00:07:27,520 --> 00:07:32,510 +Slightly similar to the +third case is this one. + +145 +00:07:32,510 --> 00:07:35,720 +Zero star is not equivalent to 0 + +146 +00:07:35,720 --> 00:07:40,025 +because that can match +exactly the empty string. + +147 +00:07:40,025 --> 00:07:42,740 +This cannot match anything. + +148 +00:07:42,740 --> 00:07:44,839 +While the previous + +149 +00:07:44,839 --> 00:07:47,540 +equivalences are all very +instructive to look at, + +150 +00:07:47,540 --> 00:07:49,910 +these are the +equivalences we need + +151 +00:07:49,910 --> 00:07:52,685 +in our regular expression matchers. + +152 +00:07:52,685 --> 00:07:54,350 +And they are defined for + +153 +00:07:54,350 --> 00:07:58,160 +all regular expressions r. +So r plus 0, + +154 +00:07:58,160 --> 00:08:00,470 +no matter what r looks +like is equivalent + +155 +00:08:00,470 --> 00:08:02,945 +to just r. Similarly + +156 +00:08:02,945 --> 00:08:05,930 +0 plus r is also +equivalent to r. + +157 +00:08:05,930 --> 00:08:08,525 +The order of this +choice doesn't matter; + +158 +00:08:08,525 --> 00:08:11,495 +r followed by 1, + +159 +00:08:11,495 --> 00:08:14,030 +that's the sequence +regular expression, is + +160 +00:08:14,030 --> 00:08:16,760 +equivalent to r. The +sam, r followed by + +161 +00:08:16,760 --> 00:08:19,010 +r is equivalent to r. + +162 +00:08:19,010 --> 00:08:20,990 +And r followed by + +163 +00:08:20,990 --> 00:08:23,390 +the regular expression which +cannot match anything, + +164 +00:08:23,390 --> 00:08:27,455 +is equivalent to just 0. + +165 +00:08:27,455 --> 00:08:30,740 +And 0 followed by r is also equivalent to 0. + +166 +00:08:30,740 --> 00:08:33,615 +And if you have the +choice of r plus r, + +167 +00:08:33,615 --> 00:08:37,210 +that is also +equivalent to just r. + +168 +00:08:37,210 --> 00:08:40,270 +What we're going to do +with these equivalences is + +169 +00:08:40,270 --> 00:08:42,820 +that we regard them as +simplification rules. + +170 +00:08:42,820 --> 00:08:43,930 +So whenever we see + +171 +00:08:43,930 --> 00:08:46,465 +this regular expression +in our algorithm, + +172 +00:08:46,465 --> 00:08:48,580 +we will replace it by +the right-hand side. + +173 +00:08:48,580 --> 00:08:51,715 +So we will orient +these equivalences. + +174 +00:08:51,715 --> 00:08:53,650 +If we see, this regular expression, + +175 +00:08:53,650 --> 00:08:55,225 +we will replace it by that one. + +176 +00:08:55,225 --> 00:08:58,945 +And it will not matter +because the left-hand sides + +177 +00:08:58,945 --> 00:09:01,255 +can match exactly +the same strings + +178 +00:09:01,255 --> 00:09:03,475 +as the right-hand sides. + +179 +00:09:03,475 --> 00:09:06,370 +Here two quick examples. + +180 +00:09:06,370 --> 00:09:08,680 +The first one, let's +assume you have + +181 +00:09:08,680 --> 00:09:10,270 +a regular expression r and + +182 +00:09:10,270 --> 00:09:11,905 +there is something +in front of it. + +183 +00:09:11,905 --> 00:09:13,720 +This "something in front of it" + +184 +00:09:13,720 --> 00:09:15,870 +can be simplified as follows. + +185 +00:09:15,870 --> 00:09:20,000 +This 1 times b +can be simplified to b. + +186 +00:09:20,000 --> 00:09:23,555 +Then this b plus 0 can +be simplified to b. + +187 +00:09:23,555 --> 00:09:25,490 +And assuming that nothing can + +188 +00:09:25,490 --> 00:09:27,470 +be simplified inside this r, + +189 +00:09:27,470 --> 00:09:28,700 +then this would be + +190 +00:09:28,700 --> 00:09:33,050 +the simplified version +of this regular expression. + +191 +00:09:33,050 --> 00:09:34,820 +And because the simplification + +192 +00:09:34,820 --> 00:09:36,965 +rules preserve what can be matched, + +193 +00:09:36,965 --> 00:09:39,080 +we can be sure that +this regular expression + +194 +00:09:39,080 --> 00:09:41,255 +can match exactly the strings + +195 +00:09:41,255 --> 00:09:43,940 +this regular expression can match. + +196 +00:09:43,940 --> 00:09:45,740 +Here is the other example. + +197 +00:09:45,740 --> 00:09:49,550 +This has just a tiny change +that this becomes here as 0. + +198 +00:09:49,550 --> 00:09:54,485 +Then 0 times b can be +replaced by just 0. + +199 +00:09:54,485 --> 00:09:56,705 +Then they are actually two +rules which could apply + +200 +00:09:56,705 --> 00:09:59,014 +either that we have +the same choice + +201 +00:09:59,014 --> 00:10:01,160 +or we can just say something plus + +202 +00:10:01,160 --> 00:10:04,100 +0 will always go to something. + +203 +00:10:04,100 --> 00:10:06,245 +So we can simplify it to that. + +204 +00:10:06,245 --> 00:10:08,510 +And then we have +0 times r again, + +205 +00:10:08,510 --> 00:10:10,460 +and that can be simplified to 0. + +206 +00:10:10,460 --> 00:10:12,080 +So actually what we find out with + +207 +00:10:12,080 --> 00:10:14,645 +this calculation is that +this regular expression, + +208 +00:10:14,645 --> 00:10:16,835 +even if it looks +quite complicated, + +209 +00:10:16,835 --> 00:10:19,295 +actually doesn't +match any string. + +210 +00:10:19,295 --> 00:10:21,290 +Because it matches exactly + +211 +00:10:21,290 --> 00:10:23,420 +those string this regular +expression can match. + +212 +00:10:23,420 --> 00:10:26,285 +And this reg expression +cannot match any. + +213 +00:10:26,285 --> 00:10:29,930 +We need one more +operation on languages. + +214 +00:10:29,930 --> 00:10:31,700 +I call this operation + +215 +00:10:31,700 --> 00:10:35,165 +the semantic derivative +of a language. + +216 +00:10:35,165 --> 00:10:37,775 +This operation takes +two arguments. + +217 +00:10:37,775 --> 00:10:40,670 +It takes a character +which we take + +218 +00:10:40,670 --> 00:10:43,925 +the semantic derivative +according to, + +219 +00:10:43,925 --> 00:10:45,980 +and it takes a language. + +220 +00:10:45,980 --> 00:10:49,579 +And what it does is it +looks into this language + +221 +00:10:49,579 --> 00:10:51,365 +and looks for all the strings + +222 +00:10:51,365 --> 00:10:53,735 +that start with this character, + +223 +00:10:53,735 --> 00:10:56,405 +let's say c, and then + +224 +00:10:56,405 --> 00:11:00,920 +just strips off that c. +So here's an example. + +225 +00:11:00,920 --> 00:11:02,975 +Suppose you have a language A, + +226 +00:11:02,975 --> 00:11:04,460 +with the strings + +227 +00:11:04,460 --> 00:11:06,965 +foo, bar and frak. + +228 +00:11:06,965 --> 00:11:10,835 +And you take the semantic +derivative according to f, + +229 +00:11:10,835 --> 00:11:14,450 +then we discard all the +strings which do not + +230 +00:11:14,450 --> 00:11:18,320 +start with an f. So +bar will be discarded. + +231 +00:11:18,320 --> 00:11:22,850 +Foo and frac start with +an f. So we keep them + +232 +00:11:22,850 --> 00:11:25,265 +but we strip off the first f. + +233 +00:11:25,265 --> 00:11:29,045 +So here we have only oo and rak. + +234 +00:11:29,045 --> 00:11:31,609 +If you take the +semantic derivative + +235 +00:11:31,609 --> 00:11:33,830 +of that language according to b, + +236 +00:11:33,830 --> 00:11:37,190 +then we will discard foo and +frak because they don't + +237 +00:11:37,190 --> 00:11:40,940 +start with b, and just keep bar. + +238 +00:11:40,940 --> 00:11:44,585 +But again, we have to +strip off this first b. + +239 +00:11:44,585 --> 00:11:49,305 +And if you take the semantic +derivative according to a, + +240 +00:11:49,305 --> 00:11:52,585 +then none of these +strings start with a. + +241 +00:11:52,585 --> 00:11:56,800 +So this will be defined +as just the empty set. + +242 +00:11:56,800 --> 00:11:59,695 +You can slightly +generalized this + +243 +00:11:59,695 --> 00:12:02,560 +semantic derivative +to also strings. + +244 +00:12:02,560 --> 00:12:05,170 +So imagine you have +a language A and you + +245 +00:12:05,170 --> 00:12:08,274 +build the semantic derivative +according to a string s. + +246 +00:12:08,274 --> 00:12:10,990 +Then you look in +this language and + +247 +00:12:10,990 --> 00:12:13,420 +you look for all the +strings that start with + +248 +00:12:13,420 --> 00:12:19,555 +this s. But you strip +off that s. Ok? + +249 +00:12:19,555 --> 00:12:23,830 +So if you have a string +starting with the s, + +250 +00:12:23,830 --> 00:12:26,065 +then you keep that string, + +251 +00:12:26,065 --> 00:12:27,490 +but you keep only the rest... + +252 +00:12:27,490 --> 00:12:28,810 +what's coming after this s. + +253 +00:12:28,810 --> 00:12:32,010 +That is here indicated +with this s'. + +254 +00:12:32,010 --> 00:12:35,330 +I also have this convention, +this personal convention + +255 +00:12:35,330 --> 00:12:40,055 +I have to say, that everything +which works on lists, + +256 +00:12:40,055 --> 00:12:42,185 +remember strings are +lists of characters. + +257 +00:12:42,185 --> 00:12:46,655 +I just put there an 's'. So +here's the one for characters. + +258 +00:12:46,655 --> 00:12:48,680 +I just call it Der with a capital + +259 +00:12:48,680 --> 00:12:51,740 +D. And I call that Ders, + +260 +00:12:51,740 --> 00:12:54,350 +because that works over strings. + +261 +00:12:54,350 --> 00:12:55,865 +And you can see how it would + +262 +00:12:55,865 --> 00:12:59,750 +be defined if you only take this +as a primitive operation. + +263 +00:12:59,750 --> 00:13:01,340 +We would just need to iterate + +264 +00:13:01,340 --> 00:13:04,050 +that essentially for this Ders. + +265 +00:13:04,060 --> 00:13:07,970 +Ok, we now have all +the theory in place. + +266 +00:13:07,970 --> 00:13:11,345 +And we can finally start +implementing our algorithm. + +267 +00:13:11,345 --> 00:13:14,580 +That's when we'll do +in the next video.