100:00:09,990 --> 00:00:13,465Welcome back to thisweek's lecture.200:00:13,465 --> 00:00:16,450The task this week is to actually300:00:16,450 --> 00:00:20,320implement a regularexpression matcher.400:00:20,320 --> 00:00:22,510And we want to be a bit500:00:22,510 --> 00:00:25,315faster than the regularexpression matcher600:00:25,315 --> 00:00:29,380in Python, Ruby,Javascript, and Java.700:00:29,380 --> 00:00:31,960Remember this regular expression800:00:31,960 --> 00:00:35,785and strings which arejust a number of a's,900:00:35,785 --> 00:00:39,670these regular expression matcherswhere abysmally slow.1000:00:39,670 --> 00:00:43,170They could only matchapproximately 30 a's in1100:00:43,170 --> 00:00:45,665something like half a minute.1200:00:45,665 --> 00:00:49,460What we like to do withour regular expression matcher.1300:00:49,460 --> 00:00:51,890We will try a few times,1400:00:51,890 --> 00:00:55,505but our second trial will alreadybe much, much better.1500:00:55,505 --> 00:00:58,400It will probablymatch around maybe1600:00:58,400 --> 00:01:02,030thousand a's in somethingin half a minute.1700:01:02,030 --> 00:01:03,920But our third version in1800:01:03,920 --> 00:01:06,230comparison will beblindingly fast.1900:01:06,230 --> 00:01:08,180And we'll be able to match2000:01:08,180 --> 00:01:10,145strings of length 10,000 a's2100:01:10,145 --> 00:01:12,230and will not require2200:01:12,230 --> 00:01:14,975more than five seconds.So let's go ahead with that.2300:01:14,975 --> 00:01:18,095We will first implement2400:01:18,095 --> 00:01:19,880our regular expressionmatcher only2500:01:19,880 --> 00:01:22,069for the basicregular expressions.2600:01:22,069 --> 00:01:25,430So we will have only sixcases to think about.2700:01:25,430 --> 00:01:27,620This will simplify matters at first.2800:01:27,620 --> 00:01:30,350Later we canthink about how to2900:01:30,350 --> 00:01:34,100extend that to the extendedregular expressions.3000:01:34,100 --> 00:01:37,625Unfortunately, before we canstart our implementation,3100:01:37,625 --> 00:01:39,290we have to discuss3200:01:39,290 --> 00:01:42,470when two regularexpressions are equivalent.3300:01:42,470 --> 00:01:46,595And we use here this notationof the triple equals.3400:01:46,595 --> 00:01:48,215We say a regular expression3500:01:48,215 --> 00:01:50,180r1 and r2 are3600:01:50,180 --> 00:01:56,059equivalent if and onlyif the language of r1 is3700:01:56,059 --> 00:01:59,360equal to the language of r2.3800:01:59,360 --> 00:02:02,915Sounds complicated,but essentially means3900:02:02,915 --> 00:02:04,700if the two regular expressions can4000:02:04,700 --> 00:02:06,950match exactly the same strings,4100:02:06,950 --> 00:02:09,800then we want to regardthem as equivalent.4200:02:09,800 --> 00:02:14,300This equivalence justifieswhy we can be sloppy4300:02:14,300 --> 00:02:16,040with parentheses.4400:02:16,040 --> 00:02:23,870For example, if we have(r1 + r2) + r3,4500:02:23,870 --> 00:02:32,270then this will be equivalentto r1 + (r2 + r3).4600:02:32,270 --> 00:02:34,910Remember, regularexpressions are trees,4700:02:34,910 --> 00:02:37,940so these are two differentregular expressions,4800:02:37,940 --> 00:02:41,540but they can matchthe same strings.4900:02:41,540 --> 00:02:45,695Because, well, if we takehere the meaning of that,5000:02:45,695 --> 00:02:54,350that would be just L(r1 + r2 + r3),5100:02:54,350 --> 00:03:04,100which is equal to L(r1 + r2) u L(r3).5200:03:04,100 --> 00:03:10,595And that is equal to of 5300:03:10,595 --> 00:03:15,710L(r1) u L(r2) u L(r3).5400:03:15,710 --> 00:03:17,930And if you do thesame calculation5500:03:17,930 --> 00:03:19,445for that regular expression,5600:03:19,445 --> 00:03:22,940you will find out thetwo sets are the same.5700:03:22,940 --> 00:03:26,195So both regular expressionscan match the same strings.5800:03:26,195 --> 00:03:28,805So even if they're differentregular expressions,5900:03:28,805 --> 00:03:31,220we can regard them as eqivalent.6000:03:31,220 --> 00:03:33,290And so that's why we can forget6100:03:33,290 --> 00:03:35,270about writing these parentheses.6200:03:35,270 --> 00:03:40,205Let's have a lookat some more examples.6300:03:40,205 --> 00:03:41,870So the first one here,6400:03:41,870 --> 00:03:43,205that is now clear.6500:03:43,205 --> 00:03:45,200We did this calculation already6600:03:45,200 --> 00:03:47,480for arbitrary regular expressions.6700:03:47,480 --> 00:03:49,520Here it is forconcrete ones a, b and c.6800:03:49,520 --> 00:03:52,690Here on the left-hand side,6900:03:52,690 --> 00:03:54,895the regular expressionhas the same meaning7000:03:54,895 --> 00:03:56,620as on the right-hand side.7100:03:56,620 --> 00:03:58,420So they are equivalent.7200:03:58,420 --> 00:04:01,375We can ignore the parentheses.7300:04:01,375 --> 00:04:03,220If I have a choice,7400:04:03,220 --> 00:04:06,610but the choice isthe same a or a,7500:04:06,610 --> 00:04:09,265then this isequivalent to just a.7600:04:09,265 --> 00:04:13,075So the same choice doesn'tintroduce anything more.7700:04:13,075 --> 00:04:15,370In the next case, if I have7800:04:15,370 --> 00:04:19,450a regular expressionwhich can match a or b,7900:04:19,450 --> 00:04:23,860that can match the samestrings as b or a.8000:04:23,860 --> 00:04:27,325So you have a kind ofcommutativity for the plus,8100:04:27,325 --> 00:04:29,485which is like on natural numbers.8200:04:29,485 --> 00:04:32,880But you would not have acommunitivity for the sequence:8300:04:32,880 --> 00:04:37,070if you havefirst a and then b,8400:04:37,070 --> 00:04:40,850that's not the same asmatching first b and then a.8500:04:40,850 --> 00:04:42,800So for the sequence ...8600:04:42,800 --> 00:04:44,615they are not equivalent.8700:04:44,615 --> 00:04:49,475However, for the sequence Ican, like for the plus, ignore8800:04:49,475 --> 00:04:51,245prarentheses.8900:04:51,245 --> 00:04:55,070If I have the parenthesesand this way,9000:04:55,070 --> 00:04:57,785or I have it in this way.9100:04:57,785 --> 00:05:00,605These are two differentregular expressions,9200:05:00,605 --> 00:05:02,120but they have the same meaning.9300:05:02,120 --> 00:05:03,815So they are equivalent.9400:05:03,815 --> 00:05:05,780And so I can leave out parentheses9500:05:05,780 --> 00:05:09,170for times as well.9600:05:09,170 --> 00:05:12,185The next one is a slightlymore interesting one.9700:05:12,185 --> 00:05:15,020If I have a regularexpression which can match9800:05:15,020 --> 00:05:19,655c first followed by (a + b),9900:05:19,655 --> 00:05:25,355then this is the same asfirst c followed by a,10000:05:25,355 --> 00:05:28,640or c followed by b.10100:05:28,640 --> 00:05:31,265So that's the kind ofdistributivity law10200:05:31,265 --> 00:05:33,545on regular expressions.10300:05:33,545 --> 00:05:36,350But they're also regular expressionswhich are not equivalent.10400:05:36,350 --> 00:05:38,990If I have the regularexpression which can10500:05:38,990 --> 00:05:42,800match the string containingexactly two a's.10600:05:42,800 --> 00:05:44,240That is not equivalent10700:05:44,240 --> 00:05:46,730to the regularexpression which can just match10800:05:46,730 --> 00:05:49,250a single a. And similarly10900:05:49,250 --> 00:05:51,680in this case, if I can match11000:05:51,680 --> 00:05:56,075a or the string b followed by c,11100:05:56,075 --> 00:05:59,825that is not the same as a or b,11200:05:59,825 --> 00:06:02,000followed by a or c.11300:06:02,000 --> 00:06:05,990I'll let you think aboutwhy is that not equivalent.11400:06:05,990 --> 00:06:08,060Essentially youhave to find out what's11500:06:08,060 --> 00:06:10,475the meaning of bothregular expressions.11600:06:10,475 --> 00:06:14,090And are they thesame sets or not?11700:06:14,090 --> 00:06:17,435There are again someinteresting corner cases.11800:06:17,435 --> 00:06:20,540If I have a regular expressionthat can match a,11900:06:20,540 --> 00:06:23,450but followed by the regularexpression which cannot match12000:06:23,450 --> 00:06:25,670anything, that is not equivalent12100:06:25,670 --> 00:06:29,255to the regularexpression which can match a.12200:06:29,255 --> 00:06:31,340Again here you haveto do the calculation12300:06:31,340 --> 00:06:32,915to convince you of that.12400:06:32,915 --> 00:06:35,840Similarly, if I have a regularexpression which can12500:06:35,840 --> 00:06:38,750match a or the empty string,12600:06:38,750 --> 00:06:40,640this is not equivalent to12700:06:40,640 --> 00:06:43,355the regular expressionwhich can just match a.12800:06:43,355 --> 00:06:46,925Here are some interestingones with zeros and ones.12900:06:46,925 --> 00:06:48,860So if I have the regular expression13000:06:48,860 --> 00:06:50,975that can match the empty string,13100:06:50,975 --> 00:06:53,540this will be actually equivalent to13200:06:53,540 --> 00:06:56,435the regular expressionwhich can match nothing,13300:06:56,435 --> 00:06:59,405but star of that.13400:06:59,405 --> 00:07:01,970Remember the staralways introduces,13500:07:01,970 --> 00:07:04,370no matter what, the empty string.13600:07:04,370 --> 00:07:06,170So this regular expression can match13700:07:06,170 --> 00:07:08,930the empty string,just like the 1.13800:07:08,930 --> 00:07:12,125So these two expressionsare not the same,13900:07:12,125 --> 00:07:14,210but they are equivalent.14000:07:14,210 --> 00:07:16,700And it doesn't matterwhether you take14100:07:16,700 --> 00:07:20,090the empty string to the star,14200:07:20,090 --> 00:07:23,855because if I can match 0 ormore times the empty string,14300:07:23,855 --> 00:07:27,450that will be equivalent to just the empty string.14400:07:27,520 --> 00:07:32,510Slightly similar to thethird case is this one.14500:07:32,510 --> 00:07:35,720Zero star is not equivalent to 014600:07:35,720 --> 00:07:40,025because that can matchexactly the empty string.14700:07:40,025 --> 00:07:42,740This cannot match anything.14800:07:42,740 --> 00:07:44,839While the previous14900:07:44,839 --> 00:07:47,540equivalences are all veryinstructive to look at,15000:07:47,540 --> 00:07:49,910these are theequivalences we need15100:07:49,910 --> 00:07:52,685in our regular expression matchers.15200:07:52,685 --> 00:07:54,350And they are defined for15300:07:54,350 --> 00:07:58,160all regular expressions r.So r plus 0,15400:07:58,160 --> 00:08:00,470no matter what r lookslike is equivalent15500:08:00,470 --> 00:08:02,945to just r. Similarly15600:08:02,945 --> 00:08:05,9300 plus r is alsoequivalent to r.15700:08:05,930 --> 00:08:08,525The order of thischoice doesn't matter;15800:08:08,525 --> 00:08:11,495r followed by 1,15900:08:11,495 --> 00:08:14,030that's the sequenceregular expression, is16000:08:14,030 --> 00:08:16,760equivalent to r. Thesam, r followed by16100:08:16,760 --> 00:08:19,010r is equivalent to r.16200:08:19,010 --> 00:08:20,990And r followed by16300:08:20,990 --> 00:08:23,390the regular expression whichcannot match anything,16400:08:23,390 --> 00:08:27,455is equivalent to just 0.16500:08:27,455 --> 00:08:30,740And 0 followed by r is also equivalent to 0.16600:08:30,740 --> 00:08:33,615And if you have thechoice of r plus r,16700:08:33,615 --> 00:08:37,210that is alsoequivalent to just r.16800:08:37,210 --> 00:08:40,270What we're going to dowith these equivalences is16900:08:40,270 --> 00:08:42,820that we regard them assimplification rules.17000:08:42,820 --> 00:08:43,930So whenever we see17100:08:43,930 --> 00:08:46,465this regular expressionin our algorithm,17200:08:46,465 --> 00:08:48,580we will replace it bythe right-hand side.17300:08:48,580 --> 00:08:51,715So we will orientthese equivalences.17400:08:51,715 --> 00:08:53,650If we see, this regular expression,17500:08:53,650 --> 00:08:55,225we will replace it by that one.17600:08:55,225 --> 00:08:58,945And it will not matterbecause the left-hand sides17700:08:58,945 --> 00:09:01,255can match exactlythe same strings17800:09:01,255 --> 00:09:03,475as the right-hand sides.17900:09:03,475 --> 00:09:06,370Here two quick examples.18000:09:06,370 --> 00:09:08,680The first one, let'sassume you have18100:09:08,680 --> 00:09:10,270a regular expression r and18200:09:10,270 --> 00:09:11,905there is somethingin front of it.18300:09:11,905 --> 00:09:13,720This "something in front of it"18400:09:13,720 --> 00:09:15,870can be simplified as follows.18500:09:15,870 --> 00:09:20,000This 1 times bcan be simplified to b.18600:09:20,000 --> 00:09:23,555Then this b plus 0 canbe simplified to b.18700:09:23,555 --> 00:09:25,490And assuming that nothing can18800:09:25,490 --> 00:09:27,470be simplified inside this r,18900:09:27,470 --> 00:09:28,700then this would be19000:09:28,700 --> 00:09:33,050the simplified versionof this regular expression.19100:09:33,050 --> 00:09:34,820And because the simplification19200:09:34,820 --> 00:09:36,965rules preserve what can be matched,19300:09:36,965 --> 00:09:39,080we can be sure thatthis regular expression19400:09:39,080 --> 00:09:41,255can match exactly the strings19500:09:41,255 --> 00:09:43,940this regular expression can match.19600:09:43,940 --> 00:09:45,740Here is the other example.19700:09:45,740 --> 00:09:49,550This has just a tiny changethat this becomes here as 0.19800:09:49,550 --> 00:09:54,485Then 0 times b can bereplaced by just 0.19900:09:54,485 --> 00:09:56,705Then they are actually tworules which could apply20000:09:56,705 --> 00:09:59,014either that we havethe same choice20100:09:59,014 --> 00:10:01,160or we can just say something plus20200:10:01,160 --> 00:10:04,1000 will always go to something.20300:10:04,100 --> 00:10:06,245So we can simplify it to that.20400:10:06,245 --> 00:10:08,510And then we have0 times r again,20500:10:08,510 --> 00:10:10,460and that can be simplified to 0.20600:10:10,460 --> 00:10:12,080So actually what we find out with20700:10:12,080 --> 00:10:14,645this calculation is thatthis regular expression,20800:10:14,645 --> 00:10:16,835even if it looksquite complicated,20900:10:16,835 --> 00:10:19,295actually doesn'tmatch any string.21000:10:19,295 --> 00:10:21,290Because it matches exactly21100:10:21,290 --> 00:10:23,420those string this regularexpression can match.21200:10:23,420 --> 00:10:26,285And this reg expressioncannot match any.21300:10:26,285 --> 00:10:29,930We need one moreoperation on languages.21400:10:29,930 --> 00:10:31,700I call this operation21500:10:31,700 --> 00:10:35,165the semantic derivativeof a language.21600:10:35,165 --> 00:10:37,775This operation takestwo arguments.21700:10:37,775 --> 00:10:40,670It takes a characterwhich we take21800:10:40,670 --> 00:10:43,925the semantic derivativeaccording to,21900:10:43,925 --> 00:10:45,980and it takes a language.22000:10:45,980 --> 00:10:49,579And what it does is itlooks into this language22100:10:49,579 --> 00:10:51,365and looks for all the strings22200:10:51,365 --> 00:10:53,735that start with this character,22300:10:53,735 --> 00:10:56,405let's say c, and then22400:10:56,405 --> 00:11:00,920just strips off that c.So here's an example.22500:11:00,920 --> 00:11:02,975Suppose you have a language A,22600:11:02,975 --> 00:11:04,460with the strings22700:11:04,460 --> 00:11:06,965foo, bar and frak.22800:11:06,965 --> 00:11:10,835And you take the semanticderivative according to f,22900:11:10,835 --> 00:11:14,450then we discard all thestrings which do not23000:11:14,450 --> 00:11:18,320start with an f. Sobar will be discarded.23100:11:18,320 --> 00:11:22,850Foo and frac start withan f. So we keep them23200:11:22,850 --> 00:11:25,265but we strip off the first f.23300:11:25,265 --> 00:11:29,045So here we have only oo and rak.23400:11:29,045 --> 00:11:31,609If you take thesemantic derivative23500:11:31,609 --> 00:11:33,830of that language according to b,23600:11:33,830 --> 00:11:37,190then we will discard foo andfrak because they don't23700:11:37,190 --> 00:11:40,940start with b, and just keep bar.23800:11:40,940 --> 00:11:44,585But again, we have tostrip off this first b.23900:11:44,585 --> 00:11:49,305And if you take the semanticderivative according to a,24000:11:49,305 --> 00:11:52,585then none of thesestrings start with a.24100:11:52,585 --> 00:11:56,800So this will be definedas just the empty set.24200:11:56,800 --> 00:11:59,695You can slightly generalized this24300:11:59,695 --> 00:12:02,560semantic derivativeto also strings.24400:12:02,560 --> 00:12:05,170So imagine you havea language A and you24500:12:05,170 --> 00:12:08,274build the semantic derivativeaccording to a string s.24600:12:08,274 --> 00:12:10,990Then you look inthis language and24700:12:10,990 --> 00:12:13,420you look for all thestrings that start with24800:12:13,420 --> 00:12:19,555this s. But you stripoff that s. Ok?24900:12:19,555 --> 00:12:23,830So if you have a stringstarting with the s,25000:12:23,830 --> 00:12:26,065then you keep that string,25100:12:26,065 --> 00:12:27,490but you keep only the rest...25200:12:27,490 --> 00:12:28,810what's coming after this s.25300:12:28,810 --> 00:12:32,010That is here indicatedwith this s'.25400:12:32,010 --> 00:12:35,330I also have this convention,this personal convention25500:12:35,330 --> 00:12:40,055I have to say, that everythingwhich works on lists,25600:12:40,055 --> 00:12:42,185remember strings arelists of characters.25700:12:42,185 --> 00:12:46,655I just put there an 's'. Sohere's the one for characters.25800:12:46,655 --> 00:12:48,680I just call it Der with a capital25900:12:48,680 --> 00:12:51,740D. And I call that Ders,26000:12:51,740 --> 00:12:54,350because that works over strings.26100:12:54,350 --> 00:12:55,865And you can see how it would26200:12:55,865 --> 00:12:59,750be defined if you only take thisas a primitive operation.26300:12:59,750 --> 00:13:01,340We would just need to iterate26400:13:01,340 --> 00:13:04,050that essentially for this Ders.26500:13:04,060 --> 00:13:07,970Ok, we now have allthe theory in place.26600:13:07,970 --> 00:13:11,345And we can finally startimplementing our algorithm.26700:13:11,345 --> 00:13:14,580That's when we'll doin the next video.