100:00:05,880 --> 00:00:09,700Welcome back.Remember this slide.200:00:09,700 --> 00:00:11,500This slide said what is300:00:11,500 --> 00:00:14,500our regular expression matcheractually supposed to do?400:00:14,500 --> 00:00:16,570It will take two arguments,500:00:16,570 --> 00:00:18,670a regular expression r and a string s.600:00:18,670 --> 00:00:21,580And it's supposed to say yes,700:00:21,580 --> 00:00:23,440the regular expression matches800:00:23,440 --> 00:00:26,920the string if and onlyif the string is in900:00:26,920 --> 00:00:28,855the language of r.1000:00:28,855 --> 00:00:32,410And if the string is notin the language of r,1100:00:32,410 --> 00:00:35,515then our algorithm has to say no.1200:00:35,515 --> 00:00:37,210And we can't use1300:00:37,210 --> 00:00:39,565this specificationdirectly because remember,1400:00:39,565 --> 00:00:43,305this L sometimesproduces infinite sets.1500:00:43,305 --> 00:00:47,585And we can't test whether astring is in infinite set,1600:00:47,585 --> 00:00:50,090at least not easily.1700:00:50,090 --> 00:00:52,340And so what we have to do1800:00:52,340 --> 00:00:54,260instead is we haveto be a bit more1900:00:54,260 --> 00:00:57,050clever and implementsome operations2000:00:57,050 --> 00:00:59,284on regular expressions instead,2100:00:59,284 --> 00:01:03,275because regular expressionsare always finite trees.2200:01:03,275 --> 00:01:05,870I should say theperson who has been2300:01:05,870 --> 00:01:08,150clever is called Brzozowski.2400:01:08,150 --> 00:01:11,435It's his algorithmI'm introducing here.2500:01:11,435 --> 00:01:15,515And his algorithm consistsof two functions.2600:01:15,515 --> 00:01:17,840One is callednullable and it takes2700:01:17,840 --> 00:01:20,104a regular expression as argument.2800:01:20,104 --> 00:01:24,155And the idea is thatthis function nullable is2900:01:24,155 --> 00:01:26,450testing whetherthe regular expression3000:01:26,450 --> 00:01:27,950can match the empty string.3100:01:27,950 --> 00:01:30,305So 0 cannot match any string.3200:01:30,305 --> 00:01:33,275So it cannot match theempty string either.3300:01:33,275 --> 00:01:35,465So that's defined as false.3400:01:35,465 --> 00:01:37,775This regular expression,the whole purpose3500:01:37,775 --> 00:01:39,680is that it can matchthe empty string.3600:01:39,680 --> 00:01:41,645So it's defined as true.3700:01:41,645 --> 00:01:44,599If a regular expressioncan match a character,3800:01:44,599 --> 00:01:47,045then it cannot matchthe empty string.3900:01:47,045 --> 00:01:49,445So that is defined as false.4000:01:49,445 --> 00:01:53,180If an alternative canmatch the empty string,4100:01:53,180 --> 00:01:56,960then either r1 canmatch the empty string,4200:01:56,960 --> 00:01:59,720or r2 can match the empty string.4300:01:59,720 --> 00:02:03,110So either nullableof r1 has to hold,4400:02:03,110 --> 00:02:06,945or nullable of r2 has to hold.4500:02:06,945 --> 00:02:09,925In this sequence, it'sthe other way around.4600:02:09,925 --> 00:02:12,790If this regular expression canmatch the empty string,4700:02:12,790 --> 00:02:16,615then r1 has to be able tomatch the empty string,4800:02:16,615 --> 00:02:20,290and r2 has to be able tomatch the empty string.4900:02:20,290 --> 00:02:22,555So here it's just the opposite.5000:02:22,555 --> 00:02:25,660In one case it is or andthe other case it is and.5100:02:25,660 --> 00:02:27,970The star regularexpression can match5200:02:27,970 --> 00:02:30,445always the empty string.So that is true.5300:02:30,445 --> 00:02:33,340So this is a simplerecursive function5400:02:33,340 --> 00:02:37,179and it should not be toodifficult to implement.5500:02:37,179 --> 00:02:42,025Okay, this nullable functionthat is easy-peasy.5600:02:42,025 --> 00:02:44,604The second function, however,5700:02:44,604 --> 00:02:49,155is a bit more involved andthat's just to be expected.5800:02:49,155 --> 00:02:53,075Remember people working inthis area already for decades.5900:02:53,075 --> 00:02:57,305If they have some problemswith runtime, for example,6000:02:57,305 --> 00:02:58,940we can't expect that 6100:02:58,940 --> 00:03:03,260a simple fix will solve allthe problems in the world.6200:03:03,260 --> 00:03:06,530So I admit the secondfunction is a bit more6300:03:06,530 --> 00:03:10,085complicated and make surethat you understand it.6400:03:10,085 --> 00:03:12,140And it also justshows this Brzozowski6500:03:12,140 --> 00:03:14,345is a very clever guy.6600:03:14,345 --> 00:03:15,800Actually, I have to say he was6700:03:15,800 --> 00:03:17,720a clever guy becauseI think he either6800:03:17,720 --> 00:03:21,650died last year orthe year before.6900:03:21,650 --> 00:03:25,505And he came up with thisalgorithm already in 1964.7000:03:25,505 --> 00:03:27,260It somehow got lost,7100:03:27,260 --> 00:03:30,305but has been rediscoveredin the last ten years.7200:03:30,305 --> 00:03:34,685So the idea of the secondfunction is the following.7300:03:34,685 --> 00:03:38,120Imagine you have aregular expression and it can7400:03:38,120 --> 00:03:41,930match a string of theform c followed by s.7500:03:41,930 --> 00:03:44,810So the c is the firstcharacter of that string.7600:03:44,810 --> 00:03:48,605So I imagine that r canmatch this kind of string.7700:03:48,605 --> 00:03:50,330The question is now,7800:03:50,330 --> 00:03:52,400what would a regularexpression look7900:03:52,400 --> 00:03:54,695like that can match just8000:03:54,695 --> 00:03:57,140s. You probably remember8100:03:57,140 --> 00:03:59,300that from the semanticderivative,8200:03:59,300 --> 00:04:00,860there was alsosomething of chopping8300:04:00,860 --> 00:04:02,210off the first character.8400:04:02,210 --> 00:04:04,940Here it's the same.If a regular expression8500:04:04,940 --> 00:04:07,835can match a stringstarting with a c,8600:04:07,835 --> 00:04:11,090we're looking for a regularexpression which can match8700:04:11,090 --> 00:04:15,215the rest of the string wherethe c has been chopped off.8800:04:15,215 --> 00:04:17,690And this operation will be called8900:04:17,690 --> 00:04:21,049the derivative of aregular expression.9000:04:21,049 --> 00:04:22,205And it will also take9100:04:22,205 --> 00:04:25,460a character as argumentand a regular expression.9200:04:25,460 --> 00:04:28,730And in contrast tothe semantic 9300:04:28,730 --> 00:04:31,310derivative, which works9400:04:31,310 --> 00:04:34,430over languages orsets of strings,9500:04:34,430 --> 00:04:39,620this derivative worksover regular expressions.9600:04:39,620 --> 00:04:41,330Okay, here's this function.9700:04:41,330 --> 00:04:43,970It's defined recursively over9800:04:43,970 --> 00:04:46,370the structure of regular expressions.9900:04:46,370 --> 00:04:48,814The first argumentis the character,10000:04:48,814 --> 00:04:52,700and the second one isa regular expression.10100:04:52,700 --> 00:04:56,510And remember the ideais we're looking for10200:04:56,510 --> 00:05:00,680a regular expression thatcan match everything10300:05:00,680 --> 00:05:03,125this regular expressionas argument can match10400:05:03,125 --> 00:05:07,040except for the c. So now let'slook at this first case.10500:05:07,040 --> 00:05:10,550So this reg expressioncannot match any string.10600:05:10,550 --> 00:05:14,270So it certainly cannotmatch any string starting10700:05:14,270 --> 00:05:16,910a c. So we have to look10800:05:16,910 --> 00:05:20,090for a regular expression whichcan not match anything.10900:05:20,090 --> 00:05:22,700Well, that's the purposeof this regular expression,11000:05:22,700 --> 00:05:24,815so we define it as 0.11100:05:24,815 --> 00:05:28,130In the next case,this regular expression11200:05:28,130 --> 00:05:30,440can match the empty string,11300:05:30,440 --> 00:05:33,440but it cannot matchany string that starts11400:05:33,440 --> 00:05:36,350with c. So also in this case,11500:05:36,350 --> 00:05:39,560we just define it asthe regular expression,11600:05:39,560 --> 00:05:41,615which cannot match anything.11700:05:41,615 --> 00:05:45,170In the next case, thisc is the argument to11800:05:45,170 --> 00:05:48,335the derivative and this dis the regular expression.11900:05:48,335 --> 00:05:50,225So now there are two cases.12000:05:50,225 --> 00:05:53,525If this c and this dis actually equal,12100:05:53,525 --> 00:05:55,970Thaat means this regularexpression can match12200:05:55,970 --> 00:05:59,240a string which just contains c.12300:05:59,240 --> 00:06:01,505So if we strip that off,12400:06:01,505 --> 00:06:04,790what remains isthe empty string.12500:06:04,790 --> 00:06:06,890What is a regular expression12600:06:06,890 --> 00:06:09,170look like that canmatch the empty string.12700:06:09,170 --> 00:06:11,915Well, that's thepurpose of the 1.12800:06:11,915 --> 00:06:15,440And if c is not equal to d,12900:06:15,440 --> 00:06:17,630then this regular expression13000:06:17,630 --> 00:06:19,220cannot match anythingwhich starts with13100:06:19,220 --> 00:06:23,780a c. So again it willbe defined as just 0.13200:06:23,780 --> 00:06:29,390Now, the alternative case,13300:06:29,390 --> 00:06:31,970if this regular expression can13400:06:31,970 --> 00:06:36,050match stringsstarting with c,13500:06:36,050 --> 00:06:40,820then it can either bematched by the r113600:06:40,820 --> 00:06:44,495or it can be matched by the r2.13700:06:44,495 --> 00:06:50,090If the r1 can match cand then following a string,13800:06:50,090 --> 00:06:53,975then we just have to recursivelycall the derivative for13900:06:53,975 --> 00:06:56,570r1 to get a regular expression14000:06:56,570 --> 00:06:59,135that can match therest of the string.14100:06:59,135 --> 00:07:02,690And the same we can do with r2.14200:07:02,690 --> 00:07:06,110You can find a regular expressionwhich can match everything14300:07:06,110 --> 00:07:07,850this r2 can match,14400:07:07,850 --> 00:07:09,710starting with c, but14500:07:09,710 --> 00:07:12,590we are chopping off this c. 14600:07:12,590 --> 00:07:16,370So now if you have to findthe regular expression,14700:07:16,370 --> 00:07:20,030which can match all the stringswhere c is chopped off,14800:07:20,030 --> 00:07:22,295then we just have totake the alternative14900:07:22,295 --> 00:07:24,965of these two derivatives.15000:07:24,965 --> 00:07:29,390Now to sequence case.15100:07:29,390 --> 00:07:33,005This sequence case is themost complicated one.15200:07:33,005 --> 00:07:35,180And let's look first at15300:07:35,180 --> 00:07:39,335the second case, in theelse branch.15400:07:39,335 --> 00:07:42,830So if thisregular expression15500:07:42,830 --> 00:07:46,145can match a stringstarting with c,15600:07:46,145 --> 00:07:48,155then the followingmust have happened.15700:07:48,155 --> 00:07:51,905First, the r1 must have matcheda string starting with c15800:07:51,905 --> 00:07:56,420and then anything comingafterwards with r2.15900:07:56,420 --> 00:07:59,660So in this case I only16000:07:59,660 --> 00:08:03,260have to call thisderivative for this r1,16100:08:03,260 --> 00:08:06,830and find the regular expressionwhich can match everything16200:08:06,830 --> 00:08:11,555this r1 can match exceptfor this c chopped off.16300:08:11,555 --> 00:08:15,830And I have to build thesequence derivative of that.16400:08:15,830 --> 00:08:18,260So that's what the else branch says:16500:08:18,260 --> 00:08:21,860I take thederivative of r1 and I16600:08:21,860 --> 00:08:23,480put the r2 on16700:08:23,480 --> 00:08:25,865the back because that'sthe rest of the string.16800:08:25,865 --> 00:08:29,240So that's the onlycase we have to consider16900:08:29,240 --> 00:08:32,750this sequence caseexcept if not the17000:08:32,750 --> 00:08:37,895r1 can match thec and something else.17100:08:37,895 --> 00:08:42,965But if r1 is matchingactually the empty string.17200:08:42,965 --> 00:08:48,890In this case actually the r2is in charge of matching17300:08:48,890 --> 00:08:51,590the string starting with c. So in17400:08:51,590 --> 00:08:55,490this case we have to callthe derivative for r2.17500:08:55,490 --> 00:08:57,875So that's why we havethese two cases.17600:08:57,875 --> 00:09:00,455So if r1 is nullable,17700:09:00,455 --> 00:09:03,245then it can matchthe empty string.17800:09:03,245 --> 00:09:05,330And we have toconsider the case that17900:09:05,330 --> 00:09:08,045this r2 is matching a string18000:09:08,045 --> 00:09:10,700starting with c. And so we have18100:09:10,700 --> 00:09:14,210to call the derivativefor this r2 in this case.18200:09:14,210 --> 00:09:18,680Otherwise, the r1 willbe in charge of matching18300:09:18,680 --> 00:09:20,840a part of thatstring starting with18400:09:20,840 --> 00:09:24,695c. And I have to call thederivative on this r1.18500:09:24,695 --> 00:09:30,670I hope this makes sense.18600:09:30,670 --> 00:09:34,150Go over that andalso the handouts18700:09:34,150 --> 00:09:37,465again. That caseis really subtle.18800:09:37,465 --> 00:09:40,945And how do I remember this case?18900:09:40,945 --> 00:09:42,430Well, I know it's19000:09:42,430 --> 00:09:45,310an if condition and thecondition is nullable,19100:09:45,310 --> 00:09:48,160then I will always remember19200:09:48,160 --> 00:09:53,275the else branch where it pushesthe derivative over the r1.19300:09:53,275 --> 00:09:55,780So I usually writedown the else branch first,19400:09:55,780 --> 00:09:59,650and then constructthe then branch by saying,19500:09:59,650 --> 00:10:01,525well, I just repeat this.19600:10:01,525 --> 00:10:03,760And I have to rememberin that case I19700:10:03,760 --> 00:10:06,580have to build alsoderivative of r2.19800:10:06,580 --> 00:10:12,695Finally, the star case.19900:10:12,695 --> 00:10:15,665So here againwe're looking for20000:10:15,665 --> 00:10:17,300a regular expression which20100:10:17,300 --> 00:10:19,745can match the stringstarting with c,20200:10:19,745 --> 00:10:22,355except that the c hasbeen chopped off.20300:10:22,355 --> 00:10:28,640So if r* has to matcha string starting with c,20400:10:28,640 --> 00:10:32,735then at least we need onecopy of this r. 20500:10:32,735 --> 00:10:34,310So at least one copy of20600:10:34,310 --> 00:10:37,010this r has matchedsomething which starts with20700:10:37,010 --> 00:10:38,870a c and then afterwards20800:10:38,870 --> 00:10:41,570come 0 more copiesof this r*.20900:10:41,570 --> 00:10:45,530What we do thereis we are trying to find21000:10:45,530 --> 00:10:47,960the regular expressionwhich can match21100:10:47,960 --> 00:10:50,915this first part of the string.21200:10:50,915 --> 00:10:53,255However, where thec is chopped off.21300:10:53,255 --> 00:10:55,130And then we just say, well,21400:10:55,130 --> 00:10:57,050the rest has to bematched again with21500:10:57,050 --> 00:10:59,0300 or more copies of21600:10:59,030 --> 00:11:02,600this r. So that's whyit's defined like this.21700:11:02,600 --> 00:11:09,050Okay? ...As said, please take carewith this definition.21800:11:09,050 --> 00:11:11,435That's not so simple.21900:11:11,435 --> 00:11:13,250Once you get a hang of it,22000:11:13,250 --> 00:11:15,170however, it makes perfect sense.22100:11:15,170 --> 00:11:17,210So let me explain it in22200:11:17,210 --> 00:11:20,825different ways inthe next slides.22300:11:20,825 --> 00:11:24,695Let's lookfirst at some examples.22400:11:24,695 --> 00:11:27,140So here is a regular expression,22500:11:27,140 --> 00:11:29,390r. And let's have22600:11:29,390 --> 00:11:32,450a look at these threederivatives according to a,22700:11:32,450 --> 00:11:38,405b, and c. And we shall dothe one for a. 22800:11:38,405 --> 00:11:42,379So here is our regular expression22900:11:42,379 --> 00:11:45,334and I was very generouswith parentheses.23000:11:45,334 --> 00:11:48,140And the outermost is a star.23100:11:48,140 --> 00:11:52,550So we now build thederivative according to a,23200:11:52,550 --> 00:11:55,474the character a, ofthat regular expression.23300:11:55,474 --> 00:11:57,380So the first thing we23400:11:57,380 --> 00:11:59,555have to analyse is the case star.23500:11:59,555 --> 00:12:04,370So here's the regular expression,which we are looking at.23600:12:04,370 --> 00:12:09,170This r and the outermostconstructor is this star.23700:12:09,170 --> 00:12:11,510If you go back to the definition,23800:12:11,510 --> 00:12:13,625I hope you have it next to you,23900:12:13,625 --> 00:12:16,340then this star-case is defined24000:12:16,340 --> 00:12:20,000as you're taking just theinside of the star24100:12:20,000 --> 00:12:23,030and apply this derivative and24200:12:23,030 --> 00:12:26,765leave the r on theoutside at the end.24300:12:26,765 --> 00:12:29,990Ok. So that's the first step.24400:12:29,990 --> 00:12:32,030Now we have to analyze24500:12:32,030 --> 00:12:36,035the derivative according toa of this regular expression,24600:12:36,035 --> 00:12:38,000which is an alternative.24700:12:38,000 --> 00:12:39,665So the outermost is a plus.24800:12:39,665 --> 00:12:41,375Ok, that's very easy again,24900:12:41,375 --> 00:12:45,185we just have to push thederivative into each component,25000:12:45,185 --> 00:12:47,705into the a followed by b.25100:12:47,705 --> 00:12:49,145And into the second25200:12:49,145 --> 00:12:51,185alternative, into b. 25300:12:51,185 --> 00:12:56,030We take the derivativeof each according to a.25400:12:56,030 --> 00:13:00,635Now this one is a sequenceregular expression.25500:13:00,635 --> 00:13:02,210This was the complicated case.25600:13:02,210 --> 00:13:04,160First of allwe have to test is25700:13:04,160 --> 00:13:07,910the first componentnullable of this sequence?25800:13:07,910 --> 00:13:09,200Well, that is a,25900:13:09,200 --> 00:13:12,740in this case, a on itsown is not nullable.26000:13:12,740 --> 00:13:14,210So we are the easy case,26100:13:14,210 --> 00:13:17,000we only have a single derivative26200:13:17,000 --> 00:13:19,370pushed in the first component.26300:13:19,370 --> 00:13:25,160So we have the derivativeof a with the character a.26400:13:25,160 --> 00:13:27,920Okay, that's nowthe character case.26500:13:27,920 --> 00:13:29,720And in this case the character26600:13:29,720 --> 00:13:31,715and the regular expression agree.26700:13:31,715 --> 00:13:33,890So it's defined as one.26800:13:33,890 --> 00:13:37,550Ok? In the other case,26900:13:37,550 --> 00:13:39,710the regular expression is b,27000:13:39,710 --> 00:13:41,675but the characters a.27100:13:41,675 --> 00:13:46,385So in this caseit's defined as 0.27200:13:46,385 --> 00:13:50,630Okay? So that's what thederivative would be.27300:13:50,630 --> 00:13:52,160This r is there27400:13:52,160 --> 00:13:55,280because originally westarted with a star.27500:13:55,280 --> 00:13:58,295This regular expression is that star.27600:13:58,295 --> 00:14:02,780So the derivativeaccording to a27700:14:02,780 --> 00:14:07,610of that regular expressionis this expression.27800:14:07,610 --> 00:14:10,970We just have tosubstitute this r back in.27900:14:10,970 --> 00:14:13,745Just coming back to this slide.28000:14:13,745 --> 00:14:16,160So far, we only analyzes28100:14:16,160 --> 00:14:19,505the derivative accordingto a single character.28200:14:19,505 --> 00:14:23,960But we can also very easilyextend that to whole strings.28300:14:23,960 --> 00:14:26,360So if you build thederivative according28400:14:26,360 --> 00:14:27,905to the empty string,28500:14:27,905 --> 00:14:30,875we just return theregular expression.28600:14:30,875 --> 00:14:35,585If we have a stringstarting with character c,28700:14:35,585 --> 00:14:37,850remember that canbe any character,28800:14:37,850 --> 00:14:42,170then we build first the simplederivative according to28900:14:42,170 --> 00:14:44,360that first character and29000:14:44,360 --> 00:14:46,925continue with therest of the string.29100:14:46,925 --> 00:14:50,615So here you see again,my personal convention:29200:14:50,615 --> 00:14:54,365everything which works onlists has this s at the end.29300:14:54,365 --> 00:14:57,125So this function isfor single characters.29400:14:57,125 --> 00:14:59,179This one is for strings,29500:14:59,179 --> 00:15:02,450but it uses the onefor the character.29600:15:02,450 --> 00:15:04,025Essentially what it does is29700:15:04,025 --> 00:15:06,185it chops off the first character,29800:15:06,185 --> 00:15:09,800builds the derivative, thenchops off the next character,29900:15:09,800 --> 00:15:13,760builds the derivative ofthe result, and so on.30000:15:13,760 --> 00:15:17,000Having this function,we can actually now30100:15:17,000 --> 00:15:20,600state what the algorithmis, the complete algorithm.30200:15:20,600 --> 00:15:23,465So the Brzozowski matcher30300:15:23,465 --> 00:15:24,860takes a regular expression as30400:15:24,860 --> 00:15:26,915argument and astring as argument.30500:15:26,915 --> 00:15:30,920And is supposed to say yes ifthe regular expression matches30600:15:30,920 --> 00:15:33,560the string or no30700:15:33,560 --> 00:15:36,065if the regular expression doesnot match the string.30800:15:36,065 --> 00:15:37,715And how does it do that?30900:15:37,715 --> 00:15:42,560Well, it takes this strings and this regular expression,31000:15:42,560 --> 00:15:43,925and it first builds31100:15:43,925 --> 00:15:48,845successive derivatives untilthat string is exhausted.31200:15:48,845 --> 00:15:52,115Then you havea final derivative,31300:15:52,115 --> 00:15:53,839a final regular expression31400:15:53,839 --> 00:15:55,370and you test whether31500:15:55,370 --> 00:15:57,920this regular expression canmatch the empty string.31600:15:57,920 --> 00:16:01,370If yes, then theoriginal regular expression,31700:16:01,370 --> 00:16:03,245this r, can match the string.31800:16:03,245 --> 00:16:05,210If no, if it cannot match31900:16:05,210 --> 00:16:08,030the final derivativewith the empty string,32000:16:08,030 --> 00:16:10,280then no thisregular expression r32100:16:10,280 --> 00:16:12,905cannot match that string.32200:16:12,905 --> 00:16:16,010I know it looksvery anticlimactic,32300:16:16,010 --> 00:16:19,625but that's actually thebeauty of this algorithm,32400:16:19,625 --> 00:16:22,760that it's not that complicated.32500:16:22,760 --> 00:16:25,340So how does the algorithm work32600:16:25,340 --> 00:16:27,634in a concrete example?32700:16:27,634 --> 00:16:31,580Imagine you have a string, abc32800:16:31,580 --> 00:16:34,370and you have a regularexpression, say r1.32900:16:34,370 --> 00:16:37,520And you want to find outwhether this r1 can match33000:16:37,520 --> 00:16:41,300that string abc or not.How do you do that?33100:16:41,300 --> 00:16:45,140Well, you would firsttake this r1 and you33200:16:45,140 --> 00:16:47,150build the derivative according33300:16:47,150 --> 00:16:49,880to the first character, the a.33400:16:49,880 --> 00:16:53,015Okay? You get the derivative,33500:16:53,015 --> 00:16:55,294which I call here r2.33600:16:55,294 --> 00:16:58,640Then you take the nextcharacter, the b.33700:16:58,640 --> 00:17:04,535You now build the derivativeaccording to b of this r2.33800:17:04,535 --> 00:17:07,460Okay? So you take theresult of the first step,33900:17:07,460 --> 00:17:09,530you feed it into the second step,34000:17:09,530 --> 00:17:11,810and you take thesecond character.34100:17:11,810 --> 00:17:17,075Then you do this also with c.So you get a derivative r3,34200:17:17,075 --> 00:17:22,460and you build the derivativeof r3 according to c,34300:17:22,460 --> 00:17:24,185you get an r4.34400:17:24,185 --> 00:17:26,300Okay, so that's thefinal derivative.34500:17:26,300 --> 00:17:27,530The string is exhausted.34600:17:27,530 --> 00:17:29,570We build derivativesaccording to a, b,34700:17:29,570 --> 00:17:34,610and c. Now we just test whetherthis r4 is nullable.34800:17:34,610 --> 00:17:37,175If it says yes,34900:17:37,175 --> 00:17:41,510then the regular expression matcherwill just say true, yes,35000:17:41,510 --> 00:17:43,340this original regular expression,35100:17:43,340 --> 00:17:47,270the r1, will be able tomatch that string abc.35200:17:47,270 --> 00:17:50,585And if this test returns false,35300:17:50,585 --> 00:17:53,015then the algorithm says false.35400:17:53,015 --> 00:17:56,975This regular expression willnot match the string.35500:17:56,975 --> 00:18:00,260Ok, you might ask:Why on earth does35600:18:00,260 --> 00:18:02,960that algorithmactually work?35700:18:02,960 --> 00:18:06,515Here's anather explanationfor why it works.35800:18:06,515 --> 00:18:10,190Imagine you have a regularexpression r1, okay?35900:18:10,190 --> 00:18:13,220And you have a string abc,36000:18:13,220 --> 00:18:14,270and you want to find out36100:18:14,270 --> 00:18:17,180whether r1 canmatch that string.36200:18:17,180 --> 00:18:18,799And for the moment,36300:18:18,799 --> 00:18:22,610let's assume that itcan match that string.36400:18:22,610 --> 00:18:26,315Ok? So the language L of36500:18:26,315 --> 00:18:30,185r1 will actuallycontain that string,36600:18:30,185 --> 00:18:31,805otherwise it wouldn't match that.36700:18:31,805 --> 00:18:36,710Okay? So abc is inthis language, okay?36800:18:36,710 --> 00:18:39,785If I now take thesemantic derivative,36900:18:39,785 --> 00:18:43,145that means I look at allthe strings in this L(r1)37000:18:43,145 --> 00:18:46,820and filter out37100:18:46,820 --> 00:18:48,740all the ones whichdo not start with37200:18:48,740 --> 00:18:51,260an a, I discharge them.37300:18:51,260 --> 00:18:54,545And I only look at the oneswhich start with an a.37400:18:54,545 --> 00:18:56,465And of those strings,37500:18:56,465 --> 00:18:58,475I chop off this a.37600:18:58,475 --> 00:19:01,025So after thissemantic derivative,37700:19:01,025 --> 00:19:05,735this set of strings willcontain just bc.37800:19:05,735 --> 00:19:12,830Ok. Now if I build the nextsemantic derivative of that,37900:19:12,830 --> 00:19:14,345then I would look at38000:19:14,345 --> 00:19:16,850all the strings whichstart with a b,38100:19:16,850 --> 00:19:21,350and forget about everythingelse. Of the remaining ones38200:19:21,350 --> 00:19:27,905I know they start with b.I just chop of the b. Ok.38300:19:27,905 --> 00:19:31,655So in this whole set here,38400:19:31,655 --> 00:19:33,785in this whole set here,38500:19:33,785 --> 00:19:39,030there will be now a stringwhich is just c. Okay?38600:19:39,190 --> 00:19:44,420Then I built the thirdsemantic derivative38700:19:44,420 --> 00:19:47,300because I want to find outwhether abc is matched.38800:19:47,300 --> 00:19:50,540Okay? So now I lookat all the strings in38900:19:50,540 --> 00:19:52,820here and look at39000:19:52,820 --> 00:19:55,340them whether they startwith a c. If yes,39100:19:55,340 --> 00:19:56,885I chop off the c.39200:19:56,885 --> 00:19:59,120And put in what is remaining.39300:19:59,120 --> 00:20:00,425So in this case,39400:20:00,425 --> 00:20:02,510if I have the string c39500:20:02,510 --> 00:20:04,550in this language andI chop off this c,39600:20:04,550 --> 00:20:07,700what is remainingis the empty string.39700:20:07,700 --> 00:20:09,695So we have to check of39800:20:09,695 --> 00:20:14,510that language whether itcontains the empty string.39900:20:14,510 --> 00:20:18,800If yes, then theoriginal r1 can match40000:20:18,800 --> 00:20:21,050this abc because this abc40100:20:21,050 --> 00:20:24,119must have been in this language.40200:20:24,130 --> 00:20:28,565And if in the end there wasn'tthe empty string, then40300:20:28,565 --> 00:20:33,575this abs was not inthis language of r1.40400:20:33,575 --> 00:20:36,260And so the lexer must have,40500:20:36,260 --> 00:20:38,880or the matcher must have failed.40600:20:39,040 --> 00:20:42,530The clever bit is that here40700:20:42,530 --> 00:20:45,530the explanation is for languages.40800:20:45,530 --> 00:20:49,835Remember, thissemantic derivative40900:20:49,835 --> 00:20:53,450works over languages and theysometimes can be in finite.41000:20:53,450 --> 00:20:55,730So that's not reallyan algorithm.41100:20:55,730 --> 00:20:58,880That's justexplaining the idea. 41200:20:58,880 --> 00:21:02,525What Brzozowskiachieved was that he41300:21:02,525 --> 00:21:06,440now works with this derivativeregular expressions and41400:21:06,440 --> 00:21:10,715somehow imitates whathappens on these languages.41500:21:10,715 --> 00:21:14,135Because remember if youhave a regular expression,41600:21:14,135 --> 00:21:17,405and you want to testwhether it can match abc,41700:21:17,405 --> 00:21:22,550then you take firstderivative according to a.41800:21:22,550 --> 00:21:25,760So you will get a regularexpression which can match b41900:21:25,760 --> 00:21:29,464and c, if r could match abc.42000:21:29,464 --> 00:21:31,430So after the first derivative,42100:21:31,430 --> 00:21:33,620you will get a regular expressionwhich can match b and42200:21:33,620 --> 00:21:37,070c. If you take thesecond derivative,42300:21:37,070 --> 00:21:41,225you will get a regular expressionwhich can match c alone.42400:21:41,225 --> 00:21:44,180And if you take thefinal derivative,42500:21:44,180 --> 00:21:46,070then you will get42600:21:46,070 --> 00:21:48,260a regular expression which hopefully42700:21:48,260 --> 00:21:49,715can match the empty string.42800:21:49,715 --> 00:21:53,780If it does, then thisr can match the abc.42900:21:53,780 --> 00:21:55,655And if it doesn't, then43000:21:55,655 --> 00:21:58,680abc couldn't bematched by this r.43100:21:58,900 --> 00:22:02,990Okay, let's have a lookhow this pans out in code.43200:22:02,990 --> 00:22:06,050Here's the file re1.sc.43300:22:06,050 --> 00:22:07,940It's also uploaded on KEATS,43400:22:07,940 --> 00:22:10,625so you can see exactlywhat I'm doing.43500:22:10,625 --> 00:22:13,970And actually you already sawthat file because I showed you43600:22:13,970 --> 00:22:15,710how my regular expressions are43700:22:15,710 --> 00:22:17,960defined, with theabstract classes here.43800:22:17,960 --> 00:22:21,155And here, the six casesfor 0, 1, character,43900:22:21,155 --> 00:22:23,540alternative, sequence and star.44000:22:23,540 --> 00:22:26,705Ok. So the firstfunction nullable,44100:22:26,705 --> 00:22:28,760the simple one, takes44200:22:28,760 --> 00:22:32,120a regular expression asargument and returns a boolean.44300:22:32,120 --> 00:22:34,280And then with thispattern matching,44400:22:34,280 --> 00:22:37,040we just go throughall these six cases44500:22:37,040 --> 00:22:38,9000 is defined as false.44600:22:38,900 --> 00:22:43,234One is defined as true.Character, for any character,44700:22:43,234 --> 00:22:45,455this nullable will return false.44800:22:45,455 --> 00:22:47,540The alternative is defined here,44900:22:47,540 --> 00:22:50,000so that's the or in Scala.45000:22:50,000 --> 00:22:52,700And for the sequence,that's the and.45100:22:52,700 --> 00:22:56,180And this star, no matterwhat the regular expression is,45200:22:56,180 --> 00:22:59,540it will always match theempty string, so true.45300:22:59,540 --> 00:23:02,225So nullable is very easy.45400:23:02,225 --> 00:23:07,430The derivative is also notso much more complicated.45500:23:07,430 --> 00:23:08,974It takes two arguments,45600:23:08,974 --> 00:23:11,810a character and theregular expression,45700:23:11,810 --> 00:23:14,405and returns a regular expression.45800:23:14,405 --> 00:23:16,340So now we just have to analyze45900:23:16,340 --> 00:23:18,890what's that regularexpression looks like.46000:23:18,890 --> 00:23:23,870If it's 0, we return0; if it's 1 we return 0.46100:23:23,870 --> 00:23:26,990If the character... If theregular expressions character46200:23:26,990 --> 00:23:30,260d, we test whether it'sequal to this character46300:23:30,260 --> 00:23:32,225we want to take thederivative off.46400:23:32,225 --> 00:23:36,185If yes, return 1, otherwise 0.46500:23:36,185 --> 00:23:38,600The alternative which has pushed46600:23:38,600 --> 00:23:39,860the derivative under46700:23:39,860 --> 00:23:42,710this alternative,that's the easy one.46800:23:42,710 --> 00:23:44,690Here's the sequence case where we46900:23:44,690 --> 00:23:47,015first have to testif it's nullable,47000:23:47,015 --> 00:23:49,040If it's not the have to push47100:23:49,040 --> 00:23:52,160the derivative overthe first, the r1.47200:23:52,160 --> 00:23:56,135And otherwise if it is nullable,we have two cases.47300:23:56,135 --> 00:24:00,605Either you have to pushit over the r1 or r2.47400:24:00,605 --> 00:24:03,860And a star case, this one.47500:24:03,860 --> 00:24:07,160We just build the sequenceof the derivative of47600:24:07,160 --> 00:24:11,600r1 and append ther1 as a sequence,47700:24:11,600 --> 00:24:14,195put the star at the end.47800:24:14,195 --> 00:24:19,430Okay, so here's thefunction for strings.47900:24:19,430 --> 00:24:21,740So a list of characters.48000:24:21,740 --> 00:24:23,870This function takes an s,48100:24:23,870 --> 00:24:26,480a list of characters as argumentand a regular expression48200:24:26,480 --> 00:24:29,360as argument and returnsa regular expression.48300:24:29,360 --> 00:24:31,460And we just have topattern match what48400:24:31,460 --> 00:24:34,055that string looks likeor this list looks like.48500:24:34,055 --> 00:24:35,360If it's the empty list,48600:24:35,360 --> 00:24:38,030it just immediatelyreturn the r. If48700:24:38,030 --> 00:24:42,020it's something of theform c followed by s,48800:24:42,020 --> 00:24:45,050then we build first thederivative according48900:24:45,050 --> 00:24:48,345to c. And thenthe result of that,49000:24:48,345 --> 00:24:50,090we recursively call49100:24:50,090 --> 00:24:53,675the derivativeaccording to s. Okay?49200:24:53,675 --> 00:24:56,060And now the main matcher,49300:24:56,060 --> 00:24:59,360it takes a regularexpression and takes49400:24:59,360 --> 00:25:02,870a string and returns aboolean, true or false.49500:25:02,870 --> 00:25:05,705And it first buildsthe derivative.49600:25:05,705 --> 00:25:07,430The only thing I have to do here49700:25:07,430 --> 00:25:09,410because I'm implementingit in Scala,49800:25:09,410 --> 00:25:15,064I have to coerce between stringsand lists of characters.49900:25:15,064 --> 00:25:16,580So he, I take first50000:25:16,580 --> 00:25:20,810the toList of the s. Thatgives me a list of characters.50100:25:20,810 --> 00:25:23,135Then I build this derivative50200:25:23,135 --> 00:25:26,045with the s, because I'mlooking at strings.50300:25:26,045 --> 00:25:28,160And then at the end,50400:25:28,160 --> 00:25:33,050built the nullable ofthe final derivative.50500:25:33,050 --> 00:25:37,310The beauty of all thisis that in Scala,50600:25:37,310 --> 00:25:40,085these definition whichI had on the slides50700:25:40,085 --> 00:25:43,835go almost one-to-one into code.50800:25:43,835 --> 00:25:45,605And that's of course,50900:25:45,605 --> 00:25:47,480makes it very easyto implement in51000:25:47,480 --> 00:25:49,730a functional language like Scala.51100:25:49,730 --> 00:25:52,865We can also now tryout some examples.51200:25:52,865 --> 00:25:56,960This was the exampleI had on the slide.51300:25:56,960 --> 00:25:58,370And we could now calculate51400:25:58,370 --> 00:26:00,305what's the derivativeaccording to a,51500:26:00,305 --> 00:26:02,720b, and c of thisregular expression.51600:26:02,720 --> 00:26:07,040And I hope you didn'tmake any mistake.51700:26:07,040 --> 00:26:09,260Now of course we wantto see whether we51800:26:09,260 --> 00:26:11,390do any better withthis algorithm...51900:26:11,390 --> 00:26:13,715than Python and Ruby.52000:26:13,715 --> 00:26:16,070And let's first have alook at this example.52100:26:16,070 --> 00:26:18,079So this regular expression.52200:26:18,079 --> 00:26:19,880The problem is that in52300:26:19,880 --> 00:26:22,070our regular expressionmatcher so far we have52400:26:22,070 --> 00:26:24,530the sequence rregularexpression, but we52500:26:24,530 --> 00:26:27,200don't have this optionalregular expression.52600:26:27,200 --> 00:26:30,800And we don't have this ntimes regular expression,52700:26:30,800 --> 00:26:36,605but we can quickly implementthat in our implementation.52800:26:36,605 --> 00:26:40,549So if you want to build theoptional regular expression,52900:26:40,549 --> 00:26:41,870which is defined here,53000:26:41,870 --> 00:26:44,570a little function whichtakes a reg expression and53100:26:44,570 --> 00:26:47,360returns the alternative of r and 1.53200:26:47,360 --> 00:26:49,624So it essentiallytakes the definition53300:26:49,624 --> 00:26:53,240of optional r andreplaces it by that.53400:26:53,240 --> 00:26:56,150The n-times what weessentially do there is53500:26:56,150 --> 00:27:01,535we built n copies of this r. Ok?53600:27:01,535 --> 00:27:04,745So if this n-times was 0,53700:27:04,745 --> 00:27:06,245we just return one.53800:27:06,245 --> 00:27:11,570If it's one, then we returnjust the regular expression.53900:27:11,570 --> 00:27:15,575And if it's somethingbigger than one,54000:27:15,575 --> 00:27:18,560then we just built thesequence of one of54100:27:18,560 --> 00:27:20,270these copies and call54200:27:20,270 --> 00:27:22,925this function againwith n - 1.54300:27:22,925 --> 00:27:26,330So we just build now n-copies of that.54400:27:26,330 --> 00:27:30,710Okay, so we can lookat our first example.54500:27:30,710 --> 00:27:32,629This is the regular expression,54600:27:32,629 --> 00:27:36,035and I call that hereevil regular expression1.54700:27:36,035 --> 00:27:37,670It doesn't look as beautiful54800:27:37,670 --> 00:27:39,140as what we write down on paper.54900:27:39,140 --> 00:27:41,240We will actually lookat this later on55000:27:41,240 --> 00:27:43,640if this can be improved.But here it is.55100:27:43,640 --> 00:27:45,724Here's the regular expression,55200:27:45,724 --> 00:27:49,520and here's a little functionwhich measures the time.55300:27:49,520 --> 00:27:53,180And we can now start testing55400:27:53,180 --> 00:27:57,845this regular expression withstrings just containing a's.55500:27:57,845 --> 00:28:00,020And we are first a bit cautious,55600:28:00,020 --> 00:28:04,985we test it between 0 and 20,and we count by two.55700:28:04,985 --> 00:28:08,330And I actually will notstart that within Scala,55800:28:08,330 --> 00:28:12,965but actually use Ammonite.55900:28:12,965 --> 00:28:15,305And that's out.56000:28:15,305 --> 00:28:17,269And that calculates56100:28:17,269 --> 00:28:20,675for 18 a's. It's pretty fast.56200:28:20,675 --> 00:28:24,815But for 20 it alreadyneeds 2 seconds.56300:28:24,815 --> 00:28:28,265And you can seeactually this jump from56400:28:28,265 --> 00:28:32,82518 to 20 in the runtimeis pretty bad too.56500:28:32,825 --> 00:28:37,460And if we actually measureit more accurately,56600:28:37,460 --> 00:28:39,770then we get this picture.56700:28:39,770 --> 00:28:41,255And what turns out,56800:28:41,255 --> 00:28:45,830we are actually worse than Pythonand Ruby in this example.56900:28:45,830 --> 00:28:49,230So I guess that's a fail.