100:00:06,240 --> 00:00:11,050Welcome back. This videois about regular expressions.200:00:11,050 --> 00:00:14,230We want to use regularexpressions in our lexer.300:00:14,230 --> 00:00:16,165And the purpose of thelexer is to find400:00:16,165 --> 00:00:18,070out where the words in500:00:18,070 --> 00:00:21,070our programs are. Howeverregular expressions600:00:21,070 --> 00:00:23,875are fundamental toolin computer science.700:00:23,875 --> 00:00:27,910And I'm sure you've used themalready on several occasions.800:00:27,910 --> 00:00:30,370And one would expect that about900:00:30,370 --> 00:00:31,750regular expressions since they are1000:00:31,750 --> 00:00:33,850so well-known and well studied,1100:00:33,850 --> 00:00:37,915that everything under thesun is known about them.1200:00:37,915 --> 00:00:41,080But actually there'sstill some surprising1300:00:41,080 --> 00:00:44,465and interestingproblems with them.1400:00:44,465 --> 00:00:47,945And I want to show youthem in this video.1500:00:47,945 --> 00:00:50,720I'm sure you've seenregular expressions1600:00:50,720 --> 00:00:52,445many, many times before.1700:00:52,445 --> 00:00:55,100But just to be on the same page,1800:00:55,100 --> 00:00:57,110let me just recap them.1900:00:57,110 --> 00:00:59,210So here in this line,2000:00:59,210 --> 00:01:01,790there is a regular expressionwhich is supposed to2100:01:01,790 --> 00:01:05,285recognize some formof email addresses.2200:01:05,285 --> 00:01:07,745So an e-mail address2300:01:07,745 --> 00:01:11,000has part which isbefore the @ symbol,2400:01:11,000 --> 00:01:13,400which is the name of the person.2500:01:13,400 --> 00:01:16,880And that can beany number between2600:01:16,880 --> 00:01:20,1950 and 9, and letters between a and z.2700:01:20,195 --> 00:01:24,155Let's say we avoidinghere capital letters.2800:01:24,155 --> 00:01:26,045There can be underscores.2900:01:26,045 --> 00:01:29,405There can be a dot andthere can be hyphens.3000:01:29,405 --> 00:01:35,390And after the @ symbolcomes the domain name.3100:01:35,390 --> 00:01:37,310So as you can see here,3200:01:37,310 --> 00:01:40,640we use things like star to3300:01:40,640 --> 00:01:44,314match letterszero or more times.3400:01:44,314 --> 00:01:45,985Or we have a plus,3500:01:45,985 --> 00:01:47,420which means you have to match3600:01:47,420 --> 00:01:52,489at least once or moretimes. Then we have.3700:01:52,489 --> 00:01:55,790question mark, which says you3800:01:55,790 --> 00:01:59,105match either it is thereor it is not there.3900:01:59,105 --> 00:02:01,340You are also regularexpressions which4000:02:01,340 --> 00:02:03,755match exactly n-times.4100:02:03,755 --> 00:02:08,720Or this is a regular expressionfor between n and m times.4200:02:08,720 --> 00:02:12,065You can see inthis email address,4300:02:12,065 --> 00:02:13,730the top-level domain4400:02:13,730 --> 00:02:16,130name can be any letter 4500:02:16,130 --> 00:02:19,265between a to z,and contain dots,4600:02:19,265 --> 00:02:22,340but can only be twocharacters long4700:02:22,340 --> 00:02:25,685up till six charactersand not more.4800:02:25,685 --> 00:02:29,240Then you also havesomething like ranges.4900:02:29,240 --> 00:02:31,220So you can see, letters between a5000:02:31,220 --> 00:02:33,635and z and 0 to 9 and so on.5100:02:33,635 --> 00:02:36,545Here you also have regularexpressions which can5200:02:36,545 --> 00:02:40,070match something whichisn't in this range.5300:02:40,070 --> 00:02:42,560So for example, ifyou want for example match,5400:02:42,560 --> 00:02:44,030letters but not numbers,5500:02:44,030 --> 00:02:45,800you would say, well, if5600:02:45,800 --> 00:02:48,990this is a number thatshould not match.5700:02:49,090 --> 00:02:52,804Typically you alsohave these ranges.5800:02:52,804 --> 00:02:55,565Lowercase letters,capital letters.5900:02:55,565 --> 00:02:58,550Then you have somespecial regular expressions6000:02:58,550 --> 00:03:02,195like this one is onlysupposed to match digits.6100:03:02,195 --> 00:03:05,674A dot is supposed tomatch any character.6200:03:05,674 --> 00:03:07,370And then they have also something6300:03:07,370 --> 00:03:09,800called groups whichis supposed to be6400:03:09,800 --> 00:03:12,799used when you aretrying to extract6500:03:12,799 --> 00:03:15,605a string you've matched.6600:03:15,605 --> 00:03:19,925Okay, so these are thetypical regular expressions.6700:03:19,925 --> 00:03:23,075And here's a particular one6800:03:23,075 --> 00:03:25,820trying to match something6900:03:25,820 --> 00:03:28,770which resemblesan email address.7000:03:29,590 --> 00:03:33,065Clearly that should be all easy.7100:03:33,065 --> 00:03:36,230And our technology shouldbe on top of that.7200:03:36,230 --> 00:03:37,865That we can take a7300:03:37,865 --> 00:03:41,015regular expressions andwe can take a string,7400:03:41,015 --> 00:03:43,340and we should have programs to7500:03:43,340 --> 00:03:45,680decide whether thisstring is matched7600:03:45,680 --> 00:03:50,330by a regular expression ornot and should be easy-peasy, no?7700:03:50,330 --> 00:03:56,150Well, let's have alook at two examples.7800:03:56,150 --> 00:04:00,860The first regular expressionis a star star b.7900:04:00,860 --> 00:04:02,990And it is supposedto match strings of8000:04:02,990 --> 00:04:05,825the form 0 or more a's,8100:04:05,825 --> 00:04:10,385followed by a b. The parenthesesyou can ignore.8200:04:10,385 --> 00:04:11,990And a star star8300:04:11,990 --> 00:04:14,120also doesn'tmake any difference8400:04:14,120 --> 00:04:16,505to what kind of stringsthat can be matched.8500:04:16,505 --> 00:04:21,635It can only make 0 morea's followed by a b.8600:04:21,635 --> 00:04:23,900And the other regular expression8700:04:23,900 --> 00:04:26,990is possibly a character a,8800:04:26,990 --> 00:04:32,930n times, followed by charactera axactly n-times.8900:04:32,930 --> 00:04:35,570And we will try out9000:04:35,570 --> 00:04:38,360these two regular expressionswith strings of the form a,9100:04:38,360 --> 00:04:39,890aa, and so on,9200:04:39,890 --> 00:04:45,770and up to the length of n. And9300:04:45,770 --> 00:04:49,130this regular expression shouldactually not match any of9400:04:49,130 --> 00:04:53,315the strings because thefinal b is missing.9500:04:53,315 --> 00:04:56,150But that isokay. For example9600:04:56,150 --> 00:04:57,425if you have a regular expression9700:04:57,425 --> 00:05:00,110that is supposed tocheck whether a string is9800:05:00,110 --> 00:05:01,490an email address and the user9900:05:01,490 --> 00:05:03,380gives some randomstrings in there,10000:05:03,380 --> 00:05:06,545then this regular expressionshould not match that string.10100:05:06,545 --> 00:05:08,420And for this regular expression10200:05:08,420 --> 00:05:11,195you have to scratch alittle bit of your head,10300:05:11,195 --> 00:05:12,620what it can actually match.10400:05:12,620 --> 00:05:14,720But after a little bitof head scratching,10500:05:14,720 --> 00:05:18,260you find out can matchany string which is of10600:05:18,260 --> 00:05:22,580the length n a's upto 2n of a's.10700:05:22,580 --> 00:05:24,290So anything in this range,10800:05:24,290 --> 00:05:27,185this regular expressioncan actually match.10900:05:27,185 --> 00:05:30,395Okay, let'stake a random tool,11000:05:30,395 --> 00:05:32,630maybe for example Python.11100:05:32,630 --> 00:05:35,240So here's a littlePython program.11200:05:35,240 --> 00:05:38,690It uses the libraryfunction of Python to11300:05:38,690 --> 00:05:42,935match the regular expressions ofa star star b.11400:05:42,935 --> 00:05:46,805And we measure the time with longerand longer strings of a's.11500:05:46,805 --> 00:05:48,770And so conveniently we can give11600:05:48,770 --> 00:05:51,140the number of a's hereon the command line.11700:05:51,140 --> 00:05:56,900If I just callthis on the command line,11800:05:56,900 --> 00:05:59,900Let's say we firststart with five a's.11900:05:59,900 --> 00:06:03,920And I get also the times whichin this case is next to nothing.12000:06:03,920 --> 00:06:05,960And here's the stringwe just matched.12100:06:05,960 --> 00:06:07,640And obviously theregular expression12200:06:07,640 --> 00:06:09,110did not match the string.12300:06:09,110 --> 00:06:11,255That's indicated by this None.12400:06:11,255 --> 00:06:13,925Let's take ten a's.12500:06:13,925 --> 00:06:16,490It's also pretty quick.12600:06:16,490 --> 00:06:20,780Fifteen a's, even quicker,12700:06:20,780 --> 00:06:23,180but these times always need to12800:06:23,180 --> 00:06:25,820be taken with a grain of salt.12900:06:25,820 --> 00:06:28,040They are not 100percent accurate.13000:06:28,040 --> 00:06:31,490So 15 is also OK.Let's take 20.13100:06:31,490 --> 00:06:36,965Hmmm this already takesdouble the time.13200:06:36,965 --> 00:06:42,440Twenty-five. Then even longer.13300:06:42,440 --> 00:06:45,680Okay, then suddenlyfrom 0.2 seconds,13400:06:45,680 --> 00:06:48,960it now takes almost four seconds.13500:06:49,600 --> 00:06:54,890Twenty-Six, this13600:06:54,890 --> 00:07:01,415takes six seconds...already double. 13700:07:01,415 --> 00:07:07,229Let's go to 28. That would be...hmmm....hmmm13800:07:08,890 --> 00:07:11,840You see the stringisn't very long,13900:07:11,840 --> 00:07:13,340so that could be easily like14000:07:13,340 --> 00:07:16,070just the size ofan email address.14100:07:16,070 --> 00:07:19,280And the regularexpression matching14200:07:19,280 --> 00:07:22,550engine in Python needsquite a long time14300:07:22,550 --> 00:07:24,710to find out thatthis string of 2814400:07:24,710 --> 00:07:26,570a's is actually not matched14500:07:26,570 --> 00:07:28,490by that. You see it'sstill not finished.14600:07:28,490 --> 00:07:32,900I think it should takeapproximately like 20 seconds.14700:07:32,900 --> 00:07:34,400Okay. Already 30.14800:07:34,400 --> 00:07:36,530And if we would try14900:07:36,530 --> 00:07:40,80530, we would be already herefor more than a minute.15000:07:40,805 --> 00:07:43,940And if I could usesomething like 100,15100:07:43,940 --> 00:07:46,220you remember if a doubling in15200:07:46,220 --> 00:07:48,770each step or the second step,15300:07:48,770 --> 00:07:50,720the story with the chess board,15400:07:50,720 --> 00:07:53,855we probably would sit hereuntil the next century.15500:07:53,855 --> 00:07:56,820So something strange here.15600:07:57,580 --> 00:08:01,355Okay, that might be justa problem of Python.15700:08:01,355 --> 00:08:02,990Let's have a look at another15800:08:02,990 --> 00:08:04,985regular expressionmatching engine.15900:08:04,985 --> 00:08:06,890This time from JavaScript,16000:08:06,890 --> 00:08:10,040also are pretty well-knownprogramming language.16100:08:10,040 --> 00:08:13,610So here you can seeit's still a star,16200:08:13,610 --> 00:08:16,235star followed by b,16300:08:16,235 --> 00:08:18,920by direct expression issupposed to match that from16400:08:18,920 --> 00:08:21,830the beginning of thestring up till the end.16500:08:21,830 --> 00:08:23,930So there's not any difference16600:08:23,930 --> 00:08:26,150in the strings this regularexpression matches.16700:08:26,150 --> 00:08:28,610We'll just start at thebeginning of the string16800:08:28,610 --> 00:08:31,460and finish at theend of the string.16900:08:31,460 --> 00:08:35,285And we again, we just userepeated a's for that.17000:08:35,285 --> 00:08:38,195And similarly, we can17100:08:38,195 --> 00:08:41,930call it on the command lineand can do some timing.17200:08:41,930 --> 00:08:44,540So ten a's is very good.17300:08:44,540 --> 00:08:46,340Here's the string.17400:08:46,340 --> 00:08:48,320It cannot match that string.17500:08:48,320 --> 00:08:50,525And it's pretty fast.17600:08:50,525 --> 00:08:54,725Twenty...also pretty fast.17700:08:54,725 --> 00:08:59,120Twenty-five... Again,17800:08:59,120 --> 00:09:06,650somehow is a kind ofthreshold that is 25, 26.17900:09:06,650 --> 00:09:09,485Suddenly it takes much longer.18000:09:09,485 --> 00:09:14,360And it has essentially thesame problem as with Python.18100:09:14,360 --> 00:09:17,165So you'll see in now from 26 on,18200:09:17,165 --> 00:09:19,250the times alwaysdouble from18300:09:19,250 --> 00:09:21,860three seconds to seven seconds.18400:09:21,860 --> 00:09:23,330So you can imagine what that18500:09:23,330 --> 00:09:24,890roughly takes when I put your18600:09:24,890 --> 00:09:30,23027 and you see thestring isn't very long.18700:09:30,230 --> 00:09:32,165It is just twenty-or-something a's.18800:09:32,165 --> 00:09:35,419Imagine you have tosearch a database18900:09:35,419 --> 00:09:38,720with Gigabytes of data19000:09:38,720 --> 00:09:42,260with these regularexpressions that would 19100:09:42,260 --> 00:09:48,150need years to go through withthese regular expressions.19200:09:48,630 --> 00:09:51,850Okay, maybe the people in19300:09:51,850 --> 00:09:55,435Python and JavaScript,they're just idiots.19400:09:55,435 --> 00:09:58,180Surely Java must do much better.19500:09:58,180 --> 00:10:01,045So here's a program.19600:10:01,045 --> 00:10:03,415You can see this again19700:10:03,415 --> 00:10:05,980is the regular expressionand we just having19800:10:05,980 --> 00:10:08,320some scaffolding to generate19900:10:08,320 --> 00:10:11,905strings from 5 up till 28.20000:10:11,905 --> 00:10:14,305And if we run that,20100:10:14,305 --> 00:10:16,660actually does that automatically.20200:10:16,660 --> 00:10:19,900So uphill 19, pretty fast,20300:10:19,900 --> 00:10:24,925but then starting from23, it is getting pretty slow.20400:10:24,925 --> 00:10:27,445So the question iswhat's going on?20500:10:27,445 --> 00:10:29,230By the way, I'm not gloating here.20600:10:29,230 --> 00:10:33,755Scala uses internallythe regular expression20700:10:33,755 --> 00:10:36,665matching engine from Java.20800:10:36,665 --> 00:10:39,065So would have exactlythe same problem.20900:10:39,065 --> 00:10:41,480Also, I have beenhere very careful,21000:10:41,480 --> 00:10:43,550I'm using here Java 8,21100:10:43,550 --> 00:10:46,085which nowadays is quite old.21200:10:46,085 --> 00:10:50,765But you will see alsocurrent Java versions.21300:10:50,765 --> 00:10:55,490We will see we can out-competethem by magnitudes.21400:10:55,490 --> 00:10:57,605So I think I can 21500:10:57,605 --> 00:10:59,165now, just finish this here.21600:10:59,165 --> 00:11:04,025You see the problem.Just for completeness sake.21700:11:04,025 --> 00:11:07,010Here is a Ruby program.21800:11:07,010 --> 00:11:09,935This is using the otherregular expression.21900:11:09,935 --> 00:11:12,935In this case thestring should match.22000:11:12,935 --> 00:11:20,300And again it tries outstrings between 1 and 30 here.22100:11:20,300 --> 00:11:23,450That's a program actuallya former student produced.22200:11:23,450 --> 00:11:25,565And you can see four a's22300:11:25,565 --> 00:11:29,780of links up till 20a's is pretty fast.22400:11:29,780 --> 00:11:32,495But then starting at 26,22500:11:32,495 --> 00:11:35,285it's getting really slow.22600:11:35,285 --> 00:11:37,100So in this case,remember the string22700:11:37,100 --> 00:11:38,870is actually matched bythe regular expression.22800:11:38,870 --> 00:11:40,130So it has nothing to do22900:11:40,130 --> 00:11:41,540with a regularexpression actually23000:11:41,540 --> 00:11:45,485matches a string or doesnot match a string.23100:11:45,485 --> 00:11:48,260I admit though theseregular expressions23200:11:48,260 --> 00:11:49,610are carefully chosen,23300:11:49,610 --> 00:11:52,250as you will see later on.23400:11:52,250 --> 00:11:55,620I also just stop that here.23500:11:55,710 --> 00:12:00,985Okay, this slide collectsthe information about times.23600:12:00,985 --> 00:12:03,400On the right-hand side will23700:12:03,400 --> 00:12:05,860be our regular expression matcher,23800:12:05,860 --> 00:12:08,290which we implement next week.23900:12:08,290 --> 00:12:10,795On the left-hand side,are these times by24000:12:10,795 --> 00:12:14,260various other regularexpression matching engines?24100:12:14,260 --> 00:12:17,809On the top is thisregular expression.24200:12:19,080 --> 00:12:23,335Possible a n-times a n-times.24300:12:23,335 --> 00:12:26,890And on the loweris (a*)* b.24400:12:26,890 --> 00:12:30,370And the x-axis show here24500:12:30,370 --> 00:12:35,335the length of thestring. How many a's.24600:12:35,335 --> 00:12:38,925And on the y-axis is the time24700:12:38,925 --> 00:12:41,660they need to decide whether24800:12:41,660 --> 00:12:44,615the string is matched bythe regular expression or not.24900:12:44,615 --> 00:12:46,415So you can see here, Python,25000:12:46,415 --> 00:12:47,945Java 8 and JavaScript,25100:12:47,945 --> 00:12:52,250they max out approximatelyat between 25 and 30.25200:12:52,250 --> 00:12:53,900Because then it takes already25300:12:53,900 --> 00:12:55,160a half a minute to25400:12:55,160 --> 00:12:57,410decide whether the stringis matched or not.25500:12:57,410 --> 00:13:00,815And similarly, inthe other example,25600:13:00,815 --> 00:13:03,830Python and Ruby max out25700:13:03,830 --> 00:13:07,220at a similar kind oflength of the strings.25800:13:07,220 --> 00:13:10,400Because then they use alsohalf a minute to decide25900:13:10,400 --> 00:13:13,940whether this regular expressionactually matches the string.26000:13:13,940 --> 00:13:16,790Contrast that with26100:13:16,790 --> 00:13:19,235the regular expression matcher26200:13:19,235 --> 00:13:21,470which we're going to implement.26300:13:21,470 --> 00:13:25,040This can matchapproximately 10 thousand26400:13:25,040 --> 00:13:30,065a's in this example andneeds less than ten seconds.26500:13:30,065 --> 00:13:32,285Actually, there will betwo versions of that.26600:13:32,285 --> 00:13:34,850The first version will bealso relatively slow.26700:13:34,850 --> 00:13:36,410But the second version,26800:13:36,410 --> 00:13:38,240in contrast to Python,26900:13:38,240 --> 00:13:40,295Ruby, we'll be blindingly fast.27000:13:40,295 --> 00:13:42,380And in the second example,27100:13:42,380 --> 00:13:45,740you have to be carefulabout the x-axis because27200:13:45,740 --> 00:13:49,385that means four timesten to the power six.27300:13:49,385 --> 00:13:51,695It's actually 4 million a's.27400:13:51,695 --> 00:13:55,100So our regularexpression matcher needs27500:13:55,100 --> 00:13:57,635less than ten seconds to27600:13:57,635 --> 00:14:00,725match a string of lengthof 4 million a's.27700:14:00,725 --> 00:14:04,430Contrast that Python, Java 8,27800:14:04,430 --> 00:14:06,770and JavaScript need half a minute27900:14:06,770 --> 00:14:09,905already for a stringof length just 30.28000:14:09,905 --> 00:14:12,365I was very careful with Java 8.28100:14:12,365 --> 00:14:15,725Yes, Java 9 and above,28200:14:15,725 --> 00:14:17,180they already have28300:14:17,180 --> 00:14:19,610a much better regularexpression matching engine,28400:14:19,610 --> 00:14:22,805but still we will be runningcircles around them.28500:14:22,805 --> 00:14:27,050with this data.I call this slide:28600:14:27,050 --> 00:14:29,675Why bother withregular expressions?28700:14:29,675 --> 00:14:33,515But you can probablysee these are28800:14:33,515 --> 00:14:34,910abysmal times by28900:14:34,910 --> 00:14:38,015the existing regularexpression matching engines.29000:14:38,015 --> 00:14:40,070And it's actuallysurprising that after29100:14:40,070 --> 00:14:42,695one lecture we can alreadydo substantially better.29200:14:42,695 --> 00:14:47,495And if you don't believein the times, I gave here,29300:14:47,495 --> 00:14:50,090please feel free toplay on your own29400:14:50,090 --> 00:14:52,865with the examplesI uploaded on KEATS.29500:14:52,865 --> 00:14:55,235These are exactly the programs29600:14:55,235 --> 00:14:57,470I used here in the examples.29700:14:57,470 --> 00:14:59,255So feel free.29800:14:59,255 --> 00:15:01,970You might however now think, hmm.29900:15:01,970 --> 00:15:05,449These are two verywell chosen examples,30000:15:05,449 --> 00:15:07,145and I admit that's true,30100:15:07,145 --> 00:15:09,410and such problems never30200:15:09,410 --> 00:15:12,540cause any problemsin real life.30300:15:13,300 --> 00:15:15,980Regular expressions are used very30400:15:15,980 --> 00:15:19,415frequently and theydo cause problems.30500:15:19,415 --> 00:15:21,410So here's my first example from30600:15:21,410 --> 00:15:23,885a company called Cloudflare.30700:15:23,885 --> 00:15:27,560This is a huge hosting company30800:15:27,560 --> 00:15:30,935which hosts verywell-known web pages.30900:15:30,935 --> 00:15:34,970And they really try hardto have no outage at all.31000:15:34,970 --> 00:15:37,340And they managethat for six years.31100:15:37,340 --> 00:15:39,320But then a regular expression,31200:15:39,320 --> 00:15:41,180actually this one, causeda problem and you31300:15:41,180 --> 00:15:43,265can see they're alsotwo stars31400:15:43,265 --> 00:15:44,630at the end.31500:15:44,630 --> 00:15:46,955And because of that string needed31600:15:46,955 --> 00:15:49,865too much time to be matched.31700:15:49,865 --> 00:15:50,990And because of that,31800:15:50,990 --> 00:15:52,430they had some outage for,31900:15:52,430 --> 00:15:54,125I think several hours,32000:15:54,125 --> 00:15:57,920actually in their malwaredetection subsystem.32100:15:57,920 --> 00:16:02,060And the second examplecomes from 2016,32200:16:02,060 --> 00:16:04,040where Stack Exchange,I guess you know32300:16:04,040 --> 00:16:06,650this webpage, hadalso an outage for32400:16:06,650 --> 00:16:08,390I think at least an hour.32500:16:08,390 --> 00:16:13,070Because a regular expression,needed to format posts,32600:16:13,070 --> 00:16:15,575needed too much time to32700:16:15,575 --> 00:16:19,010recognize whether this postshould be accepted or not.32800:16:19,010 --> 00:16:23,390And again, there was asimilar kind of problem.32900:16:23,390 --> 00:16:24,950And you can readthe stories behind33000:16:24,950 --> 00:16:28,080that on these two given links.33100:16:28,720 --> 00:16:31,730When I looked atthis the first time,33200:16:31,730 --> 00:16:34,175what surprised me isthat theoreticians,33300:16:34,175 --> 00:16:37,520who sometimes dedicate theirlife to regular expressions33400:16:37,520 --> 00:16:39,440and know really a lot about33500:16:39,440 --> 00:16:41,690them, didn't knowanything about this.33600:16:41,690 --> 00:16:43,610But engineers, theyalready created33700:16:43,610 --> 00:16:46,160a name for that:Regular Expression33800:16:46,160 --> 00:16:47,975Denial of Service Attack.33900:16:47,975 --> 00:16:49,745Because what you can,34000:16:49,745 --> 00:16:51,230what can happen now is that34100:16:51,230 --> 00:16:54,920attackers look forcertain strings34200:16:54,920 --> 00:16:56,780that make your regular expression34300:16:56,780 --> 00:16:59,105matching engine topple over.34400:16:59,105 --> 00:17:01,370And these kind of 34500:17:01,370 --> 00:17:04,160regular expressions are calledEvil Regular Expressions.34600:17:04,160 --> 00:17:06,350And actually there arequite a number of them.34700:17:06,350 --> 00:17:08,495So you seen this one,34800:17:08,495 --> 00:17:11,255the first one, and thesecond one already.34900:17:11,255 --> 00:17:13,400But there are many, many more.35000:17:13,400 --> 00:17:15,620And you can easily have in35100:17:15,620 --> 00:17:18,560your program one ofthese regular expressions.35200:17:18,560 --> 00:17:21,830And then you have theproblem that if you do have35300:17:21,830 --> 00:17:23,240this regular expression and35400:17:23,240 --> 00:17:25,640somebody finds thecorresponding string,35500:17:25,640 --> 00:17:29,945which make the regularmatching engine topple over,35600:17:29,945 --> 00:17:31,820then you have a problem35700:17:31,820 --> 00:17:34,295because your webpage isprobably not available.35800:17:34,295 --> 00:17:36,140This phenomenon is also sometimes 35900:17:36,140 --> 00:17:39,350calledcatastrophic backtracking.36000:17:39,350 --> 00:17:43,595In lecture three, we willlook at this more carefully.36100:17:43,595 --> 00:17:46,910And actually why thatis such a problem in36200:17:46,910 --> 00:17:50,795real life is actuallynot to do with lexers.36300:17:50,795 --> 00:17:53,180Yes, regularexpressions are used as36400:17:53,180 --> 00:17:55,040the basic tool for implementing36500:17:55,040 --> 00:17:57,185lexers. But regular expressions,36600:17:57,185 --> 00:18:00,065of course, used ina much wider area.36700:18:00,065 --> 00:18:03,770And they especially used fornetwork intrusion detection.36800:18:03,770 --> 00:18:06,590Remember, say you're having to36900:18:06,590 --> 00:18:10,130administer a big networkand you only want to let37000:18:10,130 --> 00:18:13,640in packets which you think are OK37100:18:13,640 --> 00:18:14,930and you want to keep out37200:18:14,930 --> 00:18:17,645any package which mighthack into your network.37300:18:17,645 --> 00:18:22,670So what they have is theyhave suites of thousands and37400:18:22,670 --> 00:18:25,745sometimes even moreregular expressions which37500:18:25,745 --> 00:18:27,755check whether this package37600:18:27,755 --> 00:18:30,065satisfies some patterns or not.37700:18:30,065 --> 00:18:31,460And in this case it will be left37800:18:31,460 --> 00:18:34,205out or it will be let in.37900:18:34,205 --> 00:18:36,335And with networks,38000:18:36,335 --> 00:18:39,080the problem is that ourhardware is already38100:18:39,080 --> 00:18:43,190so fast that the regularexpressions38200:18:43,190 --> 00:18:45,169really become a bottleneck.38300:18:45,169 --> 00:18:47,060Because what do you do if now is38400:18:47,060 --> 00:18:49,880suddenly a regular expressiontakes too much time?38500:18:49,880 --> 00:18:52,670Do you just stop the matching38600:18:52,670 --> 00:18:55,100and let the packagein regardless?38700:18:55,100 --> 00:18:58,190Or do you just holdthe network up38800:18:58,190 --> 00:19:01,715and don't let anything inuntil you decided that.38900:19:01,715 --> 00:19:04,895So that's actually areally hard problem.39000:19:04,895 --> 00:19:06,650But the first time I came across39100:19:06,650 --> 00:19:09,965that problem was actuallyby this engineer.39200:19:09,965 --> 00:19:13,820And it's always say thatGermans don't have any humor.39300:19:13,820 --> 00:19:16,985But I found thatvideo quite funny.39400:19:16,985 --> 00:19:19,145Maybe you have adifferent opinion,39500:19:19,145 --> 00:19:21,095but feel free tohave a look. 39600:19:21,095 --> 00:19:23,705It explains exactly that problem.39700:19:23,705 --> 00:19:25,610So in the next video,39800:19:25,610 --> 00:19:28,445we will start toimplement this matcher.39900:19:28,445 --> 00:19:30,870So I hope to see you there.