diff -r d41956ea544e -r 82a1315c128d videos/01-evilregexes.srt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/videos/01-evilregexes.srt Wed Sep 23 11:34:43 2020 +0100 @@ -0,0 +1,1813 @@ +1 +00:00:06,240 --> 00:00:11,050 +Welcome back. This video +is about regular expressions. + +2 +00:00:11,050 --> 00:00:14,230 +We want to use regular +expressions in our lexer. + +3 +00:00:14,230 --> 00:00:16,165 +And the purpose of the +lexer is to find + +4 +00:00:16,165 --> 00:00:18,070 +out where the words in + +5 +00:00:18,070 --> 00:00:21,070 +our programs are. However +regular expressions + +6 +00:00:21,070 --> 00:00:23,875 +are fundamental tool +in computer science. + +7 +00:00:23,875 --> 00:00:27,910 +And I'm sure you've used them +already on several occasions. + +8 +00:00:27,910 --> 00:00:30,370 +And one would expect that about + +9 +00:00:30,370 --> 00:00:31,750 +regular expressions since they are + +10 +00:00:31,750 --> 00:00:33,850 +so well-known and well studied, + +11 +00:00:33,850 --> 00:00:37,915 +that everything under the +sun is known about them. + +12 +00:00:37,915 --> 00:00:41,080 +But actually there's +still some surprising + +13 +00:00:41,080 --> 00:00:44,465 +and interesting +problems with them. + +14 +00:00:44,465 --> 00:00:47,945 +And I want to show you +them in this video. + +15 +00:00:47,945 --> 00:00:50,720 +I'm sure you've seen +regular expressions + +16 +00:00:50,720 --> 00:00:52,445 +many, many times before. + +17 +00:00:52,445 --> 00:00:55,100 +But just to be on the same page, + +18 +00:00:55,100 --> 00:00:57,110 +let me just recap them. + +19 +00:00:57,110 --> 00:00:59,210 +So here in this line, + +20 +00:00:59,210 --> 00:01:01,790 +there is a regular expression +which is supposed to + +21 +00:01:01,790 --> 00:01:05,285 +recognize some form +of email addresses. + +22 +00:01:05,285 --> 00:01:07,745 +So an e-mail address + +23 +00:01:07,745 --> 00:01:11,000 +has part which is +before the @ symbol, + +24 +00:01:11,000 --> 00:01:13,400 +which is the name of the person. + +25 +00:01:13,400 --> 00:01:16,880 +And that can be +any number between + +26 +00:01:16,880 --> 00:01:20,195 +0 and 9, and letters between a and z. + +27 +00:01:20,195 --> 00:01:24,155 +Let's say we avoiding +here capital letters. + +28 +00:01:24,155 --> 00:01:26,045 +There can be underscores. + +29 +00:01:26,045 --> 00:01:29,405 +There can be a dot and +there can be hyphens. + +30 +00:01:29,405 --> 00:01:35,390 +And after the @ symbol +comes the domain name. + +31 +00:01:35,390 --> 00:01:37,310 +So as you can see here, + +32 +00:01:37,310 --> 00:01:40,640 +we use things like star to + +33 +00:01:40,640 --> 00:01:44,314 +match letters +zero or more times. + +34 +00:01:44,314 --> 00:01:45,985 +Or we have a plus, + +35 +00:01:45,985 --> 00:01:47,420 +which means you have to match + +36 +00:01:47,420 --> 00:01:52,489 +at least once or more +times. Then we have. + +37 +00:01:52,489 --> 00:01:55,790 +question mark, which says you + +38 +00:01:55,790 --> 00:01:59,105 +match either it is there +or it ss not there. + +39 +00:01:59,105 --> 00:02:01,340 +You are also regular +expressions which + +40 +00:02:01,340 --> 00:02:03,755 +match exactly n-times. + +41 +00:02:03,755 --> 00:02:08,720 +Or this is a regular expression +for between n and m times. + +42 +00:02:08,720 --> 00:02:12,065 +You can see in +this email address, + +43 +00:02:12,065 --> 00:02:13,730 +the top-level domain + +44 +00:02:13,730 --> 00:02:16,130 +name can be any letter + +45 +00:02:16,130 --> 00:02:19,265 +between a to z, +and contain dots, + +46 +00:02:19,265 --> 00:02:22,340 +but can only be two +characters long + +47 +00:02:22,340 --> 00:02:25,685 +up till six characters +and not more. + +48 +00:02:25,685 --> 00:02:29,240 +Then you also have +something like ranges. + +49 +00:02:29,240 --> 00:02:31,220 +So you can see, letters between a + +50 +00:02:31,220 --> 00:02:33,635 +and z and 0 to 9 and so on. + +51 +00:02:33,635 --> 00:02:36,545 +Here you also have regular +expression which can + +52 +00:02:36,545 --> 00:02:40,070 +match something which +isn't in this range. + +53 +00:02:40,070 --> 00:02:42,560 +So for example, if +you want for example match, + +54 +00:02:42,560 --> 00:02:44,030 +letters but not numbers, + +55 +00:02:44,030 --> 00:02:45,800 +you would say, well, if + +56 +00:02:45,800 --> 00:02:48,990 +this is a number that +should not match. + +57 +00:02:49,090 --> 00:02:52,804 +Typically you also +have these ranges. + +58 +00:02:52,804 --> 00:02:55,565 +Lowercase letters, +capital letters. + +59 +00:02:55,565 --> 00:02:58,550 +Then you have some +special regular expressions + +60 +00:02:58,550 --> 00:03:02,195 +like this one is only +supposed to match digits. + +61 +00:03:02,195 --> 00:03:05,674 +A dot is supposed to +match any character. + +62 +00:03:05,674 --> 00:03:07,370 +And then they have also something + +63 +00:03:07,370 --> 00:03:09,800 +called groups which +is supposed to be + +64 +00:03:09,800 --> 00:03:12,799 +used when you are +trying to extract + +65 +00:03:12,799 --> 00:03:15,605 +a string you've matched. + +66 +00:03:15,605 --> 00:03:19,925 +Okay, so these are the +typical regular expressions. + +67 +00:03:19,925 --> 00:03:23,075 +And here's a particular one. + +68 +00:03:23,075 --> 00:03:25,820 +Trying to match something + +69 +00:03:25,820 --> 00:03:28,770 +which resembles +an email address. + +70 +00:03:29,590 --> 00:03:33,065 +Clearly that should be all easy. + +71 +00:03:33,065 --> 00:03:36,230 +And our technology should +be on top of that. + +72 +00:03:36,230 --> 00:03:37,865 +That we can take a + +73 +00:03:37,865 --> 00:03:41,015 +regular expressions and +we can take a string, + +74 +00:03:41,015 --> 00:03:43,340 +and we should have programs to + +75 +00:03:43,340 --> 00:03:45,680 +decide whether this +string is matched + +76 +00:03:45,680 --> 00:03:50,330 +by a regular expression or +not and should be easy-peasy, no? + +77 +00:03:50,330 --> 00:03:56,150 +Well, let's have a +look at two examples. + +78 +00:03:56,150 --> 00:04:00,860 +The first regular expression +is a star star b. + +79 +00:04:00,860 --> 00:04:02,990 +And it is supposed +to match strings of + +80 +00:04:02,990 --> 00:04:05,825 +the form 0 or more a's, + +81 +00:04:05,825 --> 00:04:10,385 +followed by a b. The parentheses +you can ignore. + +82 +00:04:10,385 --> 00:04:11,990 +And a star star + +83 +00:04:11,990 --> 00:04:14,120 +also doesn't +make any difference + +84 +00:04:14,120 --> 00:04:16,505 +to what kind of strings +that can be matched. + +85 +00:04:16,505 --> 00:04:21,635 +It can only make 0 more +a's followed by a b. + +86 +00:04:21,635 --> 00:04:23,900 +And the other regular expression + +87 +00:04:23,900 --> 00:04:26,990 +is possibly a character a, + +88 +00:04:26,990 --> 00:04:32,930 +n times, followed by character +a axactly n-times. + +89 +00:04:32,930 --> 00:04:35,570 +And we will try out + +90 +00:04:35,570 --> 00:04:38,360 +these two regular expressions +with strings of the form a, + +91 +00:04:38,360 --> 00:04:39,890 +aa, and so on, + +92 +00:04:39,890 --> 00:04:45,770 +and up to the length of n. And + +93 +00:04:45,770 --> 00:04:49,130 +this regular expression should +actually not match any of + +94 +00:04:49,130 --> 00:04:53,315 +the strings because the +final b is missing. + +95 +00:04:53,315 --> 00:04:56,150 +But that is +okay. For example + +96 +00:04:56,150 --> 00:04:57,425 +if you have a regular expression + +97 +00:04:57,425 --> 00:05:00,110 +that is supposed to +check whether a string is + +98 +00:05:00,110 --> 00:05:01,490 +an email address and the user + +99 +00:05:01,490 --> 00:05:03,380 +gives some random +strings in there, + +100 +00:05:03,380 --> 00:05:06,545 +then this regular expression +should not match that string. + +101 +00:05:06,545 --> 00:05:08,420 +And for this regular expression + +102 +00:05:08,420 --> 00:05:11,195 +you have to scratch a +little bit of your head, + +103 +00:05:11,195 --> 00:05:12,620 +what it can actually match. + +104 +00:05:12,620 --> 00:05:14,720 +But after a little bit +of head scratching, + +105 +00:05:14,720 --> 00:05:18,260 +you find out can match +any string which is of + +106 +00:05:18,260 --> 00:05:22,580 +the length n a's up +to 2n of a's. + +107 +00:05:22,580 --> 00:05:24,290 +So anything in this range, + +108 +00:05:24,290 --> 00:05:27,185 +this regular expression +can actually match. + +109 +00:05:27,185 --> 00:05:30,395 +Okay, let's +take a random tool, + +110 +00:05:30,395 --> 00:05:32,630 +maybe for example Python. + +111 +00:05:32,630 --> 00:05:35,240 +So here's a little +Python program. + +112 +00:05:35,240 --> 00:05:38,690 +It uses the library +function of Python to + +113 +00:05:38,690 --> 00:05:42,935 +match the regular expressions of +a star star b. + +114 +00:05:42,935 --> 00:05:46,805 +And we measure time with longer +and longer strings of a. + +115 +00:05:46,805 --> 00:05:48,770 +And so conveniently we can give + +116 +00:05:48,770 --> 00:05:51,140 +the number of a's here +on the command line. + +117 +00:05:51,140 --> 00:05:56,900 +If I just call +this on the command line, + +118 +00:05:56,900 --> 00:05:59,900 +Let's say we first +start with five a's. + +119 +00:05:59,900 --> 00:06:03,920 +And I get also the times which +in this case is next to nothing. + +120 +00:06:03,920 --> 00:06:05,960 +And here's the string +we just matched. + +121 +00:06:05,960 --> 00:06:07,640 +And obviously the +regular expression + +122 +00:06:07,640 --> 00:06:09,110 +did not match the string. + +123 +00:06:09,110 --> 00:06:11,255 +That's indicated by this none. + +124 +00:06:11,255 --> 00:06:13,925 +Let's take ten a's. + +125 +00:06:13,925 --> 00:06:16,490 +It's also pretty quick. + +126 +00:06:16,490 --> 00:06:20,780 +Fifteen a's, even quicker, + +127 +00:06:20,780 --> 00:06:23,180 +but these times always need to + +128 +00:06:23,180 --> 00:06:25,820 +be taken with a grain of salt. + +129 +00:06:25,820 --> 00:06:28,040 +They are not 100 +percent accurate. + +130 +00:06:28,040 --> 00:06:31,490 +So 15 is also a let's take + +131 +00:06:31,490 --> 00:06:36,965 +28th notes already +double the time. + +132 +00:06:36,965 --> 00:06:42,440 +Twenty-five longer. + +133 +00:06:42,440 --> 00:06:45,680 +Okay, that suddenly +from 02 seconds, + +134 +00:06:45,680 --> 00:06:48,960 +it takes almost four seconds. + +135 +00:06:49,600 --> 00:06:54,890 +Six this + +136 +00:06:54,890 --> 00:07:01,415 +takes six seconds +already Double, okay? + +137 +00:07:01,415 --> 00:07:07,229 +Go to 28. That would be now. + +138 +00:07:08,890 --> 00:07:11,840 +You see the string +isn't very long, + +139 +00:07:11,840 --> 00:07:13,340 +so that could be easily like + +140 +00:07:13,340 --> 00:07:16,070 +just the size of +an email address. + +141 +00:07:16,070 --> 00:07:19,280 +And the regular +expression matching + +142 +00:07:19,280 --> 00:07:22,550 +engine in Python needs +quite a long time + +143 +00:07:22,550 --> 00:07:24,710 +to find out that +this string of 28 + +144 +00:07:24,710 --> 00:07:26,570 +AES is actually not much + +145 +00:07:26,570 --> 00:07:28,490 +by that you see it's +still not finished. + +146 +00:07:28,490 --> 00:07:32,900 +I think it should take +approximately like 20 seconds. + +147 +00:07:32,900 --> 00:07:34,400 +Okay. Already 30. + +148 +00:07:34,400 --> 00:07:36,530 +And if we would try + +149 +00:07:36,530 --> 00:07:40,805 +30 would be already +more than a minute. + +150 +00:07:40,805 --> 00:07:43,940 +And if I could read +something like hundreds, + +151 +00:07:43,940 --> 00:07:46,220 +you remember if a doubling in + +152 +00:07:46,220 --> 00:07:48,770 +each step or the second step, + +153 +00:07:48,770 --> 00:07:50,720 +the story with the chess board, + +154 +00:07:50,720 --> 00:07:53,855 +we probably would sit here +until the next century. + +155 +00:07:53,855 --> 00:07:56,820 +So something strange here. + +156 +00:07:57,580 --> 00:08:01,355 +Okay, that might be just +a problem of Python. + +157 +00:08:01,355 --> 00:08:02,990 +Let's have a look at another + +158 +00:08:02,990 --> 00:08:04,985 +regular expression +matching engine. + +159 +00:08:04,985 --> 00:08:06,890 +This time from JavaScript, + +160 +00:08:06,890 --> 00:08:10,040 +also are pretty well-known +programming language. + +161 +00:08:10,040 --> 00:08:13,610 +So here you can see +it's still a star, + +162 +00:08:13,610 --> 00:08:16,235 +star followed by b, + +163 +00:08:16,235 --> 00:08:18,920 +by direct expression is +supposed to match that from + +164 +00:08:18,920 --> 00:08:21,830 +the beginning of the +string up till the end. + +165 +00:08:21,830 --> 00:08:23,930 +So there's not any difference + +166 +00:08:23,930 --> 00:08:26,150 +in the strings this work +expression matches. + +167 +00:08:26,150 --> 00:08:28,610 +We'll just start at the +beginning of the string + +168 +00:08:28,610 --> 00:08:31,460 +and finish at the +end of the string. + +169 +00:08:31,460 --> 00:08:35,285 +And we again, we just use +repeated A's for that. + +170 +00:08:35,285 --> 00:08:38,195 +And similarly, we can + +171 +00:08:38,195 --> 00:08:41,930 +call it on the command line +and can do some timing. + +172 +00:08:41,930 --> 00:08:44,540 +So ten SBA, good. + +173 +00:08:44,540 --> 00:08:46,340 +Here's the string. + +174 +00:08:46,340 --> 00:08:48,320 +It cannot match that string. + +175 +00:08:48,320 --> 00:08:50,525 +And it's pretty fast. + +176 +00:08:50,525 --> 00:08:54,725 +Friendly. Although pretty fast. + +177 +00:08:54,725 --> 00:08:59,120 +Five, again, + +178 +00:08:59,120 --> 00:09:06,650 +somehow is kind of +threshold that is 25, 26. + +179 +00:09:06,650 --> 00:09:09,485 +Suddenly it takes much longer. + +180 +00:09:09,485 --> 00:09:14,360 +And it has essentially the +same problem as with Python. + +181 +00:09:14,360 --> 00:09:17,165 +So you'll see in now from 26 on, + +182 +00:09:17,165 --> 00:09:19,250 +the Times has always +doubling from + +183 +00:09:19,250 --> 00:09:21,860 +three seconds to seven seconds. + +184 +00:09:21,860 --> 00:09:23,330 +So you can imagine what that + +185 +00:09:23,330 --> 00:09:24,890 +roughly takes when I put your + +186 +00:09:24,890 --> 00:09:30,230 +27 and you see the +string isn't very long. + +187 +00:09:30,230 --> 00:09:32,165 +Let's choose twenties or maize. + +188 +00:09:32,165 --> 00:09:35,419 +Imagine you have to +search a database + +189 +00:09:35,419 --> 00:09:38,720 +with kilobytes of data. + +190 +00:09:38,720 --> 00:09:42,260 +This, these regular +expressions that would years + +191 +00:09:42,260 --> 00:09:48,150 +need years to go through with +these regular expressions. + +192 +00:09:48,630 --> 00:09:51,850 +Okay, maybe the people in + +193 +00:09:51,850 --> 00:09:55,435 +Python and JavaScript, +they're just idiots. + +194 +00:09:55,435 --> 00:09:58,180 +Surely Java must do much better. + +195 +00:09:58,180 --> 00:10:01,045 +So here's a program. + +196 +00:10:01,045 --> 00:10:03,415 +You can see this again + +197 +00:10:03,415 --> 00:10:05,980 +is the reg expression +and we just having + +198 +00:10:05,980 --> 00:10:08,320 +some scaffolding to generate + +199 +00:10:08,320 --> 00:10:11,905 +strings from five up till 28. + +200 +00:10:11,905 --> 00:10:14,305 +And if we run that, + +201 +00:10:14,305 --> 00:10:16,660 +actually does that automatically. + +202 +00:10:16,660 --> 00:10:19,900 +So uphill 19, pretty fast, + +203 +00:10:19,900 --> 00:10:24,925 +but then starting from +23, skidding pretty slow. + +204 +00:10:24,925 --> 00:10:27,445 +So the question is +what's going on? + +205 +00:10:27,445 --> 00:10:29,230 +By the way, I'm not quoting here. + +206 +00:10:29,230 --> 00:10:33,755 +Scala, using internally +the regular expression + +207 +00:10:33,755 --> 00:10:36,665 +matching engine from Java. + +208 +00:10:36,665 --> 00:10:39,065 +So would have exactly +the same problem. + +209 +00:10:39,065 --> 00:10:41,480 +Also, I have been +here very careful, + +210 +00:10:41,480 --> 00:10:43,550 +I'm using here Scala aid, + +211 +00:10:43,550 --> 00:10:46,085 +which nowadays is quite old. + +212 +00:10:46,085 --> 00:10:50,765 +But you will see also +current Java versions. + +213 +00:10:50,765 --> 00:10:55,490 +We will see we can out-compete +them by magnitudes. + +214 +00:10:55,490 --> 00:10:57,605 +So I think I can that. + +215 +00:10:57,605 --> 00:10:59,165 +Now, just finish here. + +216 +00:10:59,165 --> 00:11:04,025 +You see the problem. Just +for completeness sake. + +217 +00:11:04,025 --> 00:11:07,010 +Here is a Ruby program. + +218 +00:11:07,010 --> 00:11:09,935 +This is using the other +regular expression. + +219 +00:11:09,935 --> 00:11:12,935 +In this case the +string should match. + +220 +00:11:12,935 --> 00:11:20,300 +And again it tries out +strings between 130 here. + +221 +00:11:20,300 --> 00:11:23,450 +That's a program actually +a former student produced. + +222 +00:11:23,450 --> 00:11:25,565 +And you can see four a's + +223 +00:11:25,565 --> 00:11:29,780 +of links up till 20 +AES is pretty fast. + +224 +00:11:29,780 --> 00:11:32,495 +But then starting at 26, + +225 +00:11:32,495 --> 00:11:35,285 +it's getting really slow. + +226 +00:11:35,285 --> 00:11:37,100 +So in this case, +remember the string + +227 +00:11:37,100 --> 00:11:38,870 +is actually matched by +the regular expression. + +228 +00:11:38,870 --> 00:11:40,130 +So it has nothing to do + +229 +00:11:40,130 --> 00:11:41,540 +with a regular +expression actually + +230 +00:11:41,540 --> 00:11:45,485 +matches a string or does +not match a string. + +231 +00:11:45,485 --> 00:11:48,260 +I admit though these +regular expressions + +232 +00:11:48,260 --> 00:11:49,610 +are carefully chosen, + +233 +00:11:49,610 --> 00:11:52,250 +as you will see later on. + +234 +00:11:52,250 --> 00:11:55,620 +Hey, I also just stop that here. + +235 +00:11:55,710 --> 00:12:00,985 +Okay, this slight collect +this information about times. + +236 +00:12:00,985 --> 00:12:03,400 +On the right hand side will + +237 +00:12:03,400 --> 00:12:05,860 +be our regular expression mantra, + +238 +00:12:05,860 --> 00:12:08,290 +which we implement next week. + +239 +00:12:08,290 --> 00:12:10,795 +On the left-hand side, +are these times by + +240 +00:12:10,795 --> 00:12:14,260 +barriers than regular +expression matching engines? + +241 +00:12:14,260 --> 00:12:17,809 +On the top is this +regular expression. + +242 +00:12:19,080 --> 00:12:23,335 +Possible a n times a n times. + +243 +00:12:23,335 --> 00:12:26,890 +And on the lowest +is a star, star b. + +244 +00:12:26,890 --> 00:12:30,370 +And the x-axis show here + +245 +00:12:30,370 --> 00:12:35,335 +the length of the +string. How many a's. + +246 +00:12:35,335 --> 00:12:38,925 +And on the y axis is the time. + +247 +00:12:38,925 --> 00:12:41,660 +They need to decide whether + +248 +00:12:41,660 --> 00:12:44,615 +the string is matched by +the rate expression or not. + +249 +00:12:44,615 --> 00:12:46,415 +So you can see here, Python, + +250 +00:12:46,415 --> 00:12:47,945 +Java eight in JavaScript, + +251 +00:12:47,945 --> 00:12:52,250 +they max out approximately +at between 2530. + +252 +00:12:52,250 --> 00:12:53,900 +The kristin, it takes already + +253 +00:12:53,900 --> 00:12:55,160 +a half a minute to + +254 +00:12:55,160 --> 00:12:57,410 +decide whether the string +is matched or not. + +255 +00:12:57,410 --> 00:13:00,815 +And similarly, in +the other example, + +256 +00:13:00,815 --> 00:13:03,830 +Python and derived Ruby max out + +257 +00:13:03,830 --> 00:13:07,220 +at a similar kind of +length of the strings. + +258 +00:13:07,220 --> 00:13:10,400 +Because then they use also +half a minute to decide + +259 +00:13:10,400 --> 00:13:13,940 +whether this rec expression +actually matches the string. + +260 +00:13:13,940 --> 00:13:16,790 +Contrast that with +the reg expression + +261 +00:13:16,790 --> 00:13:19,235 +which we are regular +expression mantra, + +262 +00:13:19,235 --> 00:13:21,470 +which we're going to implement. + +263 +00:13:21,470 --> 00:13:25,040 +This can match +approximately 10 thousand + +264 +00:13:25,040 --> 00:13:30,065 +a's in this example and +needs less than ten seconds. + +265 +00:13:30,065 --> 00:13:32,285 +Actually, there will be +two versions of that. + +266 +00:13:32,285 --> 00:13:34,850 +First version may be +also relatively slow. + +267 +00:13:34,850 --> 00:13:36,410 +But the second version, + +268 +00:13:36,410 --> 00:13:38,240 +in contrast to Python, + +269 +00:13:38,240 --> 00:13:40,295 +Ruby, we'll be blindingly fast. + +270 +00:13:40,295 --> 00:13:42,380 +And in the second example, + +271 +00:13:42,380 --> 00:13:45,740 +you have to be careful +about the x axis because + +272 +00:13:45,740 --> 00:13:49,385 +that means four times +ten to the power six. + +273 +00:13:49,385 --> 00:13:51,695 +It's actually 4 million A's. + +274 +00:13:51,695 --> 00:13:55,100 +So our regular +expression match or need + +275 +00:13:55,100 --> 00:13:57,635 +less than ten seconds to + +276 +00:13:57,635 --> 00:14:00,725 +match a string of length +of 4 million A's. + +277 +00:14:00,725 --> 00:14:04,430 +Contrast that Python, Java eight, + +278 +00:14:04,430 --> 00:14:06,770 +and JavaScript need half a minute + +279 +00:14:06,770 --> 00:14:09,905 +already for a string +of length just 30, + +280 +00:14:09,905 --> 00:14:12,365 +unless you're very +careful with Java eight. + +281 +00:14:12,365 --> 00:14:15,725 +Yes, Java nine and above, + +282 +00:14:15,725 --> 00:14:17,180 +they already have + +283 +00:14:17,180 --> 00:14:19,610 +a much better regular +expression matching engine, + +284 +00:14:19,610 --> 00:14:22,805 +but still we will be running +circles around them. + +285 +00:14:22,805 --> 00:14:27,050 +It's this data. I +call this slide. + +286 +00:14:27,050 --> 00:14:29,675 +Why bother with +regular expressions? + +287 +00:14:29,675 --> 00:14:33,515 +But you can probably +see these are + +288 +00:14:33,515 --> 00:14:34,910 +at least more times by + +289 +00:14:34,910 --> 00:14:38,015 +the existing regular +expression matching engines. + +290 +00:14:38,015 --> 00:14:40,070 +And it's actually +surprising that after + +291 +00:14:40,070 --> 00:14:42,695 +one lecture we can already +do substantially better. + +292 +00:14:42,695 --> 00:14:47,495 +And if you don't believe +in D times, I gave here, + +293 +00:14:47,495 --> 00:14:50,090 +please feel free to +play on your own + +294 +00:14:50,090 --> 00:14:52,865 +with the examples +I uploaded, Keats. + +295 +00:14:52,865 --> 00:14:55,235 +These are exactly the programs + +296 +00:14:55,235 --> 00:14:57,470 +are used here in the examples. + +297 +00:14:57,470 --> 00:14:59,255 +So feel free. + +298 +00:14:59,255 --> 00:15:01,970 +You might however now think, hmm. + +299 +00:15:01,970 --> 00:15:05,449 +These are two very +well chosen examples. + +300 +00:15:05,449 --> 00:15:07,145 +And I admit that's true. + +301 +00:15:07,145 --> 00:15:09,410 +And such problem there never + +302 +00:15:09,410 --> 00:15:12,540 +causing any problems +in real life. + +303 +00:15:13,300 --> 00:15:15,980 +Regular expressions are used very + +304 +00:15:15,980 --> 00:15:19,415 +frequently and they +do cause problems. + +305 +00:15:19,415 --> 00:15:21,410 +So here's my first example from + +306 +00:15:21,410 --> 00:15:23,885 +a company called cloudflare. + +307 +00:15:23,885 --> 00:15:27,560 +This is a huge hosting company + +308 +00:15:27,560 --> 00:15:30,935 +which host very +well-known web pages. + +309 +00:15:30,935 --> 00:15:34,970 +And they really try hard +to have no outage at all. + +310 +00:15:34,970 --> 00:15:37,340 +And they manage +that for six years. + +311 +00:15:37,340 --> 00:15:39,320 +But then a Rekha expression, + +312 +00:15:39,320 --> 00:15:41,180 +actually this one caused +a problem and you + +313 +00:15:41,180 --> 00:15:43,265 +can see they're also +like two stars. + +314 +00:15:43,265 --> 00:15:44,630 +They are at the end. + +315 +00:15:44,630 --> 00:15:46,955 +And because of that string needed + +316 +00:15:46,955 --> 00:15:49,865 +too much time to be matched. + +317 +00:15:49,865 --> 00:15:50,990 +And because of that, + +318 +00:15:50,990 --> 00:15:52,430 +they had some outage for, + +319 +00:15:52,430 --> 00:15:54,125 +I think several hours, + +320 +00:15:54,125 --> 00:15:57,920 +actually in their malware +detection subsystem. + +321 +00:15:57,920 --> 00:16:02,060 +And the second example +comes from 2016, + +322 +00:16:02,060 --> 00:16:04,040 +where Stack Exchange, +I guess you know + +323 +00:16:04,040 --> 00:16:06,650 +this webpage had +also an outage from, + +324 +00:16:06,650 --> 00:16:08,390 +I think at least an hour. + +325 +00:16:08,390 --> 00:16:13,070 +Because a regular expression +then needed to format posts, + +326 +00:16:13,070 --> 00:16:15,575 +needed too much time to + +327 +00:16:15,575 --> 00:16:19,010 +recognize whether this post +should be accepted or not. + +328 +00:16:19,010 --> 00:16:23,390 +And again, there was a +semi kind of problem. + +329 +00:16:23,390 --> 00:16:24,950 +And you can read +the stories behind + +330 +00:16:24,950 --> 00:16:28,080 +that on these two given links. + +331 +00:16:28,720 --> 00:16:31,730 +When I looked at +this the first time, + +332 +00:16:31,730 --> 00:16:34,175 +what surprised me is +that theoretician + +333 +00:16:34,175 --> 00:16:37,520 +who sometimes dedicate their +life to regular expression. + +334 +00:16:37,520 --> 00:16:39,440 +And no really a lot about + +335 +00:16:39,440 --> 00:16:41,690 +them didn't know +anything about this. + +336 +00:16:41,690 --> 00:16:43,610 +But engineers, they +already created + +337 +00:16:43,610 --> 00:16:46,160 +a name for that +regular expression, + +338 +00:16:46,160 --> 00:16:47,975 +denial of service attack. + +339 +00:16:47,975 --> 00:16:49,745 +Because what you can, + +340 +00:16:49,745 --> 00:16:51,230 +what can happen now is that + +341 +00:16:51,230 --> 00:16:54,920 +attackers look for +certain strings. + +342 +00:16:54,920 --> 00:16:56,780 +You make your regular expression + +343 +00:16:56,780 --> 00:16:59,105 +matching engine topple over. + +344 +00:16:59,105 --> 00:17:01,370 +And these kind of expressions, + +345 +00:17:01,370 --> 00:17:04,160 +regular expressions called +Eve of reg expression. + +346 +00:17:04,160 --> 00:17:06,350 +And actually there are +quite a number of them. + +347 +00:17:06,350 --> 00:17:08,495 +So you seen this one, + +348 +00:17:08,495 --> 00:17:11,255 +the first one, and the +second one already. + +349 +00:17:11,255 --> 00:17:13,400 +But there are many, many more. + +350 +00:17:13,400 --> 00:17:15,620 +And you can easily have in + +351 +00:17:15,620 --> 00:17:18,560 +your program one of +these reg expression. + +352 +00:17:18,560 --> 00:17:21,830 +And then you have the +problem that if you do have + +353 +00:17:21,830 --> 00:17:23,240 +this regular expression and + +354 +00:17:23,240 --> 00:17:25,640 +somebody finds the +corresponding string, + +355 +00:17:25,640 --> 00:17:29,945 +which make the records +matching engine topple over, + +356 +00:17:29,945 --> 00:17:31,820 +then you have a problem + +357 +00:17:31,820 --> 00:17:34,295 +because your webpage is +probably not variable. + +358 +00:17:34,295 --> 00:17:36,140 +This is also sometimes called + +359 +00:17:36,140 --> 00:17:39,350 +this phenomenon, +catastrophic backtracking. + +360 +00:17:39,350 --> 00:17:43,595 +In lecture three, we will +look at this more carefully. + +361 +00:17:43,595 --> 00:17:46,910 +And actually why that +is such a problem in + +362 +00:17:46,910 --> 00:17:50,795 +real life is actually +not to do with Lexus. + +363 +00:17:50,795 --> 00:17:53,180 +Yes, regular +expressions are used as + +364 +00:17:53,180 --> 00:17:55,040 +the basic tool for implementing + +365 +00:17:55,040 --> 00:17:57,185 +like source bad reg expressions, + +366 +00:17:57,185 --> 00:18:00,065 +of course, used in +a much wider area. + +367 +00:18:00,065 --> 00:18:03,770 +And they especially used for +network intrusion detection. + +368 +00:18:03,770 --> 00:18:06,590 +Remember, you having to + +369 +00:18:06,590 --> 00:18:10,130 +administer a big network +and you only want to let + +370 +00:18:10,130 --> 00:18:13,640 +in packets which you think are K + +371 +00:18:13,640 --> 00:18:14,930 +and you want to keep out + +372 +00:18:14,930 --> 00:18:17,645 +any package which might +hack into your network. + +373 +00:18:17,645 --> 00:18:22,670 +So what they have is they +have suites of thousands and + +374 +00:18:22,670 --> 00:18:25,745 +sometimes even more +regular expressions which + +375 +00:18:25,745 --> 00:18:27,755 +check whether this package + +376 +00:18:27,755 --> 00:18:30,065 +satisfies some patterns or not. + +377 +00:18:30,065 --> 00:18:31,460 +And in this case it will be left + +378 +00:18:31,460 --> 00:18:34,205 +out or it will be let in. + +379 +00:18:34,205 --> 00:18:36,335 +And with networks, + +380 +00:18:36,335 --> 00:18:39,080 +the problem is that our +hardware is already + +381 +00:18:39,080 --> 00:18:43,190 +so fast that the reg expressions + +382 +00:18:43,190 --> 00:18:45,169 +really become a bottleneck. + +383 +00:18:45,169 --> 00:18:47,060 +Because what do you do if now is + +384 +00:18:47,060 --> 00:18:49,880 +suddenly a reg expression +takes too much time + +385 +00:18:49,880 --> 00:18:52,670 +to just stop the matching + +386 +00:18:52,670 --> 00:18:55,100 +and let the package +in regardless? + +387 +00:18:55,100 --> 00:18:58,190 +Or do you just hold +the network up + +388 +00:18:58,190 --> 00:19:01,715 +and don't let anything in +until you decided that. + +389 +00:19:01,715 --> 00:19:04,895 +So that's actually a +really hard problem. + +390 +00:19:04,895 --> 00:19:06,650 +But the first time I came across + +391 +00:19:06,650 --> 00:19:09,965 +that problem was actually +by this engineer. + +392 +00:19:09,965 --> 00:19:13,820 +And it's always say that +Germans don't have any Yammer. + +393 +00:19:13,820 --> 00:19:16,985 +But I found that +video quite funny. + +394 +00:19:16,985 --> 00:19:19,145 +Maybe you have a +different opinion, + +395 +00:19:19,145 --> 00:19:21,095 +but feel free to +have a look which + +396 +00:19:21,095 --> 00:19:23,705 +explains exactly that problem. + +397 +00:19:23,705 --> 00:19:25,610 +So in the next video, + +398 +00:19:25,610 --> 00:19:28,445 +we will start to +implement this matcher. + +399 +00:19:28,445 --> 00:19:30,870 +So I hope to see you there.