|
1 1 |
|
2 00:00:06,240 --> 00:00:11,050 |
|
3 Welcome back. This video |
|
4 is about regular expressions. |
|
5 |
|
6 2 |
|
7 00:00:11,050 --> 00:00:14,230 |
|
8 We want to use regular |
|
9 expressions in our lexer. |
|
10 |
|
11 3 |
|
12 00:00:14,230 --> 00:00:16,165 |
|
13 And the purpose of the |
|
14 lexer is to find |
|
15 |
|
16 4 |
|
17 00:00:16,165 --> 00:00:18,070 |
|
18 out where the words in |
|
19 |
|
20 5 |
|
21 00:00:18,070 --> 00:00:21,070 |
|
22 our programs are. However |
|
23 regular expressions |
|
24 |
|
25 6 |
|
26 00:00:21,070 --> 00:00:23,875 |
|
27 are fundamental tool |
|
28 in computer science. |
|
29 |
|
30 7 |
|
31 00:00:23,875 --> 00:00:27,910 |
|
32 And I'm sure you've used them |
|
33 already on several occasions. |
|
34 |
|
35 8 |
|
36 00:00:27,910 --> 00:00:30,370 |
|
37 And one would expect that about |
|
38 |
|
39 9 |
|
40 00:00:30,370 --> 00:00:31,750 |
|
41 regular expressions since they are |
|
42 |
|
43 10 |
|
44 00:00:31,750 --> 00:00:33,850 |
|
45 so well-known and well studied, |
|
46 |
|
47 11 |
|
48 00:00:33,850 --> 00:00:37,915 |
|
49 that everything under the |
|
50 sun is known about them. |
|
51 |
|
52 12 |
|
53 00:00:37,915 --> 00:00:41,080 |
|
54 But actually there's |
|
55 still some surprising |
|
56 |
|
57 13 |
|
58 00:00:41,080 --> 00:00:44,465 |
|
59 and interesting |
|
60 problems with them. |
|
61 |
|
62 14 |
|
63 00:00:44,465 --> 00:00:47,945 |
|
64 And I want to show you |
|
65 them in this video. |
|
66 |
|
67 15 |
|
68 00:00:47,945 --> 00:00:50,720 |
|
69 I'm sure you've seen |
|
70 regular expressions |
|
71 |
|
72 16 |
|
73 00:00:50,720 --> 00:00:52,445 |
|
74 many, many times before. |
|
75 |
|
76 17 |
|
77 00:00:52,445 --> 00:00:55,100 |
|
78 But just to be on the same page, |
|
79 |
|
80 18 |
|
81 00:00:55,100 --> 00:00:57,110 |
|
82 let me just recap them. |
|
83 |
|
84 19 |
|
85 00:00:57,110 --> 00:00:59,210 |
|
86 So here in this line, |
|
87 |
|
88 20 |
|
89 00:00:59,210 --> 00:01:01,790 |
|
90 there is a regular expression |
|
91 which is supposed to |
|
92 |
|
93 21 |
|
94 00:01:01,790 --> 00:01:05,285 |
|
95 recognize some form |
|
96 of email addresses. |
|
97 |
|
98 22 |
|
99 00:01:05,285 --> 00:01:07,745 |
|
100 So an e-mail address |
|
101 |
|
102 23 |
|
103 00:01:07,745 --> 00:01:11,000 |
|
104 has part which is |
|
105 before the @ symbol, |
|
106 |
|
107 24 |
|
108 00:01:11,000 --> 00:01:13,400 |
|
109 which is the name of the person. |
|
110 |
|
111 25 |
|
112 00:01:13,400 --> 00:01:16,880 |
|
113 And that can be |
|
114 any number between |
|
115 |
|
116 26 |
|
117 00:01:16,880 --> 00:01:20,195 |
|
118 0 and 9, and letters between a and z. |
|
119 |
|
120 27 |
|
121 00:01:20,195 --> 00:01:24,155 |
|
122 Let's say we avoiding |
|
123 here capital letters. |
|
124 |
|
125 28 |
|
126 00:01:24,155 --> 00:01:26,045 |
|
127 There can be underscores. |
|
128 |
|
129 29 |
|
130 00:01:26,045 --> 00:01:29,405 |
|
131 There can be a dot and |
|
132 there can be hyphens. |
|
133 |
|
134 30 |
|
135 00:01:29,405 --> 00:01:35,390 |
|
136 And after the @ symbol |
|
137 comes the domain name. |
|
138 |
|
139 31 |
|
140 00:01:35,390 --> 00:01:37,310 |
|
141 So as you can see here, |
|
142 |
|
143 32 |
|
144 00:01:37,310 --> 00:01:40,640 |
|
145 we use things like star to |
|
146 |
|
147 33 |
|
148 00:01:40,640 --> 00:01:44,314 |
|
149 match letters |
|
150 zero or more times. |
|
151 |
|
152 34 |
|
153 00:01:44,314 --> 00:01:45,985 |
|
154 Or we have a plus, |
|
155 |
|
156 35 |
|
157 00:01:45,985 --> 00:01:47,420 |
|
158 which means you have to match |
|
159 |
|
160 36 |
|
161 00:01:47,420 --> 00:01:52,489 |
|
162 at least once or more |
|
163 times. Then we have. |
|
164 |
|
165 37 |
|
166 00:01:52,489 --> 00:01:55,790 |
|
167 question mark, which says you |
|
168 |
|
169 38 |
|
170 00:01:55,790 --> 00:01:59,105 |
|
171 match either it is there |
|
172 or it ss not there. |
|
173 |
|
174 39 |
|
175 00:01:59,105 --> 00:02:01,340 |
|
176 You are also regular |
|
177 expressions which |
|
178 |
|
179 40 |
|
180 00:02:01,340 --> 00:02:03,755 |
|
181 match exactly n-times. |
|
182 |
|
183 41 |
|
184 00:02:03,755 --> 00:02:08,720 |
|
185 Or this is a regular expression |
|
186 for between n and m times. |
|
187 |
|
188 42 |
|
189 00:02:08,720 --> 00:02:12,065 |
|
190 You can see in |
|
191 this email address, |
|
192 |
|
193 43 |
|
194 00:02:12,065 --> 00:02:13,730 |
|
195 the top-level domain |
|
196 |
|
197 44 |
|
198 00:02:13,730 --> 00:02:16,130 |
|
199 name can be any letter |
|
200 |
|
201 45 |
|
202 00:02:16,130 --> 00:02:19,265 |
|
203 between a to z, |
|
204 and contain dots, |
|
205 |
|
206 46 |
|
207 00:02:19,265 --> 00:02:22,340 |
|
208 but can only be two |
|
209 characters long |
|
210 |
|
211 47 |
|
212 00:02:22,340 --> 00:02:25,685 |
|
213 up till six characters |
|
214 and not more. |
|
215 |
|
216 48 |
|
217 00:02:25,685 --> 00:02:29,240 |
|
218 Then you also have |
|
219 something like ranges. |
|
220 |
|
221 49 |
|
222 00:02:29,240 --> 00:02:31,220 |
|
223 So you can see, letters between a |
|
224 |
|
225 50 |
|
226 00:02:31,220 --> 00:02:33,635 |
|
227 and z and 0 to 9 and so on. |
|
228 |
|
229 51 |
|
230 00:02:33,635 --> 00:02:36,545 |
|
231 Here you also have regular |
|
232 expression which can |
|
233 |
|
234 52 |
|
235 00:02:36,545 --> 00:02:40,070 |
|
236 match something which |
|
237 isn't in this range. |
|
238 |
|
239 53 |
|
240 00:02:40,070 --> 00:02:42,560 |
|
241 So for example, if |
|
242 you want for example match, |
|
243 |
|
244 54 |
|
245 00:02:42,560 --> 00:02:44,030 |
|
246 letters but not numbers, |
|
247 |
|
248 55 |
|
249 00:02:44,030 --> 00:02:45,800 |
|
250 you would say, well, if |
|
251 |
|
252 56 |
|
253 00:02:45,800 --> 00:02:48,990 |
|
254 this is a number that |
|
255 should not match. |
|
256 |
|
257 57 |
|
258 00:02:49,090 --> 00:02:52,804 |
|
259 Typically you also |
|
260 have these ranges. |
|
261 |
|
262 58 |
|
263 00:02:52,804 --> 00:02:55,565 |
|
264 Lowercase letters, |
|
265 capital letters. |
|
266 |
|
267 59 |
|
268 00:02:55,565 --> 00:02:58,550 |
|
269 Then you have some |
|
270 special regular expressions |
|
271 |
|
272 60 |
|
273 00:02:58,550 --> 00:03:02,195 |
|
274 like this one is only |
|
275 supposed to match digits. |
|
276 |
|
277 61 |
|
278 00:03:02,195 --> 00:03:05,674 |
|
279 A dot is supposed to |
|
280 match any character. |
|
281 |
|
282 62 |
|
283 00:03:05,674 --> 00:03:07,370 |
|
284 And then they have also something |
|
285 |
|
286 63 |
|
287 00:03:07,370 --> 00:03:09,800 |
|
288 called groups which |
|
289 is supposed to be |
|
290 |
|
291 64 |
|
292 00:03:09,800 --> 00:03:12,799 |
|
293 used when you are |
|
294 trying to extract |
|
295 |
|
296 65 |
|
297 00:03:12,799 --> 00:03:15,605 |
|
298 a string you've matched. |
|
299 |
|
300 66 |
|
301 00:03:15,605 --> 00:03:19,925 |
|
302 Okay, so these are the |
|
303 typical regular expressions. |
|
304 |
|
305 67 |
|
306 00:03:19,925 --> 00:03:23,075 |
|
307 And here's a particular one. |
|
308 |
|
309 68 |
|
310 00:03:23,075 --> 00:03:25,820 |
|
311 Trying to match something |
|
312 |
|
313 69 |
|
314 00:03:25,820 --> 00:03:28,770 |
|
315 which resembles |
|
316 an email address. |
|
317 |
|
318 70 |
|
319 00:03:29,590 --> 00:03:33,065 |
|
320 Clearly that should be all easy. |
|
321 |
|
322 71 |
|
323 00:03:33,065 --> 00:03:36,230 |
|
324 And our technology should |
|
325 be on top of that. |
|
326 |
|
327 72 |
|
328 00:03:36,230 --> 00:03:37,865 |
|
329 That we can take a |
|
330 |
|
331 73 |
|
332 00:03:37,865 --> 00:03:41,015 |
|
333 regular expressions and |
|
334 we can take a string, |
|
335 |
|
336 74 |
|
337 00:03:41,015 --> 00:03:43,340 |
|
338 and we should have programs to |
|
339 |
|
340 75 |
|
341 00:03:43,340 --> 00:03:45,680 |
|
342 decide whether this |
|
343 string is matched |
|
344 |
|
345 76 |
|
346 00:03:45,680 --> 00:03:50,330 |
|
347 by a regular expression or |
|
348 not and should be easy-peasy, no? |
|
349 |
|
350 77 |
|
351 00:03:50,330 --> 00:03:56,150 |
|
352 Well, let's have a |
|
353 look at two examples. |
|
354 |
|
355 78 |
|
356 00:03:56,150 --> 00:04:00,860 |
|
357 The first regular expression |
|
358 is a star star b. |
|
359 |
|
360 79 |
|
361 00:04:00,860 --> 00:04:02,990 |
|
362 And it is supposed |
|
363 to match strings of |
|
364 |
|
365 80 |
|
366 00:04:02,990 --> 00:04:05,825 |
|
367 the form 0 or more a's, |
|
368 |
|
369 81 |
|
370 00:04:05,825 --> 00:04:10,385 |
|
371 followed by a b. The parentheses |
|
372 you can ignore. |
|
373 |
|
374 82 |
|
375 00:04:10,385 --> 00:04:11,990 |
|
376 And a star star |
|
377 |
|
378 83 |
|
379 00:04:11,990 --> 00:04:14,120 |
|
380 also doesn't |
|
381 make any difference |
|
382 |
|
383 84 |
|
384 00:04:14,120 --> 00:04:16,505 |
|
385 to what kind of strings |
|
386 that can be matched. |
|
387 |
|
388 85 |
|
389 00:04:16,505 --> 00:04:21,635 |
|
390 It can only make 0 more |
|
391 a's followed by a b. |
|
392 |
|
393 86 |
|
394 00:04:21,635 --> 00:04:23,900 |
|
395 And the other regular expression |
|
396 |
|
397 87 |
|
398 00:04:23,900 --> 00:04:26,990 |
|
399 is possibly a character a, |
|
400 |
|
401 88 |
|
402 00:04:26,990 --> 00:04:32,930 |
|
403 n times, followed by character |
|
404 a axactly n-times. |
|
405 |
|
406 89 |
|
407 00:04:32,930 --> 00:04:35,570 |
|
408 And we will try out |
|
409 |
|
410 90 |
|
411 00:04:35,570 --> 00:04:38,360 |
|
412 these two regular expressions |
|
413 with strings of the form a, |
|
414 |
|
415 91 |
|
416 00:04:38,360 --> 00:04:39,890 |
|
417 aa, and so on, |
|
418 |
|
419 92 |
|
420 00:04:39,890 --> 00:04:45,770 |
|
421 and up to the length of n. And |
|
422 |
|
423 93 |
|
424 00:04:45,770 --> 00:04:49,130 |
|
425 this regular expression should |
|
426 actually not match any of |
|
427 |
|
428 94 |
|
429 00:04:49,130 --> 00:04:53,315 |
|
430 the strings because the |
|
431 final b is missing. |
|
432 |
|
433 95 |
|
434 00:04:53,315 --> 00:04:56,150 |
|
435 But that is |
|
436 okay. For example |
|
437 |
|
438 96 |
|
439 00:04:56,150 --> 00:04:57,425 |
|
440 if you have a regular expression |
|
441 |
|
442 97 |
|
443 00:04:57,425 --> 00:05:00,110 |
|
444 that is supposed to |
|
445 check whether a string is |
|
446 |
|
447 98 |
|
448 00:05:00,110 --> 00:05:01,490 |
|
449 an email address and the user |
|
450 |
|
451 99 |
|
452 00:05:01,490 --> 00:05:03,380 |
|
453 gives some random |
|
454 strings in there, |
|
455 |
|
456 100 |
|
457 00:05:03,380 --> 00:05:06,545 |
|
458 then this regular expression |
|
459 should not match that string. |
|
460 |
|
461 101 |
|
462 00:05:06,545 --> 00:05:08,420 |
|
463 And for this regular expression |
|
464 |
|
465 102 |
|
466 00:05:08,420 --> 00:05:11,195 |
|
467 you have to scratch a |
|
468 little bit of your head, |
|
469 |
|
470 103 |
|
471 00:05:11,195 --> 00:05:12,620 |
|
472 what it can actually match. |
|
473 |
|
474 104 |
|
475 00:05:12,620 --> 00:05:14,720 |
|
476 But after a little bit |
|
477 of head scratching, |
|
478 |
|
479 105 |
|
480 00:05:14,720 --> 00:05:18,260 |
|
481 you find out can match |
|
482 any string which is of |
|
483 |
|
484 106 |
|
485 00:05:18,260 --> 00:05:22,580 |
|
486 the length n a's up |
|
487 to 2n of a's. |
|
488 |
|
489 107 |
|
490 00:05:22,580 --> 00:05:24,290 |
|
491 So anything in this range, |
|
492 |
|
493 108 |
|
494 00:05:24,290 --> 00:05:27,185 |
|
495 this regular expression |
|
496 can actually match. |
|
497 |
|
498 109 |
|
499 00:05:27,185 --> 00:05:30,395 |
|
500 Okay, let's |
|
501 take a random tool, |
|
502 |
|
503 110 |
|
504 00:05:30,395 --> 00:05:32,630 |
|
505 maybe for example Python. |
|
506 |
|
507 111 |
|
508 00:05:32,630 --> 00:05:35,240 |
|
509 So here's a little |
|
510 Python program. |
|
511 |
|
512 112 |
|
513 00:05:35,240 --> 00:05:38,690 |
|
514 It uses the library |
|
515 function of Python to |
|
516 |
|
517 113 |
|
518 00:05:38,690 --> 00:05:42,935 |
|
519 match the regular expressions of |
|
520 a star star b. |
|
521 |
|
522 114 |
|
523 00:05:42,935 --> 00:05:46,805 |
|
524 And we measure time with longer |
|
525 and longer strings of a. |
|
526 |
|
527 115 |
|
528 00:05:46,805 --> 00:05:48,770 |
|
529 And so conveniently we can give |
|
530 |
|
531 116 |
|
532 00:05:48,770 --> 00:05:51,140 |
|
533 the number of a's here |
|
534 on the command line. |
|
535 |
|
536 117 |
|
537 00:05:51,140 --> 00:05:56,900 |
|
538 If I just call |
|
539 this on the command line, |
|
540 |
|
541 118 |
|
542 00:05:56,900 --> 00:05:59,900 |
|
543 Let's say we first |
|
544 start with five a's. |
|
545 |
|
546 119 |
|
547 00:05:59,900 --> 00:06:03,920 |
|
548 And I get also the times which |
|
549 in this case is next to nothing. |
|
550 |
|
551 120 |
|
552 00:06:03,920 --> 00:06:05,960 |
|
553 And here's the string |
|
554 we just matched. |
|
555 |
|
556 121 |
|
557 00:06:05,960 --> 00:06:07,640 |
|
558 And obviously the |
|
559 regular expression |
|
560 |
|
561 122 |
|
562 00:06:07,640 --> 00:06:09,110 |
|
563 did not match the string. |
|
564 |
|
565 123 |
|
566 00:06:09,110 --> 00:06:11,255 |
|
567 That's indicated by this none. |
|
568 |
|
569 124 |
|
570 00:06:11,255 --> 00:06:13,925 |
|
571 Let's take ten a's. |
|
572 |
|
573 125 |
|
574 00:06:13,925 --> 00:06:16,490 |
|
575 It's also pretty quick. |
|
576 |
|
577 126 |
|
578 00:06:16,490 --> 00:06:20,780 |
|
579 Fifteen a's, even quicker, |
|
580 |
|
581 127 |
|
582 00:06:20,780 --> 00:06:23,180 |
|
583 but these times always need to |
|
584 |
|
585 128 |
|
586 00:06:23,180 --> 00:06:25,820 |
|
587 be taken with a grain of salt. |
|
588 |
|
589 129 |
|
590 00:06:25,820 --> 00:06:28,040 |
|
591 They are not 100 |
|
592 percent accurate. |
|
593 |
|
594 130 |
|
595 00:06:28,040 --> 00:06:31,490 |
|
596 So 15 is also a let's take |
|
597 |
|
598 131 |
|
599 00:06:31,490 --> 00:06:36,965 |
|
600 28th notes already |
|
601 double the time. |
|
602 |
|
603 132 |
|
604 00:06:36,965 --> 00:06:42,440 |
|
605 Twenty-five longer. |
|
606 |
|
607 133 |
|
608 00:06:42,440 --> 00:06:45,680 |
|
609 Okay, that suddenly |
|
610 from 02 seconds, |
|
611 |
|
612 134 |
|
613 00:06:45,680 --> 00:06:48,960 |
|
614 it takes almost four seconds. |
|
615 |
|
616 135 |
|
617 00:06:49,600 --> 00:06:54,890 |
|
618 Six this |
|
619 |
|
620 136 |
|
621 00:06:54,890 --> 00:07:01,415 |
|
622 takes six seconds |
|
623 already Double, okay? |
|
624 |
|
625 137 |
|
626 00:07:01,415 --> 00:07:07,229 |
|
627 Go to 28. That would be now. |
|
628 |
|
629 138 |
|
630 00:07:08,890 --> 00:07:11,840 |
|
631 You see the string |
|
632 isn't very long, |
|
633 |
|
634 139 |
|
635 00:07:11,840 --> 00:07:13,340 |
|
636 so that could be easily like |
|
637 |
|
638 140 |
|
639 00:07:13,340 --> 00:07:16,070 |
|
640 just the size of |
|
641 an email address. |
|
642 |
|
643 141 |
|
644 00:07:16,070 --> 00:07:19,280 |
|
645 And the regular |
|
646 expression matching |
|
647 |
|
648 142 |
|
649 00:07:19,280 --> 00:07:22,550 |
|
650 engine in Python needs |
|
651 quite a long time |
|
652 |
|
653 143 |
|
654 00:07:22,550 --> 00:07:24,710 |
|
655 to find out that |
|
656 this string of 28 |
|
657 |
|
658 144 |
|
659 00:07:24,710 --> 00:07:26,570 |
|
660 AES is actually not much |
|
661 |
|
662 145 |
|
663 00:07:26,570 --> 00:07:28,490 |
|
664 by that you see it's |
|
665 still not finished. |
|
666 |
|
667 146 |
|
668 00:07:28,490 --> 00:07:32,900 |
|
669 I think it should take |
|
670 approximately like 20 seconds. |
|
671 |
|
672 147 |
|
673 00:07:32,900 --> 00:07:34,400 |
|
674 Okay. Already 30. |
|
675 |
|
676 148 |
|
677 00:07:34,400 --> 00:07:36,530 |
|
678 And if we would try |
|
679 |
|
680 149 |
|
681 00:07:36,530 --> 00:07:40,805 |
|
682 30 would be already |
|
683 more than a minute. |
|
684 |
|
685 150 |
|
686 00:07:40,805 --> 00:07:43,940 |
|
687 And if I could read |
|
688 something like hundreds, |
|
689 |
|
690 151 |
|
691 00:07:43,940 --> 00:07:46,220 |
|
692 you remember if a doubling in |
|
693 |
|
694 152 |
|
695 00:07:46,220 --> 00:07:48,770 |
|
696 each step or the second step, |
|
697 |
|
698 153 |
|
699 00:07:48,770 --> 00:07:50,720 |
|
700 the story with the chess board, |
|
701 |
|
702 154 |
|
703 00:07:50,720 --> 00:07:53,855 |
|
704 we probably would sit here |
|
705 until the next century. |
|
706 |
|
707 155 |
|
708 00:07:53,855 --> 00:07:56,820 |
|
709 So something strange here. |
|
710 |
|
711 156 |
|
712 00:07:57,580 --> 00:08:01,355 |
|
713 Okay, that might be just |
|
714 a problem of Python. |
|
715 |
|
716 157 |
|
717 00:08:01,355 --> 00:08:02,990 |
|
718 Let's have a look at another |
|
719 |
|
720 158 |
|
721 00:08:02,990 --> 00:08:04,985 |
|
722 regular expression |
|
723 matching engine. |
|
724 |
|
725 159 |
|
726 00:08:04,985 --> 00:08:06,890 |
|
727 This time from JavaScript, |
|
728 |
|
729 160 |
|
730 00:08:06,890 --> 00:08:10,040 |
|
731 also are pretty well-known |
|
732 programming language. |
|
733 |
|
734 161 |
|
735 00:08:10,040 --> 00:08:13,610 |
|
736 So here you can see |
|
737 it's still a star, |
|
738 |
|
739 162 |
|
740 00:08:13,610 --> 00:08:16,235 |
|
741 star followed by b, |
|
742 |
|
743 163 |
|
744 00:08:16,235 --> 00:08:18,920 |
|
745 by direct expression is |
|
746 supposed to match that from |
|
747 |
|
748 164 |
|
749 00:08:18,920 --> 00:08:21,830 |
|
750 the beginning of the |
|
751 string up till the end. |
|
752 |
|
753 165 |
|
754 00:08:21,830 --> 00:08:23,930 |
|
755 So there's not any difference |
|
756 |
|
757 166 |
|
758 00:08:23,930 --> 00:08:26,150 |
|
759 in the strings this work |
|
760 expression matches. |
|
761 |
|
762 167 |
|
763 00:08:26,150 --> 00:08:28,610 |
|
764 We'll just start at the |
|
765 beginning of the string |
|
766 |
|
767 168 |
|
768 00:08:28,610 --> 00:08:31,460 |
|
769 and finish at the |
|
770 end of the string. |
|
771 |
|
772 169 |
|
773 00:08:31,460 --> 00:08:35,285 |
|
774 And we again, we just use |
|
775 repeated A's for that. |
|
776 |
|
777 170 |
|
778 00:08:35,285 --> 00:08:38,195 |
|
779 And similarly, we can |
|
780 |
|
781 171 |
|
782 00:08:38,195 --> 00:08:41,930 |
|
783 call it on the command line |
|
784 and can do some timing. |
|
785 |
|
786 172 |
|
787 00:08:41,930 --> 00:08:44,540 |
|
788 So ten SBA, good. |
|
789 |
|
790 173 |
|
791 00:08:44,540 --> 00:08:46,340 |
|
792 Here's the string. |
|
793 |
|
794 174 |
|
795 00:08:46,340 --> 00:08:48,320 |
|
796 It cannot match that string. |
|
797 |
|
798 175 |
|
799 00:08:48,320 --> 00:08:50,525 |
|
800 And it's pretty fast. |
|
801 |
|
802 176 |
|
803 00:08:50,525 --> 00:08:54,725 |
|
804 Friendly. Although pretty fast. |
|
805 |
|
806 177 |
|
807 00:08:54,725 --> 00:08:59,120 |
|
808 Five, again, |
|
809 |
|
810 178 |
|
811 00:08:59,120 --> 00:09:06,650 |
|
812 somehow is kind of |
|
813 threshold that is 25, 26. |
|
814 |
|
815 179 |
|
816 00:09:06,650 --> 00:09:09,485 |
|
817 Suddenly it takes much longer. |
|
818 |
|
819 180 |
|
820 00:09:09,485 --> 00:09:14,360 |
|
821 And it has essentially the |
|
822 same problem as with Python. |
|
823 |
|
824 181 |
|
825 00:09:14,360 --> 00:09:17,165 |
|
826 So you'll see in now from 26 on, |
|
827 |
|
828 182 |
|
829 00:09:17,165 --> 00:09:19,250 |
|
830 the Times has always |
|
831 doubling from |
|
832 |
|
833 183 |
|
834 00:09:19,250 --> 00:09:21,860 |
|
835 three seconds to seven seconds. |
|
836 |
|
837 184 |
|
838 00:09:21,860 --> 00:09:23,330 |
|
839 So you can imagine what that |
|
840 |
|
841 185 |
|
842 00:09:23,330 --> 00:09:24,890 |
|
843 roughly takes when I put your |
|
844 |
|
845 186 |
|
846 00:09:24,890 --> 00:09:30,230 |
|
847 27 and you see the |
|
848 string isn't very long. |
|
849 |
|
850 187 |
|
851 00:09:30,230 --> 00:09:32,165 |
|
852 Let's choose twenties or maize. |
|
853 |
|
854 188 |
|
855 00:09:32,165 --> 00:09:35,419 |
|
856 Imagine you have to |
|
857 search a database |
|
858 |
|
859 189 |
|
860 00:09:35,419 --> 00:09:38,720 |
|
861 with kilobytes of data. |
|
862 |
|
863 190 |
|
864 00:09:38,720 --> 00:09:42,260 |
|
865 This, these regular |
|
866 expressions that would years |
|
867 |
|
868 191 |
|
869 00:09:42,260 --> 00:09:48,150 |
|
870 need years to go through with |
|
871 these regular expressions. |
|
872 |
|
873 192 |
|
874 00:09:48,630 --> 00:09:51,850 |
|
875 Okay, maybe the people in |
|
876 |
|
877 193 |
|
878 00:09:51,850 --> 00:09:55,435 |
|
879 Python and JavaScript, |
|
880 they're just idiots. |
|
881 |
|
882 194 |
|
883 00:09:55,435 --> 00:09:58,180 |
|
884 Surely Java must do much better. |
|
885 |
|
886 195 |
|
887 00:09:58,180 --> 00:10:01,045 |
|
888 So here's a program. |
|
889 |
|
890 196 |
|
891 00:10:01,045 --> 00:10:03,415 |
|
892 You can see this again |
|
893 |
|
894 197 |
|
895 00:10:03,415 --> 00:10:05,980 |
|
896 is the reg expression |
|
897 and we just having |
|
898 |
|
899 198 |
|
900 00:10:05,980 --> 00:10:08,320 |
|
901 some scaffolding to generate |
|
902 |
|
903 199 |
|
904 00:10:08,320 --> 00:10:11,905 |
|
905 strings from five up till 28. |
|
906 |
|
907 200 |
|
908 00:10:11,905 --> 00:10:14,305 |
|
909 And if we run that, |
|
910 |
|
911 201 |
|
912 00:10:14,305 --> 00:10:16,660 |
|
913 actually does that automatically. |
|
914 |
|
915 202 |
|
916 00:10:16,660 --> 00:10:19,900 |
|
917 So uphill 19, pretty fast, |
|
918 |
|
919 203 |
|
920 00:10:19,900 --> 00:10:24,925 |
|
921 but then starting from |
|
922 23, skidding pretty slow. |
|
923 |
|
924 204 |
|
925 00:10:24,925 --> 00:10:27,445 |
|
926 So the question is |
|
927 what's going on? |
|
928 |
|
929 205 |
|
930 00:10:27,445 --> 00:10:29,230 |
|
931 By the way, I'm not quoting here. |
|
932 |
|
933 206 |
|
934 00:10:29,230 --> 00:10:33,755 |
|
935 Scala, using internally |
|
936 the regular expression |
|
937 |
|
938 207 |
|
939 00:10:33,755 --> 00:10:36,665 |
|
940 matching engine from Java. |
|
941 |
|
942 208 |
|
943 00:10:36,665 --> 00:10:39,065 |
|
944 So would have exactly |
|
945 the same problem. |
|
946 |
|
947 209 |
|
948 00:10:39,065 --> 00:10:41,480 |
|
949 Also, I have been |
|
950 here very careful, |
|
951 |
|
952 210 |
|
953 00:10:41,480 --> 00:10:43,550 |
|
954 I'm using here Scala aid, |
|
955 |
|
956 211 |
|
957 00:10:43,550 --> 00:10:46,085 |
|
958 which nowadays is quite old. |
|
959 |
|
960 212 |
|
961 00:10:46,085 --> 00:10:50,765 |
|
962 But you will see also |
|
963 current Java versions. |
|
964 |
|
965 213 |
|
966 00:10:50,765 --> 00:10:55,490 |
|
967 We will see we can out-compete |
|
968 them by magnitudes. |
|
969 |
|
970 214 |
|
971 00:10:55,490 --> 00:10:57,605 |
|
972 So I think I can that. |
|
973 |
|
974 215 |
|
975 00:10:57,605 --> 00:10:59,165 |
|
976 Now, just finish here. |
|
977 |
|
978 216 |
|
979 00:10:59,165 --> 00:11:04,025 |
|
980 You see the problem. Just |
|
981 for completeness sake. |
|
982 |
|
983 217 |
|
984 00:11:04,025 --> 00:11:07,010 |
|
985 Here is a Ruby program. |
|
986 |
|
987 218 |
|
988 00:11:07,010 --> 00:11:09,935 |
|
989 This is using the other |
|
990 regular expression. |
|
991 |
|
992 219 |
|
993 00:11:09,935 --> 00:11:12,935 |
|
994 In this case the |
|
995 string should match. |
|
996 |
|
997 220 |
|
998 00:11:12,935 --> 00:11:20,300 |
|
999 And again it tries out |
|
1000 strings between 130 here. |
|
1001 |
|
1002 221 |
|
1003 00:11:20,300 --> 00:11:23,450 |
|
1004 That's a program actually |
|
1005 a former student produced. |
|
1006 |
|
1007 222 |
|
1008 00:11:23,450 --> 00:11:25,565 |
|
1009 And you can see four a's |
|
1010 |
|
1011 223 |
|
1012 00:11:25,565 --> 00:11:29,780 |
|
1013 of links up till 20 |
|
1014 AES is pretty fast. |
|
1015 |
|
1016 224 |
|
1017 00:11:29,780 --> 00:11:32,495 |
|
1018 But then starting at 26, |
|
1019 |
|
1020 225 |
|
1021 00:11:32,495 --> 00:11:35,285 |
|
1022 it's getting really slow. |
|
1023 |
|
1024 226 |
|
1025 00:11:35,285 --> 00:11:37,100 |
|
1026 So in this case, |
|
1027 remember the string |
|
1028 |
|
1029 227 |
|
1030 00:11:37,100 --> 00:11:38,870 |
|
1031 is actually matched by |
|
1032 the regular expression. |
|
1033 |
|
1034 228 |
|
1035 00:11:38,870 --> 00:11:40,130 |
|
1036 So it has nothing to do |
|
1037 |
|
1038 229 |
|
1039 00:11:40,130 --> 00:11:41,540 |
|
1040 with a regular |
|
1041 expression actually |
|
1042 |
|
1043 230 |
|
1044 00:11:41,540 --> 00:11:45,485 |
|
1045 matches a string or does |
|
1046 not match a string. |
|
1047 |
|
1048 231 |
|
1049 00:11:45,485 --> 00:11:48,260 |
|
1050 I admit though these |
|
1051 regular expressions |
|
1052 |
|
1053 232 |
|
1054 00:11:48,260 --> 00:11:49,610 |
|
1055 are carefully chosen, |
|
1056 |
|
1057 233 |
|
1058 00:11:49,610 --> 00:11:52,250 |
|
1059 as you will see later on. |
|
1060 |
|
1061 234 |
|
1062 00:11:52,250 --> 00:11:55,620 |
|
1063 Hey, I also just stop that here. |
|
1064 |
|
1065 235 |
|
1066 00:11:55,710 --> 00:12:00,985 |
|
1067 Okay, this slight collect |
|
1068 this information about times. |
|
1069 |
|
1070 236 |
|
1071 00:12:00,985 --> 00:12:03,400 |
|
1072 On the right hand side will |
|
1073 |
|
1074 237 |
|
1075 00:12:03,400 --> 00:12:05,860 |
|
1076 be our regular expression mantra, |
|
1077 |
|
1078 238 |
|
1079 00:12:05,860 --> 00:12:08,290 |
|
1080 which we implement next week. |
|
1081 |
|
1082 239 |
|
1083 00:12:08,290 --> 00:12:10,795 |
|
1084 On the left-hand side, |
|
1085 are these times by |
|
1086 |
|
1087 240 |
|
1088 00:12:10,795 --> 00:12:14,260 |
|
1089 barriers than regular |
|
1090 expression matching engines? |
|
1091 |
|
1092 241 |
|
1093 00:12:14,260 --> 00:12:17,809 |
|
1094 On the top is this |
|
1095 regular expression. |
|
1096 |
|
1097 242 |
|
1098 00:12:19,080 --> 00:12:23,335 |
|
1099 Possible a n times a n times. |
|
1100 |
|
1101 243 |
|
1102 00:12:23,335 --> 00:12:26,890 |
|
1103 And on the lowest |
|
1104 is a star, star b. |
|
1105 |
|
1106 244 |
|
1107 00:12:26,890 --> 00:12:30,370 |
|
1108 And the x-axis show here |
|
1109 |
|
1110 245 |
|
1111 00:12:30,370 --> 00:12:35,335 |
|
1112 the length of the |
|
1113 string. How many a's. |
|
1114 |
|
1115 246 |
|
1116 00:12:35,335 --> 00:12:38,925 |
|
1117 And on the y axis is the time. |
|
1118 |
|
1119 247 |
|
1120 00:12:38,925 --> 00:12:41,660 |
|
1121 They need to decide whether |
|
1122 |
|
1123 248 |
|
1124 00:12:41,660 --> 00:12:44,615 |
|
1125 the string is matched by |
|
1126 the rate expression or not. |
|
1127 |
|
1128 249 |
|
1129 00:12:44,615 --> 00:12:46,415 |
|
1130 So you can see here, Python, |
|
1131 |
|
1132 250 |
|
1133 00:12:46,415 --> 00:12:47,945 |
|
1134 Java eight in JavaScript, |
|
1135 |
|
1136 251 |
|
1137 00:12:47,945 --> 00:12:52,250 |
|
1138 they max out approximately |
|
1139 at between 2530. |
|
1140 |
|
1141 252 |
|
1142 00:12:52,250 --> 00:12:53,900 |
|
1143 The kristin, it takes already |
|
1144 |
|
1145 253 |
|
1146 00:12:53,900 --> 00:12:55,160 |
|
1147 a half a minute to |
|
1148 |
|
1149 254 |
|
1150 00:12:55,160 --> 00:12:57,410 |
|
1151 decide whether the string |
|
1152 is matched or not. |
|
1153 |
|
1154 255 |
|
1155 00:12:57,410 --> 00:13:00,815 |
|
1156 And similarly, in |
|
1157 the other example, |
|
1158 |
|
1159 256 |
|
1160 00:13:00,815 --> 00:13:03,830 |
|
1161 Python and derived Ruby max out |
|
1162 |
|
1163 257 |
|
1164 00:13:03,830 --> 00:13:07,220 |
|
1165 at a similar kind of |
|
1166 length of the strings. |
|
1167 |
|
1168 258 |
|
1169 00:13:07,220 --> 00:13:10,400 |
|
1170 Because then they use also |
|
1171 half a minute to decide |
|
1172 |
|
1173 259 |
|
1174 00:13:10,400 --> 00:13:13,940 |
|
1175 whether this rec expression |
|
1176 actually matches the string. |
|
1177 |
|
1178 260 |
|
1179 00:13:13,940 --> 00:13:16,790 |
|
1180 Contrast that with |
|
1181 the reg expression |
|
1182 |
|
1183 261 |
|
1184 00:13:16,790 --> 00:13:19,235 |
|
1185 which we are regular |
|
1186 expression mantra, |
|
1187 |
|
1188 262 |
|
1189 00:13:19,235 --> 00:13:21,470 |
|
1190 which we're going to implement. |
|
1191 |
|
1192 263 |
|
1193 00:13:21,470 --> 00:13:25,040 |
|
1194 This can match |
|
1195 approximately 10 thousand |
|
1196 |
|
1197 264 |
|
1198 00:13:25,040 --> 00:13:30,065 |
|
1199 a's in this example and |
|
1200 needs less than ten seconds. |
|
1201 |
|
1202 265 |
|
1203 00:13:30,065 --> 00:13:32,285 |
|
1204 Actually, there will be |
|
1205 two versions of that. |
|
1206 |
|
1207 266 |
|
1208 00:13:32,285 --> 00:13:34,850 |
|
1209 First version may be |
|
1210 also relatively slow. |
|
1211 |
|
1212 267 |
|
1213 00:13:34,850 --> 00:13:36,410 |
|
1214 But the second version, |
|
1215 |
|
1216 268 |
|
1217 00:13:36,410 --> 00:13:38,240 |
|
1218 in contrast to Python, |
|
1219 |
|
1220 269 |
|
1221 00:13:38,240 --> 00:13:40,295 |
|
1222 Ruby, we'll be blindingly fast. |
|
1223 |
|
1224 270 |
|
1225 00:13:40,295 --> 00:13:42,380 |
|
1226 And in the second example, |
|
1227 |
|
1228 271 |
|
1229 00:13:42,380 --> 00:13:45,740 |
|
1230 you have to be careful |
|
1231 about the x axis because |
|
1232 |
|
1233 272 |
|
1234 00:13:45,740 --> 00:13:49,385 |
|
1235 that means four times |
|
1236 ten to the power six. |
|
1237 |
|
1238 273 |
|
1239 00:13:49,385 --> 00:13:51,695 |
|
1240 It's actually 4 million A's. |
|
1241 |
|
1242 274 |
|
1243 00:13:51,695 --> 00:13:55,100 |
|
1244 So our regular |
|
1245 expression match or need |
|
1246 |
|
1247 275 |
|
1248 00:13:55,100 --> 00:13:57,635 |
|
1249 less than ten seconds to |
|
1250 |
|
1251 276 |
|
1252 00:13:57,635 --> 00:14:00,725 |
|
1253 match a string of length |
|
1254 of 4 million A's. |
|
1255 |
|
1256 277 |
|
1257 00:14:00,725 --> 00:14:04,430 |
|
1258 Contrast that Python, Java eight, |
|
1259 |
|
1260 278 |
|
1261 00:14:04,430 --> 00:14:06,770 |
|
1262 and JavaScript need half a minute |
|
1263 |
|
1264 279 |
|
1265 00:14:06,770 --> 00:14:09,905 |
|
1266 already for a string |
|
1267 of length just 30, |
|
1268 |
|
1269 280 |
|
1270 00:14:09,905 --> 00:14:12,365 |
|
1271 unless you're very |
|
1272 careful with Java eight. |
|
1273 |
|
1274 281 |
|
1275 00:14:12,365 --> 00:14:15,725 |
|
1276 Yes, Java nine and above, |
|
1277 |
|
1278 282 |
|
1279 00:14:15,725 --> 00:14:17,180 |
|
1280 they already have |
|
1281 |
|
1282 283 |
|
1283 00:14:17,180 --> 00:14:19,610 |
|
1284 a much better regular |
|
1285 expression matching engine, |
|
1286 |
|
1287 284 |
|
1288 00:14:19,610 --> 00:14:22,805 |
|
1289 but still we will be running |
|
1290 circles around them. |
|
1291 |
|
1292 285 |
|
1293 00:14:22,805 --> 00:14:27,050 |
|
1294 It's this data. I |
|
1295 call this slide. |
|
1296 |
|
1297 286 |
|
1298 00:14:27,050 --> 00:14:29,675 |
|
1299 Why bother with |
|
1300 regular expressions? |
|
1301 |
|
1302 287 |
|
1303 00:14:29,675 --> 00:14:33,515 |
|
1304 But you can probably |
|
1305 see these are |
|
1306 |
|
1307 288 |
|
1308 00:14:33,515 --> 00:14:34,910 |
|
1309 at least more times by |
|
1310 |
|
1311 289 |
|
1312 00:14:34,910 --> 00:14:38,015 |
|
1313 the existing regular |
|
1314 expression matching engines. |
|
1315 |
|
1316 290 |
|
1317 00:14:38,015 --> 00:14:40,070 |
|
1318 And it's actually |
|
1319 surprising that after |
|
1320 |
|
1321 291 |
|
1322 00:14:40,070 --> 00:14:42,695 |
|
1323 one lecture we can already |
|
1324 do substantially better. |
|
1325 |
|
1326 292 |
|
1327 00:14:42,695 --> 00:14:47,495 |
|
1328 And if you don't believe |
|
1329 in D times, I gave here, |
|
1330 |
|
1331 293 |
|
1332 00:14:47,495 --> 00:14:50,090 |
|
1333 please feel free to |
|
1334 play on your own |
|
1335 |
|
1336 294 |
|
1337 00:14:50,090 --> 00:14:52,865 |
|
1338 with the examples |
|
1339 I uploaded, Keats. |
|
1340 |
|
1341 295 |
|
1342 00:14:52,865 --> 00:14:55,235 |
|
1343 These are exactly the programs |
|
1344 |
|
1345 296 |
|
1346 00:14:55,235 --> 00:14:57,470 |
|
1347 are used here in the examples. |
|
1348 |
|
1349 297 |
|
1350 00:14:57,470 --> 00:14:59,255 |
|
1351 So feel free. |
|
1352 |
|
1353 298 |
|
1354 00:14:59,255 --> 00:15:01,970 |
|
1355 You might however now think, hmm. |
|
1356 |
|
1357 299 |
|
1358 00:15:01,970 --> 00:15:05,449 |
|
1359 These are two very |
|
1360 well chosen examples. |
|
1361 |
|
1362 300 |
|
1363 00:15:05,449 --> 00:15:07,145 |
|
1364 And I admit that's true. |
|
1365 |
|
1366 301 |
|
1367 00:15:07,145 --> 00:15:09,410 |
|
1368 And such problem there never |
|
1369 |
|
1370 302 |
|
1371 00:15:09,410 --> 00:15:12,540 |
|
1372 causing any problems |
|
1373 in real life. |
|
1374 |
|
1375 303 |
|
1376 00:15:13,300 --> 00:15:15,980 |
|
1377 Regular expressions are used very |
|
1378 |
|
1379 304 |
|
1380 00:15:15,980 --> 00:15:19,415 |
|
1381 frequently and they |
|
1382 do cause problems. |
|
1383 |
|
1384 305 |
|
1385 00:15:19,415 --> 00:15:21,410 |
|
1386 So here's my first example from |
|
1387 |
|
1388 306 |
|
1389 00:15:21,410 --> 00:15:23,885 |
|
1390 a company called cloudflare. |
|
1391 |
|
1392 307 |
|
1393 00:15:23,885 --> 00:15:27,560 |
|
1394 This is a huge hosting company |
|
1395 |
|
1396 308 |
|
1397 00:15:27,560 --> 00:15:30,935 |
|
1398 which host very |
|
1399 well-known web pages. |
|
1400 |
|
1401 309 |
|
1402 00:15:30,935 --> 00:15:34,970 |
|
1403 And they really try hard |
|
1404 to have no outage at all. |
|
1405 |
|
1406 310 |
|
1407 00:15:34,970 --> 00:15:37,340 |
|
1408 And they manage |
|
1409 that for six years. |
|
1410 |
|
1411 311 |
|
1412 00:15:37,340 --> 00:15:39,320 |
|
1413 But then a Rekha expression, |
|
1414 |
|
1415 312 |
|
1416 00:15:39,320 --> 00:15:41,180 |
|
1417 actually this one caused |
|
1418 a problem and you |
|
1419 |
|
1420 313 |
|
1421 00:15:41,180 --> 00:15:43,265 |
|
1422 can see they're also |
|
1423 like two stars. |
|
1424 |
|
1425 314 |
|
1426 00:15:43,265 --> 00:15:44,630 |
|
1427 They are at the end. |
|
1428 |
|
1429 315 |
|
1430 00:15:44,630 --> 00:15:46,955 |
|
1431 And because of that string needed |
|
1432 |
|
1433 316 |
|
1434 00:15:46,955 --> 00:15:49,865 |
|
1435 too much time to be matched. |
|
1436 |
|
1437 317 |
|
1438 00:15:49,865 --> 00:15:50,990 |
|
1439 And because of that, |
|
1440 |
|
1441 318 |
|
1442 00:15:50,990 --> 00:15:52,430 |
|
1443 they had some outage for, |
|
1444 |
|
1445 319 |
|
1446 00:15:52,430 --> 00:15:54,125 |
|
1447 I think several hours, |
|
1448 |
|
1449 320 |
|
1450 00:15:54,125 --> 00:15:57,920 |
|
1451 actually in their malware |
|
1452 detection subsystem. |
|
1453 |
|
1454 321 |
|
1455 00:15:57,920 --> 00:16:02,060 |
|
1456 And the second example |
|
1457 comes from 2016, |
|
1458 |
|
1459 322 |
|
1460 00:16:02,060 --> 00:16:04,040 |
|
1461 where Stack Exchange, |
|
1462 I guess you know |
|
1463 |
|
1464 323 |
|
1465 00:16:04,040 --> 00:16:06,650 |
|
1466 this webpage had |
|
1467 also an outage from, |
|
1468 |
|
1469 324 |
|
1470 00:16:06,650 --> 00:16:08,390 |
|
1471 I think at least an hour. |
|
1472 |
|
1473 325 |
|
1474 00:16:08,390 --> 00:16:13,070 |
|
1475 Because a regular expression |
|
1476 then needed to format posts, |
|
1477 |
|
1478 326 |
|
1479 00:16:13,070 --> 00:16:15,575 |
|
1480 needed too much time to |
|
1481 |
|
1482 327 |
|
1483 00:16:15,575 --> 00:16:19,010 |
|
1484 recognize whether this post |
|
1485 should be accepted or not. |
|
1486 |
|
1487 328 |
|
1488 00:16:19,010 --> 00:16:23,390 |
|
1489 And again, there was a |
|
1490 semi kind of problem. |
|
1491 |
|
1492 329 |
|
1493 00:16:23,390 --> 00:16:24,950 |
|
1494 And you can read |
|
1495 the stories behind |
|
1496 |
|
1497 330 |
|
1498 00:16:24,950 --> 00:16:28,080 |
|
1499 that on these two given links. |
|
1500 |
|
1501 331 |
|
1502 00:16:28,720 --> 00:16:31,730 |
|
1503 When I looked at |
|
1504 this the first time, |
|
1505 |
|
1506 332 |
|
1507 00:16:31,730 --> 00:16:34,175 |
|
1508 what surprised me is |
|
1509 that theoretician |
|
1510 |
|
1511 333 |
|
1512 00:16:34,175 --> 00:16:37,520 |
|
1513 who sometimes dedicate their |
|
1514 life to regular expression. |
|
1515 |
|
1516 334 |
|
1517 00:16:37,520 --> 00:16:39,440 |
|
1518 And no really a lot about |
|
1519 |
|
1520 335 |
|
1521 00:16:39,440 --> 00:16:41,690 |
|
1522 them didn't know |
|
1523 anything about this. |
|
1524 |
|
1525 336 |
|
1526 00:16:41,690 --> 00:16:43,610 |
|
1527 But engineers, they |
|
1528 already created |
|
1529 |
|
1530 337 |
|
1531 00:16:43,610 --> 00:16:46,160 |
|
1532 a name for that |
|
1533 regular expression, |
|
1534 |
|
1535 338 |
|
1536 00:16:46,160 --> 00:16:47,975 |
|
1537 denial of service attack. |
|
1538 |
|
1539 339 |
|
1540 00:16:47,975 --> 00:16:49,745 |
|
1541 Because what you can, |
|
1542 |
|
1543 340 |
|
1544 00:16:49,745 --> 00:16:51,230 |
|
1545 what can happen now is that |
|
1546 |
|
1547 341 |
|
1548 00:16:51,230 --> 00:16:54,920 |
|
1549 attackers look for |
|
1550 certain strings. |
|
1551 |
|
1552 342 |
|
1553 00:16:54,920 --> 00:16:56,780 |
|
1554 You make your regular expression |
|
1555 |
|
1556 343 |
|
1557 00:16:56,780 --> 00:16:59,105 |
|
1558 matching engine topple over. |
|
1559 |
|
1560 344 |
|
1561 00:16:59,105 --> 00:17:01,370 |
|
1562 And these kind of expressions, |
|
1563 |
|
1564 345 |
|
1565 00:17:01,370 --> 00:17:04,160 |
|
1566 regular expressions called |
|
1567 Eve of reg expression. |
|
1568 |
|
1569 346 |
|
1570 00:17:04,160 --> 00:17:06,350 |
|
1571 And actually there are |
|
1572 quite a number of them. |
|
1573 |
|
1574 347 |
|
1575 00:17:06,350 --> 00:17:08,495 |
|
1576 So you seen this one, |
|
1577 |
|
1578 348 |
|
1579 00:17:08,495 --> 00:17:11,255 |
|
1580 the first one, and the |
|
1581 second one already. |
|
1582 |
|
1583 349 |
|
1584 00:17:11,255 --> 00:17:13,400 |
|
1585 But there are many, many more. |
|
1586 |
|
1587 350 |
|
1588 00:17:13,400 --> 00:17:15,620 |
|
1589 And you can easily have in |
|
1590 |
|
1591 351 |
|
1592 00:17:15,620 --> 00:17:18,560 |
|
1593 your program one of |
|
1594 these reg expression. |
|
1595 |
|
1596 352 |
|
1597 00:17:18,560 --> 00:17:21,830 |
|
1598 And then you have the |
|
1599 problem that if you do have |
|
1600 |
|
1601 353 |
|
1602 00:17:21,830 --> 00:17:23,240 |
|
1603 this regular expression and |
|
1604 |
|
1605 354 |
|
1606 00:17:23,240 --> 00:17:25,640 |
|
1607 somebody finds the |
|
1608 corresponding string, |
|
1609 |
|
1610 355 |
|
1611 00:17:25,640 --> 00:17:29,945 |
|
1612 which make the records |
|
1613 matching engine topple over, |
|
1614 |
|
1615 356 |
|
1616 00:17:29,945 --> 00:17:31,820 |
|
1617 then you have a problem |
|
1618 |
|
1619 357 |
|
1620 00:17:31,820 --> 00:17:34,295 |
|
1621 because your webpage is |
|
1622 probably not variable. |
|
1623 |
|
1624 358 |
|
1625 00:17:34,295 --> 00:17:36,140 |
|
1626 This is also sometimes called |
|
1627 |
|
1628 359 |
|
1629 00:17:36,140 --> 00:17:39,350 |
|
1630 this phenomenon, |
|
1631 catastrophic backtracking. |
|
1632 |
|
1633 360 |
|
1634 00:17:39,350 --> 00:17:43,595 |
|
1635 In lecture three, we will |
|
1636 look at this more carefully. |
|
1637 |
|
1638 361 |
|
1639 00:17:43,595 --> 00:17:46,910 |
|
1640 And actually why that |
|
1641 is such a problem in |
|
1642 |
|
1643 362 |
|
1644 00:17:46,910 --> 00:17:50,795 |
|
1645 real life is actually |
|
1646 not to do with Lexus. |
|
1647 |
|
1648 363 |
|
1649 00:17:50,795 --> 00:17:53,180 |
|
1650 Yes, regular |
|
1651 expressions are used as |
|
1652 |
|
1653 364 |
|
1654 00:17:53,180 --> 00:17:55,040 |
|
1655 the basic tool for implementing |
|
1656 |
|
1657 365 |
|
1658 00:17:55,040 --> 00:17:57,185 |
|
1659 like source bad reg expressions, |
|
1660 |
|
1661 366 |
|
1662 00:17:57,185 --> 00:18:00,065 |
|
1663 of course, used in |
|
1664 a much wider area. |
|
1665 |
|
1666 367 |
|
1667 00:18:00,065 --> 00:18:03,770 |
|
1668 And they especially used for |
|
1669 network intrusion detection. |
|
1670 |
|
1671 368 |
|
1672 00:18:03,770 --> 00:18:06,590 |
|
1673 Remember, you having to |
|
1674 |
|
1675 369 |
|
1676 00:18:06,590 --> 00:18:10,130 |
|
1677 administer a big network |
|
1678 and you only want to let |
|
1679 |
|
1680 370 |
|
1681 00:18:10,130 --> 00:18:13,640 |
|
1682 in packets which you think are K |
|
1683 |
|
1684 371 |
|
1685 00:18:13,640 --> 00:18:14,930 |
|
1686 and you want to keep out |
|
1687 |
|
1688 372 |
|
1689 00:18:14,930 --> 00:18:17,645 |
|
1690 any package which might |
|
1691 hack into your network. |
|
1692 |
|
1693 373 |
|
1694 00:18:17,645 --> 00:18:22,670 |
|
1695 So what they have is they |
|
1696 have suites of thousands and |
|
1697 |
|
1698 374 |
|
1699 00:18:22,670 --> 00:18:25,745 |
|
1700 sometimes even more |
|
1701 regular expressions which |
|
1702 |
|
1703 375 |
|
1704 00:18:25,745 --> 00:18:27,755 |
|
1705 check whether this package |
|
1706 |
|
1707 376 |
|
1708 00:18:27,755 --> 00:18:30,065 |
|
1709 satisfies some patterns or not. |
|
1710 |
|
1711 377 |
|
1712 00:18:30,065 --> 00:18:31,460 |
|
1713 And in this case it will be left |
|
1714 |
|
1715 378 |
|
1716 00:18:31,460 --> 00:18:34,205 |
|
1717 out or it will be let in. |
|
1718 |
|
1719 379 |
|
1720 00:18:34,205 --> 00:18:36,335 |
|
1721 And with networks, |
|
1722 |
|
1723 380 |
|
1724 00:18:36,335 --> 00:18:39,080 |
|
1725 the problem is that our |
|
1726 hardware is already |
|
1727 |
|
1728 381 |
|
1729 00:18:39,080 --> 00:18:43,190 |
|
1730 so fast that the reg expressions |
|
1731 |
|
1732 382 |
|
1733 00:18:43,190 --> 00:18:45,169 |
|
1734 really become a bottleneck. |
|
1735 |
|
1736 383 |
|
1737 00:18:45,169 --> 00:18:47,060 |
|
1738 Because what do you do if now is |
|
1739 |
|
1740 384 |
|
1741 00:18:47,060 --> 00:18:49,880 |
|
1742 suddenly a reg expression |
|
1743 takes too much time |
|
1744 |
|
1745 385 |
|
1746 00:18:49,880 --> 00:18:52,670 |
|
1747 to just stop the matching |
|
1748 |
|
1749 386 |
|
1750 00:18:52,670 --> 00:18:55,100 |
|
1751 and let the package |
|
1752 in regardless? |
|
1753 |
|
1754 387 |
|
1755 00:18:55,100 --> 00:18:58,190 |
|
1756 Or do you just hold |
|
1757 the network up |
|
1758 |
|
1759 388 |
|
1760 00:18:58,190 --> 00:19:01,715 |
|
1761 and don't let anything in |
|
1762 until you decided that. |
|
1763 |
|
1764 389 |
|
1765 00:19:01,715 --> 00:19:04,895 |
|
1766 So that's actually a |
|
1767 really hard problem. |
|
1768 |
|
1769 390 |
|
1770 00:19:04,895 --> 00:19:06,650 |
|
1771 But the first time I came across |
|
1772 |
|
1773 391 |
|
1774 00:19:06,650 --> 00:19:09,965 |
|
1775 that problem was actually |
|
1776 by this engineer. |
|
1777 |
|
1778 392 |
|
1779 00:19:09,965 --> 00:19:13,820 |
|
1780 And it's always say that |
|
1781 Germans don't have any Yammer. |
|
1782 |
|
1783 393 |
|
1784 00:19:13,820 --> 00:19:16,985 |
|
1785 But I found that |
|
1786 video quite funny. |
|
1787 |
|
1788 394 |
|
1789 00:19:16,985 --> 00:19:19,145 |
|
1790 Maybe you have a |
|
1791 different opinion, |
|
1792 |
|
1793 395 |
|
1794 00:19:19,145 --> 00:19:21,095 |
|
1795 but feel free to |
|
1796 have a look which |
|
1797 |
|
1798 396 |
|
1799 00:19:21,095 --> 00:19:23,705 |
|
1800 explains exactly that problem. |
|
1801 |
|
1802 397 |
|
1803 00:19:23,705 --> 00:19:25,610 |
|
1804 So in the next video, |
|
1805 |
|
1806 398 |
|
1807 00:19:25,610 --> 00:19:28,445 |
|
1808 we will start to |
|
1809 implement this matcher. |
|
1810 |
|
1811 399 |
|
1812 00:19:28,445 --> 00:19:30,870 |
|
1813 So I hope to see you there. |