videos/01-evilregexes.srt
changeset 761 82a1315c128d
child 765 b294cfbb5c01
equal deleted inserted replaced
760:d41956ea544e 761:82a1315c128d
       
     1 1
       
     2 00:00:06,240 --> 00:00:11,050
       
     3 Welcome back. This video
       
     4 is about regular expressions.
       
     5 
       
     6 2
       
     7 00:00:11,050 --> 00:00:14,230
       
     8 We want to use regular
       
     9 expressions in our lexer.
       
    10 
       
    11 3
       
    12 00:00:14,230 --> 00:00:16,165
       
    13 And the purpose of the
       
    14 lexer is to find
       
    15 
       
    16 4
       
    17 00:00:16,165 --> 00:00:18,070
       
    18 out where the words in
       
    19 
       
    20 5
       
    21 00:00:18,070 --> 00:00:21,070
       
    22 our programs are. However
       
    23 regular expressions
       
    24 
       
    25 6
       
    26 00:00:21,070 --> 00:00:23,875
       
    27 are fundamental tool
       
    28 in computer science.
       
    29 
       
    30 7
       
    31 00:00:23,875 --> 00:00:27,910
       
    32 And I'm sure you've used them
       
    33 already on several occasions.
       
    34 
       
    35 8
       
    36 00:00:27,910 --> 00:00:30,370
       
    37 And one would expect that about
       
    38 
       
    39 9
       
    40 00:00:30,370 --> 00:00:31,750
       
    41 regular expressions since they are
       
    42 
       
    43 10
       
    44 00:00:31,750 --> 00:00:33,850
       
    45 so well-known and well studied,
       
    46 
       
    47 11
       
    48 00:00:33,850 --> 00:00:37,915
       
    49 that everything under the
       
    50 sun is known about them.
       
    51 
       
    52 12
       
    53 00:00:37,915 --> 00:00:41,080
       
    54 But actually there's
       
    55 still some surprising
       
    56 
       
    57 13
       
    58 00:00:41,080 --> 00:00:44,465
       
    59 and interesting
       
    60 problems with them.
       
    61 
       
    62 14
       
    63 00:00:44,465 --> 00:00:47,945
       
    64 And I want to show you
       
    65 them in this video.
       
    66 
       
    67 15
       
    68 00:00:47,945 --> 00:00:50,720
       
    69 I'm sure you've seen
       
    70 regular expressions
       
    71 
       
    72 16
       
    73 00:00:50,720 --> 00:00:52,445
       
    74 many, many times before.
       
    75 
       
    76 17
       
    77 00:00:52,445 --> 00:00:55,100
       
    78 But just to be on the same page,
       
    79 
       
    80 18
       
    81 00:00:55,100 --> 00:00:57,110
       
    82 let me just recap them.
       
    83 
       
    84 19
       
    85 00:00:57,110 --> 00:00:59,210
       
    86 So here in this line,
       
    87 
       
    88 20
       
    89 00:00:59,210 --> 00:01:01,790
       
    90 there is a regular expression
       
    91 which is supposed to
       
    92 
       
    93 21
       
    94 00:01:01,790 --> 00:01:05,285
       
    95 recognize some form
       
    96 of email addresses.
       
    97 
       
    98 22
       
    99 00:01:05,285 --> 00:01:07,745
       
   100 So an e-mail address
       
   101 
       
   102 23
       
   103 00:01:07,745 --> 00:01:11,000
       
   104 has part which is
       
   105 before the @ symbol,
       
   106 
       
   107 24
       
   108 00:01:11,000 --> 00:01:13,400
       
   109 which is the name of the person.
       
   110 
       
   111 25
       
   112 00:01:13,400 --> 00:01:16,880
       
   113 And that can be
       
   114 any number between
       
   115 
       
   116 26
       
   117 00:01:16,880 --> 00:01:20,195
       
   118 0 and 9, and letters between a and z.
       
   119 
       
   120 27
       
   121 00:01:20,195 --> 00:01:24,155
       
   122 Let's say we avoiding
       
   123 here capital letters.
       
   124 
       
   125 28
       
   126 00:01:24,155 --> 00:01:26,045
       
   127 There can be underscores.
       
   128 
       
   129 29
       
   130 00:01:26,045 --> 00:01:29,405
       
   131 There can be a dot and
       
   132 there can be hyphens.
       
   133 
       
   134 30
       
   135 00:01:29,405 --> 00:01:35,390
       
   136 And after the @ symbol
       
   137 comes the domain name.
       
   138 
       
   139 31
       
   140 00:01:35,390 --> 00:01:37,310
       
   141 So as you can see here,
       
   142 
       
   143 32
       
   144 00:01:37,310 --> 00:01:40,640
       
   145 we use things like star to
       
   146 
       
   147 33
       
   148 00:01:40,640 --> 00:01:44,314
       
   149 match letters
       
   150 zero or more times.
       
   151 
       
   152 34
       
   153 00:01:44,314 --> 00:01:45,985
       
   154 Or we have a plus,
       
   155 
       
   156 35
       
   157 00:01:45,985 --> 00:01:47,420
       
   158 which means you have to match
       
   159 
       
   160 36
       
   161 00:01:47,420 --> 00:01:52,489
       
   162 at least once or more
       
   163 times. Then we have.
       
   164 
       
   165 37
       
   166 00:01:52,489 --> 00:01:55,790
       
   167 question mark, which says you
       
   168 
       
   169 38
       
   170 00:01:55,790 --> 00:01:59,105
       
   171 match either it is there
       
   172 or it ss not there.
       
   173 
       
   174 39
       
   175 00:01:59,105 --> 00:02:01,340
       
   176 You are also regular
       
   177 expressions which
       
   178 
       
   179 40
       
   180 00:02:01,340 --> 00:02:03,755
       
   181 match exactly n-times.
       
   182 
       
   183 41
       
   184 00:02:03,755 --> 00:02:08,720
       
   185 Or this is a regular expression
       
   186 for between n and m times.
       
   187 
       
   188 42
       
   189 00:02:08,720 --> 00:02:12,065
       
   190 You can see in
       
   191 this email address,
       
   192 
       
   193 43
       
   194 00:02:12,065 --> 00:02:13,730
       
   195 the top-level domain
       
   196 
       
   197 44
       
   198 00:02:13,730 --> 00:02:16,130
       
   199 name can be any letter 
       
   200 
       
   201 45
       
   202 00:02:16,130 --> 00:02:19,265
       
   203 between a to z,
       
   204 and contain dots,
       
   205 
       
   206 46
       
   207 00:02:19,265 --> 00:02:22,340
       
   208 but can only be two
       
   209 characters long
       
   210 
       
   211 47
       
   212 00:02:22,340 --> 00:02:25,685
       
   213 up till six characters
       
   214 and not more.
       
   215 
       
   216 48
       
   217 00:02:25,685 --> 00:02:29,240
       
   218 Then you also have
       
   219 something like ranges.
       
   220 
       
   221 49
       
   222 00:02:29,240 --> 00:02:31,220
       
   223 So you can see, letters between a
       
   224 
       
   225 50
       
   226 00:02:31,220 --> 00:02:33,635
       
   227 and z and 0 to 9 and so on.
       
   228 
       
   229 51
       
   230 00:02:33,635 --> 00:02:36,545
       
   231 Here you also have regular
       
   232 expression which can
       
   233 
       
   234 52
       
   235 00:02:36,545 --> 00:02:40,070
       
   236 match something which
       
   237 isn't in this range.
       
   238 
       
   239 53
       
   240 00:02:40,070 --> 00:02:42,560
       
   241 So for example, if
       
   242 you want for example match,
       
   243 
       
   244 54
       
   245 00:02:42,560 --> 00:02:44,030
       
   246 letters but not numbers,
       
   247 
       
   248 55
       
   249 00:02:44,030 --> 00:02:45,800
       
   250 you would say, well, if
       
   251 
       
   252 56
       
   253 00:02:45,800 --> 00:02:48,990
       
   254 this is a number that
       
   255 should not match.
       
   256 
       
   257 57
       
   258 00:02:49,090 --> 00:02:52,804
       
   259 Typically you also
       
   260 have these ranges.
       
   261 
       
   262 58
       
   263 00:02:52,804 --> 00:02:55,565
       
   264 Lowercase letters,
       
   265 capital letters.
       
   266 
       
   267 59
       
   268 00:02:55,565 --> 00:02:58,550
       
   269 Then you have some
       
   270 special regular expressions
       
   271 
       
   272 60
       
   273 00:02:58,550 --> 00:03:02,195
       
   274 like this one is only
       
   275 supposed to match digits.
       
   276 
       
   277 61
       
   278 00:03:02,195 --> 00:03:05,674
       
   279 A dot is supposed to
       
   280 match any character.
       
   281 
       
   282 62
       
   283 00:03:05,674 --> 00:03:07,370
       
   284 And then they have also something
       
   285 
       
   286 63
       
   287 00:03:07,370 --> 00:03:09,800
       
   288 called groups which
       
   289 is supposed to be
       
   290 
       
   291 64
       
   292 00:03:09,800 --> 00:03:12,799
       
   293 used when you are
       
   294 trying to extract
       
   295 
       
   296 65
       
   297 00:03:12,799 --> 00:03:15,605
       
   298 a string you've matched.
       
   299 
       
   300 66
       
   301 00:03:15,605 --> 00:03:19,925
       
   302 Okay, so these are the
       
   303 typical regular expressions.
       
   304 
       
   305 67
       
   306 00:03:19,925 --> 00:03:23,075
       
   307 And here's a particular one.
       
   308 
       
   309 68
       
   310 00:03:23,075 --> 00:03:25,820
       
   311 Trying to match something
       
   312 
       
   313 69
       
   314 00:03:25,820 --> 00:03:28,770
       
   315 which resembles
       
   316 an email address.
       
   317 
       
   318 70
       
   319 00:03:29,590 --> 00:03:33,065
       
   320 Clearly that should be all easy.
       
   321 
       
   322 71
       
   323 00:03:33,065 --> 00:03:36,230
       
   324 And our technology should
       
   325 be on top of that.
       
   326 
       
   327 72
       
   328 00:03:36,230 --> 00:03:37,865
       
   329 That we can take a
       
   330 
       
   331 73
       
   332 00:03:37,865 --> 00:03:41,015
       
   333 regular expressions and
       
   334 we can take a string,
       
   335 
       
   336 74
       
   337 00:03:41,015 --> 00:03:43,340
       
   338 and we should have programs to
       
   339 
       
   340 75
       
   341 00:03:43,340 --> 00:03:45,680
       
   342 decide whether this
       
   343 string is matched
       
   344 
       
   345 76
       
   346 00:03:45,680 --> 00:03:50,330
       
   347 by a regular expression or
       
   348 not and should be easy-peasy, no?
       
   349 
       
   350 77
       
   351 00:03:50,330 --> 00:03:56,150
       
   352 Well, let's have a
       
   353 look at two examples.
       
   354 
       
   355 78
       
   356 00:03:56,150 --> 00:04:00,860
       
   357 The first regular expression
       
   358 is a star star b.
       
   359 
       
   360 79
       
   361 00:04:00,860 --> 00:04:02,990
       
   362 And it is supposed
       
   363 to match strings of
       
   364 
       
   365 80
       
   366 00:04:02,990 --> 00:04:05,825
       
   367 the form 0 or more a's,
       
   368 
       
   369 81
       
   370 00:04:05,825 --> 00:04:10,385
       
   371 followed by a b. The parentheses
       
   372 you can ignore.
       
   373 
       
   374 82
       
   375 00:04:10,385 --> 00:04:11,990
       
   376 And a star star
       
   377 
       
   378 83
       
   379 00:04:11,990 --> 00:04:14,120
       
   380 also doesn't
       
   381 make any difference
       
   382 
       
   383 84
       
   384 00:04:14,120 --> 00:04:16,505
       
   385 to what kind of strings
       
   386 that can be matched.
       
   387 
       
   388 85
       
   389 00:04:16,505 --> 00:04:21,635
       
   390 It can only make 0 more
       
   391 a's followed by a b.
       
   392 
       
   393 86
       
   394 00:04:21,635 --> 00:04:23,900
       
   395 And the other regular expression
       
   396 
       
   397 87
       
   398 00:04:23,900 --> 00:04:26,990
       
   399 is possibly a character a,
       
   400 
       
   401 88
       
   402 00:04:26,990 --> 00:04:32,930
       
   403 n times, followed by character
       
   404 a axactly n-times.
       
   405 
       
   406 89
       
   407 00:04:32,930 --> 00:04:35,570
       
   408 And we will try out
       
   409 
       
   410 90
       
   411 00:04:35,570 --> 00:04:38,360
       
   412 these two regular expressions
       
   413 with strings of the form a,
       
   414 
       
   415 91
       
   416 00:04:38,360 --> 00:04:39,890
       
   417 aa, and so on,
       
   418 
       
   419 92
       
   420 00:04:39,890 --> 00:04:45,770
       
   421 and up to the length of n. And
       
   422 
       
   423 93
       
   424 00:04:45,770 --> 00:04:49,130
       
   425 this regular expression should
       
   426 actually not match any of
       
   427 
       
   428 94
       
   429 00:04:49,130 --> 00:04:53,315
       
   430 the strings because the
       
   431 final b is missing.
       
   432 
       
   433 95
       
   434 00:04:53,315 --> 00:04:56,150
       
   435 But that is
       
   436 okay. For example
       
   437 
       
   438 96
       
   439 00:04:56,150 --> 00:04:57,425
       
   440 if you have a regular expression
       
   441 
       
   442 97
       
   443 00:04:57,425 --> 00:05:00,110
       
   444 that is supposed to
       
   445 check whether a string is
       
   446 
       
   447 98
       
   448 00:05:00,110 --> 00:05:01,490
       
   449 an email address and the user
       
   450 
       
   451 99
       
   452 00:05:01,490 --> 00:05:03,380
       
   453 gives some random
       
   454 strings in there,
       
   455 
       
   456 100
       
   457 00:05:03,380 --> 00:05:06,545
       
   458 then this regular expression
       
   459 should not match that string.
       
   460 
       
   461 101
       
   462 00:05:06,545 --> 00:05:08,420
       
   463 And for this regular expression
       
   464 
       
   465 102
       
   466 00:05:08,420 --> 00:05:11,195
       
   467 you have to scratch a
       
   468 little bit of your head,
       
   469 
       
   470 103
       
   471 00:05:11,195 --> 00:05:12,620
       
   472 what it can actually match.
       
   473 
       
   474 104
       
   475 00:05:12,620 --> 00:05:14,720
       
   476 But after a little bit
       
   477 of head scratching,
       
   478 
       
   479 105
       
   480 00:05:14,720 --> 00:05:18,260
       
   481 you find out can match
       
   482 any string which is of
       
   483 
       
   484 106
       
   485 00:05:18,260 --> 00:05:22,580
       
   486 the length n a's up
       
   487 to 2n of a's.
       
   488 
       
   489 107
       
   490 00:05:22,580 --> 00:05:24,290
       
   491 So anything in this range,
       
   492 
       
   493 108
       
   494 00:05:24,290 --> 00:05:27,185
       
   495 this regular expression
       
   496 can actually match.
       
   497 
       
   498 109
       
   499 00:05:27,185 --> 00:05:30,395
       
   500 Okay, let's
       
   501 take a random tool,
       
   502 
       
   503 110
       
   504 00:05:30,395 --> 00:05:32,630
       
   505 maybe for example Python.
       
   506 
       
   507 111
       
   508 00:05:32,630 --> 00:05:35,240
       
   509 So here's a little
       
   510 Python program.
       
   511 
       
   512 112
       
   513 00:05:35,240 --> 00:05:38,690
       
   514 It uses the library
       
   515 function of Python to
       
   516 
       
   517 113
       
   518 00:05:38,690 --> 00:05:42,935
       
   519 match the regular expressions of
       
   520 a star star b.
       
   521 
       
   522 114
       
   523 00:05:42,935 --> 00:05:46,805
       
   524 And we measure time with longer
       
   525 and longer strings of a.
       
   526 
       
   527 115
       
   528 00:05:46,805 --> 00:05:48,770
       
   529 And so conveniently we can give
       
   530 
       
   531 116
       
   532 00:05:48,770 --> 00:05:51,140
       
   533 the number of a's here
       
   534 on the command line.
       
   535 
       
   536 117
       
   537 00:05:51,140 --> 00:05:56,900
       
   538 If I just call
       
   539 this on the command line,
       
   540 
       
   541 118
       
   542 00:05:56,900 --> 00:05:59,900
       
   543 Let's say we first
       
   544 start with five a's.
       
   545 
       
   546 119
       
   547 00:05:59,900 --> 00:06:03,920
       
   548 And I get also the times which
       
   549 in this case is next to nothing.
       
   550 
       
   551 120
       
   552 00:06:03,920 --> 00:06:05,960
       
   553 And here's the string
       
   554 we just matched.
       
   555 
       
   556 121
       
   557 00:06:05,960 --> 00:06:07,640
       
   558 And obviously the
       
   559 regular expression
       
   560 
       
   561 122
       
   562 00:06:07,640 --> 00:06:09,110
       
   563 did not match the string.
       
   564 
       
   565 123
       
   566 00:06:09,110 --> 00:06:11,255
       
   567 That's indicated by this none.
       
   568 
       
   569 124
       
   570 00:06:11,255 --> 00:06:13,925
       
   571 Let's take ten a's.
       
   572 
       
   573 125
       
   574 00:06:13,925 --> 00:06:16,490
       
   575 It's also pretty quick.
       
   576 
       
   577 126
       
   578 00:06:16,490 --> 00:06:20,780
       
   579 Fifteen a's, even quicker,
       
   580 
       
   581 127
       
   582 00:06:20,780 --> 00:06:23,180
       
   583 but these times always need to
       
   584 
       
   585 128
       
   586 00:06:23,180 --> 00:06:25,820
       
   587 be taken with a grain of salt.
       
   588 
       
   589 129
       
   590 00:06:25,820 --> 00:06:28,040
       
   591 They are not 100
       
   592 percent accurate.
       
   593 
       
   594 130
       
   595 00:06:28,040 --> 00:06:31,490
       
   596 So 15 is also a let's take
       
   597 
       
   598 131
       
   599 00:06:31,490 --> 00:06:36,965
       
   600 28th notes already
       
   601 double the time.
       
   602 
       
   603 132
       
   604 00:06:36,965 --> 00:06:42,440
       
   605 Twenty-five longer.
       
   606 
       
   607 133
       
   608 00:06:42,440 --> 00:06:45,680
       
   609 Okay, that suddenly
       
   610 from 02 seconds,
       
   611 
       
   612 134
       
   613 00:06:45,680 --> 00:06:48,960
       
   614 it takes almost four seconds.
       
   615 
       
   616 135
       
   617 00:06:49,600 --> 00:06:54,890
       
   618 Six this
       
   619 
       
   620 136
       
   621 00:06:54,890 --> 00:07:01,415
       
   622 takes six seconds
       
   623 already Double, okay?
       
   624 
       
   625 137
       
   626 00:07:01,415 --> 00:07:07,229
       
   627 Go to 28. That would be now.
       
   628 
       
   629 138
       
   630 00:07:08,890 --> 00:07:11,840
       
   631 You see the string
       
   632 isn't very long,
       
   633 
       
   634 139
       
   635 00:07:11,840 --> 00:07:13,340
       
   636 so that could be easily like
       
   637 
       
   638 140
       
   639 00:07:13,340 --> 00:07:16,070
       
   640 just the size of
       
   641 an email address.
       
   642 
       
   643 141
       
   644 00:07:16,070 --> 00:07:19,280
       
   645 And the regular
       
   646 expression matching
       
   647 
       
   648 142
       
   649 00:07:19,280 --> 00:07:22,550
       
   650 engine in Python needs
       
   651 quite a long time
       
   652 
       
   653 143
       
   654 00:07:22,550 --> 00:07:24,710
       
   655 to find out that
       
   656 this string of 28
       
   657 
       
   658 144
       
   659 00:07:24,710 --> 00:07:26,570
       
   660 AES is actually not much
       
   661 
       
   662 145
       
   663 00:07:26,570 --> 00:07:28,490
       
   664 by that you see it's
       
   665 still not finished.
       
   666 
       
   667 146
       
   668 00:07:28,490 --> 00:07:32,900
       
   669 I think it should take
       
   670 approximately like 20 seconds.
       
   671 
       
   672 147
       
   673 00:07:32,900 --> 00:07:34,400
       
   674 Okay. Already 30.
       
   675 
       
   676 148
       
   677 00:07:34,400 --> 00:07:36,530
       
   678 And if we would try
       
   679 
       
   680 149
       
   681 00:07:36,530 --> 00:07:40,805
       
   682 30 would be already
       
   683 more than a minute.
       
   684 
       
   685 150
       
   686 00:07:40,805 --> 00:07:43,940
       
   687 And if I could read
       
   688 something like hundreds,
       
   689 
       
   690 151
       
   691 00:07:43,940 --> 00:07:46,220
       
   692 you remember if a doubling in
       
   693 
       
   694 152
       
   695 00:07:46,220 --> 00:07:48,770
       
   696 each step or the second step,
       
   697 
       
   698 153
       
   699 00:07:48,770 --> 00:07:50,720
       
   700 the story with the chess board,
       
   701 
       
   702 154
       
   703 00:07:50,720 --> 00:07:53,855
       
   704 we probably would sit here
       
   705 until the next century.
       
   706 
       
   707 155
       
   708 00:07:53,855 --> 00:07:56,820
       
   709 So something strange here.
       
   710 
       
   711 156
       
   712 00:07:57,580 --> 00:08:01,355
       
   713 Okay, that might be just
       
   714 a problem of Python.
       
   715 
       
   716 157
       
   717 00:08:01,355 --> 00:08:02,990
       
   718 Let's have a look at another
       
   719 
       
   720 158
       
   721 00:08:02,990 --> 00:08:04,985
       
   722 regular expression
       
   723 matching engine.
       
   724 
       
   725 159
       
   726 00:08:04,985 --> 00:08:06,890
       
   727 This time from JavaScript,
       
   728 
       
   729 160
       
   730 00:08:06,890 --> 00:08:10,040
       
   731 also are pretty well-known
       
   732 programming language.
       
   733 
       
   734 161
       
   735 00:08:10,040 --> 00:08:13,610
       
   736 So here you can see
       
   737 it's still a star,
       
   738 
       
   739 162
       
   740 00:08:13,610 --> 00:08:16,235
       
   741 star followed by b,
       
   742 
       
   743 163
       
   744 00:08:16,235 --> 00:08:18,920
       
   745 by direct expression is
       
   746 supposed to match that from
       
   747 
       
   748 164
       
   749 00:08:18,920 --> 00:08:21,830
       
   750 the beginning of the
       
   751 string up till the end.
       
   752 
       
   753 165
       
   754 00:08:21,830 --> 00:08:23,930
       
   755 So there's not any difference
       
   756 
       
   757 166
       
   758 00:08:23,930 --> 00:08:26,150
       
   759 in the strings this work
       
   760 expression matches.
       
   761 
       
   762 167
       
   763 00:08:26,150 --> 00:08:28,610
       
   764 We'll just start at the
       
   765 beginning of the string
       
   766 
       
   767 168
       
   768 00:08:28,610 --> 00:08:31,460
       
   769 and finish at the
       
   770 end of the string.
       
   771 
       
   772 169
       
   773 00:08:31,460 --> 00:08:35,285
       
   774 And we again, we just use
       
   775 repeated A's for that.
       
   776 
       
   777 170
       
   778 00:08:35,285 --> 00:08:38,195
       
   779 And similarly, we can
       
   780 
       
   781 171
       
   782 00:08:38,195 --> 00:08:41,930
       
   783 call it on the command line
       
   784 and can do some timing.
       
   785 
       
   786 172
       
   787 00:08:41,930 --> 00:08:44,540
       
   788 So ten SBA, good.
       
   789 
       
   790 173
       
   791 00:08:44,540 --> 00:08:46,340
       
   792 Here's the string.
       
   793 
       
   794 174
       
   795 00:08:46,340 --> 00:08:48,320
       
   796 It cannot match that string.
       
   797 
       
   798 175
       
   799 00:08:48,320 --> 00:08:50,525
       
   800 And it's pretty fast.
       
   801 
       
   802 176
       
   803 00:08:50,525 --> 00:08:54,725
       
   804 Friendly. Although pretty fast.
       
   805 
       
   806 177
       
   807 00:08:54,725 --> 00:08:59,120
       
   808 Five, again,
       
   809 
       
   810 178
       
   811 00:08:59,120 --> 00:09:06,650
       
   812 somehow is kind of
       
   813 threshold that is 25, 26.
       
   814 
       
   815 179
       
   816 00:09:06,650 --> 00:09:09,485
       
   817 Suddenly it takes much longer.
       
   818 
       
   819 180
       
   820 00:09:09,485 --> 00:09:14,360
       
   821 And it has essentially the
       
   822 same problem as with Python.
       
   823 
       
   824 181
       
   825 00:09:14,360 --> 00:09:17,165
       
   826 So you'll see in now from 26 on,
       
   827 
       
   828 182
       
   829 00:09:17,165 --> 00:09:19,250
       
   830 the Times has always
       
   831 doubling from
       
   832 
       
   833 183
       
   834 00:09:19,250 --> 00:09:21,860
       
   835 three seconds to seven seconds.
       
   836 
       
   837 184
       
   838 00:09:21,860 --> 00:09:23,330
       
   839 So you can imagine what that
       
   840 
       
   841 185
       
   842 00:09:23,330 --> 00:09:24,890
       
   843 roughly takes when I put your
       
   844 
       
   845 186
       
   846 00:09:24,890 --> 00:09:30,230
       
   847 27 and you see the
       
   848 string isn't very long.
       
   849 
       
   850 187
       
   851 00:09:30,230 --> 00:09:32,165
       
   852 Let's choose twenties or maize.
       
   853 
       
   854 188
       
   855 00:09:32,165 --> 00:09:35,419
       
   856 Imagine you have to
       
   857 search a database
       
   858 
       
   859 189
       
   860 00:09:35,419 --> 00:09:38,720
       
   861 with kilobytes of data.
       
   862 
       
   863 190
       
   864 00:09:38,720 --> 00:09:42,260
       
   865 This, these regular
       
   866 expressions that would years
       
   867 
       
   868 191
       
   869 00:09:42,260 --> 00:09:48,150
       
   870 need years to go through with
       
   871 these regular expressions.
       
   872 
       
   873 192
       
   874 00:09:48,630 --> 00:09:51,850
       
   875 Okay, maybe the people in
       
   876 
       
   877 193
       
   878 00:09:51,850 --> 00:09:55,435
       
   879 Python and JavaScript,
       
   880 they're just idiots.
       
   881 
       
   882 194
       
   883 00:09:55,435 --> 00:09:58,180
       
   884 Surely Java must do much better.
       
   885 
       
   886 195
       
   887 00:09:58,180 --> 00:10:01,045
       
   888 So here's a program.
       
   889 
       
   890 196
       
   891 00:10:01,045 --> 00:10:03,415
       
   892 You can see this again
       
   893 
       
   894 197
       
   895 00:10:03,415 --> 00:10:05,980
       
   896 is the reg expression
       
   897 and we just having
       
   898 
       
   899 198
       
   900 00:10:05,980 --> 00:10:08,320
       
   901 some scaffolding to generate
       
   902 
       
   903 199
       
   904 00:10:08,320 --> 00:10:11,905
       
   905 strings from five up till 28.
       
   906 
       
   907 200
       
   908 00:10:11,905 --> 00:10:14,305
       
   909 And if we run that,
       
   910 
       
   911 201
       
   912 00:10:14,305 --> 00:10:16,660
       
   913 actually does that automatically.
       
   914 
       
   915 202
       
   916 00:10:16,660 --> 00:10:19,900
       
   917 So uphill 19, pretty fast,
       
   918 
       
   919 203
       
   920 00:10:19,900 --> 00:10:24,925
       
   921 but then starting from
       
   922 23, skidding pretty slow.
       
   923 
       
   924 204
       
   925 00:10:24,925 --> 00:10:27,445
       
   926 So the question is
       
   927 what's going on?
       
   928 
       
   929 205
       
   930 00:10:27,445 --> 00:10:29,230
       
   931 By the way, I'm not quoting here.
       
   932 
       
   933 206
       
   934 00:10:29,230 --> 00:10:33,755
       
   935 Scala, using internally
       
   936 the regular expression
       
   937 
       
   938 207
       
   939 00:10:33,755 --> 00:10:36,665
       
   940 matching engine from Java.
       
   941 
       
   942 208
       
   943 00:10:36,665 --> 00:10:39,065
       
   944 So would have exactly
       
   945 the same problem.
       
   946 
       
   947 209
       
   948 00:10:39,065 --> 00:10:41,480
       
   949 Also, I have been
       
   950 here very careful,
       
   951 
       
   952 210
       
   953 00:10:41,480 --> 00:10:43,550
       
   954 I'm using here Scala aid,
       
   955 
       
   956 211
       
   957 00:10:43,550 --> 00:10:46,085
       
   958 which nowadays is quite old.
       
   959 
       
   960 212
       
   961 00:10:46,085 --> 00:10:50,765
       
   962 But you will see also
       
   963 current Java versions.
       
   964 
       
   965 213
       
   966 00:10:50,765 --> 00:10:55,490
       
   967 We will see we can out-compete
       
   968 them by magnitudes.
       
   969 
       
   970 214
       
   971 00:10:55,490 --> 00:10:57,605
       
   972 So I think I can that.
       
   973 
       
   974 215
       
   975 00:10:57,605 --> 00:10:59,165
       
   976 Now, just finish here.
       
   977 
       
   978 216
       
   979 00:10:59,165 --> 00:11:04,025
       
   980 You see the problem. Just
       
   981 for completeness sake.
       
   982 
       
   983 217
       
   984 00:11:04,025 --> 00:11:07,010
       
   985 Here is a Ruby program.
       
   986 
       
   987 218
       
   988 00:11:07,010 --> 00:11:09,935
       
   989 This is using the other
       
   990 regular expression.
       
   991 
       
   992 219
       
   993 00:11:09,935 --> 00:11:12,935
       
   994 In this case the
       
   995 string should match.
       
   996 
       
   997 220
       
   998 00:11:12,935 --> 00:11:20,300
       
   999 And again it tries out
       
  1000 strings between 130 here.
       
  1001 
       
  1002 221
       
  1003 00:11:20,300 --> 00:11:23,450
       
  1004 That's a program actually
       
  1005 a former student produced.
       
  1006 
       
  1007 222
       
  1008 00:11:23,450 --> 00:11:25,565
       
  1009 And you can see four a's
       
  1010 
       
  1011 223
       
  1012 00:11:25,565 --> 00:11:29,780
       
  1013 of links up till 20
       
  1014 AES is pretty fast.
       
  1015 
       
  1016 224
       
  1017 00:11:29,780 --> 00:11:32,495
       
  1018 But then starting at 26,
       
  1019 
       
  1020 225
       
  1021 00:11:32,495 --> 00:11:35,285
       
  1022 it's getting really slow.
       
  1023 
       
  1024 226
       
  1025 00:11:35,285 --> 00:11:37,100
       
  1026 So in this case,
       
  1027 remember the string
       
  1028 
       
  1029 227
       
  1030 00:11:37,100 --> 00:11:38,870
       
  1031 is actually matched by
       
  1032 the regular expression.
       
  1033 
       
  1034 228
       
  1035 00:11:38,870 --> 00:11:40,130
       
  1036 So it has nothing to do
       
  1037 
       
  1038 229
       
  1039 00:11:40,130 --> 00:11:41,540
       
  1040 with a regular
       
  1041 expression actually
       
  1042 
       
  1043 230
       
  1044 00:11:41,540 --> 00:11:45,485
       
  1045 matches a string or does
       
  1046 not match a string.
       
  1047 
       
  1048 231
       
  1049 00:11:45,485 --> 00:11:48,260
       
  1050 I admit though these
       
  1051 regular expressions
       
  1052 
       
  1053 232
       
  1054 00:11:48,260 --> 00:11:49,610
       
  1055 are carefully chosen,
       
  1056 
       
  1057 233
       
  1058 00:11:49,610 --> 00:11:52,250
       
  1059 as you will see later on.
       
  1060 
       
  1061 234
       
  1062 00:11:52,250 --> 00:11:55,620
       
  1063 Hey, I also just stop that here.
       
  1064 
       
  1065 235
       
  1066 00:11:55,710 --> 00:12:00,985
       
  1067 Okay, this slight collect
       
  1068 this information about times.
       
  1069 
       
  1070 236
       
  1071 00:12:00,985 --> 00:12:03,400
       
  1072 On the right hand side will
       
  1073 
       
  1074 237
       
  1075 00:12:03,400 --> 00:12:05,860
       
  1076 be our regular expression mantra,
       
  1077 
       
  1078 238
       
  1079 00:12:05,860 --> 00:12:08,290
       
  1080 which we implement next week.
       
  1081 
       
  1082 239
       
  1083 00:12:08,290 --> 00:12:10,795
       
  1084 On the left-hand side,
       
  1085 are these times by
       
  1086 
       
  1087 240
       
  1088 00:12:10,795 --> 00:12:14,260
       
  1089 barriers than regular
       
  1090 expression matching engines?
       
  1091 
       
  1092 241
       
  1093 00:12:14,260 --> 00:12:17,809
       
  1094 On the top is this
       
  1095 regular expression.
       
  1096 
       
  1097 242
       
  1098 00:12:19,080 --> 00:12:23,335
       
  1099 Possible a n times a n times.
       
  1100 
       
  1101 243
       
  1102 00:12:23,335 --> 00:12:26,890
       
  1103 And on the lowest
       
  1104 is a star, star b.
       
  1105 
       
  1106 244
       
  1107 00:12:26,890 --> 00:12:30,370
       
  1108 And the x-axis show here
       
  1109 
       
  1110 245
       
  1111 00:12:30,370 --> 00:12:35,335
       
  1112 the length of the
       
  1113 string. How many a's.
       
  1114 
       
  1115 246
       
  1116 00:12:35,335 --> 00:12:38,925
       
  1117 And on the y axis is the time.
       
  1118 
       
  1119 247
       
  1120 00:12:38,925 --> 00:12:41,660
       
  1121 They need to decide whether
       
  1122 
       
  1123 248
       
  1124 00:12:41,660 --> 00:12:44,615
       
  1125 the string is matched by
       
  1126 the rate expression or not.
       
  1127 
       
  1128 249
       
  1129 00:12:44,615 --> 00:12:46,415
       
  1130 So you can see here, Python,
       
  1131 
       
  1132 250
       
  1133 00:12:46,415 --> 00:12:47,945
       
  1134 Java eight in JavaScript,
       
  1135 
       
  1136 251
       
  1137 00:12:47,945 --> 00:12:52,250
       
  1138 they max out approximately
       
  1139 at between 2530.
       
  1140 
       
  1141 252
       
  1142 00:12:52,250 --> 00:12:53,900
       
  1143 The kristin, it takes already
       
  1144 
       
  1145 253
       
  1146 00:12:53,900 --> 00:12:55,160
       
  1147 a half a minute to
       
  1148 
       
  1149 254
       
  1150 00:12:55,160 --> 00:12:57,410
       
  1151 decide whether the string
       
  1152 is matched or not.
       
  1153 
       
  1154 255
       
  1155 00:12:57,410 --> 00:13:00,815
       
  1156 And similarly, in
       
  1157 the other example,
       
  1158 
       
  1159 256
       
  1160 00:13:00,815 --> 00:13:03,830
       
  1161 Python and derived Ruby max out
       
  1162 
       
  1163 257
       
  1164 00:13:03,830 --> 00:13:07,220
       
  1165 at a similar kind of
       
  1166 length of the strings.
       
  1167 
       
  1168 258
       
  1169 00:13:07,220 --> 00:13:10,400
       
  1170 Because then they use also
       
  1171 half a minute to decide
       
  1172 
       
  1173 259
       
  1174 00:13:10,400 --> 00:13:13,940
       
  1175 whether this rec expression
       
  1176 actually matches the string.
       
  1177 
       
  1178 260
       
  1179 00:13:13,940 --> 00:13:16,790
       
  1180 Contrast that with
       
  1181 the reg expression
       
  1182 
       
  1183 261
       
  1184 00:13:16,790 --> 00:13:19,235
       
  1185 which we are regular
       
  1186 expression mantra,
       
  1187 
       
  1188 262
       
  1189 00:13:19,235 --> 00:13:21,470
       
  1190 which we're going to implement.
       
  1191 
       
  1192 263
       
  1193 00:13:21,470 --> 00:13:25,040
       
  1194 This can match
       
  1195 approximately 10 thousand
       
  1196 
       
  1197 264
       
  1198 00:13:25,040 --> 00:13:30,065
       
  1199 a's in this example and
       
  1200 needs less than ten seconds.
       
  1201 
       
  1202 265
       
  1203 00:13:30,065 --> 00:13:32,285
       
  1204 Actually, there will be
       
  1205 two versions of that.
       
  1206 
       
  1207 266
       
  1208 00:13:32,285 --> 00:13:34,850
       
  1209 First version may be
       
  1210 also relatively slow.
       
  1211 
       
  1212 267
       
  1213 00:13:34,850 --> 00:13:36,410
       
  1214 But the second version,
       
  1215 
       
  1216 268
       
  1217 00:13:36,410 --> 00:13:38,240
       
  1218 in contrast to Python,
       
  1219 
       
  1220 269
       
  1221 00:13:38,240 --> 00:13:40,295
       
  1222 Ruby, we'll be blindingly fast.
       
  1223 
       
  1224 270
       
  1225 00:13:40,295 --> 00:13:42,380
       
  1226 And in the second example,
       
  1227 
       
  1228 271
       
  1229 00:13:42,380 --> 00:13:45,740
       
  1230 you have to be careful
       
  1231 about the x axis because
       
  1232 
       
  1233 272
       
  1234 00:13:45,740 --> 00:13:49,385
       
  1235 that means four times
       
  1236 ten to the power six.
       
  1237 
       
  1238 273
       
  1239 00:13:49,385 --> 00:13:51,695
       
  1240 It's actually 4 million A's.
       
  1241 
       
  1242 274
       
  1243 00:13:51,695 --> 00:13:55,100
       
  1244 So our regular
       
  1245 expression match or need
       
  1246 
       
  1247 275
       
  1248 00:13:55,100 --> 00:13:57,635
       
  1249 less than ten seconds to
       
  1250 
       
  1251 276
       
  1252 00:13:57,635 --> 00:14:00,725
       
  1253 match a string of length
       
  1254 of 4 million A's.
       
  1255 
       
  1256 277
       
  1257 00:14:00,725 --> 00:14:04,430
       
  1258 Contrast that Python, Java eight,
       
  1259 
       
  1260 278
       
  1261 00:14:04,430 --> 00:14:06,770
       
  1262 and JavaScript need half a minute
       
  1263 
       
  1264 279
       
  1265 00:14:06,770 --> 00:14:09,905
       
  1266 already for a string
       
  1267 of length just 30,
       
  1268 
       
  1269 280
       
  1270 00:14:09,905 --> 00:14:12,365
       
  1271 unless you're very
       
  1272 careful with Java eight.
       
  1273 
       
  1274 281
       
  1275 00:14:12,365 --> 00:14:15,725
       
  1276 Yes, Java nine and above,
       
  1277 
       
  1278 282
       
  1279 00:14:15,725 --> 00:14:17,180
       
  1280 they already have
       
  1281 
       
  1282 283
       
  1283 00:14:17,180 --> 00:14:19,610
       
  1284 a much better regular
       
  1285 expression matching engine,
       
  1286 
       
  1287 284
       
  1288 00:14:19,610 --> 00:14:22,805
       
  1289 but still we will be running
       
  1290 circles around them.
       
  1291 
       
  1292 285
       
  1293 00:14:22,805 --> 00:14:27,050
       
  1294 It's this data. I
       
  1295 call this slide.
       
  1296 
       
  1297 286
       
  1298 00:14:27,050 --> 00:14:29,675
       
  1299 Why bother with
       
  1300 regular expressions?
       
  1301 
       
  1302 287
       
  1303 00:14:29,675 --> 00:14:33,515
       
  1304 But you can probably
       
  1305 see these are
       
  1306 
       
  1307 288
       
  1308 00:14:33,515 --> 00:14:34,910
       
  1309 at least more times by
       
  1310 
       
  1311 289
       
  1312 00:14:34,910 --> 00:14:38,015
       
  1313 the existing regular
       
  1314 expression matching engines.
       
  1315 
       
  1316 290
       
  1317 00:14:38,015 --> 00:14:40,070
       
  1318 And it's actually
       
  1319 surprising that after
       
  1320 
       
  1321 291
       
  1322 00:14:40,070 --> 00:14:42,695
       
  1323 one lecture we can already
       
  1324 do substantially better.
       
  1325 
       
  1326 292
       
  1327 00:14:42,695 --> 00:14:47,495
       
  1328 And if you don't believe
       
  1329 in D times, I gave here,
       
  1330 
       
  1331 293
       
  1332 00:14:47,495 --> 00:14:50,090
       
  1333 please feel free to
       
  1334 play on your own
       
  1335 
       
  1336 294
       
  1337 00:14:50,090 --> 00:14:52,865
       
  1338 with the examples
       
  1339 I uploaded, Keats.
       
  1340 
       
  1341 295
       
  1342 00:14:52,865 --> 00:14:55,235
       
  1343 These are exactly the programs
       
  1344 
       
  1345 296
       
  1346 00:14:55,235 --> 00:14:57,470
       
  1347 are used here in the examples.
       
  1348 
       
  1349 297
       
  1350 00:14:57,470 --> 00:14:59,255
       
  1351 So feel free.
       
  1352 
       
  1353 298
       
  1354 00:14:59,255 --> 00:15:01,970
       
  1355 You might however now think, hmm.
       
  1356 
       
  1357 299
       
  1358 00:15:01,970 --> 00:15:05,449
       
  1359 These are two very
       
  1360 well chosen examples.
       
  1361 
       
  1362 300
       
  1363 00:15:05,449 --> 00:15:07,145
       
  1364 And I admit that's true.
       
  1365 
       
  1366 301
       
  1367 00:15:07,145 --> 00:15:09,410
       
  1368 And such problem there never
       
  1369 
       
  1370 302
       
  1371 00:15:09,410 --> 00:15:12,540
       
  1372 causing any problems
       
  1373 in real life.
       
  1374 
       
  1375 303
       
  1376 00:15:13,300 --> 00:15:15,980
       
  1377 Regular expressions are used very
       
  1378 
       
  1379 304
       
  1380 00:15:15,980 --> 00:15:19,415
       
  1381 frequently and they
       
  1382 do cause problems.
       
  1383 
       
  1384 305
       
  1385 00:15:19,415 --> 00:15:21,410
       
  1386 So here's my first example from
       
  1387 
       
  1388 306
       
  1389 00:15:21,410 --> 00:15:23,885
       
  1390 a company called cloudflare.
       
  1391 
       
  1392 307
       
  1393 00:15:23,885 --> 00:15:27,560
       
  1394 This is a huge hosting company
       
  1395 
       
  1396 308
       
  1397 00:15:27,560 --> 00:15:30,935
       
  1398 which host very
       
  1399 well-known web pages.
       
  1400 
       
  1401 309
       
  1402 00:15:30,935 --> 00:15:34,970
       
  1403 And they really try hard
       
  1404 to have no outage at all.
       
  1405 
       
  1406 310
       
  1407 00:15:34,970 --> 00:15:37,340
       
  1408 And they manage
       
  1409 that for six years.
       
  1410 
       
  1411 311
       
  1412 00:15:37,340 --> 00:15:39,320
       
  1413 But then a Rekha expression,
       
  1414 
       
  1415 312
       
  1416 00:15:39,320 --> 00:15:41,180
       
  1417 actually this one caused
       
  1418 a problem and you
       
  1419 
       
  1420 313
       
  1421 00:15:41,180 --> 00:15:43,265
       
  1422 can see they're also
       
  1423 like two stars.
       
  1424 
       
  1425 314
       
  1426 00:15:43,265 --> 00:15:44,630
       
  1427 They are at the end.
       
  1428 
       
  1429 315
       
  1430 00:15:44,630 --> 00:15:46,955
       
  1431 And because of that string needed
       
  1432 
       
  1433 316
       
  1434 00:15:46,955 --> 00:15:49,865
       
  1435 too much time to be matched.
       
  1436 
       
  1437 317
       
  1438 00:15:49,865 --> 00:15:50,990
       
  1439 And because of that,
       
  1440 
       
  1441 318
       
  1442 00:15:50,990 --> 00:15:52,430
       
  1443 they had some outage for,
       
  1444 
       
  1445 319
       
  1446 00:15:52,430 --> 00:15:54,125
       
  1447 I think several hours,
       
  1448 
       
  1449 320
       
  1450 00:15:54,125 --> 00:15:57,920
       
  1451 actually in their malware
       
  1452 detection subsystem.
       
  1453 
       
  1454 321
       
  1455 00:15:57,920 --> 00:16:02,060
       
  1456 And the second example
       
  1457 comes from 2016,
       
  1458 
       
  1459 322
       
  1460 00:16:02,060 --> 00:16:04,040
       
  1461 where Stack Exchange,
       
  1462 I guess you know
       
  1463 
       
  1464 323
       
  1465 00:16:04,040 --> 00:16:06,650
       
  1466 this webpage had
       
  1467 also an outage from,
       
  1468 
       
  1469 324
       
  1470 00:16:06,650 --> 00:16:08,390
       
  1471 I think at least an hour.
       
  1472 
       
  1473 325
       
  1474 00:16:08,390 --> 00:16:13,070
       
  1475 Because a regular expression
       
  1476 then needed to format posts,
       
  1477 
       
  1478 326
       
  1479 00:16:13,070 --> 00:16:15,575
       
  1480 needed too much time to
       
  1481 
       
  1482 327
       
  1483 00:16:15,575 --> 00:16:19,010
       
  1484 recognize whether this post
       
  1485 should be accepted or not.
       
  1486 
       
  1487 328
       
  1488 00:16:19,010 --> 00:16:23,390
       
  1489 And again, there was a
       
  1490 semi kind of problem.
       
  1491 
       
  1492 329
       
  1493 00:16:23,390 --> 00:16:24,950
       
  1494 And you can read
       
  1495 the stories behind
       
  1496 
       
  1497 330
       
  1498 00:16:24,950 --> 00:16:28,080
       
  1499 that on these two given links.
       
  1500 
       
  1501 331
       
  1502 00:16:28,720 --> 00:16:31,730
       
  1503 When I looked at
       
  1504 this the first time,
       
  1505 
       
  1506 332
       
  1507 00:16:31,730 --> 00:16:34,175
       
  1508 what surprised me is
       
  1509 that theoretician
       
  1510 
       
  1511 333
       
  1512 00:16:34,175 --> 00:16:37,520
       
  1513 who sometimes dedicate their
       
  1514 life to regular expression.
       
  1515 
       
  1516 334
       
  1517 00:16:37,520 --> 00:16:39,440
       
  1518 And no really a lot about
       
  1519 
       
  1520 335
       
  1521 00:16:39,440 --> 00:16:41,690
       
  1522 them didn't know
       
  1523 anything about this.
       
  1524 
       
  1525 336
       
  1526 00:16:41,690 --> 00:16:43,610
       
  1527 But engineers, they
       
  1528 already created
       
  1529 
       
  1530 337
       
  1531 00:16:43,610 --> 00:16:46,160
       
  1532 a name for that
       
  1533 regular expression,
       
  1534 
       
  1535 338
       
  1536 00:16:46,160 --> 00:16:47,975
       
  1537 denial of service attack.
       
  1538 
       
  1539 339
       
  1540 00:16:47,975 --> 00:16:49,745
       
  1541 Because what you can,
       
  1542 
       
  1543 340
       
  1544 00:16:49,745 --> 00:16:51,230
       
  1545 what can happen now is that
       
  1546 
       
  1547 341
       
  1548 00:16:51,230 --> 00:16:54,920
       
  1549 attackers look for
       
  1550 certain strings.
       
  1551 
       
  1552 342
       
  1553 00:16:54,920 --> 00:16:56,780
       
  1554 You make your regular expression
       
  1555 
       
  1556 343
       
  1557 00:16:56,780 --> 00:16:59,105
       
  1558 matching engine topple over.
       
  1559 
       
  1560 344
       
  1561 00:16:59,105 --> 00:17:01,370
       
  1562 And these kind of expressions,
       
  1563 
       
  1564 345
       
  1565 00:17:01,370 --> 00:17:04,160
       
  1566 regular expressions called
       
  1567 Eve of reg expression.
       
  1568 
       
  1569 346
       
  1570 00:17:04,160 --> 00:17:06,350
       
  1571 And actually there are
       
  1572 quite a number of them.
       
  1573 
       
  1574 347
       
  1575 00:17:06,350 --> 00:17:08,495
       
  1576 So you seen this one,
       
  1577 
       
  1578 348
       
  1579 00:17:08,495 --> 00:17:11,255
       
  1580 the first one, and the
       
  1581 second one already.
       
  1582 
       
  1583 349
       
  1584 00:17:11,255 --> 00:17:13,400
       
  1585 But there are many, many more.
       
  1586 
       
  1587 350
       
  1588 00:17:13,400 --> 00:17:15,620
       
  1589 And you can easily have in
       
  1590 
       
  1591 351
       
  1592 00:17:15,620 --> 00:17:18,560
       
  1593 your program one of
       
  1594 these reg expression.
       
  1595 
       
  1596 352
       
  1597 00:17:18,560 --> 00:17:21,830
       
  1598 And then you have the
       
  1599 problem that if you do have
       
  1600 
       
  1601 353
       
  1602 00:17:21,830 --> 00:17:23,240
       
  1603 this regular expression and
       
  1604 
       
  1605 354
       
  1606 00:17:23,240 --> 00:17:25,640
       
  1607 somebody finds the
       
  1608 corresponding string,
       
  1609 
       
  1610 355
       
  1611 00:17:25,640 --> 00:17:29,945
       
  1612 which make the records
       
  1613 matching engine topple over,
       
  1614 
       
  1615 356
       
  1616 00:17:29,945 --> 00:17:31,820
       
  1617 then you have a problem
       
  1618 
       
  1619 357
       
  1620 00:17:31,820 --> 00:17:34,295
       
  1621 because your webpage is
       
  1622 probably not variable.
       
  1623 
       
  1624 358
       
  1625 00:17:34,295 --> 00:17:36,140
       
  1626 This is also sometimes called
       
  1627 
       
  1628 359
       
  1629 00:17:36,140 --> 00:17:39,350
       
  1630 this phenomenon,
       
  1631 catastrophic backtracking.
       
  1632 
       
  1633 360
       
  1634 00:17:39,350 --> 00:17:43,595
       
  1635 In lecture three, we will
       
  1636 look at this more carefully.
       
  1637 
       
  1638 361
       
  1639 00:17:43,595 --> 00:17:46,910
       
  1640 And actually why that
       
  1641 is such a problem in
       
  1642 
       
  1643 362
       
  1644 00:17:46,910 --> 00:17:50,795
       
  1645 real life is actually
       
  1646 not to do with Lexus.
       
  1647 
       
  1648 363
       
  1649 00:17:50,795 --> 00:17:53,180
       
  1650 Yes, regular
       
  1651 expressions are used as
       
  1652 
       
  1653 364
       
  1654 00:17:53,180 --> 00:17:55,040
       
  1655 the basic tool for implementing
       
  1656 
       
  1657 365
       
  1658 00:17:55,040 --> 00:17:57,185
       
  1659 like source bad reg expressions,
       
  1660 
       
  1661 366
       
  1662 00:17:57,185 --> 00:18:00,065
       
  1663 of course, used in
       
  1664 a much wider area.
       
  1665 
       
  1666 367
       
  1667 00:18:00,065 --> 00:18:03,770
       
  1668 And they especially used for
       
  1669 network intrusion detection.
       
  1670 
       
  1671 368
       
  1672 00:18:03,770 --> 00:18:06,590
       
  1673 Remember, you having to
       
  1674 
       
  1675 369
       
  1676 00:18:06,590 --> 00:18:10,130
       
  1677 administer a big network
       
  1678 and you only want to let
       
  1679 
       
  1680 370
       
  1681 00:18:10,130 --> 00:18:13,640
       
  1682 in packets which you think are K
       
  1683 
       
  1684 371
       
  1685 00:18:13,640 --> 00:18:14,930
       
  1686 and you want to keep out
       
  1687 
       
  1688 372
       
  1689 00:18:14,930 --> 00:18:17,645
       
  1690 any package which might
       
  1691 hack into your network.
       
  1692 
       
  1693 373
       
  1694 00:18:17,645 --> 00:18:22,670
       
  1695 So what they have is they
       
  1696 have suites of thousands and
       
  1697 
       
  1698 374
       
  1699 00:18:22,670 --> 00:18:25,745
       
  1700 sometimes even more
       
  1701 regular expressions which
       
  1702 
       
  1703 375
       
  1704 00:18:25,745 --> 00:18:27,755
       
  1705 check whether this package
       
  1706 
       
  1707 376
       
  1708 00:18:27,755 --> 00:18:30,065
       
  1709 satisfies some patterns or not.
       
  1710 
       
  1711 377
       
  1712 00:18:30,065 --> 00:18:31,460
       
  1713 And in this case it will be left
       
  1714 
       
  1715 378
       
  1716 00:18:31,460 --> 00:18:34,205
       
  1717 out or it will be let in.
       
  1718 
       
  1719 379
       
  1720 00:18:34,205 --> 00:18:36,335
       
  1721 And with networks,
       
  1722 
       
  1723 380
       
  1724 00:18:36,335 --> 00:18:39,080
       
  1725 the problem is that our
       
  1726 hardware is already
       
  1727 
       
  1728 381
       
  1729 00:18:39,080 --> 00:18:43,190
       
  1730 so fast that the reg expressions
       
  1731 
       
  1732 382
       
  1733 00:18:43,190 --> 00:18:45,169
       
  1734 really become a bottleneck.
       
  1735 
       
  1736 383
       
  1737 00:18:45,169 --> 00:18:47,060
       
  1738 Because what do you do if now is
       
  1739 
       
  1740 384
       
  1741 00:18:47,060 --> 00:18:49,880
       
  1742 suddenly a reg expression
       
  1743 takes too much time
       
  1744 
       
  1745 385
       
  1746 00:18:49,880 --> 00:18:52,670
       
  1747 to just stop the matching
       
  1748 
       
  1749 386
       
  1750 00:18:52,670 --> 00:18:55,100
       
  1751 and let the package
       
  1752 in regardless?
       
  1753 
       
  1754 387
       
  1755 00:18:55,100 --> 00:18:58,190
       
  1756 Or do you just hold
       
  1757 the network up
       
  1758 
       
  1759 388
       
  1760 00:18:58,190 --> 00:19:01,715
       
  1761 and don't let anything in
       
  1762 until you decided that.
       
  1763 
       
  1764 389
       
  1765 00:19:01,715 --> 00:19:04,895
       
  1766 So that's actually a
       
  1767 really hard problem.
       
  1768 
       
  1769 390
       
  1770 00:19:04,895 --> 00:19:06,650
       
  1771 But the first time I came across
       
  1772 
       
  1773 391
       
  1774 00:19:06,650 --> 00:19:09,965
       
  1775 that problem was actually
       
  1776 by this engineer.
       
  1777 
       
  1778 392
       
  1779 00:19:09,965 --> 00:19:13,820
       
  1780 And it's always say that
       
  1781 Germans don't have any Yammer.
       
  1782 
       
  1783 393
       
  1784 00:19:13,820 --> 00:19:16,985
       
  1785 But I found that
       
  1786 video quite funny.
       
  1787 
       
  1788 394
       
  1789 00:19:16,985 --> 00:19:19,145
       
  1790 Maybe you have a
       
  1791 different opinion,
       
  1792 
       
  1793 395
       
  1794 00:19:19,145 --> 00:19:21,095
       
  1795 but feel free to
       
  1796 have a look which
       
  1797 
       
  1798 396
       
  1799 00:19:21,095 --> 00:19:23,705
       
  1800 explains exactly that problem.
       
  1801 
       
  1802 397
       
  1803 00:19:23,705 --> 00:19:25,610
       
  1804 So in the next video,
       
  1805 
       
  1806 398
       
  1807 00:19:25,610 --> 00:19:28,445
       
  1808 we will start to
       
  1809 implement this matcher.
       
  1810 
       
  1811 399
       
  1812 00:19:28,445 --> 00:19:30,870
       
  1813 So I hope to see you there.