videos/02-prelims.srt
changeset 766 e8402d8ec8e6
equal deleted inserted replaced
765:b294cfbb5c01 766:e8402d8ec8e6
       
     1 1
       
     2 00:00:09,990 --> 00:00:13,465
       
     3 Welcome back to this
       
     4 week's lecture.
       
     5 
       
     6 2
       
     7 00:00:13,465 --> 00:00:16,450
       
     8 The task this week is to actually
       
     9 
       
    10 3
       
    11 00:00:16,450 --> 00:00:20,320
       
    12 implement a regular
       
    13 expression matcher.
       
    14 
       
    15 4
       
    16 00:00:20,320 --> 00:00:22,510
       
    17 And we want to be a bit
       
    18 
       
    19 5
       
    20 00:00:22,510 --> 00:00:25,315
       
    21 faster than the regular
       
    22 expression matcher
       
    23 
       
    24 6
       
    25 00:00:25,315 --> 00:00:29,380
       
    26 in Python, Ruby,
       
    27 Javascript, and Java.
       
    28 
       
    29 7
       
    30 00:00:29,380 --> 00:00:31,960
       
    31 Remember this regular expression
       
    32 
       
    33 8
       
    34 00:00:31,960 --> 00:00:35,785
       
    35 and strings which are
       
    36 just a number of a's,
       
    37 
       
    38 9
       
    39 00:00:35,785 --> 00:00:39,670
       
    40 these regular expression matchers
       
    41 where abysmally slow.
       
    42 
       
    43 10
       
    44 00:00:39,670 --> 00:00:43,170
       
    45 They could only match
       
    46 approximately 30 a's in
       
    47 
       
    48 11
       
    49 00:00:43,170 --> 00:00:45,665
       
    50 something like half a minute.
       
    51 
       
    52 12
       
    53 00:00:45,665 --> 00:00:49,460
       
    54 What we like to do with
       
    55 our regular expression matcher.
       
    56 
       
    57 13
       
    58 00:00:49,460 --> 00:00:51,890
       
    59 We will try a few times,
       
    60 
       
    61 14
       
    62 00:00:51,890 --> 00:00:55,505
       
    63 but our second trial will already
       
    64 be much, much better.
       
    65 
       
    66 15
       
    67 00:00:55,505 --> 00:00:58,400
       
    68 It will probably
       
    69 match around maybe
       
    70 
       
    71 16
       
    72 00:00:58,400 --> 00:01:02,030
       
    73 thousand a's in something
       
    74 in half a minute.
       
    75 
       
    76 17
       
    77 00:01:02,030 --> 00:01:03,920
       
    78 But our third version in
       
    79 
       
    80 18
       
    81 00:01:03,920 --> 00:01:06,230
       
    82 comparison will be
       
    83 blindingly fast.
       
    84 
       
    85 19
       
    86 00:01:06,230 --> 00:01:08,180
       
    87 And we'll be able to match
       
    88 
       
    89 20
       
    90 00:01:08,180 --> 00:01:10,145
       
    91 strings of length 10,000 a's
       
    92 
       
    93 21
       
    94 00:01:10,145 --> 00:01:12,230
       
    95 and will not require
       
    96 
       
    97 22
       
    98 00:01:12,230 --> 00:01:14,975
       
    99 more than five seconds.
       
   100 So let's go ahead with that.
       
   101 
       
   102 23
       
   103 00:01:14,975 --> 00:01:18,095
       
   104 We will first implement
       
   105 
       
   106 24
       
   107 00:01:18,095 --> 00:01:19,880
       
   108 our regular expression
       
   109 matcher only
       
   110 
       
   111 25
       
   112 00:01:19,880 --> 00:01:22,069
       
   113 for the basic
       
   114 regular expressions.
       
   115 
       
   116 26
       
   117 00:01:22,069 --> 00:01:25,430
       
   118 So we will have only six
       
   119 cases to think about.
       
   120 
       
   121 27
       
   122 00:01:25,430 --> 00:01:27,620
       
   123 This will simplify matters at first.
       
   124 
       
   125 28
       
   126 00:01:27,620 --> 00:01:30,350
       
   127 Later we can
       
   128 think about how to
       
   129 
       
   130 29
       
   131 00:01:30,350 --> 00:01:34,100
       
   132 extend that to the extended
       
   133 regular expressions.
       
   134 
       
   135 30
       
   136 00:01:34,100 --> 00:01:37,625
       
   137 Unfortunately, before we can
       
   138 start our implementation,
       
   139 
       
   140 31
       
   141 00:01:37,625 --> 00:01:39,290
       
   142 we have to discuss
       
   143 
       
   144 32
       
   145 00:01:39,290 --> 00:01:42,470
       
   146 when two regular
       
   147 expressions are equivalent.
       
   148 
       
   149 33
       
   150 00:01:42,470 --> 00:01:46,595
       
   151 And we use here this notation
       
   152 of the triple equals.
       
   153 
       
   154 34
       
   155 00:01:46,595 --> 00:01:48,215
       
   156 We say a regular expression
       
   157 
       
   158 35
       
   159 00:01:48,215 --> 00:01:50,180
       
   160 r1 and r2 are
       
   161 
       
   162 36
       
   163 00:01:50,180 --> 00:01:56,059
       
   164 equivalent if and only
       
   165 if the language of r1 is
       
   166 
       
   167 37
       
   168 00:01:56,059 --> 00:01:59,360
       
   169 equal to the language of r2.
       
   170 
       
   171 38
       
   172 00:01:59,360 --> 00:02:02,915
       
   173 Sounds complicated,
       
   174 but essentially means
       
   175 
       
   176 39
       
   177 00:02:02,915 --> 00:02:04,700
       
   178 if the two regular expressions can
       
   179 
       
   180 40
       
   181 00:02:04,700 --> 00:02:06,950
       
   182 match exactly the same strings,
       
   183 
       
   184 41
       
   185 00:02:06,950 --> 00:02:09,800
       
   186 then we want to regard
       
   187 them as equivalent.
       
   188 
       
   189 42
       
   190 00:02:09,800 --> 00:02:14,300
       
   191 This equivalence justifies
       
   192 why we can be sloppy
       
   193 
       
   194 43
       
   195 00:02:14,300 --> 00:02:16,040
       
   196 with parentheses.
       
   197 
       
   198 44
       
   199 00:02:16,040 --> 00:02:23,870
       
   200 For example, if we have
       
   201 (r1 + r2) + r3,
       
   202 
       
   203 45
       
   204 00:02:23,870 --> 00:02:32,270
       
   205 then this will be equivalent
       
   206 to r1 + (r2 + r3).
       
   207 
       
   208 46
       
   209 00:02:32,270 --> 00:02:34,910
       
   210 Remember, regular
       
   211 expressions are trees,
       
   212 
       
   213 47
       
   214 00:02:34,910 --> 00:02:37,940
       
   215 so these are two different
       
   216 regular expressions,
       
   217 
       
   218 48
       
   219 00:02:37,940 --> 00:02:41,540
       
   220 but they can match
       
   221 the same strings.
       
   222 
       
   223 49
       
   224 00:02:41,540 --> 00:02:45,695
       
   225 Because, well, if we take
       
   226 here the meaning of that,
       
   227 
       
   228 50
       
   229 00:02:45,695 --> 00:02:54,350
       
   230 that would be just L(r1 + r2 + r3),
       
   231 
       
   232 
       
   233 51
       
   234 00:02:54,350 --> 00:03:04,100
       
   235 which is equal to L(r1 + r2) u L(r3).
       
   236 
       
   237 
       
   238 52
       
   239 00:03:04,100 --> 00:03:10,595
       
   240 And that is equal to of 
       
   241 
       
   242 53
       
   243 00:03:10,595 --> 00:03:15,710
       
   244 L(r1) u L(r2) u L(r3).
       
   245 
       
   246 
       
   247 54
       
   248 00:03:15,710 --> 00:03:17,930
       
   249 And if you do the
       
   250 same calculation
       
   251 
       
   252 55
       
   253 00:03:17,930 --> 00:03:19,445
       
   254 for that regular expression,
       
   255 
       
   256 56
       
   257 00:03:19,445 --> 00:03:22,940
       
   258 you will find out the
       
   259 two sets are the same.
       
   260 
       
   261 57
       
   262 00:03:22,940 --> 00:03:26,195
       
   263 So both regular expressions
       
   264 can match the same strings.
       
   265 
       
   266 58
       
   267 00:03:26,195 --> 00:03:28,805
       
   268 So even if they're different
       
   269 regular expressions,
       
   270 
       
   271 59
       
   272 00:03:28,805 --> 00:03:31,220
       
   273 we can regard them as eqivalent.
       
   274 
       
   275 60
       
   276 00:03:31,220 --> 00:03:33,290
       
   277 And so that's why we can forget
       
   278 
       
   279 61
       
   280 00:03:33,290 --> 00:03:35,270
       
   281 about writing these parentheses.
       
   282 
       
   283 62
       
   284 00:03:35,270 --> 00:03:40,205
       
   285 Let's have a look
       
   286 at some more examples.
       
   287 
       
   288 63
       
   289 00:03:40,205 --> 00:03:41,870
       
   290 So the first one here,
       
   291 
       
   292 64
       
   293 00:03:41,870 --> 00:03:43,205
       
   294 that is now clear.
       
   295 
       
   296 65
       
   297 00:03:43,205 --> 00:03:45,200
       
   298 We did this calculation already
       
   299 
       
   300 66
       
   301 00:03:45,200 --> 00:03:47,480
       
   302 for arbitrary regular expressions.
       
   303 
       
   304 67
       
   305 00:03:47,480 --> 00:03:49,520
       
   306 Here it is for
       
   307 concrete ones a, b and c.
       
   308 
       
   309 68
       
   310 00:03:49,520 --> 00:03:52,690
       
   311 Here on the left-hand side,
       
   312 
       
   313 69
       
   314 00:03:52,690 --> 00:03:54,895
       
   315 the regular expression
       
   316 has the same meaning
       
   317 
       
   318 70
       
   319 00:03:54,895 --> 00:03:56,620
       
   320 as on the right-hand side.
       
   321 
       
   322 71
       
   323 00:03:56,620 --> 00:03:58,420
       
   324 So they are equivalent.
       
   325 
       
   326 72
       
   327 00:03:58,420 --> 00:04:01,375
       
   328 We can ignore the parentheses.
       
   329 
       
   330 73
       
   331 00:04:01,375 --> 00:04:03,220
       
   332 If I have a choice,
       
   333 
       
   334 74
       
   335 00:04:03,220 --> 00:04:06,610
       
   336 but the choice is
       
   337 the same a or a,
       
   338 
       
   339 75
       
   340 00:04:06,610 --> 00:04:09,265
       
   341 then this is
       
   342 equivalent to just a.
       
   343 
       
   344 76
       
   345 00:04:09,265 --> 00:04:13,075
       
   346 So the same choice doesn't
       
   347 introduce anything more.
       
   348 
       
   349 77
       
   350 00:04:13,075 --> 00:04:15,370
       
   351 In the next case, if I have
       
   352 
       
   353 78
       
   354 00:04:15,370 --> 00:04:19,450
       
   355 a regular expression
       
   356 which can match a or b,
       
   357 
       
   358 79
       
   359 00:04:19,450 --> 00:04:23,860
       
   360 that can match the same
       
   361 strings as b or a.
       
   362 
       
   363 80
       
   364 00:04:23,860 --> 00:04:27,325
       
   365 So you have a kind of
       
   366 commutativity for the plus,
       
   367 
       
   368 81
       
   369 00:04:27,325 --> 00:04:29,485
       
   370 which is like on natural numbers.
       
   371 
       
   372 82
       
   373 00:04:29,485 --> 00:04:32,880
       
   374 But you would not have a
       
   375 communitivity for the sequence:
       
   376 
       
   377 83
       
   378 00:04:32,880 --> 00:04:37,070
       
   379 if you have
       
   380 first a and then b,
       
   381 
       
   382 84
       
   383 00:04:37,070 --> 00:04:40,850
       
   384 that's not the same as
       
   385 matching first b and then a.
       
   386 
       
   387 85
       
   388 00:04:40,850 --> 00:04:42,800
       
   389 So for the sequence ...
       
   390 
       
   391 86
       
   392 00:04:42,800 --> 00:04:44,615
       
   393 they are not equivalent.
       
   394 
       
   395 87
       
   396 00:04:44,615 --> 00:04:49,475
       
   397 However, for the sequence I
       
   398 can, like for the plus, ignore
       
   399 
       
   400 88
       
   401 00:04:49,475 --> 00:04:51,245
       
   402 prarentheses.
       
   403 
       
   404 89
       
   405 00:04:51,245 --> 00:04:55,070
       
   406 If I have the parentheses
       
   407 and this way,
       
   408 
       
   409 90
       
   410 00:04:55,070 --> 00:04:57,785
       
   411 or I have it in this way.
       
   412 
       
   413 91
       
   414 00:04:57,785 --> 00:05:00,605
       
   415 These are two different
       
   416 regular expressions,
       
   417 
       
   418 92
       
   419 00:05:00,605 --> 00:05:02,120
       
   420 but they have the same meaning.
       
   421 
       
   422 93
       
   423 00:05:02,120 --> 00:05:03,815
       
   424 So they are equivalent.
       
   425 
       
   426 94
       
   427 00:05:03,815 --> 00:05:05,780
       
   428 And so I can leave out parentheses
       
   429 
       
   430 95
       
   431 00:05:05,780 --> 00:05:09,170
       
   432 for times as well.
       
   433 
       
   434 96
       
   435 00:05:09,170 --> 00:05:12,185
       
   436 The next one is a slightly
       
   437 more interesting one.
       
   438 
       
   439 97
       
   440 00:05:12,185 --> 00:05:15,020
       
   441 If I have a regular
       
   442 expression which can match
       
   443 
       
   444 98
       
   445 00:05:15,020 --> 00:05:19,655
       
   446 c first followed by (a + b),
       
   447 
       
   448 99
       
   449 00:05:19,655 --> 00:05:25,355
       
   450 then this is the same as
       
   451 first c followed by a,
       
   452 
       
   453 100
       
   454 00:05:25,355 --> 00:05:28,640
       
   455 or c followed by b.
       
   456 
       
   457 101
       
   458 00:05:28,640 --> 00:05:31,265
       
   459 So that's the kind of
       
   460 distributivity law
       
   461 
       
   462 102
       
   463 00:05:31,265 --> 00:05:33,545
       
   464 on regular expressions.
       
   465 
       
   466 103
       
   467 00:05:33,545 --> 00:05:36,350
       
   468 But they're also regular expressions
       
   469 which are not equivalent.
       
   470 
       
   471 104
       
   472 00:05:36,350 --> 00:05:38,990
       
   473 If I have the regular
       
   474 expression which can
       
   475 
       
   476 105
       
   477 00:05:38,990 --> 00:05:42,800
       
   478 match the string containing
       
   479 exactly two a's.
       
   480 
       
   481 106
       
   482 00:05:42,800 --> 00:05:44,240
       
   483 That is not equivalent
       
   484 
       
   485 107
       
   486 00:05:44,240 --> 00:05:46,730
       
   487 to the regular
       
   488 expression which can just match
       
   489 
       
   490 108
       
   491 00:05:46,730 --> 00:05:49,250
       
   492 a single a. And similarly
       
   493 
       
   494 109
       
   495 00:05:49,250 --> 00:05:51,680
       
   496 in this case, if I can match
       
   497 
       
   498 110
       
   499 00:05:51,680 --> 00:05:56,075
       
   500 a or the string b followed by c,
       
   501 
       
   502 111
       
   503 00:05:56,075 --> 00:05:59,825
       
   504 that is not the same as a or b,
       
   505 
       
   506 112
       
   507 00:05:59,825 --> 00:06:02,000
       
   508 followed by a or c.
       
   509 
       
   510 113
       
   511 00:06:02,000 --> 00:06:05,990
       
   512 I'll let you think about
       
   513 why is that not equivalent.
       
   514 
       
   515 114
       
   516 00:06:05,990 --> 00:06:08,060
       
   517 Essentially you
       
   518 have to find out what's
       
   519 
       
   520 115
       
   521 00:06:08,060 --> 00:06:10,475
       
   522 the meaning of both
       
   523 regular expressions.
       
   524 
       
   525 116
       
   526 00:06:10,475 --> 00:06:14,090
       
   527 And are they the
       
   528 same sets or not?
       
   529 
       
   530 117
       
   531 00:06:14,090 --> 00:06:17,435
       
   532 There are again some
       
   533 interesting corner cases.
       
   534 
       
   535 118
       
   536 00:06:17,435 --> 00:06:20,540
       
   537 If I have a regular expression
       
   538 that can match a,
       
   539 
       
   540 119
       
   541 00:06:20,540 --> 00:06:23,450
       
   542 but followed by the regular
       
   543 expression which cannot match
       
   544 
       
   545 120
       
   546 00:06:23,450 --> 00:06:25,670
       
   547 anything, that is not equivalent
       
   548 
       
   549 121
       
   550 00:06:25,670 --> 00:06:29,255
       
   551 to the regular
       
   552 expression which can match a.
       
   553 
       
   554 122
       
   555 00:06:29,255 --> 00:06:31,340
       
   556 Again here you have
       
   557 to do the calculation
       
   558 
       
   559 123
       
   560 00:06:31,340 --> 00:06:32,915
       
   561 to convince you of that.
       
   562 
       
   563 124
       
   564 00:06:32,915 --> 00:06:35,840
       
   565 Similarly, if I have a regular
       
   566 expression which can
       
   567 
       
   568 125
       
   569 00:06:35,840 --> 00:06:38,750
       
   570 match a or the empty string,
       
   571 
       
   572 126
       
   573 00:06:38,750 --> 00:06:40,640
       
   574 this is not equivalent to
       
   575 
       
   576 127
       
   577 00:06:40,640 --> 00:06:43,355
       
   578 the regular expression
       
   579 which can just match a.
       
   580 
       
   581 128
       
   582 00:06:43,355 --> 00:06:46,925
       
   583 Here are some interesting
       
   584 ones with zeros and ones.
       
   585 
       
   586 129
       
   587 00:06:46,925 --> 00:06:48,860
       
   588 So if I have the regular expression
       
   589 
       
   590 130
       
   591 00:06:48,860 --> 00:06:50,975
       
   592 that can match the empty string,
       
   593 
       
   594 131
       
   595 00:06:50,975 --> 00:06:53,540
       
   596 this will be actually equivalent to
       
   597 
       
   598 132
       
   599 00:06:53,540 --> 00:06:56,435
       
   600 the regular expression
       
   601 which can match nothing,
       
   602 
       
   603 133
       
   604 00:06:56,435 --> 00:06:59,405
       
   605 but star of that.
       
   606 
       
   607 134
       
   608 00:06:59,405 --> 00:07:01,970
       
   609 Remember the star
       
   610 always introduces,
       
   611 
       
   612 135
       
   613 00:07:01,970 --> 00:07:04,370
       
   614 no matter what, the empty string.
       
   615 
       
   616 136
       
   617 00:07:04,370 --> 00:07:06,170
       
   618 So this regular expression can match
       
   619 
       
   620 137
       
   621 00:07:06,170 --> 00:07:08,930
       
   622 the empty string,
       
   623 just like the 1.
       
   624 
       
   625 138
       
   626 00:07:08,930 --> 00:07:12,125
       
   627 So these two expressions
       
   628 are not the same,
       
   629 
       
   630 139
       
   631 00:07:12,125 --> 00:07:14,210
       
   632 but they are equivalent.
       
   633 
       
   634 140
       
   635 00:07:14,210 --> 00:07:16,700
       
   636 And it doesn't matter
       
   637 whether you take
       
   638 
       
   639 141
       
   640 00:07:16,700 --> 00:07:20,090
       
   641 the empty string to  the star,
       
   642 
       
   643 142
       
   644 00:07:20,090 --> 00:07:23,855
       
   645 because if I can match 0 or
       
   646 more times the empty string,
       
   647 
       
   648 143
       
   649 00:07:23,855 --> 00:07:27,450
       
   650 that will be equivalent to 
       
   651 just the empty string.
       
   652 
       
   653 144
       
   654 00:07:27,520 --> 00:07:32,510
       
   655 Slightly similar to the
       
   656 third case is this one.
       
   657 
       
   658 145
       
   659 00:07:32,510 --> 00:07:35,720
       
   660 Zero star is not equivalent to 0
       
   661 
       
   662 146
       
   663 00:07:35,720 --> 00:07:40,025
       
   664 because that can match
       
   665 exactly the empty string.
       
   666 
       
   667 147
       
   668 00:07:40,025 --> 00:07:42,740
       
   669 This cannot match anything.
       
   670 
       
   671 148
       
   672 00:07:42,740 --> 00:07:44,839
       
   673 While the previous
       
   674 
       
   675 149
       
   676 00:07:44,839 --> 00:07:47,540
       
   677 equivalences are all very
       
   678 instructive to look at,
       
   679 
       
   680 150
       
   681 00:07:47,540 --> 00:07:49,910
       
   682 these are the
       
   683 equivalences we need
       
   684 
       
   685 151
       
   686 00:07:49,910 --> 00:07:52,685
       
   687 in our regular expression matchers.
       
   688 
       
   689 152
       
   690 00:07:52,685 --> 00:07:54,350
       
   691 And they are defined for
       
   692 
       
   693 153
       
   694 00:07:54,350 --> 00:07:58,160
       
   695 all regular expressions r.
       
   696 So r plus 0,
       
   697 
       
   698 154
       
   699 00:07:58,160 --> 00:08:00,470
       
   700 no matter what r looks
       
   701 like is equivalent
       
   702 
       
   703 155
       
   704 00:08:00,470 --> 00:08:02,945
       
   705 to just r. Similarly
       
   706 
       
   707 156
       
   708 00:08:02,945 --> 00:08:05,930
       
   709 0 plus r is also
       
   710 equivalent to r.
       
   711 
       
   712 157
       
   713 00:08:05,930 --> 00:08:08,525
       
   714 The order of this
       
   715 choice doesn't matter;
       
   716 
       
   717 158
       
   718 00:08:08,525 --> 00:08:11,495
       
   719 r followed by 1,
       
   720 
       
   721 159
       
   722 00:08:11,495 --> 00:08:14,030
       
   723 that's the sequence
       
   724 regular expression, is
       
   725 
       
   726 160
       
   727 00:08:14,030 --> 00:08:16,760
       
   728 equivalent to r. The
       
   729 sam, r followed by
       
   730 
       
   731 161
       
   732 00:08:16,760 --> 00:08:19,010
       
   733 r is equivalent to r.
       
   734 
       
   735 162
       
   736 00:08:19,010 --> 00:08:20,990
       
   737 And r followed by
       
   738 
       
   739 163
       
   740 00:08:20,990 --> 00:08:23,390
       
   741 the regular expression which
       
   742 cannot match anything,
       
   743 
       
   744 164
       
   745 00:08:23,390 --> 00:08:27,455
       
   746 is equivalent to just 0.
       
   747 
       
   748 165
       
   749 00:08:27,455 --> 00:08:30,740
       
   750 And 0 followed by r is also equivalent to 0.
       
   751 
       
   752 166
       
   753 00:08:30,740 --> 00:08:33,615
       
   754 And if you have the
       
   755 choice of r plus r,
       
   756 
       
   757 167
       
   758 00:08:33,615 --> 00:08:37,210
       
   759 that is also
       
   760 equivalent to just r.
       
   761 
       
   762 168
       
   763 00:08:37,210 --> 00:08:40,270
       
   764 What we're going to do
       
   765 with these equivalences is
       
   766 
       
   767 169
       
   768 00:08:40,270 --> 00:08:42,820
       
   769 that we regard them as
       
   770 simplification rules.
       
   771 
       
   772 170
       
   773 00:08:42,820 --> 00:08:43,930
       
   774 So whenever we see
       
   775 
       
   776 171
       
   777 00:08:43,930 --> 00:08:46,465
       
   778 this regular expression
       
   779 in our algorithm,
       
   780 
       
   781 172
       
   782 00:08:46,465 --> 00:08:48,580
       
   783 we will replace it by
       
   784 the right-hand side.
       
   785 
       
   786 173
       
   787 00:08:48,580 --> 00:08:51,715
       
   788 So we will orient
       
   789 these equivalences.
       
   790 
       
   791 174
       
   792 00:08:51,715 --> 00:08:53,650
       
   793 If we see, this regular expression,
       
   794 
       
   795 175
       
   796 00:08:53,650 --> 00:08:55,225
       
   797 we will replace it by that one.
       
   798 
       
   799 176
       
   800 00:08:55,225 --> 00:08:58,945
       
   801 And it will not matter
       
   802 because the left-hand sides
       
   803 
       
   804 177
       
   805 00:08:58,945 --> 00:09:01,255
       
   806 can match exactly
       
   807 the same strings
       
   808 
       
   809 178
       
   810 00:09:01,255 --> 00:09:03,475
       
   811 as the right-hand sides.
       
   812 
       
   813 179
       
   814 00:09:03,475 --> 00:09:06,370
       
   815 Here two quick examples.
       
   816 
       
   817 180
       
   818 00:09:06,370 --> 00:09:08,680
       
   819 The first one, let's
       
   820 assume you have
       
   821 
       
   822 181
       
   823 00:09:08,680 --> 00:09:10,270
       
   824 a regular expression r and
       
   825 
       
   826 182
       
   827 00:09:10,270 --> 00:09:11,905
       
   828 there is something
       
   829 in front of it.
       
   830 
       
   831 183
       
   832 00:09:11,905 --> 00:09:13,720
       
   833 This "something in front of it"
       
   834 
       
   835 184
       
   836 00:09:13,720 --> 00:09:15,870
       
   837 can be simplified as follows.
       
   838 
       
   839 185
       
   840 00:09:15,870 --> 00:09:20,000
       
   841 This 1 times b
       
   842 can be simplified to b.
       
   843 
       
   844 186
       
   845 00:09:20,000 --> 00:09:23,555
       
   846 Then this b plus 0 can
       
   847 be simplified to b.
       
   848 
       
   849 187
       
   850 00:09:23,555 --> 00:09:25,490
       
   851 And assuming that nothing can
       
   852 
       
   853 188
       
   854 00:09:25,490 --> 00:09:27,470
       
   855 be simplified inside this r,
       
   856 
       
   857 189
       
   858 00:09:27,470 --> 00:09:28,700
       
   859 then this would be
       
   860 
       
   861 190
       
   862 00:09:28,700 --> 00:09:33,050
       
   863 the simplified version
       
   864 of this regular expression.
       
   865 
       
   866 191
       
   867 00:09:33,050 --> 00:09:34,820
       
   868 And because the simplification
       
   869 
       
   870 192
       
   871 00:09:34,820 --> 00:09:36,965
       
   872 rules preserve what can be matched,
       
   873 
       
   874 193
       
   875 00:09:36,965 --> 00:09:39,080
       
   876 we can be sure that
       
   877 this regular expression
       
   878 
       
   879 194
       
   880 00:09:39,080 --> 00:09:41,255
       
   881 can match exactly the strings
       
   882 
       
   883 195
       
   884 00:09:41,255 --> 00:09:43,940
       
   885 this regular expression can match.
       
   886 
       
   887 196
       
   888 00:09:43,940 --> 00:09:45,740
       
   889 Here is the other example.
       
   890 
       
   891 197
       
   892 00:09:45,740 --> 00:09:49,550
       
   893 This has just a tiny change
       
   894 that this becomes here as 0.
       
   895 
       
   896 198
       
   897 00:09:49,550 --> 00:09:54,485
       
   898 Then 0 times b can be
       
   899 replaced by just 0.
       
   900 
       
   901 199
       
   902 00:09:54,485 --> 00:09:56,705
       
   903 Then they are actually two
       
   904 rules which could apply
       
   905 
       
   906 200
       
   907 00:09:56,705 --> 00:09:59,014
       
   908 either that we have
       
   909 the same choice
       
   910 
       
   911 201
       
   912 00:09:59,014 --> 00:10:01,160
       
   913 or we can just say something plus
       
   914 
       
   915 202
       
   916 00:10:01,160 --> 00:10:04,100
       
   917 0 will always go to something.
       
   918 
       
   919 203
       
   920 00:10:04,100 --> 00:10:06,245
       
   921 So we can simplify it to that.
       
   922 
       
   923 204
       
   924 00:10:06,245 --> 00:10:08,510
       
   925 And then we have
       
   926 0 times r again,
       
   927 
       
   928 205
       
   929 00:10:08,510 --> 00:10:10,460
       
   930 and that can be simplified to 0.
       
   931 
       
   932 206
       
   933 00:10:10,460 --> 00:10:12,080
       
   934 So actually what we find out with
       
   935 
       
   936 207
       
   937 00:10:12,080 --> 00:10:14,645
       
   938 this calculation is that
       
   939 this regular expression,
       
   940 
       
   941 208
       
   942 00:10:14,645 --> 00:10:16,835
       
   943 even if it looks
       
   944 quite complicated,
       
   945 
       
   946 209
       
   947 00:10:16,835 --> 00:10:19,295
       
   948 actually doesn't
       
   949 match any string.
       
   950 
       
   951 210
       
   952 00:10:19,295 --> 00:10:21,290
       
   953 Because it matches exactly
       
   954 
       
   955 211
       
   956 00:10:21,290 --> 00:10:23,420
       
   957 those string this regular
       
   958 expression can match.
       
   959 
       
   960 212
       
   961 00:10:23,420 --> 00:10:26,285
       
   962 And this reg expression
       
   963 cannot match any.
       
   964 
       
   965 213
       
   966 00:10:26,285 --> 00:10:29,930
       
   967 We need one more
       
   968 operation on languages.
       
   969 
       
   970 214
       
   971 00:10:29,930 --> 00:10:31,700
       
   972 I call this operation
       
   973 
       
   974 215
       
   975 00:10:31,700 --> 00:10:35,165
       
   976 the semantic derivative
       
   977 of a language.
       
   978 
       
   979 216
       
   980 00:10:35,165 --> 00:10:37,775
       
   981 This operation takes
       
   982 two arguments.
       
   983 
       
   984 217
       
   985 00:10:37,775 --> 00:10:40,670
       
   986 It takes a character
       
   987 which we take
       
   988 
       
   989 218
       
   990 00:10:40,670 --> 00:10:43,925
       
   991 the semantic derivative
       
   992 according to,
       
   993 
       
   994 219
       
   995 00:10:43,925 --> 00:10:45,980
       
   996 and it takes a language.
       
   997 
       
   998 220
       
   999 00:10:45,980 --> 00:10:49,579
       
  1000 And what it does is it
       
  1001 looks into this language
       
  1002 
       
  1003 221
       
  1004 00:10:49,579 --> 00:10:51,365
       
  1005 and looks for all the strings
       
  1006 
       
  1007 222
       
  1008 00:10:51,365 --> 00:10:53,735
       
  1009 that start with this character,
       
  1010 
       
  1011 223
       
  1012 00:10:53,735 --> 00:10:56,405
       
  1013 let's say c, and then
       
  1014 
       
  1015 224
       
  1016 00:10:56,405 --> 00:11:00,920
       
  1017 just strips off that c.
       
  1018 So here's an example.
       
  1019 
       
  1020 225
       
  1021 00:11:00,920 --> 00:11:02,975
       
  1022 Suppose you have a language A,
       
  1023 
       
  1024 226
       
  1025 00:11:02,975 --> 00:11:04,460
       
  1026 with the strings
       
  1027 
       
  1028 227
       
  1029 00:11:04,460 --> 00:11:06,965
       
  1030 foo, bar and frak.
       
  1031 
       
  1032 228
       
  1033 00:11:06,965 --> 00:11:10,835
       
  1034 And you take the semantic
       
  1035 derivative according to f,
       
  1036 
       
  1037 229
       
  1038 00:11:10,835 --> 00:11:14,450
       
  1039 then we discard all the
       
  1040 strings which do not
       
  1041 
       
  1042 230
       
  1043 00:11:14,450 --> 00:11:18,320
       
  1044 start with an f. So
       
  1045 bar will be discarded.
       
  1046 
       
  1047 231
       
  1048 00:11:18,320 --> 00:11:22,850
       
  1049 Foo and frac start with
       
  1050 an f. So we keep them
       
  1051 
       
  1052 232
       
  1053 00:11:22,850 --> 00:11:25,265
       
  1054 but we strip off the first f.
       
  1055 
       
  1056 233
       
  1057 00:11:25,265 --> 00:11:29,045
       
  1058 So here we have only oo and rak.
       
  1059 
       
  1060 234
       
  1061 00:11:29,045 --> 00:11:31,609
       
  1062 If you take the
       
  1063 semantic derivative
       
  1064 
       
  1065 235
       
  1066 00:11:31,609 --> 00:11:33,830
       
  1067 of that language according to b,
       
  1068 
       
  1069 236
       
  1070 00:11:33,830 --> 00:11:37,190
       
  1071 then we will discard foo and
       
  1072 frak because they don't
       
  1073 
       
  1074 237
       
  1075 00:11:37,190 --> 00:11:40,940
       
  1076 start with b, and just keep bar.
       
  1077 
       
  1078 238
       
  1079 00:11:40,940 --> 00:11:44,585
       
  1080 But again, we have to
       
  1081 strip off this first b.
       
  1082 
       
  1083 239
       
  1084 00:11:44,585 --> 00:11:49,305
       
  1085 And if you take the semantic
       
  1086 derivative according to a,
       
  1087 
       
  1088 240
       
  1089 00:11:49,305 --> 00:11:52,585
       
  1090 then none of these
       
  1091 strings start with a.
       
  1092 
       
  1093 241
       
  1094 00:11:52,585 --> 00:11:56,800
       
  1095 So this will be defined
       
  1096 as just the empty set.
       
  1097 
       
  1098 242
       
  1099 00:11:56,800 --> 00:11:59,695
       
  1100 You can slightly 
       
  1101 generalized this
       
  1102 
       
  1103 243
       
  1104 00:11:59,695 --> 00:12:02,560
       
  1105 semantic derivative
       
  1106 to also strings.
       
  1107 
       
  1108 244
       
  1109 00:12:02,560 --> 00:12:05,170
       
  1110 So imagine you have
       
  1111 a language A and you
       
  1112 
       
  1113 245
       
  1114 00:12:05,170 --> 00:12:08,274
       
  1115 build the semantic derivative
       
  1116 according to a string s.
       
  1117 
       
  1118 246
       
  1119 00:12:08,274 --> 00:12:10,990
       
  1120 Then you look in
       
  1121 this language and
       
  1122 
       
  1123 247
       
  1124 00:12:10,990 --> 00:12:13,420
       
  1125 you look for all the
       
  1126 strings that start with
       
  1127 
       
  1128 248
       
  1129 00:12:13,420 --> 00:12:19,555
       
  1130 this s. But you strip
       
  1131 off that s. Ok?
       
  1132 
       
  1133 249
       
  1134 00:12:19,555 --> 00:12:23,830
       
  1135 So if you have a string
       
  1136 starting with the s,
       
  1137 
       
  1138 250
       
  1139 00:12:23,830 --> 00:12:26,065
       
  1140 then you keep that string,
       
  1141 
       
  1142 251
       
  1143 00:12:26,065 --> 00:12:27,490
       
  1144 but you keep only the rest...
       
  1145 
       
  1146 252
       
  1147 00:12:27,490 --> 00:12:28,810
       
  1148 what's coming after this s.
       
  1149 
       
  1150 253
       
  1151 00:12:28,810 --> 00:12:32,010
       
  1152 That is here indicated
       
  1153 with this s'.
       
  1154 
       
  1155 254
       
  1156 00:12:32,010 --> 00:12:35,330
       
  1157 I also have this convention,
       
  1158 this personal convention
       
  1159 
       
  1160 255
       
  1161 00:12:35,330 --> 00:12:40,055
       
  1162 I have to say, that everything
       
  1163 which works on lists,
       
  1164 
       
  1165 256
       
  1166 00:12:40,055 --> 00:12:42,185
       
  1167 remember strings are
       
  1168 lists of characters.
       
  1169 
       
  1170 257
       
  1171 00:12:42,185 --> 00:12:46,655
       
  1172 I just put there an 's'. So
       
  1173 here's the one for characters.
       
  1174 
       
  1175 258
       
  1176 00:12:46,655 --> 00:12:48,680
       
  1177 I just call it Der with a capital
       
  1178 
       
  1179 259
       
  1180 00:12:48,680 --> 00:12:51,740
       
  1181 D. And I call that Ders,
       
  1182 
       
  1183 260
       
  1184 00:12:51,740 --> 00:12:54,350
       
  1185 because that works over strings.
       
  1186 
       
  1187 261
       
  1188 00:12:54,350 --> 00:12:55,865
       
  1189 And you can see how it would
       
  1190 
       
  1191 262
       
  1192 00:12:55,865 --> 00:12:59,750
       
  1193 be defined if you only take this
       
  1194 as a primitive operation.
       
  1195 
       
  1196 263
       
  1197 00:12:59,750 --> 00:13:01,340
       
  1198 We would just need to iterate
       
  1199 
       
  1200 264
       
  1201 00:13:01,340 --> 00:13:04,050
       
  1202 that essentially for this Ders.
       
  1203 
       
  1204 265
       
  1205 00:13:04,060 --> 00:13:07,970
       
  1206 Ok, we now have all
       
  1207 the theory in place.
       
  1208 
       
  1209 266
       
  1210 00:13:07,970 --> 00:13:11,345
       
  1211 And we can finally start
       
  1212 implementing our algorithm.
       
  1213 
       
  1214 267
       
  1215 00:13:11,345 --> 00:13:14,580
       
  1216 That's when we'll do
       
  1217 in the next video.