videos/01-basics1.srt
changeset 763 4e628958c01a
equal deleted inserted replaced
762:e70df76926c0 763:4e628958c01a
       
     1 1
       
     2 00:00:06,710 --> 00:00:09,225
       
     3 Thanks for tuning in again.
       
     4 
       
     5 2
       
     6 00:00:09,225 --> 00:00:11,640
       
     7 In this video, we want to specify
       
     8 
       
     9 3
       
    10 00:00:11,640 --> 00:00:14,370
       
    11 what problem our regular
       
    12 expression matcher
       
    13 
       
    14 4
       
    15 00:00:14,370 --> 00:00:16,155
       
    16 is actually supposed to solve.
       
    17 
       
    18 5
       
    19 00:00:16,155 --> 00:00:18,900
       
    20 The reason is that
       
    21 we know that some of
       
    22 
       
    23 6
       
    24 00:00:18,900 --> 00:00:21,585
       
    25 the existing regular
       
    26 expression matching engines
       
    27 
       
    28 7
       
    29 00:00:21,585 --> 00:00:25,200
       
    30 are not just abysmally
       
    31 slow in some examples,
       
    32 
       
    33 8
       
    34 00:00:25,200 --> 00:00:27,105
       
    35 as you've seen in the
       
    36 previous video,
       
    37 
       
    38 9
       
    39 00:00:27,105 --> 00:00:30,570
       
    40 but also produce sometimes
       
    41 incorrect results.
       
    42 
       
    43 10
       
    44 00:00:30,570 --> 00:00:33,330
       
    45 In order to avoid
       
    46 this with our matcher,
       
    47 
       
    48 11
       
    49 00:00:33,330 --> 00:00:35,325
       
    50 we need to somehow explain
       
    51 
       
    52 12
       
    53 00:00:35,325 --> 00:00:39,255
       
    54 precisely what is the problem
       
    55 our algorithm solves.
       
    56 
       
    57 13
       
    58 00:00:39,255 --> 00:00:41,935
       
    59 This will require
       
    60 a bit of theory, but
       
    61 
       
    62 14
       
    63 00:00:41,935 --> 00:00:45,335
       
    64 I hope it is nevertheless
       
    65 a bit of fun.
       
    66 
       
    67 15
       
    68 00:00:45,335 --> 00:00:47,915
       
    69 First, we have to specify
       
    70 
       
    71 16
       
    72 00:00:47,915 --> 00:00:50,585
       
    73 what we mean by a
       
    74 regular expression.
       
    75 
       
    76 17
       
    77 00:00:50,585 --> 00:00:53,210
       
    78 You've seen earlier some
       
    79 examples. They were
       
    80 
       
    81 18
       
    82 00:00:53,210 --> 00:00:56,060
       
    83 actually taken or
       
    84 inspired by what
       
    85 
       
    86 19
       
    87 00:00:56,060 --> 00:00:58,850
       
    88 is available in standard
       
    89 regular expression matching
       
    90 
       
    91 20
       
    92 00:00:58,850 --> 00:01:02,330
       
    93 engines, like star, plus and n-times.
       
    94 
       
    95 21
       
    96 00:01:02,330 --> 00:01:05,690
       
    97 But for many tasks,
       
    98 for our algorithm,
       
    99 
       
   100 22
       
   101 00:01:05,690 --> 00:01:10,174
       
   102 we will focus only what I call
       
   103 basic regular expressions.
       
   104 
       
   105 23
       
   106 00:01:10,174 --> 00:01:11,840
       
   107 Since I'm lazy, I will call
       
   108 
       
   109 24
       
   110 00:01:11,840 --> 00:01:13,550
       
   111 these basic regular expressions
       
   112 
       
   113 25
       
   114 00:01:13,550 --> 00:01:15,485
       
   115 just as regular expressions.
       
   116 
       
   117 26
       
   118 00:01:15,485 --> 00:01:17,405
       
   119 And the ones you've seen earlier
       
   120 
       
   121 27
       
   122 00:01:17,405 --> 00:01:19,400
       
   123 as extended regular expressions.
       
   124 
       
   125 28
       
   126 00:01:19,400 --> 00:01:22,940
       
   127 So the basic regulare expressions,
       
   128 or just regular expressions,
       
   129 
       
   130 29
       
   131 00:01:22,940 --> 00:01:25,280
       
   132 they will have characters.
       
   133 
       
   134 30
       
   135 00:01:25,280 --> 00:01:27,170
       
   136 So you can match any character,
       
   137 
       
   138 31
       
   139 00:01:27,170 --> 00:01:31,370
       
   140 a,b,c to z or 0 to 9.
       
   141 
       
   142 32
       
   143 00:01:31,370 --> 00:01:35,525
       
   144 Any Ascii character. 'c' here
       
   145 is just a representative.
       
   146 
       
   147 33
       
   148 00:01:35,525 --> 00:01:38,825
       
   149 So we can match
       
   150 single characters.
       
   151 
       
   152 34
       
   153 00:01:38,825 --> 00:01:42,440
       
   154 Then we can match alternatives.
       
   155 
       
   156 35
       
   157 00:01:42,440 --> 00:01:44,930
       
   158 That means a string
       
   159 is either matched
       
   160 
       
   161 36
       
   162 00:01:44,930 --> 00:01:46,730
       
   163 by the regular expression r1
       
   164 
       
   165 37
       
   166 00:01:46,730 --> 00:01:49,324
       
   167 or by the regular expression r2.
       
   168 
       
   169 38
       
   170 00:01:49,324 --> 00:01:52,790
       
   171 And for the
       
   172 alternative we write +.
       
   173 
       
   174 39
       
   175 00:01:52,790 --> 00:01:55,175
       
   176 Then we also have sequence.
       
   177 
       
   178 40
       
   179 00:01:55,175 --> 00:01:57,410
       
   180 This sequence regular
       
   181 expression essentially
       
   182 
       
   183 41
       
   184 00:01:57,410 --> 00:01:59,915
       
   185 says that a string needs to be matched
       
   186 
       
   187 42
       
   188 00:01:59,915 --> 00:02:02,210
       
   189 the first part by
       
   190 the regular expression r1
       
   191 
       
   192 43
       
   193 00:02:02,210 --> 00:02:06,275
       
   194 and then the second
       
   195 part by the r2.
       
   196 
       
   197 44
       
   198 00:02:06,275 --> 00:02:10,190
       
   199 And then we have also the
       
   200 star regular expression,
       
   201 
       
   202 45
       
   203 00:02:10,190 --> 00:02:12,980
       
   204 which says the regular
       
   205 expression needs to match
       
   206 
       
   207 46
       
   208 00:02:12,980 --> 00:02:16,520
       
   209 the string with zero
       
   210 or more copies.
       
   211 
       
   212 47
       
   213 00:02:16,520 --> 00:02:18,140
       
   214 And then we also have some
       
   215 
       
   216 48
       
   217 00:02:18,140 --> 00:02:20,060
       
   218 slightly strange
       
   219 regular expressions.
       
   220 
       
   221 49
       
   222 00:02:20,060 --> 00:02:22,505
       
   223 We have the regular expression 1,
       
   224 
       
   225 50
       
   226 00:02:22,505 --> 00:02:25,910
       
   227 which can only match
       
   228 the empty string.
       
   229 
       
   230 51
       
   231 00:02:25,910 --> 00:02:29,075
       
   232 I'm using here the
       
   233 notation 1 for that
       
   234 
       
   235 52
       
   236 00:02:29,075 --> 00:02:31,340
       
   237 and in my writing I will always
       
   238 
       
   239 53
       
   240 00:02:31,340 --> 00:02:33,440
       
   241 make sure that for the
       
   242 regular expression
       
   243 
       
   244 54
       
   245 00:02:33,440 --> 00:02:35,765
       
   246 I will write the
       
   247 1 in a bold font.
       
   248 
       
   249 55
       
   250 00:02:35,765 --> 00:02:38,510
       
   251 So whenever you see
       
   252 a 1 in bold font,
       
   253 
       
   254 56
       
   255 00:02:38,510 --> 00:02:40,395
       
   256 this is not the 1, but
       
   257 
       
   258 57
       
   259 00:02:40,395 --> 00:02:44,300
       
   260 the regular expression which
       
   261 can match the empty string.
       
   262 
       
   263 58
       
   264 00:02:44,300 --> 00:02:48,050
       
   265 And we also have the
       
   266 regular expression 0,
       
   267 
       
   268 59
       
   269 00:02:48,050 --> 00:02:50,315
       
   270 which cannot match
       
   271 anything at all.
       
   272 
       
   273 60
       
   274 00:02:50,315 --> 00:02:51,695
       
   275 You might think, well,
       
   276 
       
   277 61
       
   278 00:02:51,695 --> 00:02:54,635
       
   279 that's not much use if it cannot
       
   280 match anything at all,
       
   281 
       
   282 62
       
   283 00:02:54,635 --> 00:02:58,130
       
   284 but you will see why that
       
   285 one is important later on.
       
   286 
       
   287 63
       
   288 00:02:58,130 --> 00:03:00,785
       
   289 So our basic regular expressions,
       
   290 
       
   291 64
       
   292 00:03:00,785 --> 00:03:02,375
       
   293 they will be 0,
       
   294 
       
   295 65
       
   296 00:03:02,375 --> 00:03:08,390
       
   297 1, characters, alternatives,
       
   298 sequences and stars.
       
   299 
       
   300 66
       
   301 00:03:08,390 --> 00:03:12,170
       
   302 And these are all the
       
   303 basic regular expressions.
       
   304 
       
   305 67
       
   306 00:03:12,170 --> 00:03:16,280
       
   307 If this definition is a
       
   308 bit too abstract for you,
       
   309 
       
   310 68
       
   311 00:03:16,280 --> 00:03:18,560
       
   312 we can also look at
       
   313 the concrete code,
       
   314 
       
   315 69
       
   316 00:03:18,560 --> 00:03:23,060
       
   317 how that would pan out when
       
   318 actually writing some Scala.
       
   319 
       
   320 70
       
   321 00:03:23,060 --> 00:03:28,040
       
   322 I promised you, I show
       
   323 you always my code in Scala.
       
   324 
       
   325 71
       
   326 00:03:28,040 --> 00:03:29,480
       
   327 So here you would have
       
   328 
       
   329 72
       
   330 00:03:29,480 --> 00:03:32,885
       
   331 first an abstract class
       
   332 for regular expressions.
       
   333 
       
   334 73
       
   335 00:03:32,885 --> 00:03:37,580
       
   336 Then you have one regular
       
   337 expression for 0, 
       
   338 
       
   339 74
       
   340 00:03:37,580 --> 00:03:41,540
       
   341 one regular expression for 1, 
       
   342 
       
   343 75
       
   344 00:03:41,540 --> 00:03:42,875
       
   345 one regular expression, which
       
   346 takes an argument,
       
   347 
       
   348 76
       
   349 00:03:42,875 --> 00:03:45,050
       
   350 the character you want to match,
       
   351 
       
   352 77
       
   353 00:03:45,050 --> 00:03:47,915
       
   354 the characters a,b, c and so on.
       
   355 
       
   356 78
       
   357 00:03:47,915 --> 00:03:50,945
       
   358 Then we have an alternative
       
   359 regular expression,
       
   360 
       
   361 79
       
   362 00:03:50,945 --> 00:03:53,480
       
   363 which takes the first
       
   364 alternative and
       
   365 
       
   366 80
       
   367 00:03:53,480 --> 00:03:56,435
       
   368 the second alternative
       
   369 as arguments.
       
   370 
       
   371 81
       
   372 00:03:56,435 --> 00:03:59,690
       
   373 And we have a sequence
       
   374 regular expression. Again,
       
   375 
       
   376 82
       
   377 00:03:59,690 --> 00:04:01,850
       
   378 which takes the
       
   379 first component and
       
   380 
       
   381 83
       
   382 00:04:01,850 --> 00:04:04,730
       
   383 the second component
       
   384 as two arguments.
       
   385 
       
   386 84
       
   387 00:04:04,730 --> 00:04:07,249
       
   388 And we have the star
       
   389 regular expression,
       
   390 
       
   391 85
       
   392 00:04:07,249 --> 00:04:10,880
       
   393 which just take one regular
       
   394 expression as argument.
       
   395 
       
   396 86
       
   397 00:04:10,880 --> 00:04:16,115
       
   398 And all these reg expressions
       
   399 extend our abstract class.
       
   400 
       
   401 87
       
   402 00:04:16,115 --> 00:04:20,300
       
   403 For whatever I do in
       
   404 this module here I have
       
   405 
       
   406 88
       
   407 00:04:20,300 --> 00:04:23,300
       
   408 the convention that all
       
   409 the regular expressions
       
   410 
       
   411 89
       
   412 00:04:23,300 --> 00:04:25,550
       
   413 are written with capital letters.
       
   414 
       
   415 90
       
   416 00:04:25,550 --> 00:04:26,885
       
   417 As you can see that here,
       
   418 
       
   419 91
       
   420 00:04:26,885 --> 00:04:31,685
       
   421 O, 1,  character, these will be
       
   422 always regular expressions.
       
   423 
       
   424 92
       
   425 00:04:31,685 --> 00:04:34,370
       
   426 They have all capital letters.
       
   427 
       
   428 93
       
   429 00:04:34,370 --> 00:04:36,484
       
   430 Let's for a moment,
       
   431 
       
   432 94
       
   433 00:04:36,484 --> 00:04:38,720
       
   434 play around with this definition.
       
   435 
       
   436 95
       
   437 00:04:38,720 --> 00:04:41,945
       
   438 I'm using here the
       
   439 Ammonite REPL.
       
   440 
       
   441 96
       
   442 00:04:41,945 --> 00:04:46,950
       
   443 And I can evaluate
       
   444 this definition.
       
   445 
       
   446 97
       
   447 00:04:53,430 --> 00:04:55,810
       
   448 And now I can start to
       
   449 
       
   450 98
       
   451 00:04:55,810 --> 00:04:58,570
       
   452 define particular
       
   453 regular expressions.
       
   454 
       
   455 99
       
   456 00:04:58,570 --> 00:05:00,340
       
   457 For example, if I need
       
   458 a regular expression
       
   459 
       
   460 100
       
   461 00:05:00,340 --> 00:05:02,860
       
   462 which can recognise
       
   463 the character a,
       
   464 
       
   465 101
       
   466 00:05:02,860 --> 00:05:06,025
       
   467 then I would write
       
   468 something like this.
       
   469 
       
   470 102
       
   471 00:05:06,025 --> 00:05:08,710
       
   472 So this regular expression
       
   473 takes an argument,
       
   474 
       
   475 103
       
   476 00:05:08,710 --> 00:05:13,615
       
   477 the character 'a'  to specify
       
   478 which character to match.
       
   479 
       
   480 104
       
   481 00:05:13,615 --> 00:05:16,945
       
   482 We do this obviously also with 'b'.
       
   483 
       
   484 105
       
   485 00:05:16,945 --> 00:05:19,405
       
   486 And I can do that with
       
   487 
       
   488 106
       
   489 00:05:19,405 --> 00:05:22,975
       
   490 'c'. So now we have three
       
   491 regular expressions.
       
   492 
       
   493 107
       
   494 00:05:22,975 --> 00:05:25,570
       
   495 If you look very carefully
       
   496 at this definition,
       
   497 
       
   498 108
       
   499 00:05:25,570 --> 00:05:27,070
       
   500 you can actually see
       
   501 
       
   502 109
       
   503 00:05:27,070 --> 00:05:29,940
       
   504 these regular
       
   505 expressions are trees.
       
   506 
       
   507 110
       
   508 00:05:29,940 --> 00:05:33,365
       
   509 So no matter what we
       
   510 write down on paper,
       
   511 
       
   512 111
       
   513 00:05:33,365 --> 00:05:36,755
       
   514 they are behind the
       
   515 scenes always trees.
       
   516 
       
   517 112
       
   518 00:05:36,755 --> 00:05:40,010
       
   519 And you can see that
       
   520 actually in this definition.
       
   521 
       
   522 113
       
   523 00:05:40,010 --> 00:05:44,330
       
   524 If you define two regular
       
   525 expressions r1 and r2.
       
   526 
       
   527 114
       
   528 00:05:44,330 --> 00:05:49,310
       
   529 They are essentially
       
   530 the alternative of a, b and c.
       
   531 
       
   532 115
       
   533 00:05:49,310 --> 00:05:52,760
       
   534 Then this regular expression
       
   535 
       
   536 116
       
   537 00:05:52,760 --> 00:05:54,710
       
   538 can match either the character
       
   539 
       
   540 117
       
   541 00:05:54,710 --> 00:05:57,980
       
   542 a or the character b
       
   543 or the character c.
       
   544 
       
   545 118
       
   546 00:05:57,980 --> 00:06:01,640
       
   547 And the same for the
       
   548 regular expression r2.
       
   549 
       
   550 119
       
   551 00:06:01,640 --> 00:06:03,875
       
   552 So let me just evaluate that.
       
   553 
       
   554 120
       
   555 00:06:03,875 --> 00:06:05,690
       
   556 And even though these are
       
   557 
       
   558 121
       
   559 00:06:05,690 --> 00:06:07,175
       
   560 two regular expressions
       
   561 which can match
       
   562 
       
   563 122
       
   564 00:06:07,175 --> 00:06:11,750
       
   565 exactly the same things,
       
   566 they a different trees.
       
   567 
       
   568 123
       
   569 00:06:11,750 --> 00:06:14,195
       
   570 So if I ask Scala,
       
   571 
       
   572 124
       
   573 00:06:14,195 --> 00:06:16,460
       
   574 are these trees different?
       
   575 
       
   576 125
       
   577 00:06:16,460 --> 00:06:19,250
       
   578 Or ask if they're
       
   579 
       
   580 126
       
   581 00:06:19,250 --> 00:06:21,865
       
   582 the same, then Scala will say No,
       
   583 
       
   584 127
       
   585 00:06:21,865 --> 00:06:25,440
       
   586 they actually different trees.
       
   587 
       
   588 128
       
   589 00:06:25,450 --> 00:06:28,459
       
   590 Let's come back to
       
   591 this definition.
       
   592 
       
   593 129
       
   594 00:06:28,459 --> 00:06:31,760
       
   595 If we want to write down
       
   596 regular expressions on paper,
       
   597 
       
   598 130
       
   599 00:06:31,760 --> 00:06:33,620
       
   600 then we want to be sloppy as
       
   601 
       
   602 131
       
   603 00:06:33,620 --> 00:06:35,750
       
   604 mathematicians rather than as
       
   605 
       
   606 132
       
   607 00:06:35,750 --> 00:06:37,745
       
   608 precise as computer scientists.
       
   609 
       
   610 133
       
   611 00:06:37,745 --> 00:06:40,490
       
   612 So when we want to write down
       
   613 a regular expression which can
       
   614 
       
   615 134
       
   616 00:06:40,490 --> 00:06:43,955
       
   617 either match the character
       
   618 a or the character b,
       
   619 
       
   620 135
       
   621 00:06:43,955 --> 00:06:49,130
       
   622 then we would write down
       
   623 something like this, a plus b.
       
   624 
       
   625 136
       
   626 00:06:49,130 --> 00:06:51,170
       
   627 And if you want to have
       
   628 the regular expression
       
   629 
       
   630 137
       
   631 00:06:51,170 --> 00:06:52,625
       
   632 which can either match
       
   633 
       
   634 138
       
   635 00:06:52,625 --> 00:06:55,925
       
   636 the character a or b or c,
       
   637 
       
   638 139
       
   639 00:06:55,925 --> 00:06:58,340
       
   640 we will write
       
   641 something like this.
       
   642 
       
   643 140
       
   644 00:06:58,340 --> 00:07:01,370
       
   645 But of course behind the
       
   646 scenes, these are trees.
       
   647 
       
   648 141
       
   649 00:07:01,370 --> 00:07:04,460
       
   650 So we should have written
       
   651 them with parentheses.
       
   652 
       
   653 142
       
   654 00:07:04,460 --> 00:07:06,440
       
   655 And you can see
       
   656 actually, there are two
       
   657 
       
   658 143
       
   659 00:07:06,440 --> 00:07:08,990
       
   660 regular expressions I
       
   661 could have written down.
       
   662 
       
   663 144
       
   664 00:07:08,990 --> 00:07:11,270
       
   665 They're different.
       
   666 
       
   667 145
       
   668 00:07:11,270 --> 00:07:12,710
       
   669 Just by convention,
       
   670 
       
   671 146
       
   672 00:07:12,710 --> 00:07:15,575
       
   673 we on't write these parentheses.
       
   674 
       
   675 147
       
   676 00:07:15,575 --> 00:07:18,740
       
   677 And that is similar with sequences.
       
   678 
       
   679 148
       
   680 00:07:18,740 --> 00:07:20,000
       
   681 If I want to write down
       
   682 
       
   683 149
       
   684 00:07:20,000 --> 00:07:22,955
       
   685 the regular expression which
       
   686 can match first an 'a',
       
   687 
       
   688 150
       
   689 00:07:22,955 --> 00:07:25,010
       
   690 then a 'b', and then a 'c',
       
   691 
       
   692 151
       
   693 00:07:25,010 --> 00:07:28,160
       
   694 then I would write down
       
   695 something like this.
       
   696 
       
   697 152
       
   698 00:07:28,160 --> 00:07:32,120
       
   699 Just, there are again
       
   700 
       
   701 153
       
   702 00:07:32,120 --> 00:07:35,735
       
   703 two regular expressions I
       
   704 could have written down.
       
   705 
       
   706 154
       
   707 00:07:35,735 --> 00:07:38,480
       
   708 Again by convention we don't
       
   709 
       
   710 155
       
   711 00:07:38,480 --> 00:07:40,670
       
   712 write these parentheses though.
       
   713 
       
   714 156
       
   715 00:07:40,670 --> 00:07:42,350
       
   716 However, sometimes we have to be
       
   717 
       
   718 157
       
   719 00:07:42,350 --> 00:07:43,940
       
   720 very careful with parentheses,
       
   721 
       
   722 158
       
   723 00:07:43,940 --> 00:07:47,195
       
   724 especially with star. 
       
   725 
       
   726 159
       
   727 00:07:47,195 --> 00:07:50,525
       
   728 Because this regular expression
       
   729 is definitely not
       
   730 
       
   731 160
       
   732 00:07:50,525 --> 00:07:54,900
       
   733 the same as this regular expression.
       
   734 
       
   735 161
       
   736 00:07:56,100 --> 00:07:59,410
       
   737 The first one here can match
       
   738 
       
   739 162
       
   740 00:07:59,410 --> 00:08:03,610
       
   741 any strings containing a or b's.
       
   742 
       
   743 163
       
   744 00:08:03,610 --> 00:08:05,860
       
   745 While this regular expression can
       
   746 
       
   747 164
       
   748 00:08:05,860 --> 00:08:07,945
       
   749 only match the single character
       
   750 
       
   751 165
       
   752 00:08:07,945 --> 00:08:13,300
       
   753 a or any string
       
   754 containing only b's.
       
   755 
       
   756 166
       
   757 00:08:13,300 --> 00:08:15,265
       
   758 So to make the difference clear,
       
   759 
       
   760 167
       
   761 00:08:15,265 --> 00:08:20,065
       
   762 in this example, we would have
       
   763 to use the parentheses.
       
   764 
       
   765 168
       
   766 00:08:20,065 --> 00:08:23,140
       
   767 There's one more issue
       
   768 with this definition.
       
   769 
       
   770 169
       
   771 00:08:23,140 --> 00:08:26,635
       
   772 Why do we focus on these
       
   773 basic regular expressions?
       
   774 
       
   775 170
       
   776 00:08:26,635 --> 00:08:28,660
       
   777 Why don't we also include
       
   778 
       
   779 171
       
   780 00:08:28,660 --> 00:08:31,285
       
   781 the ones from the
       
   782 extended regular expressions.
       
   783 
       
   784 172
       
   785 00:08:31,285 --> 00:08:33,055
       
   786 The answers very easy.
       
   787 
       
   788 173
       
   789 00:08:33,055 --> 00:08:35,680
       
   790 These basic regular
       
   791 expressions can be used
       
   792 
       
   793 174
       
   794 00:08:35,680 --> 00:08:38,370
       
   795 to represent also
       
   796 the extended ones.
       
   797 
       
   798 175
       
   799 00:08:38,370 --> 00:08:40,220
       
   800 Let me give you some examples.
       
   801 
       
   802 176
       
   803 00:08:40,220 --> 00:08:44,225
       
   804 If I have a regular
       
   805 expression r+, for example,
       
   806 
       
   807 177
       
   808 00:08:44,225 --> 00:08:46,280
       
   809 then the meaning
       
   810 was I have to use
       
   811 
       
   812 178
       
   813 00:08:46,280 --> 00:08:49,115
       
   814 at least one or more copies
       
   815 
       
   816 179
       
   817 00:08:49,115 --> 00:08:51,200
       
   818 of this r to
       
   819 match a string. 
       
   820 
       
   821 180
       
   822 00:08:51,200 --> 00:08:53,810
       
   823 Well, one or more copies
       
   824 can be represented by
       
   825 
       
   826 181
       
   827 00:08:53,810 --> 00:08:58,385
       
   828 the basic ones as just
       
   829 r followed by r*.
       
   830 
       
   831 182
       
   832 00:08:58,385 --> 00:09:01,760
       
   833 Meaning I have to use one
       
   834 copy of r, followed by
       
   835 
       
   836 183
       
   837 00:09:01,760 --> 00:09:05,150
       
   838 0 or more copies of r.
       
   839 
       
   840 184
       
   841 00:09:05,150 --> 00:09:07,895
       
   842 Similarly, if I have the optional
       
   843 regular expression,
       
   844 
       
   845 185
       
   846 00:09:07,895 --> 00:09:10,715
       
   847 which is supposed to
       
   848 match a string
       
   849 
       
   850 186
       
   851 00:09:10,715 --> 00:09:13,865
       
   852 by using r, or match
       
   853 the empty string.
       
   854 
       
   855 187
       
   856 00:09:13,865 --> 00:09:19,295
       
   857 Then this can be obviously
       
   858 defined as r + 1.
       
   859 
       
   860 188
       
   861 00:09:19,295 --> 00:09:23,945
       
   862 So here is the bold
       
   863 regular expression 1,
       
   864 
       
   865 189
       
   866 00:09:23,945 --> 00:09:26,180
       
   867 which means it either can
       
   868 
       
   869 190
       
   870 00:09:26,180 --> 00:09:28,205
       
   871 recognize whatever
       
   872 r can recognize,
       
   873 
       
   874 191
       
   875 00:09:28,205 --> 00:09:30,470
       
   876 or it can recognize
       
   877 the empty string.
       
   878 
       
   879 192
       
   880 00:09:30,470 --> 00:09:35,150
       
   881 And if I have ranges, like a
       
   882 to z,  then I can define
       
   883 
       
   884 193
       
   885 00:09:35,150 --> 00:09:41,135
       
   886 that as a + b + c + ...
       
   887 and so on until z.
       
   888 
       
   889 194
       
   890 00:09:41,135 --> 00:09:45,920
       
   891 Maybe this definition is not
       
   892 good in terms of runtime,
       
   893 
       
   894 195
       
   895 00:09:45,920 --> 00:09:47,960
       
   896 but in terms of just being able
       
   897 
       
   898 196
       
   899 00:09:47,960 --> 00:09:50,780
       
   900 to recognize strings
       
   901 or match strings,
       
   902 
       
   903 197
       
   904 00:09:50,780 --> 00:09:54,680
       
   905 the basic regular expressions
       
   906 will be just sufficient.
       
   907 
       
   908 198
       
   909 00:09:54,680 --> 00:09:56,690
       
   910 Unfortunately, we
       
   911 also need to have
       
   912 
       
   913 199
       
   914 00:09:56,690 --> 00:09:58,850
       
   915 a quick chat about strings.
       
   916 
       
   917 200
       
   918 00:09:58,850 --> 00:10:02,255
       
   919 In Scala, it's crystal
       
   920 clear what a string is.
       
   921 
       
   922 201
       
   923 00:10:02,255 --> 00:10:05,480
       
   924 There's a separate datatype
       
   925 which is called string.
       
   926 
       
   927 202
       
   928 00:10:05,480 --> 00:10:07,895
       
   929 So here, for example,
       
   930 is a string.
       
   931 
       
   932 203
       
   933 00:10:07,895 --> 00:10:09,200
       
   934 And as you can see,
       
   935 
       
   936 204
       
   937 00:10:09,200 --> 00:10:11,105
       
   938 it is of the type string.
       
   939 
       
   940 205
       
   941 00:10:11,105 --> 00:10:13,985
       
   942 And the empty string
       
   943 will be just that.
       
   944 
       
   945 206
       
   946 00:10:13,985 --> 00:10:16,160
       
   947 However, when we write things down on
       
   948 
       
   949 207
       
   950 00:10:16,160 --> 00:10:18,320
       
   951 paper and think
       
   952 about our algorithm,
       
   953 
       
   954 208
       
   955 00:10:18,320 --> 00:10:22,790
       
   956 we want to think of strings
       
   957 as lists of characters.
       
   958 
       
   959 209
       
   960 00:10:22,790 --> 00:10:26,070
       
   961 So more something like this.
       
   962 
       
   963 210
       
   964 00:10:27,070 --> 00:10:31,745
       
   965 You can see here, this is actually
       
   966 a list of characters.
       
   967 
       
   968 211
       
   969 00:10:31,745 --> 00:10:35,150
       
   970 And the two operations
       
   971 we need are taking
       
   972 
       
   973 212
       
   974 00:10:35,150 --> 00:10:37,280
       
   975 the head of this list and
       
   976 
       
   977 213
       
   978 00:10:37,280 --> 00:10:39,770
       
   979 the rest of the list
       
   980 or tail of the list.
       
   981 
       
   982 214
       
   983 00:10:39,770 --> 00:10:41,720
       
   984 That's why we want
       
   985 to regard them as
       
   986 
       
   987 215
       
   988 00:10:41,720 --> 00:10:45,260
       
   989 lists rather than strings.
       
   990 
       
   991 216
       
   992 00:10:45,260 --> 00:10:48,200
       
   993 So if I'm using a
       
   994 string like this,
       
   995 
       
   996 217
       
   997 00:10:48,200 --> 00:10:51,935
       
   998 then on paper I always will
       
   999 write something like that.
       
  1000 
       
  1001 218
       
  1002 00:10:51,935 --> 00:10:54,575
       
  1003 Or since I'm lazy, just that.
       
  1004 
       
  1005 219
       
  1006 00:10:54,575 --> 00:10:56,675
       
  1007 And for the empty string,
       
  1008 
       
  1009 220
       
  1010 00:10:56,675 --> 00:10:59,210
       
  1011 I will write either
       
  1012 the empty list, with
       
  1013 
       
  1014 221
       
  1015 00:10:59,210 --> 00:11:03,920
       
  1016 two brackets or,
       
  1017 being lazy, just that.
       
  1018 
       
  1019 222
       
  1020 00:11:03,920 --> 00:11:06,620
       
  1021 Actually there is one
       
  1022 more operation we need on
       
  1023 
       
  1024 223
       
  1025 00:11:06,620 --> 00:11:09,410
       
  1026 strings and that
       
  1027 is concatenation.
       
  1028 
       
  1029 224
       
  1030 00:11:09,410 --> 00:11:11,255
       
  1031 If you have a string s1,
       
  1032 
       
  1033 225
       
  1034 00:11:11,255 --> 00:11:14,510
       
  1035 string s2, and put an
       
  1036 at symbol in between,
       
  1037 
       
  1038 226
       
  1039 00:11:14,510 --> 00:11:18,050
       
  1040 that means we want to
       
  1041 concatenate both strings.
       
  1042 
       
  1043 227
       
  1044 00:11:18,050 --> 00:11:22,625
       
  1045 So foo concatenated with
       
  1046 bar, would be foobar.
       
  1047 
       
  1048 228
       
  1049 00:11:22,625 --> 00:11:25,085
       
  1050 And any string concatenated with
       
  1051 
       
  1052 229
       
  1053 00:11:25,085 --> 00:11:27,950
       
  1054 the empty string
       
  1055 is left untouched.
       
  1056 
       
  1057 230
       
  1058 00:11:27,950 --> 00:11:31,310
       
  1059 So baz concatenated with
       
  1060 
       
  1061 231
       
  1062 00:11:31,310 --> 00:11:33,545
       
  1063 the empty string, is just baz.
       
  1064 
       
  1065 232
       
  1066 00:11:33,545 --> 00:11:37,295
       
  1067 So that's like if we have
       
  1068 strings as lists of characters,
       
  1069 
       
  1070 233
       
  1071 00:11:37,295 --> 00:11:39,755
       
  1072 that will be just list append.
       
  1073 
       
  1074 234
       
  1075 00:11:39,755 --> 00:11:41,480
       
  1076 In the next video,
       
  1077 
       
  1078 235
       
  1079 00:11:41,480 --> 00:11:43,160
       
  1080 we will use these definitions
       
  1081 
       
  1082 236
       
  1083 00:11:43,160 --> 00:11:45,050
       
  1084 and introduce the notion of what
       
  1085 
       
  1086 237
       
  1087 00:11:45,050 --> 00:11:46,850
       
  1088 a language is and
       
  1089 
       
  1090 238
       
  1091 00:11:46,850 --> 00:11:49,920
       
  1092 what the meaning of a
       
  1093 regular expression is.