videos/01-basics1.srt
changeset 763 4e628958c01a
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/videos/01-basics1.srt	Sat Sep 26 23:45:40 2020 +0100
@@ -0,0 +1,1093 @@
+1
+00:00:06,710 --> 00:00:09,225
+Thanks for tuning in again.
+
+2
+00:00:09,225 --> 00:00:11,640
+In this video, we want to specify
+
+3
+00:00:11,640 --> 00:00:14,370
+what problem our regular
+expression matcher
+
+4
+00:00:14,370 --> 00:00:16,155
+is actually supposed to solve.
+
+5
+00:00:16,155 --> 00:00:18,900
+The reason is that
+we know that some of
+
+6
+00:00:18,900 --> 00:00:21,585
+the existing regular
+expression matching engines
+
+7
+00:00:21,585 --> 00:00:25,200
+are not just abysmally
+slow in some examples,
+
+8
+00:00:25,200 --> 00:00:27,105
+as you've seen in the
+previous video,
+
+9
+00:00:27,105 --> 00:00:30,570
+but also produce sometimes
+incorrect results.
+
+10
+00:00:30,570 --> 00:00:33,330
+In order to avoid
+this with our matcher,
+
+11
+00:00:33,330 --> 00:00:35,325
+we need to somehow explain
+
+12
+00:00:35,325 --> 00:00:39,255
+precisely what is the problem
+our algorithm solves.
+
+13
+00:00:39,255 --> 00:00:41,935
+This will require
+a bit of theory, but
+
+14
+00:00:41,935 --> 00:00:45,335
+I hope it is nevertheless
+a bit of fun.
+
+15
+00:00:45,335 --> 00:00:47,915
+First, we have to specify
+
+16
+00:00:47,915 --> 00:00:50,585
+what we mean by a
+regular expression.
+
+17
+00:00:50,585 --> 00:00:53,210
+You've seen earlier some
+examples. They were
+
+18
+00:00:53,210 --> 00:00:56,060
+actually taken or
+inspired by what
+
+19
+00:00:56,060 --> 00:00:58,850
+is available in standard
+regular expression matching
+
+20
+00:00:58,850 --> 00:01:02,330
+engines, like star, plus and n-times.
+
+21
+00:01:02,330 --> 00:01:05,690
+But for many tasks,
+for our algorithm,
+
+22
+00:01:05,690 --> 00:01:10,174
+we will focus only what I call
+basic regular expressions.
+
+23
+00:01:10,174 --> 00:01:11,840
+Since I'm lazy, I will call
+
+24
+00:01:11,840 --> 00:01:13,550
+these basic regular expressions
+
+25
+00:01:13,550 --> 00:01:15,485
+just as regular expressions.
+
+26
+00:01:15,485 --> 00:01:17,405
+And the ones you've seen earlier
+
+27
+00:01:17,405 --> 00:01:19,400
+as extended regular expressions.
+
+28
+00:01:19,400 --> 00:01:22,940
+So the basic regulare expressions,
+or just regular expressions,
+
+29
+00:01:22,940 --> 00:01:25,280
+they will have characters.
+
+30
+00:01:25,280 --> 00:01:27,170
+So you can match any character,
+
+31
+00:01:27,170 --> 00:01:31,370
+a,b,c to z or 0 to 9.
+
+32
+00:01:31,370 --> 00:01:35,525
+Any Ascii character. 'c' here
+is just a representative.
+
+33
+00:01:35,525 --> 00:01:38,825
+So we can match
+single characters.
+
+34
+00:01:38,825 --> 00:01:42,440
+Then we can match alternatives.
+
+35
+00:01:42,440 --> 00:01:44,930
+That means a string
+is either matched
+
+36
+00:01:44,930 --> 00:01:46,730
+by the regular expression r1
+
+37
+00:01:46,730 --> 00:01:49,324
+or by the regular expression r2.
+
+38
+00:01:49,324 --> 00:01:52,790
+And for the
+alternative we write +.
+
+39
+00:01:52,790 --> 00:01:55,175
+Then we also have sequence.
+
+40
+00:01:55,175 --> 00:01:57,410
+This sequence regular
+expression essentially
+
+41
+00:01:57,410 --> 00:01:59,915
+says that a string needs to be matched
+
+42
+00:01:59,915 --> 00:02:02,210
+the first part by
+the regular expression r1
+
+43
+00:02:02,210 --> 00:02:06,275
+and then the second
+part by the r2.
+
+44
+00:02:06,275 --> 00:02:10,190
+And then we have also the
+star regular expression,
+
+45
+00:02:10,190 --> 00:02:12,980
+which says the regular
+expression needs to match
+
+46
+00:02:12,980 --> 00:02:16,520
+the string with zero
+or more copies.
+
+47
+00:02:16,520 --> 00:02:18,140
+And then we also have some
+
+48
+00:02:18,140 --> 00:02:20,060
+slightly strange
+regular expressions.
+
+49
+00:02:20,060 --> 00:02:22,505
+We have the regular expression 1,
+
+50
+00:02:22,505 --> 00:02:25,910
+which can only match
+the empty string.
+
+51
+00:02:25,910 --> 00:02:29,075
+I'm using here the
+notation 1 for that
+
+52
+00:02:29,075 --> 00:02:31,340
+and in my writing I will always
+
+53
+00:02:31,340 --> 00:02:33,440
+make sure that for the
+regular expression
+
+54
+00:02:33,440 --> 00:02:35,765
+I will write the
+1 in a bold font.
+
+55
+00:02:35,765 --> 00:02:38,510
+So whenever you see
+a 1 in bold font,
+
+56
+00:02:38,510 --> 00:02:40,395
+this is not the 1, but
+
+57
+00:02:40,395 --> 00:02:44,300
+the regular expression which
+can match the empty string.
+
+58
+00:02:44,300 --> 00:02:48,050
+And we also have the
+regular expression 0,
+
+59
+00:02:48,050 --> 00:02:50,315
+which cannot match
+anything at all.
+
+60
+00:02:50,315 --> 00:02:51,695
+You might think, well,
+
+61
+00:02:51,695 --> 00:02:54,635
+that's not much use if it cannot
+match anything at all,
+
+62
+00:02:54,635 --> 00:02:58,130
+but you will see why that
+one is important later on.
+
+63
+00:02:58,130 --> 00:03:00,785
+So our basic regular expressions,
+
+64
+00:03:00,785 --> 00:03:02,375
+they will be 0,
+
+65
+00:03:02,375 --> 00:03:08,390
+1, characters, alternatives,
+sequences and stars.
+
+66
+00:03:08,390 --> 00:03:12,170
+And these are all the
+basic regular expressions.
+
+67
+00:03:12,170 --> 00:03:16,280
+If this definition is a
+bit too abstract for you,
+
+68
+00:03:16,280 --> 00:03:18,560
+we can also look at
+the concrete code,
+
+69
+00:03:18,560 --> 00:03:23,060
+how that would pan out when
+actually writing some Scala.
+
+70
+00:03:23,060 --> 00:03:28,040
+I promised you, I show
+you always my code in Scala.
+
+71
+00:03:28,040 --> 00:03:29,480
+So here you would have
+
+72
+00:03:29,480 --> 00:03:32,885
+first an abstract class
+for regular expressions.
+
+73
+00:03:32,885 --> 00:03:37,580
+Then you have one regular
+expression for 0, 
+
+74
+00:03:37,580 --> 00:03:41,540
+one regular expression for 1, 
+
+75
+00:03:41,540 --> 00:03:42,875
+one regular expression, which
+takes an argument,
+
+76
+00:03:42,875 --> 00:03:45,050
+the character you want to match,
+
+77
+00:03:45,050 --> 00:03:47,915
+the characters a,b, c and so on.
+
+78
+00:03:47,915 --> 00:03:50,945
+Then we have an alternative
+regular expression,
+
+79
+00:03:50,945 --> 00:03:53,480
+which takes the first
+alternative and
+
+80
+00:03:53,480 --> 00:03:56,435
+the second alternative
+as arguments.
+
+81
+00:03:56,435 --> 00:03:59,690
+And we have a sequence
+regular expression. Again,
+
+82
+00:03:59,690 --> 00:04:01,850
+which takes the
+first component and
+
+83
+00:04:01,850 --> 00:04:04,730
+the second component
+as two arguments.
+
+84
+00:04:04,730 --> 00:04:07,249
+And we have the star
+regular expression,
+
+85
+00:04:07,249 --> 00:04:10,880
+which just take one regular
+expression as argument.
+
+86
+00:04:10,880 --> 00:04:16,115
+And all these reg expressions
+extend our abstract class.
+
+87
+00:04:16,115 --> 00:04:20,300
+For whatever I do in
+this module here I have
+
+88
+00:04:20,300 --> 00:04:23,300
+the convention that all
+the regular expressions
+
+89
+00:04:23,300 --> 00:04:25,550
+are written with capital letters.
+
+90
+00:04:25,550 --> 00:04:26,885
+As you can see that here,
+
+91
+00:04:26,885 --> 00:04:31,685
+O, 1,  character, these will be
+always regular expressions.
+
+92
+00:04:31,685 --> 00:04:34,370
+They have all capital letters.
+
+93
+00:04:34,370 --> 00:04:36,484
+Let's for a moment,
+
+94
+00:04:36,484 --> 00:04:38,720
+play around with this definition.
+
+95
+00:04:38,720 --> 00:04:41,945
+I'm using here the
+Ammonite REPL.
+
+96
+00:04:41,945 --> 00:04:46,950
+And I can evaluate
+this definition.
+
+97
+00:04:53,430 --> 00:04:55,810
+And now I can start to
+
+98
+00:04:55,810 --> 00:04:58,570
+define particular
+regular expressions.
+
+99
+00:04:58,570 --> 00:05:00,340
+For example, if I need
+a regular expression
+
+100
+00:05:00,340 --> 00:05:02,860
+which can recognise
+the character a,
+
+101
+00:05:02,860 --> 00:05:06,025
+then I would write
+something like this.
+
+102
+00:05:06,025 --> 00:05:08,710
+So this regular expression
+takes an argument,
+
+103
+00:05:08,710 --> 00:05:13,615
+the character 'a'  to specify
+which character to match.
+
+104
+00:05:13,615 --> 00:05:16,945
+We do this obviously also with 'b'.
+
+105
+00:05:16,945 --> 00:05:19,405
+And I can do that with
+
+106
+00:05:19,405 --> 00:05:22,975
+'c'. So now we have three
+regular expressions.
+
+107
+00:05:22,975 --> 00:05:25,570
+If you look very carefully
+at this definition,
+
+108
+00:05:25,570 --> 00:05:27,070
+you can actually see
+
+109
+00:05:27,070 --> 00:05:29,940
+these regular
+expressions are trees.
+
+110
+00:05:29,940 --> 00:05:33,365
+So no matter what we
+write down on paper,
+
+111
+00:05:33,365 --> 00:05:36,755
+they are behind the
+scenes always trees.
+
+112
+00:05:36,755 --> 00:05:40,010
+And you can see that
+actually in this definition.
+
+113
+00:05:40,010 --> 00:05:44,330
+If you define two regular
+expressions r1 and r2.
+
+114
+00:05:44,330 --> 00:05:49,310
+They are essentially
+the alternative of a, b and c.
+
+115
+00:05:49,310 --> 00:05:52,760
+Then this regular expression
+
+116
+00:05:52,760 --> 00:05:54,710
+can match either the character
+
+117
+00:05:54,710 --> 00:05:57,980
+a or the character b
+or the character c.
+
+118
+00:05:57,980 --> 00:06:01,640
+And the same for the
+regular expression r2.
+
+119
+00:06:01,640 --> 00:06:03,875
+So let me just evaluate that.
+
+120
+00:06:03,875 --> 00:06:05,690
+And even though these are
+
+121
+00:06:05,690 --> 00:06:07,175
+two regular expressions
+which can match
+
+122
+00:06:07,175 --> 00:06:11,750
+exactly the same things,
+they a different trees.
+
+123
+00:06:11,750 --> 00:06:14,195
+So if I ask Scala,
+
+124
+00:06:14,195 --> 00:06:16,460
+are these trees different?
+
+125
+00:06:16,460 --> 00:06:19,250
+Or ask if they're
+
+126
+00:06:19,250 --> 00:06:21,865
+the same, then Scala will say No,
+
+127
+00:06:21,865 --> 00:06:25,440
+they actually different trees.
+
+128
+00:06:25,450 --> 00:06:28,459
+Let's come back to
+this definition.
+
+129
+00:06:28,459 --> 00:06:31,760
+If we want to write down
+regular expressions on paper,
+
+130
+00:06:31,760 --> 00:06:33,620
+then we want to be sloppy as
+
+131
+00:06:33,620 --> 00:06:35,750
+mathematicians rather than as
+
+132
+00:06:35,750 --> 00:06:37,745
+precise as computer scientists.
+
+133
+00:06:37,745 --> 00:06:40,490
+So when we want to write down
+a regular expression which can
+
+134
+00:06:40,490 --> 00:06:43,955
+either match the character
+a or the character b,
+
+135
+00:06:43,955 --> 00:06:49,130
+then we would write down
+something like this, a plus b.
+
+136
+00:06:49,130 --> 00:06:51,170
+And if you want to have
+the regular expression
+
+137
+00:06:51,170 --> 00:06:52,625
+which can either match
+
+138
+00:06:52,625 --> 00:06:55,925
+the character a or b or c,
+
+139
+00:06:55,925 --> 00:06:58,340
+we will write
+something like this.
+
+140
+00:06:58,340 --> 00:07:01,370
+But of course behind the
+scenes, these are trees.
+
+141
+00:07:01,370 --> 00:07:04,460
+So we should have written
+them with parentheses.
+
+142
+00:07:04,460 --> 00:07:06,440
+And you can see
+actually, there are two
+
+143
+00:07:06,440 --> 00:07:08,990
+regular expressions I
+could have written down.
+
+144
+00:07:08,990 --> 00:07:11,270
+They're different.
+
+145
+00:07:11,270 --> 00:07:12,710
+Just by convention,
+
+146
+00:07:12,710 --> 00:07:15,575
+we on't write these parentheses.
+
+147
+00:07:15,575 --> 00:07:18,740
+And that is similar with sequences.
+
+148
+00:07:18,740 --> 00:07:20,000
+If I want to write down
+
+149
+00:07:20,000 --> 00:07:22,955
+the regular expression which
+can match first an 'a',
+
+150
+00:07:22,955 --> 00:07:25,010
+then a 'b', and then a 'c',
+
+151
+00:07:25,010 --> 00:07:28,160
+then I would write down
+something like this.
+
+152
+00:07:28,160 --> 00:07:32,120
+Just, there are again
+
+153
+00:07:32,120 --> 00:07:35,735
+two regular expressions I
+could have written down.
+
+154
+00:07:35,735 --> 00:07:38,480
+Again by convention we don't
+
+155
+00:07:38,480 --> 00:07:40,670
+write these parentheses though.
+
+156
+00:07:40,670 --> 00:07:42,350
+However, sometimes we have to be
+
+157
+00:07:42,350 --> 00:07:43,940
+very careful with parentheses,
+
+158
+00:07:43,940 --> 00:07:47,195
+especially with star. 
+
+159
+00:07:47,195 --> 00:07:50,525
+Because this regular expression
+is definitely not
+
+160
+00:07:50,525 --> 00:07:54,900
+the same as this regular expression.
+
+161
+00:07:56,100 --> 00:07:59,410
+The first one here can match
+
+162
+00:07:59,410 --> 00:08:03,610
+any strings containing a or b's.
+
+163
+00:08:03,610 --> 00:08:05,860
+While this regular expression can
+
+164
+00:08:05,860 --> 00:08:07,945
+only match the single character
+
+165
+00:08:07,945 --> 00:08:13,300
+a or any string
+containing only b's.
+
+166
+00:08:13,300 --> 00:08:15,265
+So to make the difference clear,
+
+167
+00:08:15,265 --> 00:08:20,065
+in this example, we would have
+to use the parentheses.
+
+168
+00:08:20,065 --> 00:08:23,140
+There's one more issue
+with this definition.
+
+169
+00:08:23,140 --> 00:08:26,635
+Why do we focus on these
+basic regular expressions?
+
+170
+00:08:26,635 --> 00:08:28,660
+Why don't we also include
+
+171
+00:08:28,660 --> 00:08:31,285
+the ones from the
+extended regular expressions.
+
+172
+00:08:31,285 --> 00:08:33,055
+The answers very easy.
+
+173
+00:08:33,055 --> 00:08:35,680
+These basic regular
+expressions can be used
+
+174
+00:08:35,680 --> 00:08:38,370
+to represent also
+the extended ones.
+
+175
+00:08:38,370 --> 00:08:40,220
+Let me give you some examples.
+
+176
+00:08:40,220 --> 00:08:44,225
+If I have a regular
+expression r+, for example,
+
+177
+00:08:44,225 --> 00:08:46,280
+then the meaning
+was I have to use
+
+178
+00:08:46,280 --> 00:08:49,115
+at least one or more copies
+
+179
+00:08:49,115 --> 00:08:51,200
+of this r to
+match a string. 
+
+180
+00:08:51,200 --> 00:08:53,810
+Well, one or more copies
+can be represented by
+
+181
+00:08:53,810 --> 00:08:58,385
+the basic ones as just
+r followed by r*.
+
+182
+00:08:58,385 --> 00:09:01,760
+Meaning I have to use one
+copy of r, followed by
+
+183
+00:09:01,760 --> 00:09:05,150
+0 or more copies of r.
+
+184
+00:09:05,150 --> 00:09:07,895
+Similarly, if I have the optional
+regular expression,
+
+185
+00:09:07,895 --> 00:09:10,715
+which is supposed to
+match a string
+
+186
+00:09:10,715 --> 00:09:13,865
+by using r, or match
+the empty string.
+
+187
+00:09:13,865 --> 00:09:19,295
+Then this can be obviously
+defined as r + 1.
+
+188
+00:09:19,295 --> 00:09:23,945
+So here is the bold
+regular expression 1,
+
+189
+00:09:23,945 --> 00:09:26,180
+which means it either can
+
+190
+00:09:26,180 --> 00:09:28,205
+recognize whatever
+r can recognize,
+
+191
+00:09:28,205 --> 00:09:30,470
+or it can recognize
+the empty string.
+
+192
+00:09:30,470 --> 00:09:35,150
+And if I have ranges, like a
+to z,  then I can define
+
+193
+00:09:35,150 --> 00:09:41,135
+that as a + b + c + ...
+and so on until z.
+
+194
+00:09:41,135 --> 00:09:45,920
+Maybe this definition is not
+good in terms of runtime,
+
+195
+00:09:45,920 --> 00:09:47,960
+but in terms of just being able
+
+196
+00:09:47,960 --> 00:09:50,780
+to recognize strings
+or match strings,
+
+197
+00:09:50,780 --> 00:09:54,680
+the basic regular expressions
+will be just sufficient.
+
+198
+00:09:54,680 --> 00:09:56,690
+Unfortunately, we
+also need to have
+
+199
+00:09:56,690 --> 00:09:58,850
+a quick chat about strings.
+
+200
+00:09:58,850 --> 00:10:02,255
+In Scala, it's crystal
+clear what a string is.
+
+201
+00:10:02,255 --> 00:10:05,480
+There's a separate datatype
+which is called string.
+
+202
+00:10:05,480 --> 00:10:07,895
+So here, for example,
+is a string.
+
+203
+00:10:07,895 --> 00:10:09,200
+And as you can see,
+
+204
+00:10:09,200 --> 00:10:11,105
+it is of the type string.
+
+205
+00:10:11,105 --> 00:10:13,985
+And the empty string
+will be just that.
+
+206
+00:10:13,985 --> 00:10:16,160
+However, when we write things down on
+
+207
+00:10:16,160 --> 00:10:18,320
+paper and think
+about our algorithm,
+
+208
+00:10:18,320 --> 00:10:22,790
+we want to think of strings
+as lists of characters.
+
+209
+00:10:22,790 --> 00:10:26,070
+So more something like this.
+
+210
+00:10:27,070 --> 00:10:31,745
+You can see here, this is actually
+a list of characters.
+
+211
+00:10:31,745 --> 00:10:35,150
+And the two operations
+we need are taking
+
+212
+00:10:35,150 --> 00:10:37,280
+the head of this list and
+
+213
+00:10:37,280 --> 00:10:39,770
+the rest of the list
+or tail of the list.
+
+214
+00:10:39,770 --> 00:10:41,720
+That's why we want
+to regard them as
+
+215
+00:10:41,720 --> 00:10:45,260
+lists rather than strings.
+
+216
+00:10:45,260 --> 00:10:48,200
+So if I'm using a
+string like this,
+
+217
+00:10:48,200 --> 00:10:51,935
+then on paper I always will
+write something like that.
+
+218
+00:10:51,935 --> 00:10:54,575
+Or since I'm lazy, just that.
+
+219
+00:10:54,575 --> 00:10:56,675
+And for the empty string,
+
+220
+00:10:56,675 --> 00:10:59,210
+I will write either
+the empty list, with
+
+221
+00:10:59,210 --> 00:11:03,920
+two brackets or,
+being lazy, just that.
+
+222
+00:11:03,920 --> 00:11:06,620
+Actually there is one
+more operation we need on
+
+223
+00:11:06,620 --> 00:11:09,410
+strings and that
+is concatenation.
+
+224
+00:11:09,410 --> 00:11:11,255
+If you have a string s1,
+
+225
+00:11:11,255 --> 00:11:14,510
+string s2, and put an
+at symbol in between,
+
+226
+00:11:14,510 --> 00:11:18,050
+that means we want to
+concatenate both strings.
+
+227
+00:11:18,050 --> 00:11:22,625
+So foo concatenated with
+bar, would be foobar.
+
+228
+00:11:22,625 --> 00:11:25,085
+And any string concatenated with
+
+229
+00:11:25,085 --> 00:11:27,950
+the empty string
+is left untouched.
+
+230
+00:11:27,950 --> 00:11:31,310
+So baz concatenated with
+
+231
+00:11:31,310 --> 00:11:33,545
+the empty string, is just baz.
+
+232
+00:11:33,545 --> 00:11:37,295
+So that's like if we have
+strings as lists of characters,
+
+233
+00:11:37,295 --> 00:11:39,755
+that will be just list append.
+
+234
+00:11:39,755 --> 00:11:41,480
+In the next video,
+
+235
+00:11:41,480 --> 00:11:43,160
+we will use these definitions
+
+236
+00:11:43,160 --> 00:11:45,050
+and introduce the notion of what
+
+237
+00:11:45,050 --> 00:11:46,850
+a language is and
+
+238
+00:11:46,850 --> 00:11:49,920
+what the meaning of a
+regular expression is.