100:00:06,710 --> 00:00:09,225Thanks for tuning in again.200:00:09,225 --> 00:00:11,640In this video, we want to specify300:00:11,640 --> 00:00:14,370what problem our regularexpression matcher400:00:14,370 --> 00:00:16,155is actually supposed to solve.500:00:16,155 --> 00:00:18,900The reason is thatwe know that some of600:00:18,900 --> 00:00:21,585the existing regularexpression matching engines700:00:21,585 --> 00:00:25,200are not just abysmallyslow in some examples,800:00:25,200 --> 00:00:27,105as you've seen in theprevious video,900:00:27,105 --> 00:00:30,570but also produce sometimesincorrect results.1000:00:30,570 --> 00:00:33,330In order to avoidthis with our matcher,1100:00:33,330 --> 00:00:35,325we need to somehow explain1200:00:35,325 --> 00:00:39,255precisely what is the problemour algorithm solves.1300:00:39,255 --> 00:00:41,935This will requirea bit of theory, but1400:00:41,935 --> 00:00:45,335I hope it is neverthelessa bit of fun.1500:00:45,335 --> 00:00:47,915First, we have to specify1600:00:47,915 --> 00:00:50,585what we mean by aregular expression.1700:00:50,585 --> 00:00:53,210You've seen earlier someexamples. They were1800:00:53,210 --> 00:00:56,060actually taken orinspired by what1900:00:56,060 --> 00:00:58,850is available in standardregular expression matching2000:00:58,850 --> 00:01:02,330engines, like star, plus and n-times.2100:01:02,330 --> 00:01:05,690But for many tasks,for our algorithm,2200:01:05,690 --> 00:01:10,174we will focus only what I callbasic regular expressions.2300:01:10,174 --> 00:01:11,840Since I'm lazy, I will call2400:01:11,840 --> 00:01:13,550these basic regular expressions2500:01:13,550 --> 00:01:15,485just as regular expressions.2600:01:15,485 --> 00:01:17,405And the ones you've seen earlier2700:01:17,405 --> 00:01:19,400as extended regular expressions.2800:01:19,400 --> 00:01:22,940So the basic regulare expressions,or just regular expressions,2900:01:22,940 --> 00:01:25,280they will have characters.3000:01:25,280 --> 00:01:27,170So you can match any character,3100:01:27,170 --> 00:01:31,370a,b,c to z or 0 to 9.3200:01:31,370 --> 00:01:35,525Any Ascii character. 'c' hereis just a representative.3300:01:35,525 --> 00:01:38,825So we can matchsingle characters.3400:01:38,825 --> 00:01:42,440Then we can match alternatives.3500:01:42,440 --> 00:01:44,930That means a stringis either matched3600:01:44,930 --> 00:01:46,730by the regular expression r13700:01:46,730 --> 00:01:49,324or by the regular expression r2.3800:01:49,324 --> 00:01:52,790And for thealternative we write +.3900:01:52,790 --> 00:01:55,175Then we also have sequence.4000:01:55,175 --> 00:01:57,410This sequence regularexpression essentially4100:01:57,410 --> 00:01:59,915says that a string needs to be matched4200:01:59,915 --> 00:02:02,210the first part bythe regular expression r14300:02:02,210 --> 00:02:06,275and then the secondpart by the r2.4400:02:06,275 --> 00:02:10,190And then we have also thestar regular expression,4500:02:10,190 --> 00:02:12,980which says the regularexpression needs to match4600:02:12,980 --> 00:02:16,520the string with zeroor more copies.4700:02:16,520 --> 00:02:18,140And then we also have some4800:02:18,140 --> 00:02:20,060slightly strangeregular expressions.4900:02:20,060 --> 00:02:22,505We have the regular expression 1,5000:02:22,505 --> 00:02:25,910which can only matchthe empty string.5100:02:25,910 --> 00:02:29,075I'm using here thenotation 1 for that5200:02:29,075 --> 00:02:31,340and in my writing I will always5300:02:31,340 --> 00:02:33,440make sure that for theregular expression5400:02:33,440 --> 00:02:35,765I will write the1 in a bold font.5500:02:35,765 --> 00:02:38,510So whenever you seea 1 in bold font,5600:02:38,510 --> 00:02:40,395this is not the 1, but5700:02:40,395 --> 00:02:44,300the regular expression whichcan match the empty string.5800:02:44,300 --> 00:02:48,050And we also have theregular expression 0,5900:02:48,050 --> 00:02:50,315which cannot matchanything at all.6000:02:50,315 --> 00:02:51,695You might think, well,6100:02:51,695 --> 00:02:54,635that's not much use if it cannotmatch anything at all,6200:02:54,635 --> 00:02:58,130but you will see why thatone is important later on.6300:02:58,130 --> 00:03:00,785So our basic regular expressions,6400:03:00,785 --> 00:03:02,375they will be 0,6500:03:02,375 --> 00:03:08,3901, characters, alternatives,sequences and stars.6600:03:08,390 --> 00:03:12,170And these are all thebasic regular expressions.6700:03:12,170 --> 00:03:16,280If this definition is abit too abstract for you,6800:03:16,280 --> 00:03:18,560we can also look atthe concrete code,6900:03:18,560 --> 00:03:23,060how that would pan out whenactually writing some Scala.7000:03:23,060 --> 00:03:28,040I promised you, I showyou always my code in Scala.7100:03:28,040 --> 00:03:29,480So here you would have7200:03:29,480 --> 00:03:32,885first an abstract classfor regular expressions.7300:03:32,885 --> 00:03:37,580Then you have one regularexpression for 0, 7400:03:37,580 --> 00:03:41,540one regular expression for 1, 7500:03:41,540 --> 00:03:42,875one regular expression, whichtakes an argument,7600:03:42,875 --> 00:03:45,050the character you want to match,7700:03:45,050 --> 00:03:47,915the characters a,b, c and so on.7800:03:47,915 --> 00:03:50,945Then we have an alternativeregular expression,7900:03:50,945 --> 00:03:53,480which takes the firstalternative and8000:03:53,480 --> 00:03:56,435the second alternativeas arguments.8100:03:56,435 --> 00:03:59,690And we have a sequenceregular expression. Again,8200:03:59,690 --> 00:04:01,850which takes thefirst component and8300:04:01,850 --> 00:04:04,730the second componentas two arguments.8400:04:04,730 --> 00:04:07,249And we have the starregular expression,8500:04:07,249 --> 00:04:10,880which just take one regularexpression as argument.8600:04:10,880 --> 00:04:16,115And all these reg expressionsextend our abstract class.8700:04:16,115 --> 00:04:20,300For whatever I do inthis module here I have8800:04:20,300 --> 00:04:23,300the convention that allthe regular expressions8900:04:23,300 --> 00:04:25,550are written with capital letters.9000:04:25,550 --> 00:04:26,885As you can see that here,9100:04:26,885 --> 00:04:31,685O, 1, character, these will bealways regular expressions.9200:04:31,685 --> 00:04:34,370They have all capital letters.9300:04:34,370 --> 00:04:36,484Let's for a moment,9400:04:36,484 --> 00:04:38,720play around with this definition.9500:04:38,720 --> 00:04:41,945I'm using here theAmmonite REPL.9600:04:41,945 --> 00:04:46,950And I can evaluatethis definition.9700:04:53,430 --> 00:04:55,810And now I can start to9800:04:55,810 --> 00:04:58,570define particularregular expressions.9900:04:58,570 --> 00:05:00,340For example, if I needa regular expression10000:05:00,340 --> 00:05:02,860which can recognisethe character a,10100:05:02,860 --> 00:05:06,025then I would writesomething like this.10200:05:06,025 --> 00:05:08,710So this regular expressiontakes an argument,10300:05:08,710 --> 00:05:13,615the character 'a' to specifywhich character to match.10400:05:13,615 --> 00:05:16,945We do this obviously also with 'b'.10500:05:16,945 --> 00:05:19,405And I can do that with10600:05:19,405 --> 00:05:22,975'c'. So now we have threeregular expressions.10700:05:22,975 --> 00:05:25,570If you look very carefullyat this definition,10800:05:25,570 --> 00:05:27,070you can actually see10900:05:27,070 --> 00:05:29,940these regularexpressions are trees.11000:05:29,940 --> 00:05:33,365So no matter what wewrite down on paper,11100:05:33,365 --> 00:05:36,755they are behind thescenes always trees.11200:05:36,755 --> 00:05:40,010And you can see thatactually in this definition.11300:05:40,010 --> 00:05:44,330If you define two regularexpressions r1 and r2.11400:05:44,330 --> 00:05:49,310They are essentiallythe alternative of a, b and c.11500:05:49,310 --> 00:05:52,760Then this regular expression11600:05:52,760 --> 00:05:54,710can match either the character11700:05:54,710 --> 00:05:57,980a or the character bor the character c.11800:05:57,980 --> 00:06:01,640And the same for theregular expression r2.11900:06:01,640 --> 00:06:03,875So let me just evaluate that.12000:06:03,875 --> 00:06:05,690And even though these are12100:06:05,690 --> 00:06:07,175two regular expressionswhich can match12200:06:07,175 --> 00:06:11,750exactly the same things,they a different trees.12300:06:11,750 --> 00:06:14,195So if I ask Scala,12400:06:14,195 --> 00:06:16,460are these trees different?12500:06:16,460 --> 00:06:19,250Or ask if they're12600:06:19,250 --> 00:06:21,865the same, then Scala will say No,12700:06:21,865 --> 00:06:25,440they actually different trees.12800:06:25,450 --> 00:06:28,459Let's come back tothis definition.12900:06:28,459 --> 00:06:31,760If we want to write downregular expressions on paper,13000:06:31,760 --> 00:06:33,620then we want to be sloppy as13100:06:33,620 --> 00:06:35,750mathematicians rather than as13200:06:35,750 --> 00:06:37,745precise as computer scientists.13300:06:37,745 --> 00:06:40,490So when we want to write downa regular expression which can13400:06:40,490 --> 00:06:43,955either match the charactera or the character b,13500:06:43,955 --> 00:06:49,130then we would write downsomething like this, a plus b.13600:06:49,130 --> 00:06:51,170And if you want to havethe regular expression13700:06:51,170 --> 00:06:52,625which can either match13800:06:52,625 --> 00:06:55,925the character a or b or c,13900:06:55,925 --> 00:06:58,340we will writesomething like this.14000:06:58,340 --> 00:07:01,370But of course behind thescenes, these are trees.14100:07:01,370 --> 00:07:04,460So we should have writtenthem with parentheses.14200:07:04,460 --> 00:07:06,440And you can seeactually, there are two14300:07:06,440 --> 00:07:08,990regular expressions Icould have written down.14400:07:08,990 --> 00:07:11,270They're different.14500:07:11,270 --> 00:07:12,710Just by convention,14600:07:12,710 --> 00:07:15,575we on't write these parentheses.14700:07:15,575 --> 00:07:18,740And that is similar with sequences.14800:07:18,740 --> 00:07:20,000If I want to write down14900:07:20,000 --> 00:07:22,955the regular expression whichcan match first an 'a',15000:07:22,955 --> 00:07:25,010then a 'b', and then a 'c',15100:07:25,010 --> 00:07:28,160then I would write downsomething like this.15200:07:28,160 --> 00:07:32,120Just, there are again15300:07:32,120 --> 00:07:35,735two regular expressions Icould have written down.15400:07:35,735 --> 00:07:38,480Again by convention we don't15500:07:38,480 --> 00:07:40,670write these parentheses though.15600:07:40,670 --> 00:07:42,350However, sometimes we have to be15700:07:42,350 --> 00:07:43,940very careful with parentheses,15800:07:43,940 --> 00:07:47,195especially with star. 15900:07:47,195 --> 00:07:50,525Because this regular expressionis definitely not16000:07:50,525 --> 00:07:54,900the same as this regular expression.16100:07:56,100 --> 00:07:59,410The first one here can match16200:07:59,410 --> 00:08:03,610any strings containing a or b's.16300:08:03,610 --> 00:08:05,860While this regular expression can16400:08:05,860 --> 00:08:07,945only match the single character16500:08:07,945 --> 00:08:13,300a or any stringcontaining only b's.16600:08:13,300 --> 00:08:15,265So to make the difference clear,16700:08:15,265 --> 00:08:20,065in this example, we would haveto use the parentheses.16800:08:20,065 --> 00:08:23,140There's one more issuewith this definition.16900:08:23,140 --> 00:08:26,635Why do we focus on thesebasic regular expressions?17000:08:26,635 --> 00:08:28,660Why don't we also include17100:08:28,660 --> 00:08:31,285the ones from theextended regular expressions.17200:08:31,285 --> 00:08:33,055The answers very easy.17300:08:33,055 --> 00:08:35,680These basic regularexpressions can be used17400:08:35,680 --> 00:08:38,370to represent alsothe extended ones.17500:08:38,370 --> 00:08:40,220Let me give you some examples.17600:08:40,220 --> 00:08:44,225If I have a regularexpression r+, for example,17700:08:44,225 --> 00:08:46,280then the meaningwas I have to use17800:08:46,280 --> 00:08:49,115at least one or more copies17900:08:49,115 --> 00:08:51,200of this r tomatch a string. 18000:08:51,200 --> 00:08:53,810Well, one or more copiescan be represented by18100:08:53,810 --> 00:08:58,385the basic ones as justr followed by r*.18200:08:58,385 --> 00:09:01,760Meaning I have to use onecopy of r, followed by18300:09:01,760 --> 00:09:05,1500 or more copies of r.18400:09:05,150 --> 00:09:07,895Similarly, if I have the optionalregular expression,18500:09:07,895 --> 00:09:10,715which is supposed tomatch a string18600:09:10,715 --> 00:09:13,865by using r, or matchthe empty string.18700:09:13,865 --> 00:09:19,295Then this can be obviouslydefined as r + 1.18800:09:19,295 --> 00:09:23,945So here is the boldregular expression 1,18900:09:23,945 --> 00:09:26,180which means it either can19000:09:26,180 --> 00:09:28,205recognize whateverr can recognize,19100:09:28,205 --> 00:09:30,470or it can recognizethe empty string.19200:09:30,470 --> 00:09:35,150And if I have ranges, like ato z, then I can define19300:09:35,150 --> 00:09:41,135that as a + b + c + ...and so on until z.19400:09:41,135 --> 00:09:45,920Maybe this definition is notgood in terms of runtime,19500:09:45,920 --> 00:09:47,960but in terms of just being able19600:09:47,960 --> 00:09:50,780to recognize stringsor match strings,19700:09:50,780 --> 00:09:54,680the basic regular expressionswill be just sufficient.19800:09:54,680 --> 00:09:56,690Unfortunately, wealso need to have19900:09:56,690 --> 00:09:58,850a quick chat about strings.20000:09:58,850 --> 00:10:02,255In Scala, it's crystalclear what a string is.20100:10:02,255 --> 00:10:05,480There's a separate datatypewhich is called string.20200:10:05,480 --> 00:10:07,895So here, for example,is a string.20300:10:07,895 --> 00:10:09,200And as you can see,20400:10:09,200 --> 00:10:11,105it is of the type string.20500:10:11,105 --> 00:10:13,985And the empty stringwill be just that.20600:10:13,985 --> 00:10:16,160However, when we write things down on20700:10:16,160 --> 00:10:18,320paper and thinkabout our algorithm,20800:10:18,320 --> 00:10:22,790we want to think of stringsas lists of characters.20900:10:22,790 --> 00:10:26,070So more something like this.21000:10:27,070 --> 00:10:31,745You can see here, this is actuallya list of characters.21100:10:31,745 --> 00:10:35,150And the two operationswe need are taking21200:10:35,150 --> 00:10:37,280the head of this list and21300:10:37,280 --> 00:10:39,770the rest of the listor tail of the list.21400:10:39,770 --> 00:10:41,720That's why we wantto regard them as21500:10:41,720 --> 00:10:45,260lists rather than strings.21600:10:45,260 --> 00:10:48,200So if I'm using astring like this,21700:10:48,200 --> 00:10:51,935then on paper I always willwrite something like that.21800:10:51,935 --> 00:10:54,575Or since I'm lazy, just that.21900:10:54,575 --> 00:10:56,675And for the empty string,22000:10:56,675 --> 00:10:59,210I will write eitherthe empty list, with22100:10:59,210 --> 00:11:03,920two brackets or,being lazy, just that.22200:11:03,920 --> 00:11:06,620Actually there is onemore operation we need on22300:11:06,620 --> 00:11:09,410strings and thatis concatenation.22400:11:09,410 --> 00:11:11,255If you have a string s1,22500:11:11,255 --> 00:11:14,510string s2, and put anat symbol in between,22600:11:14,510 --> 00:11:18,050that means we want toconcatenate both strings.22700:11:18,050 --> 00:11:22,625So foo concatenated withbar, would be foobar.22800:11:22,625 --> 00:11:25,085And any string concatenated with22900:11:25,085 --> 00:11:27,950the empty stringis left untouched.23000:11:27,950 --> 00:11:31,310So baz concatenated with23100:11:31,310 --> 00:11:33,545the empty string, is just baz.23200:11:33,545 --> 00:11:37,295So that's like if we havestrings as lists of characters,23300:11:37,295 --> 00:11:39,755that will be just list append.23400:11:39,755 --> 00:11:41,480In the next video,23500:11:41,480 --> 00:11:43,160we will use these definitions23600:11:43,160 --> 00:11:45,050and introduce the notion of what23700:11:45,050 --> 00:11:46,850a language is and23800:11:46,850 --> 00:11:49,920what the meaning of aregular expression is.