|
1 1 |
|
2 00:00:09,990 --> 00:00:13,465 |
|
3 Welcome back to this |
|
4 week's lecture. |
|
5 |
|
6 2 |
|
7 00:00:13,465 --> 00:00:16,450 |
|
8 The task this week is to actually |
|
9 |
|
10 3 |
|
11 00:00:16,450 --> 00:00:20,320 |
|
12 implement a regular |
|
13 expression matcher. |
|
14 |
|
15 4 |
|
16 00:00:20,320 --> 00:00:22,510 |
|
17 And we want to be a bit |
|
18 |
|
19 5 |
|
20 00:00:22,510 --> 00:00:25,315 |
|
21 faster than the regular |
|
22 expression matcher |
|
23 |
|
24 6 |
|
25 00:00:25,315 --> 00:00:29,380 |
|
26 in Python, Ruby, |
|
27 Javascript, and Java. |
|
28 |
|
29 7 |
|
30 00:00:29,380 --> 00:00:31,960 |
|
31 Remember this regular expression |
|
32 |
|
33 8 |
|
34 00:00:31,960 --> 00:00:35,785 |
|
35 and strings which are |
|
36 just a number of a's, |
|
37 |
|
38 9 |
|
39 00:00:35,785 --> 00:00:39,670 |
|
40 these regular expression matchers |
|
41 where abysmally slow. |
|
42 |
|
43 10 |
|
44 00:00:39,670 --> 00:00:43,170 |
|
45 They could only match |
|
46 approximately 30 a's in |
|
47 |
|
48 11 |
|
49 00:00:43,170 --> 00:00:45,665 |
|
50 something like half a minute. |
|
51 |
|
52 12 |
|
53 00:00:45,665 --> 00:00:49,460 |
|
54 What we like to do with |
|
55 our regular expression matcher. |
|
56 |
|
57 13 |
|
58 00:00:49,460 --> 00:00:51,890 |
|
59 We will try a few times, |
|
60 |
|
61 14 |
|
62 00:00:51,890 --> 00:00:55,505 |
|
63 but our second trial will already |
|
64 be much, much better. |
|
65 |
|
66 15 |
|
67 00:00:55,505 --> 00:00:58,400 |
|
68 It will probably |
|
69 match around maybe |
|
70 |
|
71 16 |
|
72 00:00:58,400 --> 00:01:02,030 |
|
73 thousand a's in something |
|
74 in half a minute. |
|
75 |
|
76 17 |
|
77 00:01:02,030 --> 00:01:03,920 |
|
78 But our third version in |
|
79 |
|
80 18 |
|
81 00:01:03,920 --> 00:01:06,230 |
|
82 comparison will be |
|
83 blindingly fast. |
|
84 |
|
85 19 |
|
86 00:01:06,230 --> 00:01:08,180 |
|
87 And we'll be able to match |
|
88 |
|
89 20 |
|
90 00:01:08,180 --> 00:01:10,145 |
|
91 strings of length 10,000 a's |
|
92 |
|
93 21 |
|
94 00:01:10,145 --> 00:01:12,230 |
|
95 and will not require |
|
96 |
|
97 22 |
|
98 00:01:12,230 --> 00:01:14,975 |
|
99 more than five seconds. |
|
100 So let's go ahead with that. |
|
101 |
|
102 23 |
|
103 00:01:14,975 --> 00:01:18,095 |
|
104 We will first implement |
|
105 |
|
106 24 |
|
107 00:01:18,095 --> 00:01:19,880 |
|
108 our regular expression |
|
109 matcher only |
|
110 |
|
111 25 |
|
112 00:01:19,880 --> 00:01:22,069 |
|
113 for the basic |
|
114 regular expressions. |
|
115 |
|
116 26 |
|
117 00:01:22,069 --> 00:01:25,430 |
|
118 So we will have only six |
|
119 cases to think about. |
|
120 |
|
121 27 |
|
122 00:01:25,430 --> 00:01:27,620 |
|
123 This will simplify matters at first. |
|
124 |
|
125 28 |
|
126 00:01:27,620 --> 00:01:30,350 |
|
127 Later we can |
|
128 think about how to |
|
129 |
|
130 29 |
|
131 00:01:30,350 --> 00:01:34,100 |
|
132 extend that to the extended |
|
133 regular expressions. |
|
134 |
|
135 30 |
|
136 00:01:34,100 --> 00:01:37,625 |
|
137 Unfortunately, before we can |
|
138 start our implementation, |
|
139 |
|
140 31 |
|
141 00:01:37,625 --> 00:01:39,290 |
|
142 we have to discuss |
|
143 |
|
144 32 |
|
145 00:01:39,290 --> 00:01:42,470 |
|
146 when two regular |
|
147 expressions are equivalent. |
|
148 |
|
149 33 |
|
150 00:01:42,470 --> 00:01:46,595 |
|
151 And we use here this notation |
|
152 of the triple equals. |
|
153 |
|
154 34 |
|
155 00:01:46,595 --> 00:01:48,215 |
|
156 We say a regular expression |
|
157 |
|
158 35 |
|
159 00:01:48,215 --> 00:01:50,180 |
|
160 r1 and r2 are |
|
161 |
|
162 36 |
|
163 00:01:50,180 --> 00:01:56,059 |
|
164 equivalent if and only |
|
165 if the language of r1 is |
|
166 |
|
167 37 |
|
168 00:01:56,059 --> 00:01:59,360 |
|
169 equal to the language of r2. |
|
170 |
|
171 38 |
|
172 00:01:59,360 --> 00:02:02,915 |
|
173 Sounds complicated, |
|
174 but essentially means |
|
175 |
|
176 39 |
|
177 00:02:02,915 --> 00:02:04,700 |
|
178 if the two regular expressions can |
|
179 |
|
180 40 |
|
181 00:02:04,700 --> 00:02:06,950 |
|
182 match exactly the same strings, |
|
183 |
|
184 41 |
|
185 00:02:06,950 --> 00:02:09,800 |
|
186 then we want to regard |
|
187 them as equivalent. |
|
188 |
|
189 42 |
|
190 00:02:09,800 --> 00:02:14,300 |
|
191 This equivalence justifies |
|
192 why we can be sloppy |
|
193 |
|
194 43 |
|
195 00:02:14,300 --> 00:02:16,040 |
|
196 with parentheses. |
|
197 |
|
198 44 |
|
199 00:02:16,040 --> 00:02:23,870 |
|
200 For example, if we have |
|
201 (r1 + r2) + r3, |
|
202 |
|
203 45 |
|
204 00:02:23,870 --> 00:02:32,270 |
|
205 then this will be equivalent |
|
206 to r1 + (r2 + r3). |
|
207 |
|
208 46 |
|
209 00:02:32,270 --> 00:02:34,910 |
|
210 Remember, regular |
|
211 expressions are trees, |
|
212 |
|
213 47 |
|
214 00:02:34,910 --> 00:02:37,940 |
|
215 so these are two different |
|
216 regular expressions, |
|
217 |
|
218 48 |
|
219 00:02:37,940 --> 00:02:41,540 |
|
220 but they can match |
|
221 the same strings. |
|
222 |
|
223 49 |
|
224 00:02:41,540 --> 00:02:45,695 |
|
225 Because, well, if we take |
|
226 here the meaning of that, |
|
227 |
|
228 50 |
|
229 00:02:45,695 --> 00:02:54,350 |
|
230 that would be just L(r1 + r2 + r3), |
|
231 |
|
232 |
|
233 51 |
|
234 00:02:54,350 --> 00:03:04,100 |
|
235 which is equal to L(r1 + r2) u L(r3). |
|
236 |
|
237 |
|
238 52 |
|
239 00:03:04,100 --> 00:03:10,595 |
|
240 And that is equal to of |
|
241 |
|
242 53 |
|
243 00:03:10,595 --> 00:03:15,710 |
|
244 L(r1) u L(r2) u L(r3). |
|
245 |
|
246 |
|
247 54 |
|
248 00:03:15,710 --> 00:03:17,930 |
|
249 And if you do the |
|
250 same calculation |
|
251 |
|
252 55 |
|
253 00:03:17,930 --> 00:03:19,445 |
|
254 for that regular expression, |
|
255 |
|
256 56 |
|
257 00:03:19,445 --> 00:03:22,940 |
|
258 you will find out the |
|
259 two sets are the same. |
|
260 |
|
261 57 |
|
262 00:03:22,940 --> 00:03:26,195 |
|
263 So both regular expressions |
|
264 can match the same strings. |
|
265 |
|
266 58 |
|
267 00:03:26,195 --> 00:03:28,805 |
|
268 So even if they're different |
|
269 regular expressions, |
|
270 |
|
271 59 |
|
272 00:03:28,805 --> 00:03:31,220 |
|
273 we can regard them as eqivalent. |
|
274 |
|
275 60 |
|
276 00:03:31,220 --> 00:03:33,290 |
|
277 And so that's why we can forget |
|
278 |
|
279 61 |
|
280 00:03:33,290 --> 00:03:35,270 |
|
281 about writing these parentheses. |
|
282 |
|
283 62 |
|
284 00:03:35,270 --> 00:03:40,205 |
|
285 Let's have a look |
|
286 at some more examples. |
|
287 |
|
288 63 |
|
289 00:03:40,205 --> 00:03:41,870 |
|
290 So the first one here, |
|
291 |
|
292 64 |
|
293 00:03:41,870 --> 00:03:43,205 |
|
294 that is now clear. |
|
295 |
|
296 65 |
|
297 00:03:43,205 --> 00:03:45,200 |
|
298 We did this calculation already |
|
299 |
|
300 66 |
|
301 00:03:45,200 --> 00:03:47,480 |
|
302 for arbitrary regular expressions. |
|
303 |
|
304 67 |
|
305 00:03:47,480 --> 00:03:49,520 |
|
306 Here it is for |
|
307 concrete ones a, b and c. |
|
308 |
|
309 68 |
|
310 00:03:49,520 --> 00:03:52,690 |
|
311 Here on the left-hand side, |
|
312 |
|
313 69 |
|
314 00:03:52,690 --> 00:03:54,895 |
|
315 the regular expression |
|
316 has the same meaning |
|
317 |
|
318 70 |
|
319 00:03:54,895 --> 00:03:56,620 |
|
320 as on the right-hand side. |
|
321 |
|
322 71 |
|
323 00:03:56,620 --> 00:03:58,420 |
|
324 So they are equivalent. |
|
325 |
|
326 72 |
|
327 00:03:58,420 --> 00:04:01,375 |
|
328 We can ignore the parentheses. |
|
329 |
|
330 73 |
|
331 00:04:01,375 --> 00:04:03,220 |
|
332 If I have a choice, |
|
333 |
|
334 74 |
|
335 00:04:03,220 --> 00:04:06,610 |
|
336 but the choice is |
|
337 the same a or a, |
|
338 |
|
339 75 |
|
340 00:04:06,610 --> 00:04:09,265 |
|
341 then this is |
|
342 equivalent to just a. |
|
343 |
|
344 76 |
|
345 00:04:09,265 --> 00:04:13,075 |
|
346 So the same choice doesn't |
|
347 introduce anything more. |
|
348 |
|
349 77 |
|
350 00:04:13,075 --> 00:04:15,370 |
|
351 In the next case, if I have |
|
352 |
|
353 78 |
|
354 00:04:15,370 --> 00:04:19,450 |
|
355 a regular expression |
|
356 which can match a or b, |
|
357 |
|
358 79 |
|
359 00:04:19,450 --> 00:04:23,860 |
|
360 that can match the same |
|
361 strings as b or a. |
|
362 |
|
363 80 |
|
364 00:04:23,860 --> 00:04:27,325 |
|
365 So you have a kind of |
|
366 commutativity for the plus, |
|
367 |
|
368 81 |
|
369 00:04:27,325 --> 00:04:29,485 |
|
370 which is like on natural numbers. |
|
371 |
|
372 82 |
|
373 00:04:29,485 --> 00:04:32,880 |
|
374 But you would not have a |
|
375 communitivity for the sequence: |
|
376 |
|
377 83 |
|
378 00:04:32,880 --> 00:04:37,070 |
|
379 if you have |
|
380 first a and then b, |
|
381 |
|
382 84 |
|
383 00:04:37,070 --> 00:04:40,850 |
|
384 that's not the same as |
|
385 matching first b and then a. |
|
386 |
|
387 85 |
|
388 00:04:40,850 --> 00:04:42,800 |
|
389 So for the sequence ... |
|
390 |
|
391 86 |
|
392 00:04:42,800 --> 00:04:44,615 |
|
393 they are not equivalent. |
|
394 |
|
395 87 |
|
396 00:04:44,615 --> 00:04:49,475 |
|
397 However, for the sequence I |
|
398 can, like for the plus, ignore |
|
399 |
|
400 88 |
|
401 00:04:49,475 --> 00:04:51,245 |
|
402 prarentheses. |
|
403 |
|
404 89 |
|
405 00:04:51,245 --> 00:04:55,070 |
|
406 If I have the parentheses |
|
407 and this way, |
|
408 |
|
409 90 |
|
410 00:04:55,070 --> 00:04:57,785 |
|
411 or I have it in this way. |
|
412 |
|
413 91 |
|
414 00:04:57,785 --> 00:05:00,605 |
|
415 These are two different |
|
416 regular expressions, |
|
417 |
|
418 92 |
|
419 00:05:00,605 --> 00:05:02,120 |
|
420 but they have the same meaning. |
|
421 |
|
422 93 |
|
423 00:05:02,120 --> 00:05:03,815 |
|
424 So they are equivalent. |
|
425 |
|
426 94 |
|
427 00:05:03,815 --> 00:05:05,780 |
|
428 And so I can leave out parentheses |
|
429 |
|
430 95 |
|
431 00:05:05,780 --> 00:05:09,170 |
|
432 for times as well. |
|
433 |
|
434 96 |
|
435 00:05:09,170 --> 00:05:12,185 |
|
436 The next one is a slightly |
|
437 more interesting one. |
|
438 |
|
439 97 |
|
440 00:05:12,185 --> 00:05:15,020 |
|
441 If I have a regular |
|
442 expression which can match |
|
443 |
|
444 98 |
|
445 00:05:15,020 --> 00:05:19,655 |
|
446 c first followed by (a + b), |
|
447 |
|
448 99 |
|
449 00:05:19,655 --> 00:05:25,355 |
|
450 then this is the same as |
|
451 first c followed by a, |
|
452 |
|
453 100 |
|
454 00:05:25,355 --> 00:05:28,640 |
|
455 or c followed by b. |
|
456 |
|
457 101 |
|
458 00:05:28,640 --> 00:05:31,265 |
|
459 So that's the kind of |
|
460 distributivity law |
|
461 |
|
462 102 |
|
463 00:05:31,265 --> 00:05:33,545 |
|
464 on regular expressions. |
|
465 |
|
466 103 |
|
467 00:05:33,545 --> 00:05:36,350 |
|
468 But they're also regular expressions |
|
469 which are not equivalent. |
|
470 |
|
471 104 |
|
472 00:05:36,350 --> 00:05:38,990 |
|
473 If I have the regular |
|
474 expression which can |
|
475 |
|
476 105 |
|
477 00:05:38,990 --> 00:05:42,800 |
|
478 match the string containing |
|
479 exactly two a's. |
|
480 |
|
481 106 |
|
482 00:05:42,800 --> 00:05:44,240 |
|
483 That is not equivalent |
|
484 |
|
485 107 |
|
486 00:05:44,240 --> 00:05:46,730 |
|
487 to the regular |
|
488 expression which can just match |
|
489 |
|
490 108 |
|
491 00:05:46,730 --> 00:05:49,250 |
|
492 a single a. And similarly |
|
493 |
|
494 109 |
|
495 00:05:49,250 --> 00:05:51,680 |
|
496 in this case, if I can match |
|
497 |
|
498 110 |
|
499 00:05:51,680 --> 00:05:56,075 |
|
500 a or the string b followed by c, |
|
501 |
|
502 111 |
|
503 00:05:56,075 --> 00:05:59,825 |
|
504 that is not the same as a or b, |
|
505 |
|
506 112 |
|
507 00:05:59,825 --> 00:06:02,000 |
|
508 followed by a or c. |
|
509 |
|
510 113 |
|
511 00:06:02,000 --> 00:06:05,990 |
|
512 I'll let you think about |
|
513 why is that not equivalent. |
|
514 |
|
515 114 |
|
516 00:06:05,990 --> 00:06:08,060 |
|
517 Essentially you |
|
518 have to find out what's |
|
519 |
|
520 115 |
|
521 00:06:08,060 --> 00:06:10,475 |
|
522 the meaning of both |
|
523 regular expressions. |
|
524 |
|
525 116 |
|
526 00:06:10,475 --> 00:06:14,090 |
|
527 And are they the |
|
528 same sets or not? |
|
529 |
|
530 117 |
|
531 00:06:14,090 --> 00:06:17,435 |
|
532 There are again some |
|
533 interesting corner cases. |
|
534 |
|
535 118 |
|
536 00:06:17,435 --> 00:06:20,540 |
|
537 If I have a regular expression |
|
538 that can match a, |
|
539 |
|
540 119 |
|
541 00:06:20,540 --> 00:06:23,450 |
|
542 but followed by the regular |
|
543 expression which cannot match |
|
544 |
|
545 120 |
|
546 00:06:23,450 --> 00:06:25,670 |
|
547 anything, that is not equivalent |
|
548 |
|
549 121 |
|
550 00:06:25,670 --> 00:06:29,255 |
|
551 to the regular |
|
552 expression which can match a. |
|
553 |
|
554 122 |
|
555 00:06:29,255 --> 00:06:31,340 |
|
556 Again here you have |
|
557 to do the calculation |
|
558 |
|
559 123 |
|
560 00:06:31,340 --> 00:06:32,915 |
|
561 to convince you of that. |
|
562 |
|
563 124 |
|
564 00:06:32,915 --> 00:06:35,840 |
|
565 Similarly, if I have a regular |
|
566 expression which can |
|
567 |
|
568 125 |
|
569 00:06:35,840 --> 00:06:38,750 |
|
570 match a or the empty string, |
|
571 |
|
572 126 |
|
573 00:06:38,750 --> 00:06:40,640 |
|
574 this is not equivalent to |
|
575 |
|
576 127 |
|
577 00:06:40,640 --> 00:06:43,355 |
|
578 the regular expression |
|
579 which can just match a. |
|
580 |
|
581 128 |
|
582 00:06:43,355 --> 00:06:46,925 |
|
583 Here are some interesting |
|
584 ones with zeros and ones. |
|
585 |
|
586 129 |
|
587 00:06:46,925 --> 00:06:48,860 |
|
588 So if I have the regular expression |
|
589 |
|
590 130 |
|
591 00:06:48,860 --> 00:06:50,975 |
|
592 that can match the empty string, |
|
593 |
|
594 131 |
|
595 00:06:50,975 --> 00:06:53,540 |
|
596 this will be actually equivalent to |
|
597 |
|
598 132 |
|
599 00:06:53,540 --> 00:06:56,435 |
|
600 the regular expression |
|
601 which can match nothing, |
|
602 |
|
603 133 |
|
604 00:06:56,435 --> 00:06:59,405 |
|
605 but star of that. |
|
606 |
|
607 134 |
|
608 00:06:59,405 --> 00:07:01,970 |
|
609 Remember the star |
|
610 always introduces, |
|
611 |
|
612 135 |
|
613 00:07:01,970 --> 00:07:04,370 |
|
614 no matter what, the empty string. |
|
615 |
|
616 136 |
|
617 00:07:04,370 --> 00:07:06,170 |
|
618 So this regular expression can match |
|
619 |
|
620 137 |
|
621 00:07:06,170 --> 00:07:08,930 |
|
622 the empty string, |
|
623 just like the 1. |
|
624 |
|
625 138 |
|
626 00:07:08,930 --> 00:07:12,125 |
|
627 So these two expressions |
|
628 are not the same, |
|
629 |
|
630 139 |
|
631 00:07:12,125 --> 00:07:14,210 |
|
632 but they are equivalent. |
|
633 |
|
634 140 |
|
635 00:07:14,210 --> 00:07:16,700 |
|
636 And it doesn't matter |
|
637 whether you take |
|
638 |
|
639 141 |
|
640 00:07:16,700 --> 00:07:20,090 |
|
641 the empty string to the star, |
|
642 |
|
643 142 |
|
644 00:07:20,090 --> 00:07:23,855 |
|
645 because if I can match 0 or |
|
646 more times the empty string, |
|
647 |
|
648 143 |
|
649 00:07:23,855 --> 00:07:27,450 |
|
650 that will be equivalent to |
|
651 just the empty string. |
|
652 |
|
653 144 |
|
654 00:07:27,520 --> 00:07:32,510 |
|
655 Slightly similar to the |
|
656 third case is this one. |
|
657 |
|
658 145 |
|
659 00:07:32,510 --> 00:07:35,720 |
|
660 Zero star is not equivalent to 0 |
|
661 |
|
662 146 |
|
663 00:07:35,720 --> 00:07:40,025 |
|
664 because that can match |
|
665 exactly the empty string. |
|
666 |
|
667 147 |
|
668 00:07:40,025 --> 00:07:42,740 |
|
669 This cannot match anything. |
|
670 |
|
671 148 |
|
672 00:07:42,740 --> 00:07:44,839 |
|
673 While the previous |
|
674 |
|
675 149 |
|
676 00:07:44,839 --> 00:07:47,540 |
|
677 equivalences are all very |
|
678 instructive to look at, |
|
679 |
|
680 150 |
|
681 00:07:47,540 --> 00:07:49,910 |
|
682 these are the |
|
683 equivalences we need |
|
684 |
|
685 151 |
|
686 00:07:49,910 --> 00:07:52,685 |
|
687 in our regular expression matchers. |
|
688 |
|
689 152 |
|
690 00:07:52,685 --> 00:07:54,350 |
|
691 And they are defined for |
|
692 |
|
693 153 |
|
694 00:07:54,350 --> 00:07:58,160 |
|
695 all regular expressions r. |
|
696 So r plus 0, |
|
697 |
|
698 154 |
|
699 00:07:58,160 --> 00:08:00,470 |
|
700 no matter what r looks |
|
701 like is equivalent |
|
702 |
|
703 155 |
|
704 00:08:00,470 --> 00:08:02,945 |
|
705 to just r. Similarly |
|
706 |
|
707 156 |
|
708 00:08:02,945 --> 00:08:05,930 |
|
709 0 plus r is also |
|
710 equivalent to r. |
|
711 |
|
712 157 |
|
713 00:08:05,930 --> 00:08:08,525 |
|
714 The order of this |
|
715 choice doesn't matter; |
|
716 |
|
717 158 |
|
718 00:08:08,525 --> 00:08:11,495 |
|
719 r followed by 1, |
|
720 |
|
721 159 |
|
722 00:08:11,495 --> 00:08:14,030 |
|
723 that's the sequence |
|
724 regular expression, is |
|
725 |
|
726 160 |
|
727 00:08:14,030 --> 00:08:16,760 |
|
728 equivalent to r. The |
|
729 sam, r followed by |
|
730 |
|
731 161 |
|
732 00:08:16,760 --> 00:08:19,010 |
|
733 r is equivalent to r. |
|
734 |
|
735 162 |
|
736 00:08:19,010 --> 00:08:20,990 |
|
737 And r followed by |
|
738 |
|
739 163 |
|
740 00:08:20,990 --> 00:08:23,390 |
|
741 the regular expression which |
|
742 cannot match anything, |
|
743 |
|
744 164 |
|
745 00:08:23,390 --> 00:08:27,455 |
|
746 is equivalent to just 0. |
|
747 |
|
748 165 |
|
749 00:08:27,455 --> 00:08:30,740 |
|
750 And 0 followed by r is also equivalent to 0. |
|
751 |
|
752 166 |
|
753 00:08:30,740 --> 00:08:33,615 |
|
754 And if you have the |
|
755 choice of r plus r, |
|
756 |
|
757 167 |
|
758 00:08:33,615 --> 00:08:37,210 |
|
759 that is also |
|
760 equivalent to just r. |
|
761 |
|
762 168 |
|
763 00:08:37,210 --> 00:08:40,270 |
|
764 What we're going to do |
|
765 with these equivalences is |
|
766 |
|
767 169 |
|
768 00:08:40,270 --> 00:08:42,820 |
|
769 that we regard them as |
|
770 simplification rules. |
|
771 |
|
772 170 |
|
773 00:08:42,820 --> 00:08:43,930 |
|
774 So whenever we see |
|
775 |
|
776 171 |
|
777 00:08:43,930 --> 00:08:46,465 |
|
778 this regular expression |
|
779 in our algorithm, |
|
780 |
|
781 172 |
|
782 00:08:46,465 --> 00:08:48,580 |
|
783 we will replace it by |
|
784 the right-hand side. |
|
785 |
|
786 173 |
|
787 00:08:48,580 --> 00:08:51,715 |
|
788 So we will orient |
|
789 these equivalences. |
|
790 |
|
791 174 |
|
792 00:08:51,715 --> 00:08:53,650 |
|
793 If we see, this regular expression, |
|
794 |
|
795 175 |
|
796 00:08:53,650 --> 00:08:55,225 |
|
797 we will replace it by that one. |
|
798 |
|
799 176 |
|
800 00:08:55,225 --> 00:08:58,945 |
|
801 And it will not matter |
|
802 because the left-hand sides |
|
803 |
|
804 177 |
|
805 00:08:58,945 --> 00:09:01,255 |
|
806 can match exactly |
|
807 the same strings |
|
808 |
|
809 178 |
|
810 00:09:01,255 --> 00:09:03,475 |
|
811 as the right-hand sides. |
|
812 |
|
813 179 |
|
814 00:09:03,475 --> 00:09:06,370 |
|
815 Here two quick examples. |
|
816 |
|
817 180 |
|
818 00:09:06,370 --> 00:09:08,680 |
|
819 The first one, let's |
|
820 assume you have |
|
821 |
|
822 181 |
|
823 00:09:08,680 --> 00:09:10,270 |
|
824 a regular expression r and |
|
825 |
|
826 182 |
|
827 00:09:10,270 --> 00:09:11,905 |
|
828 there is something |
|
829 in front of it. |
|
830 |
|
831 183 |
|
832 00:09:11,905 --> 00:09:13,720 |
|
833 This "something in front of it" |
|
834 |
|
835 184 |
|
836 00:09:13,720 --> 00:09:15,870 |
|
837 can be simplified as follows. |
|
838 |
|
839 185 |
|
840 00:09:15,870 --> 00:09:20,000 |
|
841 This 1 times b |
|
842 can be simplified to b. |
|
843 |
|
844 186 |
|
845 00:09:20,000 --> 00:09:23,555 |
|
846 Then this b plus 0 can |
|
847 be simplified to b. |
|
848 |
|
849 187 |
|
850 00:09:23,555 --> 00:09:25,490 |
|
851 And assuming that nothing can |
|
852 |
|
853 188 |
|
854 00:09:25,490 --> 00:09:27,470 |
|
855 be simplified inside this r, |
|
856 |
|
857 189 |
|
858 00:09:27,470 --> 00:09:28,700 |
|
859 then this would be |
|
860 |
|
861 190 |
|
862 00:09:28,700 --> 00:09:33,050 |
|
863 the simplified version |
|
864 of this regular expression. |
|
865 |
|
866 191 |
|
867 00:09:33,050 --> 00:09:34,820 |
|
868 And because the simplification |
|
869 |
|
870 192 |
|
871 00:09:34,820 --> 00:09:36,965 |
|
872 rules preserve what can be matched, |
|
873 |
|
874 193 |
|
875 00:09:36,965 --> 00:09:39,080 |
|
876 we can be sure that |
|
877 this regular expression |
|
878 |
|
879 194 |
|
880 00:09:39,080 --> 00:09:41,255 |
|
881 can match exactly the strings |
|
882 |
|
883 195 |
|
884 00:09:41,255 --> 00:09:43,940 |
|
885 this regular expression can match. |
|
886 |
|
887 196 |
|
888 00:09:43,940 --> 00:09:45,740 |
|
889 Here is the other example. |
|
890 |
|
891 197 |
|
892 00:09:45,740 --> 00:09:49,550 |
|
893 This has just a tiny change |
|
894 that this becomes here as 0. |
|
895 |
|
896 198 |
|
897 00:09:49,550 --> 00:09:54,485 |
|
898 Then 0 times b can be |
|
899 replaced by just 0. |
|
900 |
|
901 199 |
|
902 00:09:54,485 --> 00:09:56,705 |
|
903 Then they are actually two |
|
904 rules which could apply |
|
905 |
|
906 200 |
|
907 00:09:56,705 --> 00:09:59,014 |
|
908 either that we have |
|
909 the same choice |
|
910 |
|
911 201 |
|
912 00:09:59,014 --> 00:10:01,160 |
|
913 or we can just say something plus |
|
914 |
|
915 202 |
|
916 00:10:01,160 --> 00:10:04,100 |
|
917 0 will always go to something. |
|
918 |
|
919 203 |
|
920 00:10:04,100 --> 00:10:06,245 |
|
921 So we can simplify it to that. |
|
922 |
|
923 204 |
|
924 00:10:06,245 --> 00:10:08,510 |
|
925 And then we have |
|
926 0 times r again, |
|
927 |
|
928 205 |
|
929 00:10:08,510 --> 00:10:10,460 |
|
930 and that can be simplified to 0. |
|
931 |
|
932 206 |
|
933 00:10:10,460 --> 00:10:12,080 |
|
934 So actually what we find out with |
|
935 |
|
936 207 |
|
937 00:10:12,080 --> 00:10:14,645 |
|
938 this calculation is that |
|
939 this regular expression, |
|
940 |
|
941 208 |
|
942 00:10:14,645 --> 00:10:16,835 |
|
943 even if it looks |
|
944 quite complicated, |
|
945 |
|
946 209 |
|
947 00:10:16,835 --> 00:10:19,295 |
|
948 actually doesn't |
|
949 match any string. |
|
950 |
|
951 210 |
|
952 00:10:19,295 --> 00:10:21,290 |
|
953 Because it matches exactly |
|
954 |
|
955 211 |
|
956 00:10:21,290 --> 00:10:23,420 |
|
957 those string this regular |
|
958 expression can match. |
|
959 |
|
960 212 |
|
961 00:10:23,420 --> 00:10:26,285 |
|
962 And this reg expression |
|
963 cannot match any. |
|
964 |
|
965 213 |
|
966 00:10:26,285 --> 00:10:29,930 |
|
967 We need one more |
|
968 operation on languages. |
|
969 |
|
970 214 |
|
971 00:10:29,930 --> 00:10:31,700 |
|
972 I call this operation |
|
973 |
|
974 215 |
|
975 00:10:31,700 --> 00:10:35,165 |
|
976 the semantic derivative |
|
977 of a language. |
|
978 |
|
979 216 |
|
980 00:10:35,165 --> 00:10:37,775 |
|
981 This operation takes |
|
982 two arguments. |
|
983 |
|
984 217 |
|
985 00:10:37,775 --> 00:10:40,670 |
|
986 It takes a character |
|
987 which we take |
|
988 |
|
989 218 |
|
990 00:10:40,670 --> 00:10:43,925 |
|
991 the semantic derivative |
|
992 according to, |
|
993 |
|
994 219 |
|
995 00:10:43,925 --> 00:10:45,980 |
|
996 and it takes a language. |
|
997 |
|
998 220 |
|
999 00:10:45,980 --> 00:10:49,579 |
|
1000 And what it does is it |
|
1001 looks into this language |
|
1002 |
|
1003 221 |
|
1004 00:10:49,579 --> 00:10:51,365 |
|
1005 and looks for all the strings |
|
1006 |
|
1007 222 |
|
1008 00:10:51,365 --> 00:10:53,735 |
|
1009 that start with this character, |
|
1010 |
|
1011 223 |
|
1012 00:10:53,735 --> 00:10:56,405 |
|
1013 let's say c, and then |
|
1014 |
|
1015 224 |
|
1016 00:10:56,405 --> 00:11:00,920 |
|
1017 just strips off that c. |
|
1018 So here's an example. |
|
1019 |
|
1020 225 |
|
1021 00:11:00,920 --> 00:11:02,975 |
|
1022 Suppose you have a language A, |
|
1023 |
|
1024 226 |
|
1025 00:11:02,975 --> 00:11:04,460 |
|
1026 with the strings |
|
1027 |
|
1028 227 |
|
1029 00:11:04,460 --> 00:11:06,965 |
|
1030 foo, bar and frak. |
|
1031 |
|
1032 228 |
|
1033 00:11:06,965 --> 00:11:10,835 |
|
1034 And you take the semantic |
|
1035 derivative according to f, |
|
1036 |
|
1037 229 |
|
1038 00:11:10,835 --> 00:11:14,450 |
|
1039 then we discard all the |
|
1040 strings which do not |
|
1041 |
|
1042 230 |
|
1043 00:11:14,450 --> 00:11:18,320 |
|
1044 start with an f. So |
|
1045 bar will be discarded. |
|
1046 |
|
1047 231 |
|
1048 00:11:18,320 --> 00:11:22,850 |
|
1049 Foo and frac start with |
|
1050 an f. So we keep them |
|
1051 |
|
1052 232 |
|
1053 00:11:22,850 --> 00:11:25,265 |
|
1054 but we strip off the first f. |
|
1055 |
|
1056 233 |
|
1057 00:11:25,265 --> 00:11:29,045 |
|
1058 So here we have only oo and rak. |
|
1059 |
|
1060 234 |
|
1061 00:11:29,045 --> 00:11:31,609 |
|
1062 If you take the |
|
1063 semantic derivative |
|
1064 |
|
1065 235 |
|
1066 00:11:31,609 --> 00:11:33,830 |
|
1067 of that language according to b, |
|
1068 |
|
1069 236 |
|
1070 00:11:33,830 --> 00:11:37,190 |
|
1071 then we will discard foo and |
|
1072 frak because they don't |
|
1073 |
|
1074 237 |
|
1075 00:11:37,190 --> 00:11:40,940 |
|
1076 start with b, and just keep bar. |
|
1077 |
|
1078 238 |
|
1079 00:11:40,940 --> 00:11:44,585 |
|
1080 But again, we have to |
|
1081 strip off this first b. |
|
1082 |
|
1083 239 |
|
1084 00:11:44,585 --> 00:11:49,305 |
|
1085 And if you take the semantic |
|
1086 derivative according to a, |
|
1087 |
|
1088 240 |
|
1089 00:11:49,305 --> 00:11:52,585 |
|
1090 then none of these |
|
1091 strings start with a. |
|
1092 |
|
1093 241 |
|
1094 00:11:52,585 --> 00:11:56,800 |
|
1095 So this will be defined |
|
1096 as just the empty set. |
|
1097 |
|
1098 242 |
|
1099 00:11:56,800 --> 00:11:59,695 |
|
1100 You can slightly |
|
1101 generalized this |
|
1102 |
|
1103 243 |
|
1104 00:11:59,695 --> 00:12:02,560 |
|
1105 semantic derivative |
|
1106 to also strings. |
|
1107 |
|
1108 244 |
|
1109 00:12:02,560 --> 00:12:05,170 |
|
1110 So imagine you have |
|
1111 a language A and you |
|
1112 |
|
1113 245 |
|
1114 00:12:05,170 --> 00:12:08,274 |
|
1115 build the semantic derivative |
|
1116 according to a string s. |
|
1117 |
|
1118 246 |
|
1119 00:12:08,274 --> 00:12:10,990 |
|
1120 Then you look in |
|
1121 this language and |
|
1122 |
|
1123 247 |
|
1124 00:12:10,990 --> 00:12:13,420 |
|
1125 you look for all the |
|
1126 strings that start with |
|
1127 |
|
1128 248 |
|
1129 00:12:13,420 --> 00:12:19,555 |
|
1130 this s. But you strip |
|
1131 off that s. Ok? |
|
1132 |
|
1133 249 |
|
1134 00:12:19,555 --> 00:12:23,830 |
|
1135 So if you have a string |
|
1136 starting with the s, |
|
1137 |
|
1138 250 |
|
1139 00:12:23,830 --> 00:12:26,065 |
|
1140 then you keep that string, |
|
1141 |
|
1142 251 |
|
1143 00:12:26,065 --> 00:12:27,490 |
|
1144 but you keep only the rest... |
|
1145 |
|
1146 252 |
|
1147 00:12:27,490 --> 00:12:28,810 |
|
1148 what's coming after this s. |
|
1149 |
|
1150 253 |
|
1151 00:12:28,810 --> 00:12:32,010 |
|
1152 That is here indicated |
|
1153 with this s'. |
|
1154 |
|
1155 254 |
|
1156 00:12:32,010 --> 00:12:35,330 |
|
1157 I also have this convention, |
|
1158 this personal convention |
|
1159 |
|
1160 255 |
|
1161 00:12:35,330 --> 00:12:40,055 |
|
1162 I have to say, that everything |
|
1163 which works on lists, |
|
1164 |
|
1165 256 |
|
1166 00:12:40,055 --> 00:12:42,185 |
|
1167 remember strings are |
|
1168 lists of characters. |
|
1169 |
|
1170 257 |
|
1171 00:12:42,185 --> 00:12:46,655 |
|
1172 I just put there an 's'. So |
|
1173 here's the one for characters. |
|
1174 |
|
1175 258 |
|
1176 00:12:46,655 --> 00:12:48,680 |
|
1177 I just call it Der with a capital |
|
1178 |
|
1179 259 |
|
1180 00:12:48,680 --> 00:12:51,740 |
|
1181 D. And I call that Ders, |
|
1182 |
|
1183 260 |
|
1184 00:12:51,740 --> 00:12:54,350 |
|
1185 because that works over strings. |
|
1186 |
|
1187 261 |
|
1188 00:12:54,350 --> 00:12:55,865 |
|
1189 And you can see how it would |
|
1190 |
|
1191 262 |
|
1192 00:12:55,865 --> 00:12:59,750 |
|
1193 be defined if you only take this |
|
1194 as a primitive operation. |
|
1195 |
|
1196 263 |
|
1197 00:12:59,750 --> 00:13:01,340 |
|
1198 We would just need to iterate |
|
1199 |
|
1200 264 |
|
1201 00:13:01,340 --> 00:13:04,050 |
|
1202 that essentially for this Ders. |
|
1203 |
|
1204 265 |
|
1205 00:13:04,060 --> 00:13:07,970 |
|
1206 Ok, we now have all |
|
1207 the theory in place. |
|
1208 |
|
1209 266 |
|
1210 00:13:07,970 --> 00:13:11,345 |
|
1211 And we can finally start |
|
1212 implementing our algorithm. |
|
1213 |
|
1214 267 |
|
1215 00:13:11,345 --> 00:13:14,580 |
|
1216 That's when we'll do |
|
1217 in the next video. |