73 increasingly wacky platforms. It’s effectively a perpetual |
73 increasingly wacky platforms. It’s effectively a perpetual |
74 employment act for solid compiler hackers.'' |
74 employment act for solid compiler hackers.'' |
75 \end{quote} |
75 \end{quote} |
76 |
76 |
77 \noindent |
77 \noindent |
78 So the goal of this module is to become a solid (beginner) compiler |
78 Given this, the goal of this module is to become a solid (beginner) compiler |
79 hacker and as part of the coursework to implement a small |
79 hacker and as part of the coursework to implement two small |
80 compiler for a very small programming language. |
80 compilers for two very small programming languages. |
81 |
81 |
82 The first part of the module is about the problem of text processing, |
82 The first part of the module is about the problem of text processing, |
83 which is needed for compilers, but also for dictionaries, DNA-data, |
83 which is needed for compilers, but also for dictionaries, DNA-data, |
84 spam-filters and so on. When looking for a particular string, say |
84 spam-filters and so on. When looking for a particular string, say |
85 \pcode{"foobar"}, in a large text we can use the Knuth-Morris-Pratt |
85 \pcode{"foobar"}, in a large text we can use the Knuth-Morris-Pratt |
103 Often we also face the problem that we are given a string, for example |
103 Often we also face the problem that we are given a string, for example |
104 some user input, and we want to know whether it matches a particular |
104 some user input, and we want to know whether it matches a particular |
105 pattern---is it an email address, for example. In this way we can |
105 pattern---is it an email address, for example. In this way we can |
106 exclude user input that would otherwise have nasty effects on our |
106 exclude user input that would otherwise have nasty effects on our |
107 program (crashing it or making it go into an infinite loop, if not |
107 program (crashing it or making it go into an infinite loop, if not |
108 worse). This kind of ``inspecting'' mechanism is also implemented in |
108 worse). This kind of ``vetting'' mechanism is also implemented in |
109 popular network security tools such as Snort and |
109 popular network security tools such as Snort and |
110 Bro.\here{www.snort.org}\here{www.bro.org} They scan incoming |
110 Zeek.\here{www.snort.org}\here{www.bro.org} They scan incoming |
111 network traffic for computer viruses or malicious packets. Similarly |
111 network traffic for computer viruses or malicious packets. Similarly |
112 filtering out spam usually involves looking for some signature |
112 filtering out spam usually involves looking for some signature |
113 (essentially a string pattern). The point is that the fast |
113 (essentially a string pattern). The point is that the fast |
114 Knuth-Morris-Pratt algorithm for strings is not good enough for such |
114 Knuth-Morris-Pratt algorithm for strings is not good enough for such |
115 string \emph{patterns}.\smallskip |
115 string \emph{patterns}.\smallskip |
298 \end{axis} |
298 \end{axis} |
299 \end{tikzpicture} |
299 \end{tikzpicture} |
300 \end{tabular} |
300 \end{tabular} |
301 \end{center} |
301 \end{center} |
302 |
302 |
303 \noindent This first graph shows that Python, JavaScript and Java 8 need |
303 \noindent The first graph shows that Python, JavaScript and Java 8 need |
304 approximately 30 seconds to find out that the regular expression |
304 approximately 30 seconds to find out that the regular expression |
305 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly, |
305 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly, |
306 the second shows that Python and Ruby need approximately 29 seconds for finding |
306 the second shows that Python and Ruby need approximately 29 seconds for finding |
307 out whether a string of 28 \texttt{a}s matches the regular expression |
307 out whether a string of 28 \texttt{a}s matches the regular expression |
308 \texttt{a?\{28\}\,a\{28\}}.\footnote{In this |
308 \texttt{a?\{28\}\,a\{28\}}.\footnote{In this |
568 \emph{meaning} of a regular expression is. This is not good |
568 \emph{meaning} of a regular expression is. This is not good |
569 enough for specifications of what algorithms are supposed to |
569 enough for specifications of what algorithms are supposed to |
570 do or which problems they are supposed to solve. |
570 do or which problems they are supposed to solve. |
571 |
571 |
572 To define the meaning of a regular expression we will |
572 To define the meaning of a regular expression we will |
573 associate with every regular expression a language, or set of |
573 associate with every regular expression a language---a set of |
574 strings. This language contains all the strings the regular |
574 strings. This language contains all the strings the regular |
575 expression is supposed to match. To understand what is going |
575 expression is supposed to match. To understand what is going |
576 on here it is crucial that you have read the handout |
576 on here it is crucial that you have read the handout |
577 about basic mathematical notations. |
577 about basic mathematical notations. |
578 |
578 |
624 \noindent |
624 \noindent |
625 That means both languages are the same. |
625 That means both languages are the same. |
626 The point of the definition of $L$ is that we can use it to |
626 The point of the definition of $L$ is that we can use it to |
627 precisely specify when a string $s$ is matched by a regular |
627 precisely specify when a string $s$ is matched by a regular |
628 expression $r$, namely if and only if $s \in L(r)$. In fact we |
628 expression $r$, namely if and only if $s \in L(r)$. In fact we |
629 will write a program \pcode{match} that takes any string $s$ |
629 will write a program \pcode{match} that takes a string $s$ |
630 and any regular expression $r$ as arguments and returns |
630 and a regular expression $r$ as arguments and returns |
631 \emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in |
631 \emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in |
632 L(r)$. We leave this for the next lecture. |
632 L(r)$. We leave this for the next lecture. |
633 |
633 |
634 There is one more feature of regular expressions that is worth |
634 There is one more feature of regular expressions that is worth |
635 mentioning. Given some strings, there are in general many |
635 mentioning here. Given some strings, there are in general many |
636 different regular expressions that can recognise these |
636 different regular expressions that can recognise these |
637 strings. This is obvious with the regular expression $a + b$ |
637 strings. This is obvious with the regular expression $a + b$ |
638 which can match the strings $a$ and $b$. But also the regular |
638 which can match the strings $a$ and $b$. But also the regular |
639 expression $b + a$ would match the same strings. However, |
639 expression $b + a$ would match the same strings. However, |
640 sometimes it is not so obvious whether two regular expressions |
640 sometimes it is not so obvious whether two regular expressions |
647 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2) |
647 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2) |
648 \] |
648 \] |
649 |
649 |
650 \noindent That means two regular expressions are said to be |
650 \noindent That means two regular expressions are said to be |
651 equivalent if they match the same set of strings. That is |
651 equivalent if they match the same set of strings. That is |
652 their meanings is the same. Therefore we |
652 their meanings are the same. Therefore we |
653 do not really distinguish between the different regular |
653 do not really distinguish between the different regular |
654 expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$, |
654 expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$, |
655 because they are equivalent. I leave you to the question |
655 because they are equivalent. I leave you to the question |
656 whether |
656 whether |
657 |
657 |
728 To conclude: Despite my early ignorance about regular expressions, I |
728 To conclude: Despite my early ignorance about regular expressions, I |
729 find them now extremely interesting. They have practical importance |
729 find them now extremely interesting. They have practical importance |
730 (remember the shocking runtime of the regular expression matchers in |
730 (remember the shocking runtime of the regular expression matchers in |
731 Python, Ruby and Java in some instances and the problems in Stack |
731 Python, Ruby and Java in some instances and the problems in Stack |
732 Exchange and the Atom editor). They are used in tools like Snort and |
732 Exchange and the Atom editor). They are used in tools like Snort and |
733 Bro in order to monitor network traffic. They have a beautiful mathematical |
733 Zeek in order to monitor network traffic. They have a beautiful mathematical |
734 theory behind them, which can be sometimes quite deep and which |
734 theory behind them, which can be sometimes quite deep and which |
735 sometimes contains hidden snares. People who are not very familiar |
735 sometimes contains hidden snares. People who are not very familiar |
736 with the mathematical background of regular expressions get them |
736 with the mathematical background of regular expressions get them |
737 consistently wrong (this is surprising given they are a supposed to be |
737 consistently wrong (this is surprising given they are a supposed to be |
738 core skill for computer scientists). The hope is that we can do better |
738 a core skill for computer scientists). The hope is that we can do better |
739 in the future---for example by proving that the algorithms actually |
739 in the future---for example by proving that the algorithms actually |
740 satisfy their specification and that the corresponding implementations |
740 satisfy their specification and that the corresponding implementations |
741 do not contain any bugs. We are close, but not yet quite there. |
741 do not contain any bugs. We are close, but not yet quite there. |
742 |
742 |
743 Notwithstanding my fascination, I am also happy to admit that regular |
743 Notwithstanding my fascination, I am also happy to admit that regular |
761 |
761 |
762 \noindent But they admit that by using this regular expression |
762 \noindent But they admit that by using this regular expression |
763 they wilfully violate the RFC 5322 standard, which specifies |
763 they wilfully violate the RFC 5322 standard, which specifies |
764 the syntax of email addresses. With their proposed regular |
764 the syntax of email addresses. With their proposed regular |
765 expression they are too strict in some cases and too lax in |
765 expression they are too strict in some cases and too lax in |
766 others. Not a good situation to be in. A regular expression |
766 others\ldots{}not a good situation to be in. A regular expression |
767 that is claimed to be closer to the standard is shown in |
767 that is claimed to be closer to the standard is shown in |
768 Figure~\ref{monster}. Whether this claim is true or not, I |
768 Figure~\ref{monster}. Whether this claim is true or not, I |
769 would not know---the only thing I can say about this regular |
769 would not know---the only thing I can say about this regular |
770 expression is it is a monstrosity. However, this might |
770 expression is it is a monstrosity. However, this might |
771 actually be an argument against the RFC standard, rather than |
771 actually be an argument against the RFC standard, rather than |