handouts/ho01.tex
changeset 872 5f5e165c9a57
parent 871 94b84d880c2b
child 876 771396fa6cc4
equal deleted inserted replaced
871:94b84d880c2b 872:5f5e165c9a57
    27 % http://regexr.com
    27 % http://regexr.com
    28 
    28 
    29 %% emacs regexes
    29 %% emacs regexes
    30 %% https://www.gnu.org/software/emacs/manual/html_node/elisp/Regular-Expressions.html
    30 %% https://www.gnu.org/software/emacs/manual/html_node/elisp/Regular-Expressions.html
    31 
    31 
    32 %%  reasons for a new programming language
    32 %% reasons for a new programming language
    33 %% http://beautifulracket.com
    33 %% http://beautifulracket.com
    34 
    34 
    35 % compiler explorer
    35 % compiler explorer
    36 % https://gcc.godbolt.org
    36 % https://gcc.godbolt.org
    37 
    37 
    73   increasingly wacky platforms. It’s effectively a perpetual
    73   increasingly wacky platforms. It’s effectively a perpetual
    74   employment act for solid compiler hackers.''
    74   employment act for solid compiler hackers.''
    75 \end{quote}  
    75 \end{quote}  
    76 
    76 
    77 \noindent
    77 \noindent
    78 So the goal of this module is to become a solid (beginner) compiler
    78 Given this, the goal of this module is to become a solid (beginner) compiler
    79 hacker and as part of the coursework to implement a small
    79 hacker and as part of the coursework to implement two small
    80 compiler for a very small programming language.
    80 compilers for two very small programming languages.
    81 
    81 
    82 The first part of the module is about the problem of text processing,
    82 The first part of the module is about the problem of text processing,
    83 which is needed for compilers, but also for dictionaries, DNA-data,
    83 which is needed for compilers, but also for dictionaries, DNA-data,
    84 spam-filters and so on. When looking for a particular string, say
    84 spam-filters and so on. When looking for a particular string, say
    85 \pcode{"foobar"}, in a large text we can use the Knuth-Morris-Pratt
    85 \pcode{"foobar"}, in a large text we can use the Knuth-Morris-Pratt
   103 Often we also face the problem that we are given a string, for example
   103 Often we also face the problem that we are given a string, for example
   104 some user input, and we want to know whether it matches a particular
   104 some user input, and we want to know whether it matches a particular
   105 pattern---is it an email address, for example. In this way we can
   105 pattern---is it an email address, for example. In this way we can
   106 exclude user input that would otherwise have nasty effects on our
   106 exclude user input that would otherwise have nasty effects on our
   107 program (crashing it or making it go into an infinite loop, if not
   107 program (crashing it or making it go into an infinite loop, if not
   108 worse). This kind of ``inspecting'' mechanism is also implemented in
   108 worse). This kind of ``vetting'' mechanism is also implemented in
   109 popular network security tools such as Snort and
   109 popular network security tools such as Snort and
   110 Bro.\here{www.snort.org}\here{www.bro.org} They scan incoming
   110 Zeek.\here{www.snort.org}\here{www.bro.org} They scan incoming
   111 network traffic for computer viruses or malicious packets. Similarly
   111 network traffic for computer viruses or malicious packets. Similarly
   112 filtering out spam usually involves looking for some signature
   112 filtering out spam usually involves looking for some signature
   113 (essentially a string pattern). The point is that the fast
   113 (essentially a string pattern). The point is that the fast
   114 Knuth-Morris-Pratt algorithm for strings is not good enough for such
   114 Knuth-Morris-Pratt algorithm for strings is not good enough for such
   115 string \emph{patterns}.\smallskip
   115 string \emph{patterns}.\smallskip
   298 \end{axis}
   298 \end{axis}
   299 \end{tikzpicture}
   299 \end{tikzpicture}
   300 \end{tabular}
   300 \end{tabular}
   301 \end{center}
   301 \end{center}
   302 
   302 
   303 \noindent This first graph shows that Python, JavaScript and Java 8 need
   303 \noindent The first graph shows that Python, JavaScript and Java 8 need
   304 approximately 30 seconds to find out that the regular expression
   304 approximately 30 seconds to find out that the regular expression
   305 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly,
   305 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly,
   306 the second shows that Python and Ruby need approximately 29 seconds for finding
   306 the second shows that Python and Ruby need approximately 29 seconds for finding
   307 out whether a string of 28 \texttt{a}s matches the regular expression
   307 out whether a string of 28 \texttt{a}s matches the regular expression
   308 \texttt{a?\{28\}\,a\{28\}}.\footnote{In this
   308 \texttt{a?\{28\}\,a\{28\}}.\footnote{In this
   568 \emph{meaning} of a regular expression is. This is not good
   568 \emph{meaning} of a regular expression is. This is not good
   569 enough for specifications of what algorithms are supposed to
   569 enough for specifications of what algorithms are supposed to
   570 do or which problems they are supposed to solve.
   570 do or which problems they are supposed to solve.
   571 
   571 
   572 To define the meaning of a regular expression we will
   572 To define the meaning of a regular expression we will
   573 associate with every regular expression a language, or set of
   573 associate with every regular expression a language---a set of
   574 strings. This language contains all the strings the regular
   574 strings. This language contains all the strings the regular
   575 expression is supposed to match. To understand what is going
   575 expression is supposed to match. To understand what is going
   576 on here it is crucial that you have read the handout
   576 on here it is crucial that you have read the handout
   577 about basic mathematical notations.
   577 about basic mathematical notations.
   578 
   578 
   624 \noindent
   624 \noindent
   625 That means both languages are the same.
   625 That means both languages are the same.
   626 The point of the definition of $L$ is that we can use it to
   626 The point of the definition of $L$ is that we can use it to
   627 precisely specify when a string $s$ is matched by a regular
   627 precisely specify when a string $s$ is matched by a regular
   628 expression $r$, namely if and only if $s \in L(r)$. In fact we
   628 expression $r$, namely if and only if $s \in L(r)$. In fact we
   629 will write a program \pcode{match} that takes any string $s$
   629 will write a program \pcode{match} that takes a string $s$
   630 and any regular expression $r$ as arguments and returns
   630 and a regular expression $r$ as arguments and returns
   631 \emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in
   631 \emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in
   632 L(r)$. We leave this for the next lecture.
   632 L(r)$. We leave this for the next lecture.
   633 
   633 
   634 There is one more feature of regular expressions that is worth
   634 There is one more feature of regular expressions that is worth
   635 mentioning. Given some strings, there are in general many
   635 mentioning here. Given some strings, there are in general many
   636 different regular expressions that can recognise these
   636 different regular expressions that can recognise these
   637 strings. This is obvious with the regular expression $a + b$
   637 strings. This is obvious with the regular expression $a + b$
   638 which can match the strings $a$ and $b$. But also the regular
   638 which can match the strings $a$ and $b$. But also the regular
   639 expression $b + a$ would match the same strings. However,
   639 expression $b + a$ would match the same strings. However,
   640 sometimes it is not so obvious whether two regular expressions
   640 sometimes it is not so obvious whether two regular expressions
   647 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
   647 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
   648 \]
   648 \]
   649 
   649 
   650 \noindent That means two regular expressions are said to be
   650 \noindent That means two regular expressions are said to be
   651 equivalent if they match the same set of strings. That is
   651 equivalent if they match the same set of strings. That is
   652 their meanings is the same. Therefore we
   652 their meanings are the same. Therefore we
   653 do not really distinguish between the different regular
   653 do not really distinguish between the different regular
   654 expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,
   654 expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,
   655 because they are equivalent. I leave you to the question
   655 because they are equivalent. I leave you to the question
   656 whether
   656 whether
   657 
   657 
   728 To conclude: Despite my early ignorance about regular expressions, I
   728 To conclude: Despite my early ignorance about regular expressions, I
   729 find them now extremely interesting. They have practical importance
   729 find them now extremely interesting. They have practical importance
   730 (remember the shocking runtime of the regular expression matchers in
   730 (remember the shocking runtime of the regular expression matchers in
   731 Python, Ruby and Java in some instances and the problems in Stack
   731 Python, Ruby and Java in some instances and the problems in Stack
   732 Exchange and the Atom editor). They are used in tools like Snort and
   732 Exchange and the Atom editor). They are used in tools like Snort and
   733 Bro in order to monitor network traffic. They have a beautiful mathematical
   733 Zeek in order to monitor network traffic. They have a beautiful mathematical
   734 theory behind them, which can be sometimes quite deep and which
   734 theory behind them, which can be sometimes quite deep and which
   735 sometimes contains hidden snares.  People who are not very familiar
   735 sometimes contains hidden snares.  People who are not very familiar
   736 with the mathematical background of regular expressions get them
   736 with the mathematical background of regular expressions get them
   737 consistently wrong (this is surprising given they are a supposed to be
   737 consistently wrong (this is surprising given they are a supposed to be
   738 core skill for computer scientists). The hope is that we can do better
   738 a core skill for computer scientists). The hope is that we can do better
   739 in the future---for example by proving that the algorithms actually
   739 in the future---for example by proving that the algorithms actually
   740 satisfy their specification and that the corresponding implementations
   740 satisfy their specification and that the corresponding implementations
   741 do not contain any bugs. We are close, but not yet quite there.
   741 do not contain any bugs. We are close, but not yet quite there.
   742 
   742 
   743 Notwithstanding my fascination, I am also happy to admit that regular
   743 Notwithstanding my fascination, I am also happy to admit that regular
   761 
   761 
   762 \noindent But they admit that by using this regular expression
   762 \noindent But they admit that by using this regular expression
   763 they wilfully violate the RFC 5322 standard, which specifies
   763 they wilfully violate the RFC 5322 standard, which specifies
   764 the syntax of email addresses. With their proposed regular
   764 the syntax of email addresses. With their proposed regular
   765 expression they are too strict in some cases and too lax in
   765 expression they are too strict in some cases and too lax in
   766 others. Not a good situation to be in. A regular expression
   766 others\ldots{}not a good situation to be in. A regular expression
   767 that is claimed to be closer to the standard is shown in
   767 that is claimed to be closer to the standard is shown in
   768 Figure~\ref{monster}. Whether this claim is true or not, I
   768 Figure~\ref{monster}. Whether this claim is true or not, I
   769 would not know---the only thing I can say about this regular
   769 would not know---the only thing I can say about this regular
   770 expression is it is a monstrosity. However, this might
   770 expression is it is a monstrosity. However, this might
   771 actually be an argument against the RFC standard, rather than
   771 actually be an argument against the RFC standard, rather than