diff -r 94b84d880c2b -r 5f5e165c9a57 handouts/ho01.tex --- a/handouts/ho01.tex Tue Mar 22 00:36:18 2022 +0000 +++ b/handouts/ho01.tex Sat Apr 02 01:52:43 2022 +0100 @@ -29,7 +29,7 @@ %% emacs regexes %% https://www.gnu.org/software/emacs/manual/html_node/elisp/Regular-Expressions.html -%% reasons for a new programming language +%% reasons for a new programming language %% http://beautifulracket.com % compiler explorer @@ -75,9 +75,9 @@ \end{quote} \noindent -So the goal of this module is to become a solid (beginner) compiler -hacker and as part of the coursework to implement a small -compiler for a very small programming language. +Given this, the goal of this module is to become a solid (beginner) compiler +hacker and as part of the coursework to implement two small +compilers for two very small programming languages. The first part of the module is about the problem of text processing, which is needed for compilers, but also for dictionaries, DNA-data, @@ -105,9 +105,9 @@ pattern---is it an email address, for example. In this way we can exclude user input that would otherwise have nasty effects on our program (crashing it or making it go into an infinite loop, if not -worse). This kind of ``inspecting'' mechanism is also implemented in +worse). This kind of ``vetting'' mechanism is also implemented in popular network security tools such as Snort and -Bro.\here{www.snort.org}\here{www.bro.org} They scan incoming +Zeek.\here{www.snort.org}\here{www.bro.org} They scan incoming network traffic for computer viruses or malicious packets. Similarly filtering out spam usually involves looking for some signature (essentially a string pattern). The point is that the fast @@ -300,7 +300,7 @@ \end{tabular} \end{center} -\noindent This first graph shows that Python, JavaScript and Java 8 need +\noindent The first graph shows that Python, JavaScript and Java 8 need approximately 30 seconds to find out that the regular expression $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly, the second shows that Python and Ruby need approximately 29 seconds for finding @@ -570,7 +570,7 @@ do or which problems they are supposed to solve. To define the meaning of a regular expression we will -associate with every regular expression a language, or set of +associate with every regular expression a language---a set of strings. This language contains all the strings the regular expression is supposed to match. To understand what is going on here it is crucial that you have read the handout @@ -626,13 +626,13 @@ The point of the definition of $L$ is that we can use it to precisely specify when a string $s$ is matched by a regular expression $r$, namely if and only if $s \in L(r)$. In fact we -will write a program \pcode{match} that takes any string $s$ -and any regular expression $r$ as arguments and returns +will write a program \pcode{match} that takes a string $s$ +and a regular expression $r$ as arguments and returns \emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in L(r)$. We leave this for the next lecture. There is one more feature of regular expressions that is worth -mentioning. Given some strings, there are in general many +mentioning here. Given some strings, there are in general many different regular expressions that can recognise these strings. This is obvious with the regular expression $a + b$ which can match the strings $a$ and $b$. But also the regular @@ -649,7 +649,7 @@ \noindent That means two regular expressions are said to be equivalent if they match the same set of strings. That is -their meanings is the same. Therefore we +their meanings are the same. Therefore we do not really distinguish between the different regular expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$, because they are equivalent. I leave you to the question @@ -730,12 +730,12 @@ (remember the shocking runtime of the regular expression matchers in Python, Ruby and Java in some instances and the problems in Stack Exchange and the Atom editor). They are used in tools like Snort and -Bro in order to monitor network traffic. They have a beautiful mathematical +Zeek in order to monitor network traffic. They have a beautiful mathematical theory behind them, which can be sometimes quite deep and which sometimes contains hidden snares. People who are not very familiar with the mathematical background of regular expressions get them consistently wrong (this is surprising given they are a supposed to be -core skill for computer scientists). The hope is that we can do better +a core skill for computer scientists). The hope is that we can do better in the future---for example by proving that the algorithms actually satisfy their specification and that the corresponding implementations do not contain any bugs. We are close, but not yet quite there. @@ -763,7 +763,7 @@ they wilfully violate the RFC 5322 standard, which specifies the syntax of email addresses. With their proposed regular expression they are too strict in some cases and too lax in -others. Not a good situation to be in. A regular expression +others\ldots{}not a good situation to be in. A regular expression that is claimed to be closer to the standard is shown in Figure~\ref{monster}. Whether this claim is true or not, I would not know---the only thing I can say about this regular