--- a/handouts/ho01.tex Tue Mar 22 00:36:18 2022 +0000
+++ b/handouts/ho01.tex Sat Apr 02 01:52:43 2022 +0100
@@ -29,7 +29,7 @@
%% emacs regexes
%% https://www.gnu.org/software/emacs/manual/html_node/elisp/Regular-Expressions.html
-%% reasons for a new programming language
+%% reasons for a new programming language
%% http://beautifulracket.com
% compiler explorer
@@ -75,9 +75,9 @@
\end{quote}
\noindent
-So the goal of this module is to become a solid (beginner) compiler
-hacker and as part of the coursework to implement a small
-compiler for a very small programming language.
+Given this, the goal of this module is to become a solid (beginner) compiler
+hacker and as part of the coursework to implement two small
+compilers for two very small programming languages.
The first part of the module is about the problem of text processing,
which is needed for compilers, but also for dictionaries, DNA-data,
@@ -105,9 +105,9 @@
pattern---is it an email address, for example. In this way we can
exclude user input that would otherwise have nasty effects on our
program (crashing it or making it go into an infinite loop, if not
-worse). This kind of ``inspecting'' mechanism is also implemented in
+worse). This kind of ``vetting'' mechanism is also implemented in
popular network security tools such as Snort and
-Bro.\here{www.snort.org}\here{www.bro.org} They scan incoming
+Zeek.\here{www.snort.org}\here{www.bro.org} They scan incoming
network traffic for computer viruses or malicious packets. Similarly
filtering out spam usually involves looking for some signature
(essentially a string pattern). The point is that the fast
@@ -300,7 +300,7 @@
\end{tabular}
\end{center}
-\noindent This first graph shows that Python, JavaScript and Java 8 need
+\noindent The first graph shows that Python, JavaScript and Java 8 need
approximately 30 seconds to find out that the regular expression
$\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly,
the second shows that Python and Ruby need approximately 29 seconds for finding
@@ -570,7 +570,7 @@
do or which problems they are supposed to solve.
To define the meaning of a regular expression we will
-associate with every regular expression a language, or set of
+associate with every regular expression a language---a set of
strings. This language contains all the strings the regular
expression is supposed to match. To understand what is going
on here it is crucial that you have read the handout
@@ -626,13 +626,13 @@
The point of the definition of $L$ is that we can use it to
precisely specify when a string $s$ is matched by a regular
expression $r$, namely if and only if $s \in L(r)$. In fact we
-will write a program \pcode{match} that takes any string $s$
-and any regular expression $r$ as arguments and returns
+will write a program \pcode{match} that takes a string $s$
+and a regular expression $r$ as arguments and returns
\emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in
L(r)$. We leave this for the next lecture.
There is one more feature of regular expressions that is worth
-mentioning. Given some strings, there are in general many
+mentioning here. Given some strings, there are in general many
different regular expressions that can recognise these
strings. This is obvious with the regular expression $a + b$
which can match the strings $a$ and $b$. But also the regular
@@ -649,7 +649,7 @@
\noindent That means two regular expressions are said to be
equivalent if they match the same set of strings. That is
-their meanings is the same. Therefore we
+their meanings are the same. Therefore we
do not really distinguish between the different regular
expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,
because they are equivalent. I leave you to the question
@@ -730,12 +730,12 @@
(remember the shocking runtime of the regular expression matchers in
Python, Ruby and Java in some instances and the problems in Stack
Exchange and the Atom editor). They are used in tools like Snort and
-Bro in order to monitor network traffic. They have a beautiful mathematical
+Zeek in order to monitor network traffic. They have a beautiful mathematical
theory behind them, which can be sometimes quite deep and which
sometimes contains hidden snares. People who are not very familiar
with the mathematical background of regular expressions get them
consistently wrong (this is surprising given they are a supposed to be
-core skill for computer scientists). The hope is that we can do better
+a core skill for computer scientists). The hope is that we can do better
in the future---for example by proving that the algorithms actually
satisfy their specification and that the corresponding implementations
do not contain any bugs. We are close, but not yet quite there.
@@ -763,7 +763,7 @@
they wilfully violate the RFC 5322 standard, which specifies
the syntax of email addresses. With their proposed regular
expression they are too strict in some cases and too lax in
-others. Not a good situation to be in. A regular expression
+others\ldots{}not a good situation to be in. A regular expression
that is claimed to be closer to the standard is shown in
Figure~\ref{monster}. Whether this claim is true or not, I
would not know---the only thing I can say about this regular