afl-material: comparison handouts/ho01.tex

equal deleted inserted replaced

-:94b84d880c2b
+:5f5e165c9a57
 % http://regexr.com
 %% emacs regexes
 %% https://www.gnu.org/software/emacs/manual/html_node/elisp/Regular-Expressions.html
-%%  reasons for a new programming language
+%% reasons for a new programming language
 %% http://beautifulracket.com
 % compiler explorer
 % https://gcc.godbolt.org
 increasingly wacky platforms. It’s effectively a perpetual
 employment act for solid compiler hackers.''
 \end{quote}
 \noindent
-So the goal of this module is to become a solid (beginner) compiler
+Given this, the goal of this module is to become a solid (beginner) compiler
-hacker and as part of the coursework to implement a small
+hacker and as part of the coursework to implement two small
-compiler for a very small programming language.
+compilers for two very small programming languages.
 The first part of the module is about the problem of text processing,
 which is needed for compilers, but also for dictionaries, DNA-data,
 spam-filters and so on. When looking for a particular string, say
 \pcode{"foobar"}, in a large text we can use the Knuth-Morris-Pratt
 Often we also face the problem that we are given a string, for example
 some user input, and we want to know whether it matches a particular
 pattern---is it an email address, for example. In this way we can
 exclude user input that would otherwise have nasty effects on our
 program (crashing it or making it go into an infinite loop, if not
-worse). This kind of ``inspecting'' mechanism is also implemented in
+worse). This kind of ``vetting'' mechanism is also implemented in
 popular network security tools such as Snort and
-Bro.\here{www.snort.org}\here{www.bro.org} They scan incoming
+Zeek.\here{www.snort.org}\here{www.bro.org} They scan incoming
 network traffic for computer viruses or malicious packets. Similarly
 filtering out spam usually involves looking for some signature
 (essentially a string pattern). The point is that the fast
 Knuth-Morris-Pratt algorithm for strings is not good enough for such
 string \emph{patterns}.\smallskip
 \end{axis}
 \end{tikzpicture}
 \end{tabular}
 \end{center}
-\noindent This first graph shows that Python, JavaScript and Java 8 need
+\noindent The first graph shows that Python, JavaScript and Java 8 need
 approximately 30 seconds to find out that the regular expression
 $\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s. Similarly,
 the second shows that Python and Ruby need approximately 29 seconds for finding
 out whether a string of 28 \texttt{a}s matches the regular expression
 \texttt{a?\{28\}\,a\{28\}}.\footnote{In this
 \emph{meaning} of a regular expression is. This is not good
 enough for specifications of what algorithms are supposed to
 do or which problems they are supposed to solve.
 To define the meaning of a regular expression we will
-associate with every regular expression a language, or set of
+associate with every regular expression a language---a set of
 strings. This language contains all the strings the regular
 expression is supposed to match. To understand what is going
 on here it is crucial that you have read the handout
 about basic mathematical notations.
 \noindent
 That means both languages are the same.
 The point of the definition of $L$ is that we can use it to
 precisely specify when a string $s$ is matched by a regular
 expression $r$, namely if and only if $s \in L(r)$. In fact we
-will write a program \pcode{match} that takes any string $s$
+will write a program \pcode{match} that takes a string $s$
-and any regular expression $r$ as arguments and returns
+and a regular expression $r$ as arguments and returns
 \emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in
 L(r)$. We leave this for the next lecture.
 There is one more feature of regular expressions that is worth
-mentioning. Given some strings, there are in general many
+mentioning here. Given some strings, there are in general many
 different regular expressions that can recognise these
 strings. This is obvious with the regular expression $a + b$
 which can match the strings $a$ and $b$. But also the regular
 expression $b + a$ would match the same strings. However,
 sometimes it is not so obvious whether two regular expressions
 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
 \]
 \noindent That means two regular expressions are said to be
 equivalent if they match the same set of strings. That is
-their meanings is the same. Therefore we
+their meanings are the same. Therefore we
 do not really distinguish between the different regular
 expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,
 because they are equivalent. I leave you to the question
 whether
 To conclude: Despite my early ignorance about regular expressions, I
 find them now extremely interesting. They have practical importance
 (remember the shocking runtime of the regular expression matchers in
 Python, Ruby and Java in some instances and the problems in Stack
 Exchange and the Atom editor). They are used in tools like Snort and
-Bro in order to monitor network traffic. They have a beautiful mathematical
+Zeek in order to monitor network traffic. They have a beautiful mathematical
 theory behind them, which can be sometimes quite deep and which
 sometimes contains hidden snares.  People who are not very familiar
 with the mathematical background of regular expressions get them
 consistently wrong (this is surprising given they are a supposed to be
-core skill for computer scientists). The hope is that we can do better
+a core skill for computer scientists). The hope is that we can do better
 in the future---for example by proving that the algorithms actually
 satisfy their specification and that the corresponding implementations
 do not contain any bugs. We are close, but not yet quite there.
 Notwithstanding my fascination, I am also happy to admit that regular
 \noindent But they admit that by using this regular expression
 they wilfully violate the RFC 5322 standard, which specifies
 the syntax of email addresses. With their proposed regular
 expression they are too strict in some cases and too lax in
-others. Not a good situation to be in. A regular expression
+others\ldots{}not a good situation to be in. A regular expression
 that is claimed to be closer to the standard is shown in
 Figure~\ref{monster}. Whether this claim is true or not, I
 would not know---the only thing I can say about this regular
 expression is it is a monstrosity. However, this might
 actually be an argument against the RFC standard, rather than

changeset 872	5f5e165c9a57
parent 871	94b84d880c2b
child 876	771396fa6cc4