--- a/handouts/ho01.tex Thu Apr 16 19:15:46 2020 +0100
+++ b/handouts/ho01.tex Wed May 06 15:37:31 2020 +0100
@@ -35,7 +35,6 @@
% compiler explorer
% https://gcc.godbolt.org
-%https://www.youtube.com/watch?v=gmhMQfFQu20
% good article how languages have been influenced
% 10 MOST(LY DEAD) INFLUENTIAL PROGRAMMING LANGUAGES
@@ -47,12 +46,13 @@
\section*{Handout 1}
-The purpose of a compiler is to transform a program a human can read
-and write into code the machine can run as fast as
-possible. Developping a compiler is an old craft going back to 1952
-with the first compiler written by Grace Hopper. Why studiying
-compilers nowadays? An interesting answer is given by John Regher in
-his compiler
+The purpose of a compiler is to transform a program a human can read and
+write into code the machine can run as fast as possible. Developing a
+compiler is an old craft going back to 1952 with the first compiler
+written by Grace Hopper.\footnote{Who many years ago was invited on a
+talk show hosted by David Letterman, see
+\url{https://youtu.be/3N_ywhx6_K0?t=31}.} Why studying compilers
+nowadays? An interesting answer is given by John Regher in his compiler
blog:\footnote{\url{http://blog.regehr.org/archives/1419}}
\begin{quote}\it{}``We can start off with a couple of observations
@@ -73,18 +73,23 @@
employment act for solid compiler hackers.''
\end{quote}
+\noindent
+So the goal of this module is to become a solid (beginner) compiler
+hacker and as part of the coursework to implement a small
+compiler for a very small programming language.
-The first part of this module is about text processing, be it for
-compilers, dictionaries, DNA-data, spam-filters and so on.
-When looking for a particular string, say \pcode{"foobar"}, in a large
-text we can use the Knuth-Morris-Pratt algorithm, which is currently the
-most efficient general string search algorithm. But often we do
-\emph{not} just look for a particular string, but for string patterns.
-For example, in program code we need to identify what are the keywords
-(\texttt{if}, \texttt{then}, \texttt{while}, \texttt{for}, etc) and what
-are the identifiers (variable names). A pattern for identifiers could be
-stated as: they start with a letter, followed by zero or more letters,
-numbers and underscores.
+The first part of the module is about the problem of text processing,
+which is needed for compilers, but also for dictionaries, DNA-data,
+spam-filters and so on. When looking for a particular string, say
+\pcode{"foobar"}, in a large text we can use the Knuth-Morris-Pratt
+algorithm, which is currently the most efficient general string search
+algorithm. But often we do \emph{not} just look for a particular string,
+but for string patterns. For example, in program code we need to
+identify what are the keywords (\texttt{if}, \texttt{then},
+\texttt{while}, \texttt{for}, etc) and what are the identifiers
+(variable names). A pattern for identifiers could be stated as: they
+start with a letter, followed by zero or more letters, numbers and
+underscores.
%You might also be surprised what
%constraints programming languages impose about numbers: for example
@@ -200,25 +205,23 @@
\end{tabular}
\end{center}
-\noindent With this table you can figure out the purpose of
-the regular expressions in the web-crawlers shown Figures
-\ref{crawler1}, \ref{crawler2} and
-\ref{crawler3}. Note, however, the regular expression for
-http-addresses in web-pages in Figure~\ref{crawler1}, Line 15,
-is intended to be
+\noindent With this table you can figure out the purpose of the regular
+expressions in the web-crawlers shown Figures \ref{crawler1} and
+\ref{crawler3}. In in Figure~\ref{crawler1}, however, be careful with
+the regular expression for http-addresses in Line~\ref{httpline}. It is
+intended to be
\[
\pcode{"https?://[^"]*"}
\]
-\noindent It specifies that web-addresses need to start with a
-double quote, then comes \texttt{http} followed by an optional
-\texttt{s} and so on until the closing double quote comes.
-Usually we would have to escape the double quotes in order to
-make sure we interpret the double quote as character, not as
-double quote for a string. But Scala's trick with triple
-quotes allows us to omit this kind of escaping. As a result we
-can just write:
+\noindent specifying that http-addresses need to start with a double
+quote, then comes \texttt{http} followed by an optional \texttt{s} and
+so on\ldots{}until the closing double quote comes at the end of the
+address. Normally we would have to escape the double quotes in order to
+make sure we interpret the double quote as character, not as double
+quote for a string. But Scala's trick with triple quotes allows us to
+omit this kind of ugly escaping. As a result we can just write:
\[
\pcode{""""https?://[^"]*"""".r}
@@ -348,17 +351,17 @@
\end{center}
Such troublesome regular expressions are sometimes called \emph{evil
- regular expressions} because they have the potential to make regular
- expression matching engines to topple over, like in Python, Ruby,
- JavaScript and Java 8. This ``toppling over'' is also sometimes called
- \emph{catastrophic backtracking}. I have also seen the term
- \emph{eternal matching} used for this. The problem with evil regular
- expressions is that they can have some serious consequences, for
- example, if you use them in your web-application. The reason is that
- hackers can look for these instances where the matching engine behaves
- badly and then mount a nice DoS-attack against your application. These
- attacks are already have their own name: \emph{Regular Expression
- Denial of Service Attacks (ReDoS)}.
+regular expressions} because they have the potential to make regular
+expression matching engines to topple over, like in Python, Ruby,
+JavaScript and Java 8. This ``toppling over'' is also sometimes called
+\emph{catastrophic backtracking}. I have also seen the term
+\emph{eternal matching} used for this. The problem with evil regular
+expressions and catastrophic backtracking is that they can have some
+serious consequences, for example, if you use them in your
+web-application. The reason is that hackers can look for these instances
+where the matching engine behaves badly and then mount a nice DoS-attack
+against your application. These attacks are already have their own name:
+\emph{Regular Expression Denial of Service Attacks (ReDoS)}.
It will be instructive to look behind the ``scenes'' to find out why
Python and Ruby (and others) behave so badly when matching strings with
@@ -420,6 +423,11 @@
\end{tikzpicture}
\end{center}
+\noindent
+You might have wondered above why I looked at the (now) old Java 8: the
+reason is that Java 9 and later versions are a bit better, but we will
+still beat them hands down with our regex matcher.
+
\subsection*{Basic Regular Expressions}
The regular expressions shown earlier for Scala, we
@@ -776,34 +784,34 @@
{../progs/crawler1.scala}
\caption{The Scala code for a simple web-crawler that checks
-for broken links in a web-page. It uses the regular expression
+for broken links in web-pages. It uses the regular expression
\texttt{http\_pattern} in Line~\ref{httpline} for recognising
URL-addresses. It finds all links using the library function
-\texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}}
+\texttt{findAllIn} in Line~\ref{findallline} (this function
+is part of Scala's regular expression library).\label{crawler1}}
\end{figure}
+
+%\begin{figure}[p]\small
+% \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+% {../progs/crawler2.scala}
+%
+%\caption{A version of the web-crawler that only follows links
+%in ``my'' domain---since these are the ones I am interested in
+%to fix. It uses the regular expression \texttt{my\_urls} in
+%Line~\ref{myurlline} to check for my name in the links. The
+%main change is in
+%Lines~\ref{changestartline}--\ref{changeendline} where there
+%is a test whether URL is in ``my'' domain or
+%not.\label{crawler2}}
+%\end{figure}
\begin{figure}[p]\small
\lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
{../progs/crawler2.scala}
-\caption{A version of the web-crawler that only follows links
-in ``my'' domain---since these are the ones I am interested in
-to fix. It uses the regular expression \texttt{my\_urls} in
-Line~\ref{myurlline} to check for my name in the links. The
-main change is in
-Lines~\ref{changestartline}--\ref{changeendline} where there
-is a test whether URL is in ``my'' domain or
-not.\label{crawler2}}
-
-\end{figure}
-
-\begin{figure}[p]\small
- \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
- {../progs/crawler3.scala}
-
\caption{A small email harvester---whenever we download a
web-page, we also check whether it contains any email
addresses. For this we use the regular expression