handouts/ho01.tex
changeset 722 14914b57e207
parent 716 df7d47a507f8
child 723 16db16950593
--- a/handouts/ho01.tex	Thu Apr 16 19:15:46 2020 +0100
+++ b/handouts/ho01.tex	Wed May 06 15:37:31 2020 +0100
@@ -35,7 +35,6 @@
 % compiler explorer
 % https://gcc.godbolt.org
 
-%https://www.youtube.com/watch?v=gmhMQfFQu20
 
 % good article how languages have been influenced
 % 10 MOST(LY DEAD) INFLUENTIAL PROGRAMMING LANGUAGES
@@ -47,12 +46,13 @@
 
 \section*{Handout 1}
 
-The purpose of a compiler is to transform a program a human can read
-and write into code the machine can run as fast as
-possible. Developping a compiler is an old craft going back to 1952
-with the first compiler written by Grace Hopper.  Why studiying
-compilers nowadays?  An interesting answer is given by John Regher in
-his compiler
+The purpose of a compiler is to transform a program a human can read and
+write into code the machine can run as fast as possible. Developing a
+compiler is an old craft going back to 1952 with the first compiler
+written by Grace Hopper.\footnote{Who many years ago was invited on a
+talk show hosted by David Letterman, see
+\url{https://youtu.be/3N_ywhx6_K0?t=31}.} Why studying compilers
+nowadays?  An interesting answer is given by John Regher in his compiler
 blog:\footnote{\url{http://blog.regehr.org/archives/1419}}
 
 \begin{quote}\it{}``We can start off with a couple of observations
@@ -73,18 +73,23 @@
   employment act for solid compiler hackers.''
 \end{quote}  
 
+\noindent
+So the goal of this module is to become a solid (beginner) compiler
+hacker and as part of the coursework to implement a small
+compiler for a very small programming language.
 
-The first part of this module is about text processing, be it for
-compilers, dictionaries, DNA-data, spam-filters and so on.
-When looking for a particular string, say \pcode{"foobar"}, in a large
-text we can use the Knuth-Morris-Pratt algorithm, which is currently the
-most efficient general string search algorithm. But often we do
-\emph{not} just look for a particular string, but for string patterns.
-For example, in program code we need to identify what are the keywords
-(\texttt{if}, \texttt{then}, \texttt{while}, \texttt{for}, etc) and what
-are the identifiers (variable names). A pattern for identifiers could be
-stated as: they start with a letter, followed by zero or more letters,
-numbers and underscores. 
+The first part of the module is about the problem of text processing,
+which is needed for compilers, but also for dictionaries, DNA-data,
+spam-filters and so on. When looking for a particular string, say
+\pcode{"foobar"}, in a large text we can use the Knuth-Morris-Pratt
+algorithm, which is currently the most efficient general string search
+algorithm. But often we do \emph{not} just look for a particular string,
+but for string patterns. For example, in program code we need to
+identify what are the keywords (\texttt{if}, \texttt{then},
+\texttt{while}, \texttt{for}, etc) and what are the identifiers
+(variable names). A pattern for identifiers could be stated as: they
+start with a letter, followed by zero or more letters, numbers and
+underscores. 
 
 %You might also be surprised what
 %constraints programming languages impose about numbers: for example
@@ -200,25 +205,23 @@
 \end{tabular}
 \end{center}
 
-\noindent With this table you can figure out the purpose of
-the regular expressions in the web-crawlers shown Figures
-\ref{crawler1}, \ref{crawler2} and
-\ref{crawler3}. Note, however, the regular expression for
-http-addresses in web-pages in Figure~\ref{crawler1}, Line 15,
-is intended to be
+\noindent With this table you can figure out the purpose of the regular
+expressions in the web-crawlers shown Figures \ref{crawler1} and
+\ref{crawler3}. In in Figure~\ref{crawler1}, however, be careful with
+the regular expression for http-addresses in Line~\ref{httpline}. It is
+intended to be
 
 \[
 \pcode{"https?://[^"]*"}
 \]
 
-\noindent It specifies that web-addresses need to start with a
-double quote, then comes \texttt{http} followed by an optional
-\texttt{s} and so on until the closing double quote comes.
-Usually we would have to escape the double quotes in order to
-make sure we interpret the double quote as character, not as
-double quote for a string. But Scala's trick with triple
-quotes allows us to omit this kind of escaping. As a result we
-can just write:
+\noindent specifying that http-addresses need to start with a double
+quote, then comes \texttt{http} followed by an optional \texttt{s} and
+so on\ldots{}until the closing double quote comes at the end of the
+address. Normally we would have to escape the double quotes in order to
+make sure we interpret the double quote as character, not as double
+quote for a string. But Scala's trick with triple quotes allows us to
+omit this kind of ugly escaping. As a result we can just write:
 
 \[
 \pcode{""""https?://[^"]*"""".r}
@@ -348,17 +351,17 @@
 \end{center}  
 
 Such troublesome regular expressions are sometimes called \emph{evil
-  regular expressions} because they have the potential to make regular
-  expression matching engines to topple over, like in Python, Ruby,
-  JavaScript and Java 8. This ``toppling over'' is also sometimes called
-  \emph{catastrophic backtracking}.  I have also seen the term
-  \emph{eternal matching} used for this.  The problem with evil regular
-  expressions is that they can have some serious consequences, for
-  example, if you use them in your web-application. The reason is that
-  hackers can look for these instances where the matching engine behaves
-  badly and then mount a nice DoS-attack against your application. These
-  attacks are already have their own name: \emph{Regular Expression
-  Denial of Service Attacks (ReDoS)}.
+regular expressions} because they have the potential to make regular
+expression matching engines to topple over, like in Python, Ruby,
+JavaScript and Java 8. This ``toppling over'' is also sometimes called
+\emph{catastrophic backtracking}.  I have also seen the term
+\emph{eternal matching} used for this.  The problem with evil regular
+expressions and catastrophic backtracking is that they can have some
+serious consequences, for example, if you use them in your
+web-application. The reason is that hackers can look for these instances
+where the matching engine behaves badly and then mount a nice DoS-attack
+against your application. These attacks are already have their own name:
+\emph{Regular Expression Denial of Service Attacks (ReDoS)}.
 
 It will be instructive to look behind the ``scenes'' to find out why
 Python and Ruby (and others) behave so badly when matching strings with
@@ -420,6 +423,11 @@
 \end{tikzpicture}
 \end{center}
 
+\noindent
+You might have wondered above why I looked at the (now) old Java 8: the
+reason is that Java 9 and later versions are a bit better, but we will
+still beat them hands down with our regex matcher.
+
 \subsection*{Basic Regular Expressions}
 
 The regular expressions shown earlier for Scala, we
@@ -776,34 +784,34 @@
                   {../progs/crawler1.scala}
 
 \caption{The Scala code for a simple web-crawler that checks
-for broken links in a web-page. It uses the regular expression
+for broken links in web-pages. It uses the regular expression
 \texttt{http\_pattern} in Line~\ref{httpline} for recognising
 URL-addresses. It finds all links using the library function
-\texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}}
+\texttt{findAllIn} in Line~\ref{findallline} (this function 
+is part of Scala's regular expression library).\label{crawler1}}
 
 \end{figure}
 
+ 
 
+%\begin{figure}[p]\small
+%  \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+%                  {../progs/crawler2.scala}
+%
+%\caption{A version of the web-crawler that only follows links
+%in ``my'' domain---since these are the ones I am interested in
+%to fix. It uses the regular expression \texttt{my\_urls} in
+%Line~\ref{myurlline} to check for my name in the links. The
+%main change is in
+%Lines~\ref{changestartline}--\ref{changeendline} where there
+%is a test whether URL is in ``my'' domain or
+%not.\label{crawler2}}
+%\end{figure}
 
 \begin{figure}[p]\small
   \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
                   {../progs/crawler2.scala}
 
-\caption{A version of the web-crawler that only follows links
-in ``my'' domain---since these are the ones I am interested in
-to fix. It uses the regular expression \texttt{my\_urls} in
-Line~\ref{myurlline} to check for my name in the links. The
-main change is in
-Lines~\ref{changestartline}--\ref{changeendline} where there
-is a test whether URL is in ``my'' domain or
-not.\label{crawler2}}
-
-\end{figure}
-
-\begin{figure}[p]\small
-  \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
-                  {../progs/crawler3.scala}
-
 \caption{A small email harvester---whenever we download a
 web-page, we also check whether it contains any email
 addresses. For this we use the regular expression