diff -r e3c64f22dd31 -r 14914b57e207 handouts/ho01.tex --- a/handouts/ho01.tex Thu Apr 16 19:15:46 2020 +0100 +++ b/handouts/ho01.tex Wed May 06 15:37:31 2020 +0100 @@ -35,7 +35,6 @@ % compiler explorer % https://gcc.godbolt.org -%https://www.youtube.com/watch?v=gmhMQfFQu20 % good article how languages have been influenced % 10 MOST(LY DEAD) INFLUENTIAL PROGRAMMING LANGUAGES @@ -47,12 +46,13 @@ \section*{Handout 1} -The purpose of a compiler is to transform a program a human can read -and write into code the machine can run as fast as -possible. Developping a compiler is an old craft going back to 1952 -with the first compiler written by Grace Hopper. Why studiying -compilers nowadays? An interesting answer is given by John Regher in -his compiler +The purpose of a compiler is to transform a program a human can read and +write into code the machine can run as fast as possible. Developing a +compiler is an old craft going back to 1952 with the first compiler +written by Grace Hopper.\footnote{Who many years ago was invited on a +talk show hosted by David Letterman, see +\url{https://youtu.be/3N_ywhx6_K0?t=31}.} Why studying compilers +nowadays? An interesting answer is given by John Regher in his compiler blog:\footnote{\url{http://blog.regehr.org/archives/1419}} \begin{quote}\it{}``We can start off with a couple of observations @@ -73,18 +73,23 @@ employment act for solid compiler hackers.'' \end{quote} +\noindent +So the goal of this module is to become a solid (beginner) compiler +hacker and as part of the coursework to implement a small +compiler for a very small programming language. -The first part of this module is about text processing, be it for -compilers, dictionaries, DNA-data, spam-filters and so on. -When looking for a particular string, say \pcode{"foobar"}, in a large -text we can use the Knuth-Morris-Pratt algorithm, which is currently the -most efficient general string search algorithm. But often we do -\emph{not} just look for a particular string, but for string patterns. -For example, in program code we need to identify what are the keywords -(\texttt{if}, \texttt{then}, \texttt{while}, \texttt{for}, etc) and what -are the identifiers (variable names). A pattern for identifiers could be -stated as: they start with a letter, followed by zero or more letters, -numbers and underscores. +The first part of the module is about the problem of text processing, +which is needed for compilers, but also for dictionaries, DNA-data, +spam-filters and so on. When looking for a particular string, say +\pcode{"foobar"}, in a large text we can use the Knuth-Morris-Pratt +algorithm, which is currently the most efficient general string search +algorithm. But often we do \emph{not} just look for a particular string, +but for string patterns. For example, in program code we need to +identify what are the keywords (\texttt{if}, \texttt{then}, +\texttt{while}, \texttt{for}, etc) and what are the identifiers +(variable names). A pattern for identifiers could be stated as: they +start with a letter, followed by zero or more letters, numbers and +underscores. %You might also be surprised what %constraints programming languages impose about numbers: for example @@ -200,25 +205,23 @@ \end{tabular} \end{center} -\noindent With this table you can figure out the purpose of -the regular expressions in the web-crawlers shown Figures -\ref{crawler1}, \ref{crawler2} and -\ref{crawler3}. Note, however, the regular expression for -http-addresses in web-pages in Figure~\ref{crawler1}, Line 15, -is intended to be +\noindent With this table you can figure out the purpose of the regular +expressions in the web-crawlers shown Figures \ref{crawler1} and +\ref{crawler3}. In in Figure~\ref{crawler1}, however, be careful with +the regular expression for http-addresses in Line~\ref{httpline}. It is +intended to be \[ \pcode{"https?://[^"]*"} \] -\noindent It specifies that web-addresses need to start with a -double quote, then comes \texttt{http} followed by an optional -\texttt{s} and so on until the closing double quote comes. -Usually we would have to escape the double quotes in order to -make sure we interpret the double quote as character, not as -double quote for a string. But Scala's trick with triple -quotes allows us to omit this kind of escaping. As a result we -can just write: +\noindent specifying that http-addresses need to start with a double +quote, then comes \texttt{http} followed by an optional \texttt{s} and +so on\ldots{}until the closing double quote comes at the end of the +address. Normally we would have to escape the double quotes in order to +make sure we interpret the double quote as character, not as double +quote for a string. But Scala's trick with triple quotes allows us to +omit this kind of ugly escaping. As a result we can just write: \[ \pcode{""""https?://[^"]*"""".r} @@ -348,17 +351,17 @@ \end{center} Such troublesome regular expressions are sometimes called \emph{evil - regular expressions} because they have the potential to make regular - expression matching engines to topple over, like in Python, Ruby, - JavaScript and Java 8. This ``toppling over'' is also sometimes called - \emph{catastrophic backtracking}. I have also seen the term - \emph{eternal matching} used for this. The problem with evil regular - expressions is that they can have some serious consequences, for - example, if you use them in your web-application. The reason is that - hackers can look for these instances where the matching engine behaves - badly and then mount a nice DoS-attack against your application. These - attacks are already have their own name: \emph{Regular Expression - Denial of Service Attacks (ReDoS)}. +regular expressions} because they have the potential to make regular +expression matching engines to topple over, like in Python, Ruby, +JavaScript and Java 8. This ``toppling over'' is also sometimes called +\emph{catastrophic backtracking}. I have also seen the term +\emph{eternal matching} used for this. The problem with evil regular +expressions and catastrophic backtracking is that they can have some +serious consequences, for example, if you use them in your +web-application. The reason is that hackers can look for these instances +where the matching engine behaves badly and then mount a nice DoS-attack +against your application. These attacks are already have their own name: +\emph{Regular Expression Denial of Service Attacks (ReDoS)}. It will be instructive to look behind the ``scenes'' to find out why Python and Ruby (and others) behave so badly when matching strings with @@ -420,6 +423,11 @@ \end{tikzpicture} \end{center} +\noindent +You might have wondered above why I looked at the (now) old Java 8: the +reason is that Java 9 and later versions are a bit better, but we will +still beat them hands down with our regex matcher. + \subsection*{Basic Regular Expressions} The regular expressions shown earlier for Scala, we @@ -776,34 +784,34 @@ {../progs/crawler1.scala} \caption{The Scala code for a simple web-crawler that checks -for broken links in a web-page. It uses the regular expression +for broken links in web-pages. It uses the regular expression \texttt{http\_pattern} in Line~\ref{httpline} for recognising URL-addresses. It finds all links using the library function -\texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}} +\texttt{findAllIn} in Line~\ref{findallline} (this function +is part of Scala's regular expression library).\label{crawler1}} \end{figure} + +%\begin{figure}[p]\small +% \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] +% {../progs/crawler2.scala} +% +%\caption{A version of the web-crawler that only follows links +%in ``my'' domain---since these are the ones I am interested in +%to fix. It uses the regular expression \texttt{my\_urls} in +%Line~\ref{myurlline} to check for my name in the links. The +%main change is in +%Lines~\ref{changestartline}--\ref{changeendline} where there +%is a test whether URL is in ``my'' domain or +%not.\label{crawler2}} +%\end{figure} \begin{figure}[p]\small \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] {../progs/crawler2.scala} -\caption{A version of the web-crawler that only follows links -in ``my'' domain---since these are the ones I am interested in -to fix. It uses the regular expression \texttt{my\_urls} in -Line~\ref{myurlline} to check for my name in the links. The -main change is in -Lines~\ref{changestartline}--\ref{changeendline} where there -is a test whether URL is in ``my'' domain or -not.\label{crawler2}} - -\end{figure} - -\begin{figure}[p]\small - \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] - {../progs/crawler3.scala} - \caption{A small email harvester---whenever we download a web-page, we also check whether it contains any email addresses. For this we use the regular expression