handouts/ho01.tex
changeset 722 14914b57e207
parent 716 df7d47a507f8
child 723 16db16950593
equal deleted inserted replaced
721:e3c64f22dd31 722:14914b57e207
    33 %% http://beautifulracket.com
    33 %% http://beautifulracket.com
    34 
    34 
    35 % compiler explorer
    35 % compiler explorer
    36 % https://gcc.godbolt.org
    36 % https://gcc.godbolt.org
    37 
    37 
    38 %https://www.youtube.com/watch?v=gmhMQfFQu20
       
    39 
    38 
    40 % good article how languages have been influenced
    39 % good article how languages have been influenced
    41 % 10 MOST(LY DEAD) INFLUENTIAL PROGRAMMING LANGUAGES
    40 % 10 MOST(LY DEAD) INFLUENTIAL PROGRAMMING LANGUAGES
    42 % https://www.hillelwayne.com/post/influential-dead-languages/
    41 % https://www.hillelwayne.com/post/influential-dead-languages/
    43 
    42 
    45 \begin{document}
    44 \begin{document}
    46 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019}
    45 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019}
    47 
    46 
    48 \section*{Handout 1}
    47 \section*{Handout 1}
    49 
    48 
    50 The purpose of a compiler is to transform a program a human can read
    49 The purpose of a compiler is to transform a program a human can read and
    51 and write into code the machine can run as fast as
    50 write into code the machine can run as fast as possible. Developing a
    52 possible. Developping a compiler is an old craft going back to 1952
    51 compiler is an old craft going back to 1952 with the first compiler
    53 with the first compiler written by Grace Hopper.  Why studiying
    52 written by Grace Hopper.\footnote{Who many years ago was invited on a
    54 compilers nowadays?  An interesting answer is given by John Regher in
    53 talk show hosted by David Letterman, see
    55 his compiler
    54 \url{https://youtu.be/3N_ywhx6_K0?t=31}.} Why studying compilers
       
    55 nowadays?  An interesting answer is given by John Regher in his compiler
    56 blog:\footnote{\url{http://blog.regehr.org/archives/1419}}
    56 blog:\footnote{\url{http://blog.regehr.org/archives/1419}}
    57 
    57 
    58 \begin{quote}\it{}``We can start off with a couple of observations
    58 \begin{quote}\it{}``We can start off with a couple of observations
    59   about the role of compilers. First, hardware is getting weirder
    59   about the role of compilers. First, hardware is getting weirder
    60   rather than getting clocked faster: almost all processors are
    60   rather than getting clocked faster: almost all processors are
    71   and growing gap between increasingly high-level languages and
    71   and growing gap between increasingly high-level languages and
    72   increasingly wacky platforms. It’s effectively a perpetual
    72   increasingly wacky platforms. It’s effectively a perpetual
    73   employment act for solid compiler hackers.''
    73   employment act for solid compiler hackers.''
    74 \end{quote}  
    74 \end{quote}  
    75 
    75 
    76 
    76 \noindent
    77 The first part of this module is about text processing, be it for
    77 So the goal of this module is to become a solid (beginner) compiler
    78 compilers, dictionaries, DNA-data, spam-filters and so on.
    78 hacker and as part of the coursework to implement a small
    79 When looking for a particular string, say \pcode{"foobar"}, in a large
    79 compiler for a very small programming language.
    80 text we can use the Knuth-Morris-Pratt algorithm, which is currently the
    80 
    81 most efficient general string search algorithm. But often we do
    81 The first part of the module is about the problem of text processing,
    82 \emph{not} just look for a particular string, but for string patterns.
    82 which is needed for compilers, but also for dictionaries, DNA-data,
    83 For example, in program code we need to identify what are the keywords
    83 spam-filters and so on. When looking for a particular string, say
    84 (\texttt{if}, \texttt{then}, \texttt{while}, \texttt{for}, etc) and what
    84 \pcode{"foobar"}, in a large text we can use the Knuth-Morris-Pratt
    85 are the identifiers (variable names). A pattern for identifiers could be
    85 algorithm, which is currently the most efficient general string search
    86 stated as: they start with a letter, followed by zero or more letters,
    86 algorithm. But often we do \emph{not} just look for a particular string,
    87 numbers and underscores. 
    87 but for string patterns. For example, in program code we need to
       
    88 identify what are the keywords (\texttt{if}, \texttt{then},
       
    89 \texttt{while}, \texttt{for}, etc) and what are the identifiers
       
    90 (variable names). A pattern for identifiers could be stated as: they
       
    91 start with a letter, followed by zero or more letters, numbers and
       
    92 underscores. 
    88 
    93 
    89 %You might also be surprised what
    94 %You might also be surprised what
    90 %constraints programming languages impose about numbers: for example
    95 %constraints programming languages impose about numbers: for example
    91 %123 in JSON is OK, but 0123 is not.
    96 %123 in JSON is OK, but 0123 is not.
    92 %
    97 %
   198 \pcode{(re)}	& groups regular expressions and remembers 
   203 \pcode{(re)}	& groups regular expressions and remembers 
   199 matched text
   204 matched text
   200 \end{tabular}
   205 \end{tabular}
   201 \end{center}
   206 \end{center}
   202 
   207 
   203 \noindent With this table you can figure out the purpose of
   208 \noindent With this table you can figure out the purpose of the regular
   204 the regular expressions in the web-crawlers shown Figures
   209 expressions in the web-crawlers shown Figures \ref{crawler1} and
   205 \ref{crawler1}, \ref{crawler2} and
   210 \ref{crawler3}. In in Figure~\ref{crawler1}, however, be careful with
   206 \ref{crawler3}. Note, however, the regular expression for
   211 the regular expression for http-addresses in Line~\ref{httpline}. It is
   207 http-addresses in web-pages in Figure~\ref{crawler1}, Line 15,
   212 intended to be
   208 is intended to be
       
   209 
   213 
   210 \[
   214 \[
   211 \pcode{"https?://[^"]*"}
   215 \pcode{"https?://[^"]*"}
   212 \]
   216 \]
   213 
   217 
   214 \noindent It specifies that web-addresses need to start with a
   218 \noindent specifying that http-addresses need to start with a double
   215 double quote, then comes \texttt{http} followed by an optional
   219 quote, then comes \texttt{http} followed by an optional \texttt{s} and
   216 \texttt{s} and so on until the closing double quote comes.
   220 so on\ldots{}until the closing double quote comes at the end of the
   217 Usually we would have to escape the double quotes in order to
   221 address. Normally we would have to escape the double quotes in order to
   218 make sure we interpret the double quote as character, not as
   222 make sure we interpret the double quote as character, not as double
   219 double quote for a string. But Scala's trick with triple
   223 quote for a string. But Scala's trick with triple quotes allows us to
   220 quotes allows us to omit this kind of escaping. As a result we
   224 omit this kind of ugly escaping. As a result we can just write:
   221 can just write:
       
   222 
   225 
   223 \[
   226 \[
   224 \pcode{""""https?://[^"]*"""".r}
   227 \pcode{""""https?://[^"]*"""".r}
   225 \]
   228 \]
   226 
   229 
   346 \begin{center}
   349 \begin{center}
   347 \url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019}
   350 \url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019}
   348 \end{center}  
   351 \end{center}  
   349 
   352 
   350 Such troublesome regular expressions are sometimes called \emph{evil
   353 Such troublesome regular expressions are sometimes called \emph{evil
   351   regular expressions} because they have the potential to make regular
   354 regular expressions} because they have the potential to make regular
   352   expression matching engines to topple over, like in Python, Ruby,
   355 expression matching engines to topple over, like in Python, Ruby,
   353   JavaScript and Java 8. This ``toppling over'' is also sometimes called
   356 JavaScript and Java 8. This ``toppling over'' is also sometimes called
   354   \emph{catastrophic backtracking}.  I have also seen the term
   357 \emph{catastrophic backtracking}.  I have also seen the term
   355   \emph{eternal matching} used for this.  The problem with evil regular
   358 \emph{eternal matching} used for this.  The problem with evil regular
   356   expressions is that they can have some serious consequences, for
   359 expressions and catastrophic backtracking is that they can have some
   357   example, if you use them in your web-application. The reason is that
   360 serious consequences, for example, if you use them in your
   358   hackers can look for these instances where the matching engine behaves
   361 web-application. The reason is that hackers can look for these instances
   359   badly and then mount a nice DoS-attack against your application. These
   362 where the matching engine behaves badly and then mount a nice DoS-attack
   360   attacks are already have their own name: \emph{Regular Expression
   363 against your application. These attacks are already have their own name:
   361   Denial of Service Attacks (ReDoS)}.
   364 \emph{Regular Expression Denial of Service Attacks (ReDoS)}.
   362 
   365 
   363 It will be instructive to look behind the ``scenes'' to find out why
   366 It will be instructive to look behind the ``scenes'' to find out why
   364 Python and Ruby (and others) behave so badly when matching strings with
   367 Python and Ruby (and others) behave so badly when matching strings with
   365 evil regular expressions. But we will also look at a relatively simple
   368 evil regular expressions. But we will also look at a relatively simple
   366 algorithm that solves this problem much better than Python and Ruby
   369 algorithm that solves this problem much better than Python and Ruby
   418 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
   421 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data};
   419 \end{axis}
   422 \end{axis}
   420 \end{tikzpicture}
   423 \end{tikzpicture}
   421 \end{center}
   424 \end{center}
   422 
   425 
       
   426 \noindent
       
   427 You might have wondered above why I looked at the (now) old Java 8: the
       
   428 reason is that Java 9 and later versions are a bit better, but we will
       
   429 still beat them hands down with our regex matcher.
       
   430 
   423 \subsection*{Basic Regular Expressions}
   431 \subsection*{Basic Regular Expressions}
   424 
   432 
   425 The regular expressions shown earlier for Scala, we
   433 The regular expressions shown earlier for Scala, we
   426 will call \emph{extended regular expressions}. The ones we
   434 will call \emph{extended regular expressions}. The ones we
   427 will mainly study in this module are \emph{basic regular
   435 will mainly study in this module are \emph{basic regular
   774 \begin{figure}[p]\small
   782 \begin{figure}[p]\small
   775   \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
   783   \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
   776                   {../progs/crawler1.scala}
   784                   {../progs/crawler1.scala}
   777 
   785 
   778 \caption{The Scala code for a simple web-crawler that checks
   786 \caption{The Scala code for a simple web-crawler that checks
   779 for broken links in a web-page. It uses the regular expression
   787 for broken links in web-pages. It uses the regular expression
   780 \texttt{http\_pattern} in Line~\ref{httpline} for recognising
   788 \texttt{http\_pattern} in Line~\ref{httpline} for recognising
   781 URL-addresses. It finds all links using the library function
   789 URL-addresses. It finds all links using the library function
   782 \texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}}
   790 \texttt{findAllIn} in Line~\ref{findallline} (this function 
       
   791 is part of Scala's regular expression library).\label{crawler1}}
   783 
   792 
   784 \end{figure}
   793 \end{figure}
   785 
   794 
   786 
   795  
       
   796 
       
   797 %\begin{figure}[p]\small
       
   798 %  \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
       
   799 %                  {../progs/crawler2.scala}
       
   800 %
       
   801 %\caption{A version of the web-crawler that only follows links
       
   802 %in ``my'' domain---since these are the ones I am interested in
       
   803 %to fix. It uses the regular expression \texttt{my\_urls} in
       
   804 %Line~\ref{myurlline} to check for my name in the links. The
       
   805 %main change is in
       
   806 %Lines~\ref{changestartline}--\ref{changeendline} where there
       
   807 %is a test whether URL is in ``my'' domain or
       
   808 %not.\label{crawler2}}
       
   809 %\end{figure}
   787 
   810 
   788 \begin{figure}[p]\small
   811 \begin{figure}[p]\small
   789   \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
   812   \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
   790                   {../progs/crawler2.scala}
   813                   {../progs/crawler2.scala}
   791 
       
   792 \caption{A version of the web-crawler that only follows links
       
   793 in ``my'' domain---since these are the ones I am interested in
       
   794 to fix. It uses the regular expression \texttt{my\_urls} in
       
   795 Line~\ref{myurlline} to check for my name in the links. The
       
   796 main change is in
       
   797 Lines~\ref{changestartline}--\ref{changeendline} where there
       
   798 is a test whether URL is in ``my'' domain or
       
   799 not.\label{crawler2}}
       
   800 
       
   801 \end{figure}
       
   802 
       
   803 \begin{figure}[p]\small
       
   804   \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
       
   805                   {../progs/crawler3.scala}
       
   806 
   814 
   807 \caption{A small email harvester---whenever we download a
   815 \caption{A small email harvester---whenever we download a
   808 web-page, we also check whether it contains any email
   816 web-page, we also check whether it contains any email
   809 addresses. For this we use the regular expression
   817 addresses. For this we use the regular expression
   810 \texttt{email\_pattern} in Line~\ref{emailline}. The main
   818 \texttt{email\_pattern} in Line~\ref{emailline}. The main