45 \begin{document} |
44 \begin{document} |
46 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019} |
45 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019} |
47 |
46 |
48 \section*{Handout 1} |
47 \section*{Handout 1} |
49 |
48 |
50 The purpose of a compiler is to transform a program a human can read |
49 The purpose of a compiler is to transform a program a human can read and |
51 and write into code the machine can run as fast as |
50 write into code the machine can run as fast as possible. Developing a |
52 possible. Developping a compiler is an old craft going back to 1952 |
51 compiler is an old craft going back to 1952 with the first compiler |
53 with the first compiler written by Grace Hopper. Why studiying |
52 written by Grace Hopper.\footnote{Who many years ago was invited on a |
54 compilers nowadays? An interesting answer is given by John Regher in |
53 talk show hosted by David Letterman, see |
55 his compiler |
54 \url{https://youtu.be/3N_ywhx6_K0?t=31}.} Why studying compilers |
|
55 nowadays? An interesting answer is given by John Regher in his compiler |
56 blog:\footnote{\url{http://blog.regehr.org/archives/1419}} |
56 blog:\footnote{\url{http://blog.regehr.org/archives/1419}} |
57 |
57 |
58 \begin{quote}\it{}``We can start off with a couple of observations |
58 \begin{quote}\it{}``We can start off with a couple of observations |
59 about the role of compilers. First, hardware is getting weirder |
59 about the role of compilers. First, hardware is getting weirder |
60 rather than getting clocked faster: almost all processors are |
60 rather than getting clocked faster: almost all processors are |
71 and growing gap between increasingly high-level languages and |
71 and growing gap between increasingly high-level languages and |
72 increasingly wacky platforms. It’s effectively a perpetual |
72 increasingly wacky platforms. It’s effectively a perpetual |
73 employment act for solid compiler hackers.'' |
73 employment act for solid compiler hackers.'' |
74 \end{quote} |
74 \end{quote} |
75 |
75 |
76 |
76 \noindent |
77 The first part of this module is about text processing, be it for |
77 So the goal of this module is to become a solid (beginner) compiler |
78 compilers, dictionaries, DNA-data, spam-filters and so on. |
78 hacker and as part of the coursework to implement a small |
79 When looking for a particular string, say \pcode{"foobar"}, in a large |
79 compiler for a very small programming language. |
80 text we can use the Knuth-Morris-Pratt algorithm, which is currently the |
80 |
81 most efficient general string search algorithm. But often we do |
81 The first part of the module is about the problem of text processing, |
82 \emph{not} just look for a particular string, but for string patterns. |
82 which is needed for compilers, but also for dictionaries, DNA-data, |
83 For example, in program code we need to identify what are the keywords |
83 spam-filters and so on. When looking for a particular string, say |
84 (\texttt{if}, \texttt{then}, \texttt{while}, \texttt{for}, etc) and what |
84 \pcode{"foobar"}, in a large text we can use the Knuth-Morris-Pratt |
85 are the identifiers (variable names). A pattern for identifiers could be |
85 algorithm, which is currently the most efficient general string search |
86 stated as: they start with a letter, followed by zero or more letters, |
86 algorithm. But often we do \emph{not} just look for a particular string, |
87 numbers and underscores. |
87 but for string patterns. For example, in program code we need to |
|
88 identify what are the keywords (\texttt{if}, \texttt{then}, |
|
89 \texttt{while}, \texttt{for}, etc) and what are the identifiers |
|
90 (variable names). A pattern for identifiers could be stated as: they |
|
91 start with a letter, followed by zero or more letters, numbers and |
|
92 underscores. |
88 |
93 |
89 %You might also be surprised what |
94 %You might also be surprised what |
90 %constraints programming languages impose about numbers: for example |
95 %constraints programming languages impose about numbers: for example |
91 %123 in JSON is OK, but 0123 is not. |
96 %123 in JSON is OK, but 0123 is not. |
92 % |
97 % |
198 \pcode{(re)} & groups regular expressions and remembers |
203 \pcode{(re)} & groups regular expressions and remembers |
199 matched text |
204 matched text |
200 \end{tabular} |
205 \end{tabular} |
201 \end{center} |
206 \end{center} |
202 |
207 |
203 \noindent With this table you can figure out the purpose of |
208 \noindent With this table you can figure out the purpose of the regular |
204 the regular expressions in the web-crawlers shown Figures |
209 expressions in the web-crawlers shown Figures \ref{crawler1} and |
205 \ref{crawler1}, \ref{crawler2} and |
210 \ref{crawler3}. In in Figure~\ref{crawler1}, however, be careful with |
206 \ref{crawler3}. Note, however, the regular expression for |
211 the regular expression for http-addresses in Line~\ref{httpline}. It is |
207 http-addresses in web-pages in Figure~\ref{crawler1}, Line 15, |
212 intended to be |
208 is intended to be |
|
209 |
213 |
210 \[ |
214 \[ |
211 \pcode{"https?://[^"]*"} |
215 \pcode{"https?://[^"]*"} |
212 \] |
216 \] |
213 |
217 |
214 \noindent It specifies that web-addresses need to start with a |
218 \noindent specifying that http-addresses need to start with a double |
215 double quote, then comes \texttt{http} followed by an optional |
219 quote, then comes \texttt{http} followed by an optional \texttt{s} and |
216 \texttt{s} and so on until the closing double quote comes. |
220 so on\ldots{}until the closing double quote comes at the end of the |
217 Usually we would have to escape the double quotes in order to |
221 address. Normally we would have to escape the double quotes in order to |
218 make sure we interpret the double quote as character, not as |
222 make sure we interpret the double quote as character, not as double |
219 double quote for a string. But Scala's trick with triple |
223 quote for a string. But Scala's trick with triple quotes allows us to |
220 quotes allows us to omit this kind of escaping. As a result we |
224 omit this kind of ugly escaping. As a result we can just write: |
221 can just write: |
|
222 |
225 |
223 \[ |
226 \[ |
224 \pcode{""""https?://[^"]*"""".r} |
227 \pcode{""""https?://[^"]*"""".r} |
225 \] |
228 \] |
226 |
229 |
346 \begin{center} |
349 \begin{center} |
347 \url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019} |
350 \url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019} |
348 \end{center} |
351 \end{center} |
349 |
352 |
350 Such troublesome regular expressions are sometimes called \emph{evil |
353 Such troublesome regular expressions are sometimes called \emph{evil |
351 regular expressions} because they have the potential to make regular |
354 regular expressions} because they have the potential to make regular |
352 expression matching engines to topple over, like in Python, Ruby, |
355 expression matching engines to topple over, like in Python, Ruby, |
353 JavaScript and Java 8. This ``toppling over'' is also sometimes called |
356 JavaScript and Java 8. This ``toppling over'' is also sometimes called |
354 \emph{catastrophic backtracking}. I have also seen the term |
357 \emph{catastrophic backtracking}. I have also seen the term |
355 \emph{eternal matching} used for this. The problem with evil regular |
358 \emph{eternal matching} used for this. The problem with evil regular |
356 expressions is that they can have some serious consequences, for |
359 expressions and catastrophic backtracking is that they can have some |
357 example, if you use them in your web-application. The reason is that |
360 serious consequences, for example, if you use them in your |
358 hackers can look for these instances where the matching engine behaves |
361 web-application. The reason is that hackers can look for these instances |
359 badly and then mount a nice DoS-attack against your application. These |
362 where the matching engine behaves badly and then mount a nice DoS-attack |
360 attacks are already have their own name: \emph{Regular Expression |
363 against your application. These attacks are already have their own name: |
361 Denial of Service Attacks (ReDoS)}. |
364 \emph{Regular Expression Denial of Service Attacks (ReDoS)}. |
362 |
365 |
363 It will be instructive to look behind the ``scenes'' to find out why |
366 It will be instructive to look behind the ``scenes'' to find out why |
364 Python and Ruby (and others) behave so badly when matching strings with |
367 Python and Ruby (and others) behave so badly when matching strings with |
365 evil regular expressions. But we will also look at a relatively simple |
368 evil regular expressions. But we will also look at a relatively simple |
366 algorithm that solves this problem much better than Python and Ruby |
369 algorithm that solves this problem much better than Python and Ruby |
418 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data}; |
421 \addplot[black,mark=square*,mark options={fill=white}] table {re3a.data}; |
419 \end{axis} |
422 \end{axis} |
420 \end{tikzpicture} |
423 \end{tikzpicture} |
421 \end{center} |
424 \end{center} |
422 |
425 |
|
426 \noindent |
|
427 You might have wondered above why I looked at the (now) old Java 8: the |
|
428 reason is that Java 9 and later versions are a bit better, but we will |
|
429 still beat them hands down with our regex matcher. |
|
430 |
423 \subsection*{Basic Regular Expressions} |
431 \subsection*{Basic Regular Expressions} |
424 |
432 |
425 The regular expressions shown earlier for Scala, we |
433 The regular expressions shown earlier for Scala, we |
426 will call \emph{extended regular expressions}. The ones we |
434 will call \emph{extended regular expressions}. The ones we |
427 will mainly study in this module are \emph{basic regular |
435 will mainly study in this module are \emph{basic regular |
774 \begin{figure}[p]\small |
782 \begin{figure}[p]\small |
775 \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] |
783 \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] |
776 {../progs/crawler1.scala} |
784 {../progs/crawler1.scala} |
777 |
785 |
778 \caption{The Scala code for a simple web-crawler that checks |
786 \caption{The Scala code for a simple web-crawler that checks |
779 for broken links in a web-page. It uses the regular expression |
787 for broken links in web-pages. It uses the regular expression |
780 \texttt{http\_pattern} in Line~\ref{httpline} for recognising |
788 \texttt{http\_pattern} in Line~\ref{httpline} for recognising |
781 URL-addresses. It finds all links using the library function |
789 URL-addresses. It finds all links using the library function |
782 \texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}} |
790 \texttt{findAllIn} in Line~\ref{findallline} (this function |
|
791 is part of Scala's regular expression library).\label{crawler1}} |
783 |
792 |
784 \end{figure} |
793 \end{figure} |
785 |
794 |
786 |
795 |
|
796 |
|
797 %\begin{figure}[p]\small |
|
798 % \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] |
|
799 % {../progs/crawler2.scala} |
|
800 % |
|
801 %\caption{A version of the web-crawler that only follows links |
|
802 %in ``my'' domain---since these are the ones I am interested in |
|
803 %to fix. It uses the regular expression \texttt{my\_urls} in |
|
804 %Line~\ref{myurlline} to check for my name in the links. The |
|
805 %main change is in |
|
806 %Lines~\ref{changestartline}--\ref{changeendline} where there |
|
807 %is a test whether URL is in ``my'' domain or |
|
808 %not.\label{crawler2}} |
|
809 %\end{figure} |
787 |
810 |
788 \begin{figure}[p]\small |
811 \begin{figure}[p]\small |
789 \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] |
812 \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] |
790 {../progs/crawler2.scala} |
813 {../progs/crawler2.scala} |
791 |
|
792 \caption{A version of the web-crawler that only follows links |
|
793 in ``my'' domain---since these are the ones I am interested in |
|
794 to fix. It uses the regular expression \texttt{my\_urls} in |
|
795 Line~\ref{myurlline} to check for my name in the links. The |
|
796 main change is in |
|
797 Lines~\ref{changestartline}--\ref{changeendline} where there |
|
798 is a test whether URL is in ``my'' domain or |
|
799 not.\label{crawler2}} |
|
800 |
|
801 \end{figure} |
|
802 |
|
803 \begin{figure}[p]\small |
|
804 \lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}] |
|
805 {../progs/crawler3.scala} |
|
806 |
814 |
807 \caption{A small email harvester---whenever we download a |
815 \caption{A small email harvester---whenever we download a |
808 web-page, we also check whether it contains any email |
816 web-page, we also check whether it contains any email |
809 addresses. For this we use the regular expression |
817 addresses. For this we use the regular expression |
810 \texttt{email\_pattern} in Line~\ref{emailline}. The main |
818 \texttt{email\_pattern} in Line~\ref{emailline}. The main |