afl-material: comparison handouts/ho01.tex

equal deleted inserted replaced

-:352d15782d35
+:71fc4a7a7039
 % compiler explorer
 % https://gcc.godbolt.org
 %https://www.youtube.com/watch?v=gmhMQfFQu20
 \begin{document}
-\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
+\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018}
 \section*{Handout 1}
 This module is about text processing, be it for web-crawlers,
 compilers, dictionaries, DNA-data, ad filters and so on.  When looking for a
 particular string, like $abc$ in a large text we can use the
 Knuth-Morris-Pratt algorithm, which is currently the most efficient
 general string search algorithm. But often we do \emph{not} just look
 for a particular string, but for string patterns. For example, in
-program code we need to identify what are the keywords (if, then,
+program code we need to identify what are the keywords (\texttt{if}, \texttt{then},
-while, for, etc), what are the identifiers (variable names). A pattern for
+\texttt{while}, \texttt{for}, etc), what are the identifiers (variable names). A pattern for
 identifiers could be stated as: they start with a letter, followed by
 zero or more letters, numbers and underscores.  Often we also face the
 problem that we are given a string (for example some user input) and
 want to know whether it matches a particular pattern---be it an email
 address, for example. In this way we can exclude user input that would
 \end{lstlisting}
 \noindent according to the regular expression we specified in line
 \eqref{email} above. Whether this is intended or not is a different
 question (the second email above is actually an acceptable email
-address acording to the RFC 5322 standard for email addresses).
+address according to the RFC 5322 standard for email addresses).
 As mentioned above, identifiers, or variables, in program code
 are often required to satisfy the constraints that they start
 with a letter and then can be followed by zero or more letters
 or numbers and also can include underscores, but not as the
 \pcode{re?} &	 matches 0 or 1 occurrence of preceding
 expression\\
 \pcode{re\{n\}}	& matches exactly \pcode{n} number of
 occurrences of preceding  expression\\
 \pcode{re\{n,m\}} & matches at least \pcode{n} and at most {\tt m}
-occurences of the preceding expression\\
+occurrences of the preceding expression\\
 \pcode{[...]} & matches any single character inside the
 brackets\\
 \pcode{[^...]} & matches any single character not inside the
 brackets\\
 \pcode{...-...} & character ranges\\
 much ubiquitous in computer science. There are many libraries
 implementing regular expressions. I am sure you have come across them
 before (remember the PRA module?). Why on earth then is there any
 interest in studying them again in depth in this module? Well, one
 answer is in the following two graphs about regular expression
-matching in Python, Ruby and Java.
+matching in Python, Ruby and Java (Version 8).
 \begin{center}
 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
 \begin{tikzpicture}
 \begin{axis}[
 ytick={0,5,...,30},
 scaled ticks=false,
 axis lines=left,
 width=5.5cm,
 height=4.5cm,
-legend entries={Python, Java},
+legend entries={Python, Java 8},
 legend pos=north west,
 legend cell align=left]
 \addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
 \addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
 \end{axis}
 \begin{center}
 \url{https://vimeo.com/112065252}
 \end{center}
 \noindent
-A similar problem also occured in the Atom editor:
+A similar problem also occurred in the Atom editor:
 \begin{center}
 \url{http://davidvgalbraith.com/how-i-fixed-atom/}
 \end{center}
 \noindent
 Such troublesome regular expressions are sometimes called \emph{evil
 regular expressions} because they have the potential to make regular
 expression matching engines to topple over, like in Python, Ruby and
-Java. This ``toppling over'' is also sometimes called \emph{catastrophic
+Java. This ``toppling over'' is also sometimes called
-backtracking}.  The problem with evil regular expressions is that
+\emph{catastrophic backtracking}.  I have also seen the term
-they can have some serious consequences, for example, if you use them
+\emph{eternal matching} used for this.  The problem with evil regular
-in your web-application. The reason is that hackers can look for these
+expressions is that they can have some serious consequences, for
-instances where the matching engine behaves badly and then mount a
+example, if you use them in your web-application. The reason is that
-nice DoS-attack against your application. These attacks are already
+hackers can look for these instances where the matching engine behaves
-have their own name: \emph{Regular Expression Denial of Service
+badly and then mount a nice DoS-attack against your application. These
-Attacks (ReDoS)}.
+attacks are already have their own name: \emph{Regular Expression
+Denial of Service Attacks (ReDoS)}.
 It will be instructive to look behind the ``scenes'' to find
 out why Python and Ruby (and others) behave so badly when
 matching strings with evil regular expressions. But we will also look
 at a relatively simple algorithm that solves this problem much
 structure of regular expressions is always clear. But for
 writing them down in a more mathematical fashion, parentheses
 will be helpful. For example we will write $(r_1 + r_2)^*$,
 which is different from, say $r_1 + (r_2)^*$. The former means
 roughly zero or more times $r_1$ or $r_2$, while the latter
-means $r_1$ or zero or more times $r_2$. This will turn out to
+means $r_1$, or zero or more times $r_2$. This will turn out to
 be two different patterns, which match in general different
 strings. We should also write $(r_1 + r_2) + r_3$, which is
 different from the regular expression $r_1 + (r_2 + r_3)$, but
 in case of $+$ and $\cdot$ we actually do not care about the
 order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2
-\cdot r_3$, respectively. The reasons for this will become
+\cdot r_3$, respectively. The reasons for this carelessness will become
 clear shortly.
 In the literature you will often find that the choice $r_1 +
 r_2$ is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also,
 often our $\ZERO$ and $\ONE$ are written $\varnothing$ and
 $\epsilon$, respectively. Following the convention in the
-literature, we will often omit the $\cdot$ all together. This
+literature, we will often omit the $\cdot$. This
 is to make some concrete regular expressions more readable.
 For example the regular expression for email addresses shown
-in \eqref{email} would look like
+in \eqref{email} would fully expanded look like
 \[
 \texttt{[...]+} \;\cdot\;  \texttt{@} \;\cdot\;
 \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\;
 \texttt{[...]\{2,6\}}
 \[
 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
 \]
 \noindent That means two regular expressions are said to be
-equivalent if they match the same set of strings. Therefore we
+equivalent if they match the same set of strings. That is
+there meaning is the same. Therefore we
 do not really distinguish between the different regular
 expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,
 because they are equivalent. I leave you to the question
 whether
 even have the vague recollection that I did not quite understand them
 during my undergraduate study. If I remember correctly,\footnote{That
 was really a long time ago.} I got utterly confused about $\ONE$
 (which my lecturer wrote as $\epsilon$) and the empty string (which he
 also wrote as $\epsilon$).\footnote{Obviously the lecturer must have
-been bad ;o)} Since my then, I have used regular expressions for
+been bad ;o)} Since then, I have used regular expressions for
 implementing lexers and parsers as I have always been interested in
 all kinds of programming languages and compilers, which invariably
 need regular expressions in some form or shape.
 To understand my fascination \emph{nowadays} with regular
 for the last 17 years has been with theorem provers. I am a
 core developer of one of
 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
 provers are systems in which you can formally reason about
 mathematical concepts, but also about programs. In this way
-theorem provers can help with the manacing problem of writing bug-free code. Theorem provers have
+theorem provers can help with the menacing problem of writing bug-free code. Theorem provers have
 proved already their value in a number of cases (even in
 terms of hard cash), but they are still clunky and difficult
 to use by average programmers.
 Anyway, in about 2011 I came across the notion of \defn{derivatives of
 regular expressions}. This notion allows one to do almost all
 calculations with regular expressions on the level of regular
 expressions, not needing any automata (you will see we only touch
 briefly on automata in lecture 3). Automata are usually the main
 object of study in formal language courses.  The avoidance of automata
-is crucial because automata are graphs and it is rather difficult to
+is crucial for me because automata are graphs and it is rather difficult to
 reason about graphs in theorem provers. In contrast, reasoning about
 regular expressions is easy-peasy in theorem provers. Is this
 important? I think yes, because according to Kuklewicz nearly all
 POSIX-based regular expression matchers are
 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
 expressions and the notion of derivatives. Earlier versions of
 this theorem used always automata in the proof. Using this
 theorem we can show that regular languages are closed under
 complementation, something which Gasarch in his
 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be
-shown via automata. Even sombody who has written a 700+-page
+shown via automata. So even somebody who has written a 700+-page
 book\footnote{\url{http://goo.gl/fD0eHx}} on regular
-exprssions did not know better. Well, we showed it can also be
+expressions did not know better. Well, we showed it can also be
 done with regular expressions only.\footnote{\url{http://nms.kcl.ac.uk/christian.urban/Publications/posix.pdf}}
 What a feeling when you are an outsider to the subject!
 To conclude: Despite my early ignorance about regular expressions, I
-find them now very interesting. They have a beautiful mathematical
+find them now extremely interesting. They have practical importance
-theory behind them, which can be sometimes quite deep and which
-sometimes contains hidden snares. They have practical importance
 (remember the shocking runtime of the regular expression matchers in
 Python, Ruby and Java in some instances and the problems in Stack
-Exchange and the Atom editor).  People who are not very familiar with
+Exchange and the Atom editor). They are used in tools like Snort and
-the mathematical background of regular expressions get them
+Bro in order to monitor network traffic. They have a beautiful mathematical
+theory behind them, which can be sometimes quite deep and which
+sometimes contains hidden snares.  People who are not very familiar
+with the mathematical background of regular expressions get them
 consistently wrong (this is surprising given they are a supposed to be
 core skill for computer scientists). The hope is that we can do better
 in the future---for example by proving that the algorithms actually
 satisfy their specification and that the corresponding implementations
 do not contain any bugs. We are close, but not yet quite there.
 Notwithstanding my fascination, I am also happy to admit that regular
 expressions have their shortcomings. There are some well-known
 ``theoretical'' shortcomings, for example recognising strings of the
 form $a^{n}b^{n}$ is not possible with regular expressions. This means
-for example if we try to regognise whether parentheses are well-nested
+for example if we try to recognise whether parentheses are well-nested
 in an expression is impossible with (basic) regular expressions.  I am
 not so bothered by these shortcomings. What I am bothered about is
 when regular expressions are in the way of practical programming. For
 example, it turns out that the regular expression for email addresses
 shown in \eqref{email} is hopelessly inadequate for recognising all of
 problems.''
 \end{quote}
 \begin{figure}[p]\small
-\lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+\lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
 {../progs/crawler1.scala}
 \caption{The Scala code for a simple web-crawler that checks
 for broken links in a web-page. It uses the regular expression
 \texttt{http\_pattern} in Line~\ref{httpline} for recognising
 \end{figure}
 \begin{figure}[p]\small
-\lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+\lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
 {../progs/crawler2.scala}
 \caption{A version of the web-crawler that only follows links
 in ``my'' domain---since these are the ones I am interested in
 to fix. It uses the regular expression \texttt{my\_urls} in
 not.\label{crawler2}}
 \end{figure}
 \begin{figure}[p]\small
-\lstinputlisting[linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
+\lstinputlisting[numbers=left,linebackgroundcolor={\ifodd\value{lstnumber}\color{capri!3}\fi}]
 {../progs/crawler3.scala}
 \caption{A small email harvester---whenever we download a
 web-page, we also check whether it contains any email
 addresses. For this we use the regular expression

changeset 550	71fc4a7a7039
parent 507	fdbc7d0ec04f
child 551	bd551ede2be6