afl-material: comparison handouts/ho01.tex

equal deleted inserted replaced

-:1f4e81950ab4
+:9476086849ad
 % compiler explorer
 % https://gcc.godbolt.org
 \begin{document}
-\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}
+\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
 \section*{Handout 1}
 This module is about text processing, be it for web-crawlers,
-compilers, dictionaries, DNA-data and so on. When looking for
+compilers, dictionaries, DNA-data and so on. When looking for a
-a particular string, like $abc$ in a large text we can use the
+particular string, like $abc$ in a large text we can use the
-Knuth-Morris-Pratt algorithm, which is currently the most
+Knuth-Morris-Pratt algorithm, which is currently the most efficient
-efficient general string search algorithm. But often we do
+general string search algorithm. But often we do \emph{not} just look
-\emph{not} just look for a particular string, but for string
+for a particular string, but for string patterns. For example in
-patterns. For example in program code we need to identify what
+program code we need to identify what are the keywords (if, then,
-are the keywords, what are the identifiers etc. A pattern for
+while, etc), what are the identifiers (variable names). A pattern for
-identifiers could be stated as: they start with a letter,
+identifiers could be stated as: they start with a letter, followed by
-followed by zero or more letters, numbers and underscores.
+zero or more letters, numbers and underscores.  Also often we face the
-Also often we face the problem that we are given a string (for
+problem that we are given a string (for example some user input) and
-example some user input) and want to know whether it matches a
+want to know whether it matches a particular pattern---be it an email
-particular pattern---be it an email address, for example. In
+address, for example. In this way we can exclude user input that would
-this way we can exclude user input that would otherwise have
+otherwise have nasty effects on our program (crashing it or making it
-nasty effects on our program (crashing it or making it go into
+go into an infinite loop, if not worse). The point is that the fast
-an infinite loop, if not worse).\smallskip
+Knuth-Morris-Pratt algorithm for strings is not good enough for such
+string patterns.\smallskip
 \defn{Regular expressions} help with conveniently specifying
 such patterns. The idea behind regular expressions is that
 they are a simple method for describing languages (or sets of
 strings)\ldots at least languages we are interested in in
 \end{equation}
 \noindent where the first part, the user name, matches one or more lowercase
 letters (\pcode{a-z}), digits (\pcode{0-9}), underscores, dots
 and hyphens. The \pcode{+} at the end of the brackets ensures
-the ``one or more''. Then comes the \pcode{@}-sign, followed
+the ``one or more''. Then comes the email \pcode{@}-sign, followed
 by the domain name which must be one or more lowercase
 letters, digits, underscores, dots or hyphens. Note there
 cannot be an underscore in the domain name. Finally there must
 be a dot followed by the toplevel domain. This toplevel domain
 must be 2 to 6 lowercase letters including the dot. Example
 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
 user@localserver
 disposable.style.email.with+symbol@example.com
 \end{lstlisting}
-\noindent according to the regular expression we specified in
+\noindent according to the regular expression we specified in line
-\eqref{email}. Whether this is intended or not is a different
+\eqref{email} above. Whether this is intended or not is a different
-question (the second email above is actually an acceptable
+question (the second email above is actually an acceptable email
-email address acording to the RFC 5322 standard for email
+address acording to the RFC 5322 standard for email addresses).
-addresses).
 As mentioned above, identifiers, or variables, in program code
 are often required to satisfy the constraints that they start
 with a letter and then can be followed by zero or more letters
 or numbers and also can include underscores, but not as the
 leave it to you to ponder whether this regular expression
 really captures all possible web-addresses.
 \subsection*{Why Study Regular Expressions?}
-Regular expressions were introduced by Kleene in the 1950ies
+Regular expressions were introduced by Kleene in the 1950ies and they
-and they have been object of intense study since then. They
+have been object of intense study since then. They are nowadays pretty
-are nowadays pretty much ubiquitous in computer science. There
+much ubiquitous in computer science. There are many libraries
-are many libraries implementing regular expressions. I am sure
+implementing regular expressions. I am sure you have come across them
-you have come across them before (remember PRA?). Why on earth
+before (remember the PRA module?). Why on earth then is there any
-then is there any interest in studying them again in depth in
+interest in studying them again in depth in this module? Well, one
-this module? Well, one answer is in the following two graphs about
+answer is in the following two graphs about regular expression
-regular expression matching in Python, Ruby and Java.
+matching in Python, Ruby and Java.
 \begin{center}
 \begin{tabular}{@{\hspace{-1mm}}c@{\hspace{-1mm}}c@{}}
 \begin{tikzpicture}
 \begin{axis}[
 \texttt{[...]+} \;\cdot\; \texttt{.} \;\cdot\;
 \texttt{[...]\{2,6\}}
 \]
 \noindent
-which is much less readable than \eqref{email}. Similarly for
+which is much less readable than the regular expression in
-the regular expression that matches the string $hello$ we
+\eqref{email}. Similarly for the regular expression that matches the
-should write
+string $hello$ we should write
 \[
 h \cdot e \cdot l \cdot l \cdot o
 \]
 issue, then the latter can be seen as syntactic sugar of
 the former. For example we could replace
 \begin{center}
 \begin{tabular}{rcl}
-$r+$ & $\mapsto$ & $r\cdot r^*$\\
+$r^+$ & $\mapsto$ & $r\cdot r^*$\\
-$r?$ & $\mapsto$ & $\ONE + r$\\
+$r^?$ & $\mapsto$ & $\ONE + r$\\
 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
 \end{tabular}
 \end{center}
 simplify regular expressions, which will mean we can speed
 up the calculations.
 \subsection*{My Fascination for Regular Expressions}
-Up until a few years ago I was not really interested in
+Up until a few years ago I was not really interested in regular
-regular expressions. They have been studied for the last 60
+expressions. They have been studied for the last 60 years (by smarter
-years (by smarter people than me)---surely nothing new can be
+people than me)---surely nothing new can be found out about them. I
-found out about them. I even have the vague recollection that
+even have the vague recollection that I did not quite understand them
-I did not quite understand them during my undergraduate study. If I remember
+during my undergraduate study. If I remember correctly,\footnote{That
-correctly,\footnote{That was really a long time ago.} I got
+was really a long time ago.} I got utterly confused about $\ONE$
-utterly confused about $\ONE$ (which my lecturer wrote as
+(which my lecturer wrote as $\epsilon$) and the empty string (which he
-$\epsilon$) and the empty string.\footnote{Obviously the
+also wrote as $\epsilon$).\footnote{Obviously the lecturer must have
-lecturer must have been bad.} Since my then, I have used
+been bad ;o)} Since my then, I have used regular expressions for
-regular expressions for implementing lexers and parsers as I
+implementing lexers and parsers as I have always been interested in
-have always been interested in all kinds of programming
+all kinds of programming languages and compilers, which invariably
-languages and compilers, which invariably need regular
+need regular expressions in some form or shape.
-expressions in some form or shape.
 To understand my fascination \emph{nowadays} with regular
 expressions, you need to know that my main scientific interest
-for the last 14 years has been with theorem provers. I am a
+for the last 17 years has been with theorem provers. I am a
 core developer of one of
 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
 provers are systems in which you can formally reason about
 mathematical concepts, but also about programs. In this way
-theorem prover can help with the manacing problem of writing bug-free code. Theorem provers have
+theorem provers can help with the manacing problem of writing bug-free code. Theorem provers have
 proved already their value in a number of cases (even in
 terms of hard cash), but they are still clunky and difficult
 to use by average programmers.
-Anyway, in about 2011 I came across the notion of
+Anyway, in about 2011 I came across the notion of \defn{derivatives of
-\defn{derivatives of regular expressions}. This notion allows
+regular expressions}. This notion allows one to do almost all
-one to do almost all calculations in regular language theory
+calculations with regular expressions on the level of regular
-on the level of regular expressions, not needing any automata (you will
+expressions, not needing any automata (you will see we only touch
-see we only touch briefly on automata in lecture 3).
+briefly on automata in lecture 3). Automata are usually the main
-This is crucial because automata are graphs and it is rather
+object of study in formal language courses.  The avoidance of automata
-difficult to reason about graphs in theorem provers. In
+is crucial because automata are graphs and it is rather difficult to
-contrast, reasoning about regular expressions is easy-peasy in
+reason about graphs in theorem provers. In contrast, reasoning about
-theorem provers. Is this important? I think yes, because
+regular expressions is easy-peasy in theorem provers. Is this
-according to Kuklewicz nearly all POSIX-based regular
+important? I think yes, because according to Kuklewicz nearly all
-expression matchers are
+POSIX-based regular expression matchers are
 buggy.\footnote{\url{http://www.haskell.org/haskellwiki/Regex_Posix}}
-With my PhD student Fahad Ausaf I proved
+With my PhD student Fahad Ausaf I proved the correctness for one such
-the correctness for one such matcher that was
+matcher that was proposed by Sulzmann and Lu in
-proposed by Sulzmann and Lu in
+2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can prove that
-2014.\footnote{\url{http://goo.gl/bz0eHp}} Hopefully we can
+the machine code(!)  that implements this code efficiently is correct
-prove that the machine code(!)
+also. Writing programs in this way does not leave any room for
-that implements this code efficiently is correct also. Writing
+potential errors or bugs. How nice!
-programs in this way does not leave any room for potential
-errors or bugs. How nice!
 What also helped with my fascination with regular expressions
 is that we could indeed find out new things about them that
 have surprised some experts. Together with two colleagues from China, I was
 able to prove the Myhill-Nerode theorem by only using regular
 blog\footnote{\url{http://goo.gl/2R11Fw}} assumed can only be
 shown via automata. Even sombody who has written a 700+-page
 book\footnote{\url{http://goo.gl/fD0eHx}} on regular
 exprssions did not know better. Well, we showed it can also be
 done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}}
-What a feeling if you are an outsider to the subject!
+What a feeling when you are an outsider to the subject!
-To conclude: Despite my early ignorance about regular
+To conclude: Despite my early ignorance about regular expressions, I
-expressions, I find them now very interesting. They have a
+find them now very interesting. They have a beautiful mathematical
-beautiful mathematical theory behind them, which can be
+theory behind them, which can be sometimes quite deep and which
-sometimes quite deep and which sometimes contains hidden snares. They have
+sometimes contains hidden snares. They have practical importance
-practical importance (remember the shocking runtime of the
+(remember the shocking runtime of the regular expression matchers in
-regular expression matchers in Python, Ruby and Java in some
+Python, Ruby and Java in some instances and the problems in Stack
-instances and the problems in Stack Exchange and the Atom editor).
+Exchange and the Atom editor).  People who are not very familiar with
-People who are not very familiar with the
+the mathematical background of regular expressions get them
-mathematical background of regular expressions get them
+consistently wrong (surprising given they are a supposed to be core
-consistently wrong. The hope is that we can do better in the
+skill for computer scientists). The hope is that we can do better in
-future---for example by proving that the algorithms actually
+the future---for example by proving that the algorithms actually
-satisfy their specification and that the corresponding
+satisfy their specification and that the corresponding implementations
-implementations do not contain any bugs. We are close, but not
+do not contain any bugs. We are close, but not yet quite there.
-yet quite there.
 Notwithstanding my fascination, I am also happy to admit that regular
 expressions have their shortcomings. There are some well-known
-``theoretical'' shortcomings, for example recognising strings
+``theoretical'' shortcomings, for example recognising strings of the
-of the form $a^{n}b^{n}$. I am not so bothered by them. What I
+form $a^{n}b^{n}$ is not possible with regular expressions. This means
-am bothered about is when regular expressions are in the way
+for example if we try to regognise whether parentheses are well-nested
-of practical programming. For example, it turns out that the
+is impossible with (basic) regular expressions.  I am not so bothered
-regular expression for email addresses shown in \eqref{email}
+by these shortcomings. What I am bothered about is when regular
-is hopelessly inadequate for recognising all of them (despite
+expressions are in the way of practical programming. For example, it
-being touted as something every computer scientist should know
+turns out that the regular expression for email addresses shown in
-about). The W3 Consortium (which standardises the Web)
+\eqref{email} is hopelessly inadequate for recognising all of them
-proposes to use the following, already more complicated
+(despite being touted as something every computer scientist should
-regular expressions for email addresses:
+know about). The W3 Consortium (which standardises the Web) proposes
+to use the following, already more complicated regular expressions for
+email addresses:
 {\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none]
 [a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
 \end{lstlisting}}
 \noindent But they admit that by using this regular expression
-they wilfully violate the RFC 5322 standard which specifies
+they wilfully violate the RFC 5322 standard, which specifies
 the syntax of email addresses. With their proposed regular
 expression they are too strict in some cases and too lax in
 others. Not a good situation to be in. A regular expression
 that is claimed to be closer to the standard is shown in
 Figure~\ref{monster}. Whether this claim is true or not, I
 \noindent which explains some of the crazier parts of email
 addresses. Still it is good to know that some tasks in text
 processing just cannot be achieved by using regular
-expressions.
+expressions. But for what we want to use them (lexing) they are
+pretty good.
 \begin{figure}[p]
 \lstinputlisting{../progs/crawler1.scala}

changeset 471	9476086849ad
parent 452	b93f4d2aeee1
child 473	dc528091eb70
child 503	f2d7b885b3e3