afl-material: comparison handouts/ho01.tex

equal deleted inserted replaced

-:c8ce95067c1a
+:5c1fbb39c93e
 %%https://jex.im/regulex/
 %%https://www.reddit.com/r/ProgrammingLanguages/comments/42dlem/mona_compiler_development_part_2_parsing/
 %%https://www.reddit.com/r/ProgrammingLanguages/comments/43wlkq/formal_grammar_for_csh_tsch_sh_or_bash/
 \begin{document}
-\fnote{\copyright{} Christian Urban, 2014, 2015}
+\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}
 \section*{Handout 1}
 This module is about text processing, be it for web-crawlers,
 compilers, dictionaries, DNA-data and so on. When looking for
 are the keywords, what are the identifiers etc. A pattern for
 identifiers could be stated as: they start with a letter,
 followed by zero or more letters, numbers and underscores.
 Also often we face the problem that we are given a string (for
 example some user input) and want to know whether it matches a
-particular pattern. In this way we can, for example, exclude
+particular pattern---be it an email address, for example. In
-user input that would otherwise have nasty effects on our
+this way we can exclude user input that would otherwise have
-program (crashing it or making it go into an infinite loop, if
+nasty effects on our program (crashing it or making it go into
-not worse).
+an infinite loop, if not worse).
 \defn{Regular expressions} help with conveniently specifying
 such patterns. The idea behind regular expressions is that
 they are a simple method for describing languages (or sets of
 strings)\ldots at least languages we are interested in in
 computer science. For example there is no convenient regular
 expression for describing the English language short of
 enumerating all English words. But they seem useful for
-describing for example email addresses.\footnote{See ``8
+describing for example simple email addresses.\footnote{See
-Regular Expressions You Should Know''
+``8 Regular Expressions You Should Know''
 \url{http://goo.gl/5LoVX7}} Consider the following regular
 expression
 \begin{equation}\label{email}
 \texttt{[a-z0-9\_.-]+ @ [a-z0-9.-]+ . [a-z.]\{2,6\}}
 other.email-with-dash@example.edu
 \end{lstlisting}
 \noindent
-But for example the following two do not:
+But for example the following two do not
 \begin{lstlisting}[language={},numbers=none,keywordstyle=\color{black}]
 user@localserver
 disposable.style.email.with+symbol@example.com
 \end{lstlisting}
+\noindent according to the regular expression we specified in
+\eqref{email}. Whether this is intended or not is a different
+question (the second email above is actually an acceptable
+email address acording to the RFC 5322 standard for email
+addresses).
 As mentioned above, identifiers, or variables, in program code
 are often required to satisfy the constraints that they start
 with a letter and then can be followed by zero or more letters
 or numbers and also can include underscores, but not as the
 first character. Such identifiers can be recognised with the
 module? Well, one answer is in the following graph about
 regular expression matching in Python and in Ruby.
 \begin{center}
 \begin{tikzpicture}
-\begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs},
+\begin{axis}[
+xlabel={strings of {\tt a}s},
+ylabel={time in secs},
 enlargelimits=false,
 xtick={0,5,...,30},
 xmax=33,
 ymax=35,
 ytick={0,5,...,30},
 seconds, while the second version will even be able to process
 up to 12,000 in less than 10(!) seconds, see the graph below:
 \begin{center}
 \begin{tikzpicture}
-\begin{axis}[xlabel={\pcode{a}s},ylabel={time in secs},
+\begin{axis}[
+xlabel={strings of \pcode{a}s},
+ylabel={time in secs},
 enlargelimits=false,
 xtick={0,3000,...,12000},
 xmax=12500,
 ymax=35,
 ytick={0,5,...,30},
 \end{tikzpicture}
 \end{center}
 \subsection*{Basic Regular Expressions}
-The regular expressions shown above for Scala, we
+The regular expressions shown earlier for Scala, we
 will call \emph{extended regular expressions}. The ones we
 will mainly study in this module are \emph{basic regular
 expressions}, which by convention we will just call
 \emph{regular expressions}, if it is clear what we mean. The
 attraction of (basic) regular expressions is that many
 (Basic) regular expressions are defined by the following
 grammar:
 \begin{center}
 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
-$r$ & $::=$ &   $\varnothing$         & null\\
+$r$ & $::=$ &    $\ZERO$          & null language\\
-& $\mid$ & $\epsilon$           & empty string / \texttt{""} / []\\
+& $\mid$ & $\ONE$           & empty string / \texttt{""} / []\\
 & $\mid$ & $c$                  & single character\\
 & $\mid$ & $r_1 + r_2$          & alternative / choice\\
 & $\mid$ & $r_1 \cdot r_2$      & sequence\\
 & $\mid$ & $r^*$                & star (zero or more)\\
 \end{tabular}
 \end{center}
 \noindent Because we overload our notation, there are some
 subtleties you should be aware of. When regular expressions
-are referred to then $\varnothing$ does not stand for the
+are referred to then $\ZERO$ (in bold font) does not stand for
-empty set: rather it is a particular pattern that does not
+the number zero: rather it is a particular pattern that does
-match any string. Similarly, in the context of regular
+not match any string. Similarly, in the context of regular
-expressions, $\epsilon$ does not stand for the empty string
+expressions, $\ONE$ does not stand for the number one but for
-(as in many places in the literature) but for a regular
+a regular expression that matches the empty string. The letter
-expression that matches the empty string. The letter $c$
+$c$ stands for any character from the alphabet at hand. Again
-stands for any character from the alphabet at hand. Again in
+in the context of regular expressions, it is a particular
-the context of regular expressions, it is a particular pattern
+pattern that can match the specified character. You should
-that can match the specified character. You should also be
+also be careful with our overloading of the star: assuming you
-careful with our overloading of the star: assuming you have
+have read the handout about our basic mathematical notation,
-read the handout about our basic mathematical notation, you
+you will see that in the context of languages (sets of
-will see that in the context of languages (sets of strings)
+strings) the star stands for an operation on languages. Here
-the star stands for an operation on languages. Here $r^*$
+$r^*$ stands for a regular expression, which is different from
-stands for a regular expression, which is different from the
+the operation on sets is defined as
-operation on sets is defined as
+\[
-\[
+A\star\dn \bigcup_{0\le n} A^n
-A^* \dn \bigcup_{0\le n} A^n
 \]
 \noindent
 Note that this expands to
 \[
-A^* \dn A^0 \cup A^1 \cup A^2 \cup A^3 \cup A^4 \cup \ldots
+A\star \dn A^0 \cup A^1 \cup A^2 \cup A^3 \cup A^4 \cup \ldots
 \]
 \noindent which is equivalent to
 \[
-A^* \dn \{[]\} \cup A \cup A@A \cup A@A@A \cup A@A@A@A \cup \ldots
+A\star \dn \{[]\} \cup A \cup A@A \cup A@A@A \cup A@A@A@A \cup \ldots
 \]
 \noindent
 Remember that $A^0$ is always the set containing the empty
 string.
 strings. We should also write $(r_1 + r_2) + r_3$, which is
 different from the regular expression $r_1 + (r_2 + r_3)$, but
 in case of $+$ and $\cdot$ we actually do not care about the
 order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2
 \cdot r_3$, respectively. The reasons for this will become
-clear shortly. In the literature you will often find that the
+clear shortly.
-choice $r_1 + r_2$ is written as $r_1\mid{}r_2$ or
-$r_1\mid\mid{}r_2$. Also following the convention in the
+In the literature you will often find that the choice $r_1 +
+r_2$ is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also,
+often our $\ZERO$ and $\ONE$ are written $\varnothing$ and
+$\epsilon$, respectively. Following the convention in the
 literature, we will often omit the $\cdot$ all together. This
 is to make some concrete regular expressions more readable.
 For example the regular expression for email addresses shown
 in \eqref{email} would look like
 classes relate as follows\footnote{More about Scala is
 in the handout about A Crash-Course on Scala.}
 \begin{center}
 \begin{tabular}{rcl}
-$\varnothing$ & $\mapsto$ & \texttt{NULL}\\
+$\ZERO$       & $\mapsto$ & \texttt{ZERO}\\
-$\epsilon$    & $\mapsto$ & \texttt{EMPTY}\\
+$\ONE$        & $\mapsto$ & \texttt{ONE}\\
 $c$           & $\mapsto$ & \texttt{CHAR(c)}\\
 $r_1 + r_2$   & $\mapsto$ & \texttt{ALT(r1, r2)}\\
 $r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)}\\
 $r^*$         & $\mapsto$ & \texttt{STAR(r)}
 \end{tabular}
 the former. For example we could replace
 \begin{center}
 \begin{tabular}{rcl}
 $r+$ & $\mapsto$ & $r\cdot r^*$\\
-$r?$ & $\mapsto$ & $\epsilon + r$\\
+$r?$ & $\mapsto$ & $\ONE + r$\\
 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
 \end{tabular}
 \end{center}
 The \defn{meaning of a regular expression} can be defined
 by a recursive function called $L$ (for language), which
 is defined as follows
 \begin{center}
-\begin{tabular}{rcl}
+\begin{tabular}{rcll}
-$L(\varnothing)$  & $\dn$ & $\{\}$\\
+$L(\ZERO)$         & $\dn$ & $\{\}$\\
-$L(\epsilon)$     & $\dn$ & $\{[]\}$\\
+$L(\ONE)$          & $\dn$ & $\{[]\}$\\
-$L(c)$            & $\dn$ & $\{[c]\}$\\
+$L(c)$             & $\dn$ & $\{"c"\}$ & or equivalently $\dn \{[c]\}$\\
-$L(r_1+ r_2)$     & $\dn$ & $L(r_1) \cup L(r_2)$\\
+$L(r_1+ r_2)$      & $\dn$ & $L(r_1) \cup L(r_2)$\\
 $L(r_1 \cdot r_2)$ & $\dn$ & $L(r_1) \,@\, L(r_2)$\\
-$L(r^*)$           & $\dn$ & $(L(r))^*$\\
+$L(r^*)$           & $\dn$ & $(L(r))\star$\\
 \end{tabular}
 \end{center}
 \noindent As a result we can now precisely state what the
 meaning, for example, of the regular expression $h \cdot
 different regular expressions that can recognise these
 strings. This is obvious with the regular expression $a + b$
 which can match the strings $a$ and $b$. But also the regular
 expression $b + a$ would match the same strings. However,
 sometimes it is not so obvious whether two regular expressions
-match the same strings: for example do $r^*$ and $\epsilon + r
+match the same strings: for example do $r^*$ and $\ONE + r
-\cdot r^*$ match the same strings? What about $\varnothing^*$
+\cdot r^*$ match the same strings? What about $\ZERO^*$
-and $\epsilon^*$? This suggests the following relation between
+and $\ONE^*$? This suggests the following relation between
 \defn{equivalent regular expressions}:
 \[
 r_1 \equiv r_2 \;\dn\; L(r_1) = L(r_2)
 \]
 expression $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$,
 because they are equivalent. I leave you to the question
 whether
 \[
-\varnothing^* \equiv \epsilon^*
+\ZERO^* \equiv \ONE^*
 \]
-\noindent holds? Such equivalences will be important for our
+\noindent holds or not? Such equivalences will be important
-matching algorithm, because we can use them to simplify
+for our matching algorithm, because we can use them to
-regular expressions.
+simplify regular expressions, which will mean we can speed
+up the calculations.
 \subsection*{My Fascination for Regular Expressions}
 Up until a few years ago I was not really interested in
 regular expressions. They have been studied for the last 60
 years (by smarter people than me)---surely nothing new can be
 found out about them. I even have the vague recollection that
 I did not quite understand them during my study. If I remember
 correctly,\footnote{That was really a long time ago.} I got
-utterly confused about $\epsilon$ and the empty
+utterly confused about $\ONE$ (which my lecturer wrote as
-string.\footnote{Obviously the lecturer must have been bad.}
+$\epsilon$) and the empty string.\footnote{Obviously the
-Since my study, I have used regular expressions for
+lecturer must have been bad.} Since my study, I have used
-implementing lexers and parsers as I have always been
+regular expressions for implementing lexers and parsers as I
-interested in all kinds of programming languages and
+have always been interested in all kinds of programming
-compilers, which invariably need regular expression in some
+languages and compilers, which invariably need regular
-form or shape.
+expression in some form or shape.
-To understand my fascination nowadays with regular
+To understand my fascination \emph{nowadays} with regular
 expressions, you need to know that my main scientific interest
 for the last 14 years has been with theorem provers. I am a
 core developer of one of
 them.\footnote{\url{http://isabelle.in.tum.de}} Theorem
 provers are systems in which you can formally reason about
 exprssions did not know better. Well, we showed it can also be
 done with regular expressions only.\footnote{\url{http://www.inf.kcl.ac.uk/staff/urbanc/Publications/rexp.pdf}}
 What a feeling if you are an outsider to the subject!
 To conclude: Despite my early ignorance about regular
-expressions, I find them now quite interesting. They have a
+expressions, I find them now very interesting. They have a
-beautiful mathematical theory behind them. They have practical
+beautiful mathematical theory behind them, which can be
-importance (remember the shocking runtime of the regular
+sometimes quite deep and contain hidden snares. They have
-expression matchers in Python and Ruby in some instances).
+practical importance (remember the shocking runtime of the
-People who are not very familiar with the mathematical
+regular expression matchers in Python and Ruby in some
-background of regular expressions get them consistently wrong.
+instances). People who are not very familiar with the
-The hope is that we can do better in the future---for example
+mathematical background of regular expressions get them
-by proving that the algorithms actually satisfy their
+consistently wrong. The hope is that we can do better in the
-specification and that the corresponding implementations do
+future---for example by proving that the algorithms actually
-not contain any bugs. We are close, but not yet quite there.
+satisfy their specification and that the corresponding
+implementations do not contain any bugs. We are close, but not
+yet quite there.
 Notwithstanding my fascination, I am also happy to admit that regular
 expressions have their shortcomings. There are some well-known
 ``theoretical'' shortcomings, for example recognising strings
 of the form $a^{n}b^{n}$. I am not so bothered by them. What I
 that is claimed to be closer to the standard is shown in
 Figure~\ref{monster}. Whether this claim is true or not, I
 would not know---the only thing I can say to this regular
 expression is it is a monstrosity. However, this might
 actually be an argument against the RFC standard, rather than
-against regular expressions. Still it is good to know that
+against regular expressions. A similar argument is made in
-some tasks in text processing just cannot be achieved by using
-regular expressions.
+\begin{center}
+\url{https://elliot.land/validating-an-email-address}
+\end{center}
+\noindent which explains some of the crazier parts of email
+addresses. Still it is good to know that some tasks in text
+processing just cannot be achieved by using regular
+expressions.
 \begin{figure}[p]
 \lstinputlisting{../progs/crawler1.scala}
 \caption{The Scala code for a simple web-crawler that checks
 for broken links in a web-page. It uses the regular expression
-\texttt{http\_pattern} in Line~15 for recognising URL-addresses.
+\texttt{http\_pattern} in Line~\ref{httpline} for recognising
-It finds all links using the library function
+URL-addresses. It finds all links using the library function
-\texttt{findAllIn} in Line~21.\label{crawler1}}
+\texttt{findAllIn} in Line~\ref{findallline}.\label{crawler1}}
 \end{figure}
 \begin{figure}[p]
 \lstinputlisting{../progs/crawler2.scala}
 \caption{A version of the web-crawler that only follows links
 in ``my'' domain---since these are the ones I am interested in
 to fix. It uses the regular expression \texttt{my\_urls} in
-Line~16 to check for my name in the links. The main change is
+Line~\ref{myurlline} to check for my name in the links. The
-in Lines~24--28 where there is a test whether URL is in ``my''
+main change is in
-domain or not.\label{crawler2}}
+Lines~\ref{changestartline}--\ref{changeendline} where there
+is a test whether URL is in ``my'' domain or
+not.\label{crawler2}}
 \end{figure}
 \begin{figure}[p]
 \lstinputlisting{../progs/crawler3.scala}
 \caption{A small email harvester---whenever we download a
 web-page, we also check whether it contains any email
 addresses. For this we use the regular expression
-\texttt{email\_pattern} in Line~15. The main change is in Line
+\texttt{email\_pattern} in Line~\ref{emailline}. The main
-30 where all email addresses that can be found in a page are
+change is in Line~\ref{mainline} where all email addresses
-printed.\label{crawler3}}
+that can be found in a page are printed.\label{crawler3}}
 \end{figure}
 \begin{figure}[p]
 \tiny
 \begin{center}
 \begin{minipage}{0.8\textwidth}
 \lstinputlisting[language={},keywordstyle=\color{black},numbers=none]{../progs/email-rexp}
 \end{minipage}
 \end{center}
-\caption{Nothing that can be said\ldots\label{monster}}
+\caption{Nothing that can be said this\ldots\label{monster}}
 \end{figure}
 \end{document}

changeset 399	5c1fbb39c93e
parent 398	c8ce95067c1a
child 403	564f7584eff1