--- a/handouts/ho01.tex Sat Dec 29 10:30:27 2018 +0000
+++ b/handouts/ho01.tex Tue Feb 12 21:23:00 2019 +0000
@@ -36,29 +36,34 @@
%https://www.youtube.com/watch?v=gmhMQfFQu20
\begin{document}
-\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018}
+\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2019}
\section*{Handout 1}
This module is about text processing, be it for web-crawlers,
-compilers, dictionaries, DNA-data, ad filters and so on. When looking for a
-particular string, like $abc$ in a large text we can use the
-Knuth-Morris-Pratt algorithm, which is currently the most efficient
-general string search algorithm. But often we do \emph{not} just look
-for a particular string, but for string patterns. For example, in
-program code we need to identify what are the keywords (\texttt{if}, \texttt{then},
-\texttt{while}, \texttt{for}, etc), what are the identifiers (variable names). A pattern for
-identifiers could be stated as: they start with a letter, followed by
-zero or more letters, numbers and underscores. Often we also face the
-problem that we are given a string (for example some user input) and
-want to know whether it matches a particular pattern---be it an email
-address, for example. In this way we can exclude user input that would
-otherwise have nasty effects on our program (crashing it or making it
-go into an infinite loop, if not worse). In tools like Snort, scanning
-for computer viruses or filtering out spam usually involves scanning
-for some signature (essentially a string pattern). The point is that
-the fast Knuth-Morris-Pratt algorithm for strings is not good enough
-for such string \emph{patterns}.\smallskip
+compilers, dictionaries, DNA-data, ad filters and so on. When looking
+for a particular string, like \pcode{"foobar"} in a large text we can
+use the Knuth-Morris-Pratt algorithm, which is currently the most
+efficient general string search algorithm. But often we do \emph{not}
+just look for a particular string, but for string patterns. For
+example, in program code we need to identify what are the keywords
+(\texttt{if}, \texttt{then}, \texttt{while}, \texttt{for}, etc), what
+are the identifiers (variable names). A pattern for identifiers could
+be stated as: they start with a letter, followed by zero or more
+letters, numbers and underscores. You might also be surprised what
+constraints programming languages impose about numbers: for example
+123 in JSON is OK, but 0123 is not.
+
+Often we also face the problem that we are given a string (for example
+some user input) and want to know whether it matches a particular
+pattern---be it an email address, for example. In this way we can
+exclude user input that would otherwise have nasty effects on our
+program (crashing it or making it go into an infinite loop, if not
+worse). In tools like Snort, scanning for computer viruses or
+filtering out spam usually involves scanning for some signature
+(essentially a string pattern). The point is that the fast
+Knuth-Morris-Pratt algorithm for strings is not good enough for such
+string \emph{patterns}.\smallskip
\defn{Regular expressions} help with conveniently specifying
such patterns. The idea behind regular expressions is that
@@ -194,7 +199,7 @@
before (remember the PRA module?). Why on earth then is there any
interest in studying them again in depth in this module? Well, one
answer is in the following two graphs about regular expression
-matching in Python, Ruby and Java (Version 8).
+matching in Python, Ruby, JavaScript and Java (Version 8).
\begin{center}
\begin{tabular}{@{\hspace{-1mm}}c@{\hspace{1mm}}c@{}}
@@ -214,11 +219,12 @@
axis lines=left,
width=5.5cm,
height=4.5cm,
- legend entries={Python, Java 8},
+ legend entries={Python, Java 8, JavaScript},
legend pos=north west,
legend cell align=left]
\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};
\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};
+\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};
\end{axis}
\end{tikzpicture}
&
@@ -248,7 +254,7 @@
\end{tabular}
\end{center}
-\noindent This first graph shows that Python and Java 8 need
+\noindent This first graph shows that Python, JavaScript and Java 8 need
approximately 30 seconds to find out that the regular expression
$\texttt{(a*)*\,b}$ does not match strings of 28 \texttt{a}s.
Similarly, the second shows that Python needs approximately 29 seconds