\documentclass{article}\usepackage{charter}\usepackage{hyperref}\usepackage{amssymb}\usepackage{amsmath}\usepackage[T1]{fontenc}\usepackage{listings}\usepackage{xcolor}\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%\definecolor{javared}{rgb}{0.6,0,0} % for strings\definecolor{javagreen}{rgb}{0.25,0.5,0.35} % comments\definecolor{javapurple}{rgb}{0.5,0,0.35} % keywords\definecolor{javadocblue}{rgb}{0.25,0.35,0.75} % javadoc\lstdefinelanguage{scala}{ morekeywords={abstract,case,catch,class,def,% do,else,extends,false,final,finally,% for,if,implicit,import,match,mixin,% new,null,object,override,package,% private,protected,requires,return,sealed,% super,this,throw,trait,true,try,% type,val,var,while,with,yield}, otherkeywords={=>,<-,<\%,<:,>:,\#,@}, sensitive=true, morecomment=[l]{//}, morecomment=[n]{/*}{*/}, morestring=[b]", morestring=[b]', morestring=[b]"""}\lstset{language=Scala, basicstyle=\ttfamily, keywordstyle=\color{javapurple}\bfseries, stringstyle=\color{javagreen}, commentstyle=\color{javagreen}, morecomment=[s][\color{javadocblue}]{/**}{*/}, numbers=left, numberstyle=\tiny\color{black}, stepnumber=1, numbersep=10pt, tabsize=2, showspaces=false, showstringspaces=false}\begin{document}\section*{Handout 1}This course is about the processing of strings. Lets start with what we mean by \emph{strings}. Strings(they are also sometimes referred to as \emph{words}) are lists of characters drawn from an \emph{alphabet}. If nothing else is specified, we usually assume the alphabet consists of just the lower-case letters $a$, $b$, \ldots, $z$. Sometimes, however, we explicitlyrestrict strings to contain, for example, only the letters $a$ and $b$. In this case we say the alphabet is the set $\{a, b\}$.There are many ways how we can write down strings. In programming languages, they are usually written as {\it "hello"} where the double quotes indicate that we dealing with a string. Essentially, strings are lists of characters which can be written for example as follows\[[\text{\it h, e, l, l, o}]\]\noindentThe important point is that we can always decompose strings. For example, we will often consider thefirst character of a string, say $h$, and the ``rest'' of a string say {\it "ello"} when making definitions about strings. There are some subtleties with the empty string, sometimes written as {\it ""} but also as the empty list of characters $[\,]$. Two strings, for example $s_1$ and $s_2$, can be \emph{concatenated}, which we write as $s_1 @ s_2$. Suppose we are given two strings {\it "foo"} and {\it "bar"}, then their concatenationgives {\it "foobar"}.We often need to talk about sets of strings. For example the set of all strings over the alphabet $\{a, \ldots\, z\}$is\[\{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\}\]\noindentAny set of strings, not just the set-of-all-strings, is often called a \emph{language}. The idea behindthis choice of terminology is that if we enumerate, say, all words/strings from a dictionary, like \[\{\text{\it "the", "of", "milk", "name", "antidisestablishmentarianism", \ldots}\}\]\noindentthen we have essentially described the English language, or more precisely allstrings that can be used in a sentence of the English language. French would be adifferent set of strings, and so on. In the context of this course, a language might not necessarily make sense from a natural language point of view. For examplethe set of all strings shown above is a language, as is the empty set (of strings). Theempty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a difference between the empty set, or empty language, and the set that contains only the empty string $\{\text{""}\}$: the former has no elements, whereas the latter has one element.As seen, there are languages which contain infinitely many strings, like the set of all strings.The ``natural'' languages like English, French and so on contain many but only finitely many strings (namely the ones listed in a good dictionary). It might be therefore be surprising that thelanguage consisting of all email addresses is infinite provided we assume it is defined by theregular expression\footnote{See \url{http://goo.gl/5LoVX7}}\[([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}})\]\noindentThe reason is that for example before the $@$-sign there can be any string you want assuming it is made up from letters, digits, underscores, dots and hyphens---clearly there are infinitely manyof those. Similarly the string after the $@$-sign can be any string. However, this does not mean that every string is an email address. For example\[\text{\it foo}@\text{\it bar}.\text{\it c}\]\noindentis not, because the top-level-domains must be of length of at least two. (Note that there isthe convention that uppercase letters are treated in email-addresses as if they werelower-case.)\bigskipBefore we expand on the topic of regular expressions, let us review some operations onsets. We will use capital letters $A$, $B$, $\ldots$ to stand for sets of strings. The union of two sets is written as usual as $A \cup B$. We also need to define theoperation of \emph{concatenating} two sets of strings. This can be defined as\[A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \}\]\noindentwhich essentially means take the first string from the set $A$ and concatenate it with everystring in the set $B$, then take the second string from $A$ do the same and so on. You mightlike to think about what this definition means in case $A$ or $B$ is the empty set.We also need to definethe power of a set, written as $A^n$ with $n$ being a natural number. This is defined inductively as follows\begin{center}\begin{tabular}{rcl}$A^0$ & $\dn$ & $\{[\,]\}$ \\$A^{n+1}$ & $\dn$ & $A @ A^n$\\\end{tabular}\end{center}\noindentFinally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the unionof every power of $A^n$ with $n\ge 0$. The mathematical notation for this operation is\[A^* \dn \bigcup_{0\le n} A^n\]\noindentThis definition implies that the star of a set $A$ contains always the empty string (that is $A^0$), one copy of every string in $A$ (that is $A^1$), two copies in $A$ (that is $A^2$) and so on. In case $A=\{"a"\}$ we therefore have \[A^* = \{"", "a", "aa", "aaa", \ldots\}\]\noindentBe aware that these operations sometimes have quite non-intuitive properties, for example\begin{center}\begin{tabular}{ccc}\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}$A \cup \varnothing$ & $=$ & $A$\\$A \cup A$ & $=$ & $A$\\$A \cup B$ & $=$ & $B \cup A$\\\end{tabular} &\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}$A @ B$ & $\not =$ & $B @ A$\\$A @ \varnothing$ & $=$ & $\varnothing @ A = \varnothing$\\$A @ \{""\}$ & $=$ & $\{""\} @ A = A$\\\end{tabular} &\begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}$\varnothing^*$ & $=$ & $\{""\}$\\$\{""\}^*$ & $=$ & $\{""\}$\\$A^\star$ & $=$ & $\{""\} \cup A\cdot A^*$\\\end{tabular} \end{tabular}\end{center}\bigskip\noindent\emph{Regular expressions} are meant to conveniently describe languages...at least languageswe are interested in in Computer Science. For example there is no convenient regular expressionfor describing the English language short of enumerating all English words. But they seem useful for describing all permitted email addresses, as seen above. Regular expressions are given by the following grammar:\begin{center}\begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l} $r$ & $::=$ & $\varnothing$ & null\\ & $\mid$ & $\epsilon$ & empty string / "" / []\\ & $\mid$ & $c$ & single character\\ & $\mid$ & $r_1 \cdot r_2$ & sequence\\ & $\mid$ & $r_1 + r_2$ & alternative / choice\\ & $\mid$ & $r^*$ & star (zero or more)\\ \end{tabular}\end{center}\noindentBecause we overload our notation there are some subtleties you should be aware of. The letter $c$ stands for any character from thealphabet at hand. Second, we will use parentheses to disambiguateregular expressions. For example we will write $(r_1 + r_2)^*$, which is different from, say $r_1 + (r_2)^*$.The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times$r_2$. We should also write $(r_1 + r_2) + r_3$, which is different from the regular expression $r_1 + (r_2 + r_3)$,but in case of $+$ and $\cdot$ we actually do not care about the order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$,respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice$r_1 + r_2$ is written as $r_1\mid{}r_2$. Also following the convention in the literature, we will in case of $\cdot$ even often omit it all together. For example the regular expression for email addresses shown above is meant to be of the form\[([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots\]\noindentmeaning first comes a name (specified by the regular expression $([\ldots])^+$), then an $@$-sign, thena domain name (specified by the regular expression $([\ldots])^+$), then a top-level domain. Similarly ifwe want to specify the regular expression for the string {\it "hello"} we should write\[{\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}\]\noindentbut often just write {\it hello}.Another source of confusion might arise from the fact that we use the term \emph{regular expressions} for the ones used in ``theory''and also the ones in ``practice''. In this course we refer by default to the regular expressions defined by the grammar above. In ``practice'' we often use $r^+$ to stand for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ to stand for an optional regularexpression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are however mere convenience as they can be seen as shorthand for\begin{center}\begin{tabular}{rcl}$r^+$ & $\mapsto$ & $r\cdot r^*$\\$r^?$ & $\mapsto$ & $\epsilon + r$\\$\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\$[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\\end{tabular}\end{center}We will see later that the \emph{not}-regular-expression can also be seen as convenience. This regularexpression is supposed to stand for every string \emph{not} matched by a regular expression. We will writesuch not-regular-expressions as $\sim{}r$. While being ``convenience'' it is often not so clear what the shorthand forthese kind of not-regular-expressions is. Try to write down the regular expression which can match anystring except {\it "hello"} and {\it "world"}. It is possible in principle, but often it is easier to just include$\sim{}r$ in the definition or regular expressions. Whenever we do so, we will state it explicitly.So far we have only considered informally what the \emph{meaning} of a regular expression is. To do so more formally we will associate with every regular expression a set of strings that is supposed to be are matched by thisregular expression. This can be defined recursively as follows\begin{center}\begin{tabular}{rcl}$L(\varnothing)$ & $\dn$ & $\{\,\}$\\$L(\epsilon)$ & $\dn$ & $\{""\}$\\$L(c)$ & $\dn$ & $\{"c"\}$\\$L(r_1+ r_2)$ & $\dn$ & $L(r_1) \cup L(r_2)$\\$L(r_1 \cdot r_2)$ & $\dn$ & $\{s_1@ s_2 | s_1 \in L(r_1) \wedge s_2 \in L(r_2) \}$\\$L(r^*)$ & $\dn$ & $\bigcup_{n \ge 0} L(r)^n$\\\end{tabular}\end{center}\noindentThis means we can now precisely state what the meaning, for example, of the regular expression ${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely $L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$...as expected. Similarly if we have the choice-regular-expression $a + b$, its meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possiblybe matched by this choice. You can now also conclude why we do not make a differencebetween the different regular expressions $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$....they are not the same regular expression, but have the same meaning. The point of the definition of $L$ is that we can now precisely specify when a string $s$ is matched by aregular expression $r$, namely only when $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ andany regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no},if $s \not\in L(r)$. We leave this for the next lecture.\begin{figure}[p]{\lstset{language=Scala}\texttt{\lstinputlisting{../progs/crawler1.scala}}}\caption{Scala code for a web-crawler that can detect broken links in a web-page. It usesthe regular expression {\tt http\_pattern} in Line~15 for recognising URL-addresses. It findsall links using the library function {\tt findAllIn} in Line~21.}\end{figure}\begin{figure}[p]{\lstset{language=Scala}\texttt{\lstinputlisting{../progs/crawler2.scala}}}\caption{A version of the web-crawler which only follows links in ``my'' domain---since these are theones I am interested in to fix. It uses the regular expression {\tt my\_urls} in Line~16.The main change is in Line~26 where there is a test whether URL is in ``my'' domain or not.}\end{figure}\begin{figure}[p]{\lstset{language=Scala}\texttt{\lstinputlisting{../progs/crawler3.scala}}}\caption{A small email harvester---whenever we download a web-page, we also check whetherit contains any email addresses. For this we use the regular expression {\tt email\_pattern} inLine~17. The main change is in Lines 33 and 34 where all email addresses that can be found in a page are printed.}\end{figure}\end{document}%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: