handouts/scala-ho.tex
changeset 227 93bd75031ced
child 228 4df4404455d0
equal deleted inserted replaced
226:e3c454e31224 227:93bd75031ced
       
     1 \documentclass{article}
       
     2 \usepackage{hyperref}
       
     3 \usepackage{amssymb}
       
     4 \usepackage{alltt}
       
     5 \usepackage{menukeys}
       
     6 \usepackage{amsmath}
       
     7 \usepackage{../langs}
       
     8 \usepackage{mathpazo}
       
     9 \usepackage[scaled=.95]{helvet}
       
    10 
       
    11 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
       
    12 
       
    13 
       
    14 \begin{document}
       
    15 
       
    16 \section*{A Crash-Course on Scala}
       
    17 
       
    18 Scala is programming language that combines functional and
       
    19 object-oriented programming-styles. This language received in
       
    20 the last five years quite a bit of attention. One reason is
       
    21 that, like the Java programming language, it compiles to the
       
    22 Java Virtual Machine (JVM) and therefore can run under MacOSX,
       
    23 Linux and Windows.\footnote{There are also experimental
       
    24 backends for Android and JavaScript.} The main compiler can be
       
    25 downloaded from
       
    26 
       
    27 \begin{quote}
       
    28 \url{http://www.scala-lang.org}
       
    29 \end{quote}
       
    30 
       
    31 \noindent Why do I use Scala in this course? Actually, you can
       
    32 do any part of the programming coursework in \emph{any}
       
    33 programming language you like. I use Scale because its
       
    34 functional programming-style allows for some very small and
       
    35 elegant code. Since the compiler is free, you can download it
       
    36 and run every example I give. But if you prefer, you can also
       
    37 translate the examples into any other functional language, for
       
    38 example Haskell, ML, F\# and so on.
       
    39 
       
    40 Writing programs in Scala can be done with Eclipse IDE and
       
    41 also with IntelliJ, but I am using just the Emacs-editor and
       
    42 run programs on the command line. One advantage of Scala is
       
    43 that it has an interpreter (a REPL --- read-eval-print-loop)
       
    44 with which you can run and test small code-snippets without
       
    45 the need of the compiler. This helps a lot for interactively
       
    46 developing programs. Once you installed Scala correctly, you
       
    47 can start the interpreter by typing
       
    48 
       
    49 \begin{alltt}
       
    50 $ scala\small
       
    51 Welcome to Scala version 2.11.2 (Java HotSpot(TM) 64-Bit Server VM).
       
    52 Type in expressions to have them evaluated.
       
    53 Type :help for more information.\normalsize
       
    54 
       
    55 scala>
       
    56 \end{alltt}
       
    57 
       
    58 \noindent At the scala prompt you can type things like {\tt 2 + 3}
       
    59 \keys{Ret}. The output will be
       
    60 
       
    61 \begin{alltt}
       
    62 scala> 2 + 3
       
    63 res0: Int = 5
       
    64 \end{alltt}
       
    65 
       
    66 \noindent
       
    67 indicating that the result is of type {\tt Int} and the result 
       
    68 of the addition is {\tt 5}. Another example you can type in 
       
    69 immediately is
       
    70 
       
    71 \begin{alltt}
       
    72 scala> print ("hello world")
       
    73 hello world
       
    74 \end{alltt}
       
    75 
       
    76 \noindent which prints out a string. Note that in this case
       
    77 there is no result: the reason is that {\tt print} does not
       
    78 produce any result indicated by {\tt res\_}, rather it is a
       
    79 function that causes a \emph{side-effect} of printing out a
       
    80 string. Once you are more familiar with the functional
       
    81 programming-style, you will immediately see what the
       
    82 difference is between a function that returns a result and a
       
    83 function that causes a side-effect (the latter always has as
       
    84 return type {\tt Unit}).
       
    85 
       
    86 \subsection*{Inductive Datatypes}
       
    87 
       
    88 The elegance and conciseness of Scala programs stems often
       
    89 from the fact that inductive datatypes can be easily defined.
       
    90 For example in ``Mathematics'' we would define regular 
       
    91 expressions by the grammar
       
    92 
       
    93 \begin{center}
       
    94 \begin{tabular}{r@{\hspace{2mm}}r@{\hspace{2mm}}l@{\hspace{13mm}}l}
       
    95   $r$ & $::=$ &   $\varnothing$         & null\\
       
    96         & $\mid$ & $\epsilon$           & empty string\\
       
    97         & $\mid$ & $c$                  & single character\\
       
    98         & $\mid$ & $r_1 \cdot r_2$      & sequence\\
       
    99         & $\mid$ & $r_1 + r_2$          & alternative / choice\\
       
   100         & $\mid$ & $r^*$                & star (zero or more)\\
       
   101   \end{tabular}
       
   102 \end{center}
       
   103 
       
   104 \noindent This grammar specifies what regular expressions are
       
   105 (essentially a kind of tree-structure with three kinds of
       
   106 inner nodes and three leave nodes). If you are familiar with
       
   107 Java, it might be an instructive exercise to define this
       
   108 kind of inductive datatypes in Java.
       
   109 
       
   110 Implementing the regular expressions from above in Scala
       
   111 requires an \emph{abstract class}, say, {\tt Rexp}. The
       
   112 different kinds of regular expressions will be instances of
       
   113 this abstract class. The cases for $\varnothing$ and
       
   114 $\epsilon$ do not require any arguments, while in the other
       
   115 cases we do have arguments. For example the character regular
       
   116 expressions need to take as argument the character they are
       
   117 supposed to recognise.
       
   118 
       
   119 . is a relative recen programming language
       
   120 This course is about the processing of strings. Lets start
       
   121 with what we mean by \emph{strings}. Strings (they are also
       
   122 sometimes referred to as \emph{words}) are lists of characters
       
   123 drawn from an \emph{alphabet}. If nothing else is specified,
       
   124 we usually assume the alphabet consists of just the lower-case
       
   125 letters $a$, $b$, \ldots, $z$. Sometimes, however, we
       
   126 explicitly restrict strings to contain, for example, only the
       
   127 letters $a$ and $b$. In this case we say the alphabet is the
       
   128 set $\{a, b\}$.
       
   129 
       
   130 There are many ways how we can write down strings. In programming languages, they are usually 
       
   131 written as {\it "hello"} where the double quotes indicate that we dealing with a string. 
       
   132 Essentially, strings are lists of characters which can be written for example as follows
       
   133 
       
   134 \[
       
   135 [\text{\it h, e, l, l, o}]
       
   136 \]
       
   137 
       
   138 \noindent
       
   139 The important point is that we can always decompose strings. For example, we will often consider the
       
   140 first character of a string, say $h$, and the ``rest''  of a string say {\it "ello"} when making definitions 
       
   141 about strings. There are some subtleties with the empty string, sometimes written as {\it ""} but also as 
       
   142 the empty list of characters $[\,]$. Two strings, for example $s_1$ and $s_2$, can be \emph{concatenated}, 
       
   143 which we write as $s_1 @ s_2$. Suppose we are given two strings {\it "foo"} and {\it "bar"}, then their concatenation
       
   144 gives {\it "foobar"}.
       
   145 
       
   146 We often need to talk about sets of strings. For example the set of all strings over the alphabet $\{a, \ldots\, z\}$
       
   147 is
       
   148 
       
   149 \[
       
   150 \{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\}
       
   151 \]
       
   152 
       
   153 \noindent
       
   154 Any set of strings, not just the set-of-all-strings, is often called a \emph{language}. The idea behind
       
   155 this choice of terminology is that if we enumerate, say, all words/strings from a dictionary, like 
       
   156 
       
   157 \[
       
   158 \{\text{\it "the", "of", "milk", "name", "antidisestablishmentarianism", \ldots}\}
       
   159 \]
       
   160 
       
   161 \noindent
       
   162 then we have essentially described the English language, or more precisely all
       
   163 strings that can be used in a sentence of the English language. French would be a
       
   164 different set of strings, and so on. In the context of this course, a language might 
       
   165 not necessarily make sense from a natural language point of view. For example
       
   166 the set of all strings shown above is a language, as is the empty set (of strings). The
       
   167 empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a 
       
   168 difference between the empty set, or empty language, and the set that 
       
   169 contains only the empty string $\{\text{""}\}$: the former has no elements, whereas 
       
   170 the latter has one element.
       
   171 
       
   172 As seen, there are languages which contain infinitely many strings, like the set of all strings.
       
   173 The ``natural'' languages like English, French and so on contain many but only finitely many 
       
   174 strings (namely the ones listed in a good dictionary). It might be therefore be surprising that the
       
   175 language consisting of all email addresses is infinite provided we assume it is defined by the
       
   176 regular expression\footnote{See \url{http://goo.gl/5LoVX7}}
       
   177 
       
   178 \[
       
   179 ([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}})
       
   180 \]
       
   181 
       
   182 \noindent
       
   183 One reason is that before the $@$-sign there can be any string you want assuming it 
       
   184 is made up from letters, digits, underscores, dots and hyphens---clearly there are infinitely many
       
   185 of those. Similarly the string after the $@$-sign can be any string. However, this does not mean 
       
   186 that every string is an email address. For example
       
   187 
       
   188 \[
       
   189 "\text{\it foo}@\text{\it bar}.\text{\it c}"
       
   190 \]
       
   191 
       
   192 \noindent
       
   193 is not, because the top-level-domains must be of length of at least two. (Note that there is
       
   194 the convention that uppercase letters are treated in email-addresses as if they were
       
   195 lower-case.)
       
   196 \bigskip
       
   197 
       
   198 Before we expand on the topic of regular expressions, let us review some operations on
       
   199 sets. We will use capital letters $A$, $B$, $\ldots$ to stand for sets of strings. 
       
   200 The union of two sets is written as usual as $A \cup B$. We also need to define the
       
   201 operation of \emph{concatenating} two sets of strings. This can be defined as
       
   202 
       
   203 \[
       
   204 A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \}
       
   205 \]
       
   206 
       
   207 \noindent
       
   208 which essentially means take the first string from the set $A$ and concatenate it with every
       
   209 string in the set $B$, then take the second string from $A$ do the same and so on. You might
       
   210 like to think about what this definition means in case $A$ or $B$ is the empty set.
       
   211 
       
   212 We also need to define
       
   213 the power of a set of strings, written as $A^n$ with $n$ being a natural number. This is defined inductively as follows
       
   214 
       
   215 \begin{center}
       
   216 \begin{tabular}{rcl}
       
   217 $A^0$ & $\dn$ & $\{[\,]\}$ \\
       
   218 $A^{n+1}$ & $\dn$ & $A @ A^n$\\
       
   219 \end{tabular}
       
   220 \end{center}
       
   221 
       
   222 \noindent
       
   223 Finally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the union
       
   224 of every power of $A^n$ with $n\ge 0$. The mathematical notation for this operation is
       
   225 
       
   226 \[
       
   227 A^* \dn \bigcup_{0\le n} A^n
       
   228 \]
       
   229 
       
   230 \noindent
       
   231 This definition implies that the star of a set $A$ contains always the empty string (that is $A^0$), one 
       
   232 copy of every string in $A$ (that is $A^1$), two copies in $A$ (that is $A^2$) and so on. In case $A=\{"a"\}$ we therefore 
       
   233 have 
       
   234 
       
   235 \[
       
   236 A^* = \{"", "a", "aa", "aaa", \ldots\}
       
   237 \]
       
   238 
       
   239 \noindent
       
   240 Be aware that these operations sometimes have quite non-intuitive properties, for example
       
   241 
       
   242 \begin{center}
       
   243 \begin{tabular}{@{}ccc@{}}
       
   244 \begin{tabular}{@{}r@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   245 $A \cup \varnothing$ & $=$ & $A$\\
       
   246 $A \cup A$ & $=$ & $A$\\
       
   247 $A \cup B$ & $=$ & $B \cup A$\\
       
   248 \end{tabular} &
       
   249 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   250 $A @ B$ & $\not =$ & $B @ A$\\
       
   251 $A  @ \varnothing$ & $=$ & $\varnothing @ A = \varnothing$\\
       
   252 $A  @ \{""\}$ & $=$ & $\{""\} @ A = A$\\
       
   253 \end{tabular} &
       
   254 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
       
   255 $\varnothing^*$ & $=$ & $\{""\}$\\
       
   256 $\{""\}^*$ & $=$ & $\{""\}$\\
       
   257 $A^\star$ & $=$ & $\{""\} \cup A\cdot A^*$\\
       
   258 \end{tabular} 
       
   259 \end{tabular}
       
   260 \end{center}
       
   261 \bigskip
       
   262 
       
   263 \noindent
       
   264 \emph{Regular expressions} are meant to conveniently describe languages...at least languages
       
   265 we are interested in in Computer Science.  For example there is no convenient regular expression
       
   266 for describing the English language short of enumerating all English words. 
       
   267 But they seem useful for describing all permitted email addresses, as seen above. 
       
   268 
       
   269 Regular expressions are given by the following grammar:
       
   270 
       
   271 \begin{center}
       
   272 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
       
   273   $r$ & $::=$ &   $\varnothing$         & null\\
       
   274         & $\mid$ & $\epsilon$              & empty string / "" / []\\
       
   275         & $\mid$ & $c$                         & single character\\
       
   276         & $\mid$ & $r_1 \cdot r_2$      & sequence\\
       
   277         & $\mid$ & $r_1 + r_2$            & alternative / choice\\
       
   278         & $\mid$ & $r^*$                      & star (zero or more)\\
       
   279   \end{tabular}
       
   280 \end{center}
       
   281 
       
   282 \noindent
       
   283 Because we overload our notation, there are some subtleties you should be aware of. First, the letter 
       
   284 $c$ stands for any character from the
       
   285 alphabet at hand. Second, we will use parentheses to disambiguate
       
   286 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from, say $r_1 + (r_2)^*$.
       
   287 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times
       
   288 $r_2$. We should also write $(r_1 + r_2) + r_3$, which is different from the regular expression $r_1 + (r_2 + r_3)$,
       
   289 but in case of $+$ and $\cdot$ we actually do not care about the order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$,
       
   290 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice
       
   291 $r_1 + r_2$  is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also following the convention in the literature, we will in case of $\cdot$ even 
       
   292 often omit it all together. For example the regular expression for email addresses shown above is meant to be of the form
       
   293 
       
   294 \[
       
   295 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots
       
   296 \]
       
   297 
       
   298 \noindent
       
   299 meaning first comes a name (specified by the regular expression $([\ldots])^+$), then an $@$-sign, then
       
   300 a domain name (specified by the regular expression $([\ldots])^+$), then a dot and then a top-level domain. Similarly if
       
   301 we want to specify the regular expression for the string {\it "hello"} we should write
       
   302 
       
   303 \[
       
   304 {\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}
       
   305 \]
       
   306 
       
   307 \noindent
       
   308 but often just write {\it hello}.
       
   309 
       
   310 Another source of confusion might arise from the fact that we use the term \emph{regular expression} for the regular expressions used in ``theory''
       
   311 and also the ones used in ``practice''. In this course we refer by default to the regular expressions defined by the grammar above. 
       
   312 In ``practice'' we often use $r^+$ to stand for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ to stand for an optional regular
       
   313 expression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are however mere convenience 
       
   314 as they can be seen as shorthand for
       
   315 
       
   316 \begin{center}
       
   317 \begin{tabular}{rcl}
       
   318 $r^+$ & $\mapsto$ & $r\cdot r^*$\\
       
   319 $r^?$ & $\mapsto$ & $\epsilon + r$\\
       
   320 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
       
   321 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
       
   322 \end{tabular}
       
   323 \end{center}
       
   324 
       
   325 
       
   326 We will see later that the \emph{not}-regular-expression can also be seen as convenience. This regular
       
   327 expression is supposed to stand for every string \emph{not} matched by a regular expression. We will write
       
   328 such not-regular-expressions as $\sim{}r$. While being ``convenience'' it is often not so clear what the shorthand for
       
   329 these kind of not-regular-expressions is. Try to write down the regular expression which can match any
       
   330 string except the two strings {\it "hello"} and {\it "world"}. It is possible in principle, but often it is easier to just include
       
   331 $\sim{}r$ in the definition of regular expressions. Whenever we do so, we will state it explicitly.
       
   332 
       
   333 So far we have only considered informally what the \emph{meaning} of a regular expression is.  
       
   334 To do so more formally we will associate with every regular expression a set of strings 
       
   335 that is supposed to be matched by this
       
   336 regular expression. This can be defined recursively  as follows
       
   337 
       
   338 \begin{center}
       
   339 \begin{tabular}{rcl}
       
   340 $L(\varnothing)$  & $\dn$ & $\{\,\}$\\
       
   341 $L(\epsilon)$       & $\dn$ & $\{""\}$\\
       
   342 $L(c)$                  & $\dn$ & $\{"c"\}$\\
       
   343 $L(r_1+ r_2)$      & $\dn$ & $L(r_1) \cup L(r_2)$\\
       
   344 $L(r_1 \cdot r_2)$  & $\dn$ & $\{s_1@ s_2 | s_1 \in L(r_1) \wedge s_2 \in L(r_2) \}$\\
       
   345 $L(r^*)$                   & $\dn$ & $\bigcup_{n \ge 0} L(r)^n$\\
       
   346 \end{tabular}
       
   347 \end{center}
       
   348 
       
   349 \noindent
       
   350 As a result we can now precisely state what the meaning, for example, of the regular expression 
       
   351 ${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely 
       
   352 $L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$...as expected. Similarly if we have the 
       
   353 choice-regular-expression $a + b$, its meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly
       
   354 be matched by this choice. You can now also see why we do not make a difference
       
   355 between the different regular expressions $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$....they 
       
   356 are not the same regular expression, but have the same meaning. 
       
   357 
       
   358 The point of the definition of $L$ is that we can use it to precisely specify when a string $s$ is matched by a
       
   359 regular expression $r$, namely only when $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and
       
   360 any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no},
       
   361 if $s \not\in L(r)$. We leave this for the next lecture.
       
   362 
       
   363 \begin{figure}[p]
       
   364 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/crawler1.scala}}}
       
   365 \caption{Scala code for a web-crawler that can detect broken links in a web-page. It uses
       
   366 the regular expression {\tt http\_pattern} in Line~15 for recognising URL-addresses. It finds
       
   367 all links using the library function {\tt findAllIn} in Line~21.}
       
   368 \end{figure}
       
   369 
       
   370 \begin{figure}[p]
       
   371 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/crawler2.scala}}}
       
   372 \caption{A version of the web-crawler which only follows links in ``my'' domain---since these are the
       
   373 ones I am interested in to fix. It uses the regular expression {\tt my\_urls} in Line~16.
       
   374 The main change is in Line~26 where there is a test whether URL is in ``my'' domain or not.}
       
   375 
       
   376 \end{figure}
       
   377 
       
   378 \begin{figure}[p]
       
   379 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/crawler3.scala}}}
       
   380 \caption{A small email harvester---whenever we download a web-page, we also check whether
       
   381 it contains any email addresses. For this we use the regular expression {\tt email\_pattern} in
       
   382 Line~17.  The main change is in Lines 33 and 34 where all email addresses that can be found in a page are printed.}
       
   383 \end{figure}
       
   384 
       
   385 \end{document}
       
   386 
       
   387 %%% Local Variables: 
       
   388 %%% mode: latex
       
   389 %%% TeX-master: t
       
   390 %%% End: