handouts/scala-ho.tex
changeset 228 4df4404455d0
parent 227 93bd75031ced
child 229 00c4fda3d6c5
equal deleted inserted replaced
227:93bd75031ced 228:4df4404455d0
     7 \usepackage{../langs}
     7 \usepackage{../langs}
     8 \usepackage{mathpazo}
     8 \usepackage{mathpazo}
     9 \usepackage[scaled=.95]{helvet}
     9 \usepackage[scaled=.95]{helvet}
    10 
    10 
    11 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
    11 \newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
    12 
    12 \definecolor{codegray}{gray}{0.9}
       
    13 \newcommand{\code}[1]{\colorbox{codegray}{\texttt{#1}}}
    13 
    14 
    14 \begin{document}
    15 \begin{document}
    15 
    16 
    16 \section*{A Crash-Course on Scala}
    17 \section*{A Crash-Course on Scala}
    17 
    18 
    18 Scala is programming language that combines functional and
    19 Scala is programming language that combines functional and
    19 object-oriented programming-styles. This language received in
    20 object-oriented programming-styles, and has received in the
    20 the last five years quite a bit of attention. One reason is
    21 last five years quite a bit of attention. One reason for this
    21 that, like the Java programming language, it compiles to the
    22 attention is that, like the Java programming language, it
    22 Java Virtual Machine (JVM) and therefore can run under MacOSX,
    23 compiles to the Java Virtual Machine (JVM) and therefore can
    23 Linux and Windows.\footnote{There are also experimental
    24 run under MacOSX, Linux and Windows.\footnote{There are also
    24 backends for Android and JavaScript.} The main compiler can be
    25 experimental backends for Android and JavaScript.} Unlike
       
    26 Java, however, Scala often allows programmers to write concise
       
    27 and elegant code; some therefore say Scala is the much better
       
    28 Java. If you want to try it out, the Scala compiler can be
    25 downloaded from
    29 downloaded from
    26 
    30 
    27 \begin{quote}
    31 \begin{quote}
    28 \url{http://www.scala-lang.org}
    32 \url{http://www.scala-lang.org}
    29 \end{quote}
    33 \end{quote}
    30 
    34 
    31 \noindent Why do I use Scala in this course? Actually, you can
    35 Why do I use Scala in the AFL course? Actually, you can do
    32 do any part of the programming coursework in \emph{any}
    36 \emph{any} part of the programming coursework in \emph{any}
    33 programming language you like. I use Scale because its
    37 programming language you like. I use Scala for showing you
    34 functional programming-style allows for some very small and
    38 code during the lectures because its functional
    35 elegant code. Since the compiler is free, you can download it
    39 programming-style allows me to implement some of the functions
    36 and run every example I give. But if you prefer, you can also
    40 we will discuss with very small and elegant code. Since the
    37 translate the examples into any other functional language, for
    41 compiler is free, you can download it and run every example I
    38 example Haskell, ML, F\# and so on.
    42 give. But if you prefer, you can also translate the examples
    39 
    43 into any other functional language, for example Haskell, ML,
    40 Writing programs in Scala can be done with Eclipse IDE and
    44 F\# and so on.
    41 also with IntelliJ, but I am using just the Emacs-editor and
    45 
    42 run programs on the command line. One advantage of Scala is
    46 Writing programs in Scala can be done with the Eclipse IDE and
    43 that it has an interpreter (a REPL --- read-eval-print-loop)
    47 also with IntelliJ, but for the small programs we will look at
    44 with which you can run and test small code-snippets without
    48 the Emacs-editor id good for me and I will run programs on the
    45 the need of the compiler. This helps a lot for interactively
    49 command line. One advantage of Scala over Java is that it
       
    50 includes an interpreter (a REPL, or Read-Eval-Print-Loop) with
       
    51 which you can run and test small code-snippets without the
       
    52 need of the compiler. This helps a lot for interactively
    46 developing programs. Once you installed Scala correctly, you
    53 developing programs. Once you installed Scala correctly, you
    47 can start the interpreter by typing
    54 can start the interpreter by typing
       
    55 
    48 
    56 
    49 \begin{alltt}
    57 \begin{alltt}
    50 $ scala\small
    58 $ scala\small
    51 Welcome to Scala version 2.11.2 (Java HotSpot(TM) 64-Bit Server VM).
    59 Welcome to Scala version 2.11.2 (Java HotSpot(TM) 64-Bit Server VM).
    52 Type in expressions to have them evaluated.
    60 Type in expressions to have them evaluated.
    53 Type :help for more information.\normalsize
    61 Type :help for more information.\normalsize
    54 
    62 
    55 scala>
    63 scala>
    56 \end{alltt}
    64 \end{alltt}
    57 
    65 
    58 \noindent At the scala prompt you can type things like {\tt 2 + 3}
    66 \noindent The precise output may vary due to the platform
    59 \keys{Ret}. The output will be
    67 where you installed Scala. At the scala prompt you can type
       
    68 things like {\tt 2 + 3} \keys{Ret} and the output will be
    60 
    69 
    61 \begin{alltt}
    70 \begin{alltt}
    62 scala> 2 + 3
    71 scala> 2 + 3
    63 res0: Int = 5
    72 res0: Int = 5
    64 \end{alltt}
    73 \end{alltt}
    65 
    74 
    66 \noindent
    75 \noindent indicating that the result of the addition is of
    67 indicating that the result is of type {\tt Int} and the result 
    76 type {\tt Int} and the actual result is {\tt 5}. Another
    68 of the addition is {\tt 5}. Another example you can type in 
    77 example you can type in is
    69 immediately is
       
    70 
    78 
    71 \begin{alltt}
    79 \begin{alltt}
    72 scala> print ("hello world")
    80 scala> print ("hello world")
    73 hello world
    81 hello world
    74 \end{alltt}
    82 \end{alltt}
    75 
    83 
    76 \noindent which prints out a string. Note that in this case
    84 \noindent which prints out a string. Note that in this case
    77 there is no result: the reason is that {\tt print} does not
    85 there is no result: the reason is that {\tt print} does not
    78 produce any result indicated by {\tt res\_}, rather it is a
    86 actually produce a result (there is no {\tt res\_}), rather it
    79 function that causes a \emph{side-effect} of printing out a
    87 is a function that causes the \emph{side-effect} of printing
    80 string. Once you are more familiar with the functional
    88 out a string. Once you are more familiar with the functional
    81 programming-style, you will immediately see what the
    89 programming-style, you will know what the difference is
    82 difference is between a function that returns a result and a
    90 between a function that returns a result, like addition, and a
    83 function that causes a side-effect (the latter always has as
    91 function that causes a side-effect, like {\tt print}. We shall
    84 return type {\tt Unit}).
    92 come back to this later, but if you are curious, the latter
       
    93 kind of functions always have as return type {\tt Unit}.
    85 
    94 
    86 \subsection*{Inductive Datatypes}
    95 \subsection*{Inductive Datatypes}
    87 
    96 
    88 The elegance and conciseness of Scala programs stems often
    97 The elegance and conciseness of Scala programs stems often
    89 from the fact that inductive datatypes can be easily defined.
    98 from the fact that inductive datatypes can be easily defined.
    90 For example in ``Mathematics'' we would define regular 
    99 For example in ``every-day Mathematics'' we would define
    91 expressions by the grammar
   100 regular expressions simply by the grammar
    92 
   101 
    93 \begin{center}
   102 \begin{center}
    94 \begin{tabular}{r@{\hspace{2mm}}r@{\hspace{2mm}}l@{\hspace{13mm}}l}
   103 \begin{tabular}{r@{\hspace{2mm}}r@{\hspace{2mm}}l@{\hspace{13mm}}l}
    95   $r$ & $::=$ &   $\varnothing$         & null\\
   104   $r$ & $::=$ &   $\varnothing$         & null\\
    96         & $\mid$ & $\epsilon$           & empty string\\
   105         & $\mid$ & $\epsilon$           & empty string\\
   101   \end{tabular}
   110   \end{tabular}
   102 \end{center}
   111 \end{center}
   103 
   112 
   104 \noindent This grammar specifies what regular expressions are
   113 \noindent This grammar specifies what regular expressions are
   105 (essentially a kind of tree-structure with three kinds of
   114 (essentially a kind of tree-structure with three kinds of
   106 inner nodes and three leave nodes). If you are familiar with
   115 inner nodes and three kinds of leave nodes). If you are
   107 Java, it might be an instructive exercise to define this
   116 familiar with Java, it might be an instructive exercise to
   108 kind of inductive datatypes in Java.
   117 define this kind of inductive datatypes in
   109 
   118 Java.\footnote{Happy programming! ;o)}
   110 Implementing the regular expressions from above in Scala
   119 
   111 requires an \emph{abstract class}, say, {\tt Rexp}. The
   120 Implementing the regular expressions from above in Scala is
   112 different kinds of regular expressions will be instances of
   121 very simple: It first requires an \emph{abstract class}, say,
   113 this abstract class. The cases for $\varnothing$ and
   122 {\tt Rexp}. This will act as type for regular expressions.
   114 $\epsilon$ do not require any arguments, while in the other
   123 Second, it requires some instances. The cases for
   115 cases we do have arguments. For example the character regular
   124 $\varnothing$ and $\epsilon$ do not have any arguments, while
   116 expressions need to take as argument the character they are
   125 in the other cases we do have arguments. For example the
   117 supposed to recognise.
   126 character regular expression needs to take as an argument the
   118 
   127 character it is supposed to recognise. In Scala, the cases
   119 . is a relative recen programming language
   128 without arguments are called \emph{case objects}, while the
   120 This course is about the processing of strings. Lets start
   129 ones with arguments are \emph{case classes}. The corresponding
   121 with what we mean by \emph{strings}. Strings (they are also
   130 code is as follows:
   122 sometimes referred to as \emph{words}) are lists of characters
   131 
   123 drawn from an \emph{alphabet}. If nothing else is specified,
   132 \begin{quote}
   124 we usually assume the alphabet consists of just the lower-case
   133 \begin{lstlisting}[language=Scala]
   125 letters $a$, $b$, \ldots, $z$. Sometimes, however, we
   134 abstract class Rexp 
   126 explicitly restrict strings to contain, for example, only the
   135 case object NULL extends Rexp
   127 letters $a$ and $b$. In this case we say the alphabet is the
   136 case object EMPTY extends Rexp
   128 set $\{a, b\}$.
   137 case class CHAR (c: Char) extends Rexp
   129 
   138 case class SEQ (r1: Rexp, r2: Rexp) extends Rexp 
   130 There are many ways how we can write down strings. In programming languages, they are usually 
   139 case class ALT (r1: Rexp, r2: Rexp) extends Rexp 
   131 written as {\it "hello"} where the double quotes indicate that we dealing with a string. 
   140 case class STAR (r: Rexp) extends Rexp 
   132 Essentially, strings are lists of characters which can be written for example as follows
   141 \end{lstlisting}
   133 
   142 \end{quote}
   134 \[
   143 
   135 [\text{\it h, e, l, l, o}]
   144 \noindent Given the grammar above, I hope you can see the
   136 \]
   145 underlying pattern. In order to be an instance of {\tt Rexp},
   137 
   146 each case object or case class needs to extend {\tt Rexp}. If
   138 \noindent
   147 you want to play with such definitions, feel free to define
   139 The important point is that we can always decompose strings. For example, we will often consider the
   148 for example binary trees.
   140 first character of a string, say $h$, and the ``rest''  of a string say {\it "ello"} when making definitions 
   149 
   141 about strings. There are some subtleties with the empty string, sometimes written as {\it ""} but also as 
   150 Once you make a definition like the one above, you can 
   142 the empty list of characters $[\,]$. Two strings, for example $s_1$ and $s_2$, can be \emph{concatenated}, 
   151 represent, say, the regular expression for $a + b$ as
   143 which we write as $s_1 @ s_2$. Suppose we are given two strings {\it "foo"} and {\it "bar"}, then their concatenation
   152 {\tt ALT(CHAR('a'), CHAR('b'))}. If you want to assign 
   144 gives {\it "foobar"}.
   153 this regular expression to a variable, you can just type
   145 
   154 
   146 We often need to talk about sets of strings. For example the set of all strings over the alphabet $\{a, \ldots\, z\}$
   155 \begin{alltt}
   147 is
   156 scala> val r = ALT(CHAR('a'), CHAR('b'))
   148 
   157 r: ALT = ALT(CHAR(a),CHAR(b))
   149 \[
   158 \end{alltt}
   150 \{\text{\it "", "a", "b", "c",\ldots,"z", "aa", "ab", "ac", \ldots, "aaa", \ldots}\}
   159 
   151 \]
   160 \noindent In order to make such assignments there is no
   152 
   161 constructor need in the class (like in Java). However, if
   153 \noindent
   162 there is the need, you can of course define such a constructor
   154 Any set of strings, not just the set-of-all-strings, is often called a \emph{language}. The idea behind
   163 in Scala. 
   155 this choice of terminology is that if we enumerate, say, all words/strings from a dictionary, like 
   164 
   156 
   165 Note that Scala says the variable {\tt r} is of type {\tt
   157 \[
   166 ALT}, not {\tt Rexp}. Scala always tries to find the most
   158 \{\text{\it "the", "of", "milk", "name", "antidisestablishmentarianism", \ldots}\}
   167 general type that is needed for a variable, but does not
   159 \]
   168 ``over-generalise''. In this case there is no need to give
   160 
   169 {\tt r} the more general type of {\tt Rexp}. This is different
   161 \noindent
   170 if you want to form a list of regular expressions, for example
   162 then we have essentially described the English language, or more precisely all
   171 
   163 strings that can be used in a sentence of the English language. French would be a
   172 \begin{alltt}
   164 different set of strings, and so on. In the context of this course, a language might 
   173 scala> val ls = List(ALT(CHAR('a'), CHAR('b')), NULL)
   165 not necessarily make sense from a natural language point of view. For example
   174 ls: List[Rexp] = List(ALT(CHAR(a),CHAR(b)), NULL)
   166 the set of all strings shown above is a language, as is the empty set (of strings). The
   175 \end{alltt}
   167 empty set of strings is often written as $\varnothing$ or $\{\,\}$. Note that there is a 
   176 
   168 difference between the empty set, or empty language, and the set that 
   177 \noindent In this case Scala needs to assign a type to the
   169 contains only the empty string $\{\text{""}\}$: the former has no elements, whereas 
   178 regular expressions, so that it is compatible with the fact
   170 the latter has one element.
   179 that list can only contain elements of a single type, in this
   171 
   180 case this is {\tt Rexp}.\footnote{If you type in this example,
   172 As seen, there are languages which contain infinitely many strings, like the set of all strings.
   181 you will notice that the type contains some further
   173 The ``natural'' languages like English, French and so on contain many but only finitely many 
   182 information, but lets ignore this for the moment.} Note that if a type takes another
   174 strings (namely the ones listed in a good dictionary). It might be therefore be surprising that the
   183 type as argument, this is written for example as
   175 language consisting of all email addresses is infinite provided we assume it is defined by the
   184 {\tt List[Rexp]}.
   176 regular expression\footnote{See \url{http://goo.gl/5LoVX7}}
   185 
   177 
   186 \subsection*{Functions and Pattern-Matching}
   178 \[
   187 
   179 ([\text{\it{}a-z0-9\_.-}]^+)@([\text{\it a-z0-9.-}]^+).([\text{\it a-z.}]^{\{2,6\}})
   188 
   180 \]
   189 
   181 
   190 
   182 \noindent
   191 \subsection*{Types}
   183 One reason is that before the $@$-sign there can be any string you want assuming it 
   192 
   184 is made up from letters, digits, underscores, dots and hyphens---clearly there are infinitely many
   193 \subsection*{Cool Stuff}
   185 of those. Similarly the string after the $@$-sign can be any string. However, this does not mean 
   194 
   186 that every string is an email address. For example
       
   187 
       
   188 \[
       
   189 "\text{\it foo}@\text{\it bar}.\text{\it c}"
       
   190 \]
       
   191 
       
   192 \noindent
       
   193 is not, because the top-level-domains must be of length of at least two. (Note that there is
       
   194 the convention that uppercase letters are treated in email-addresses as if they were
       
   195 lower-case.)
       
   196 \bigskip
       
   197 
       
   198 Before we expand on the topic of regular expressions, let us review some operations on
       
   199 sets. We will use capital letters $A$, $B$, $\ldots$ to stand for sets of strings. 
       
   200 The union of two sets is written as usual as $A \cup B$. We also need to define the
       
   201 operation of \emph{concatenating} two sets of strings. This can be defined as
       
   202 
       
   203 \[
       
   204 A @ B \dn \{s_1@ s_2 | s_1 \in A \wedge s_2 \in B \}
       
   205 \]
       
   206 
       
   207 \noindent
       
   208 which essentially means take the first string from the set $A$ and concatenate it with every
       
   209 string in the set $B$, then take the second string from $A$ do the same and so on. You might
       
   210 like to think about what this definition means in case $A$ or $B$ is the empty set.
       
   211 
       
   212 We also need to define
       
   213 the power of a set of strings, written as $A^n$ with $n$ being a natural number. This is defined inductively as follows
       
   214 
       
   215 \begin{center}
       
   216 \begin{tabular}{rcl}
       
   217 $A^0$ & $\dn$ & $\{[\,]\}$ \\
       
   218 $A^{n+1}$ & $\dn$ & $A @ A^n$\\
       
   219 \end{tabular}
       
   220 \end{center}
       
   221 
       
   222 \noindent
       
   223 Finally we need the \emph{star} of a set of strings, written $A^*$. This is defined as the union
       
   224 of every power of $A^n$ with $n\ge 0$. The mathematical notation for this operation is
       
   225 
       
   226 \[
       
   227 A^* \dn \bigcup_{0\le n} A^n
       
   228 \]
       
   229 
       
   230 \noindent
       
   231 This definition implies that the star of a set $A$ contains always the empty string (that is $A^0$), one 
       
   232 copy of every string in $A$ (that is $A^1$), two copies in $A$ (that is $A^2$) and so on. In case $A=\{"a"\}$ we therefore 
       
   233 have 
       
   234 
       
   235 \[
       
   236 A^* = \{"", "a", "aa", "aaa", \ldots\}
       
   237 \]
       
   238 
       
   239 \noindent
       
   240 Be aware that these operations sometimes have quite non-intuitive properties, for example
       
   241 
       
   242 \begin{center}
       
   243 \begin{tabular}{@{}ccc@{}}
       
   244 \begin{tabular}{@{}r@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   245 $A \cup \varnothing$ & $=$ & $A$\\
       
   246 $A \cup A$ & $=$ & $A$\\
       
   247 $A \cup B$ & $=$ & $B \cup A$\\
       
   248 \end{tabular} &
       
   249 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   250 $A @ B$ & $\not =$ & $B @ A$\\
       
   251 $A  @ \varnothing$ & $=$ & $\varnothing @ A = \varnothing$\\
       
   252 $A  @ \{""\}$ & $=$ & $\{""\} @ A = A$\\
       
   253 \end{tabular} &
       
   254 \begin{tabular}{r@{\hspace{1mm}}c@{\hspace{1mm}}l@{}}
       
   255 $\varnothing^*$ & $=$ & $\{""\}$\\
       
   256 $\{""\}^*$ & $=$ & $\{""\}$\\
       
   257 $A^\star$ & $=$ & $\{""\} \cup A\cdot A^*$\\
       
   258 \end{tabular} 
       
   259 \end{tabular}
       
   260 \end{center}
       
   261 \bigskip
       
   262 
       
   263 \noindent
       
   264 \emph{Regular expressions} are meant to conveniently describe languages...at least languages
       
   265 we are interested in in Computer Science.  For example there is no convenient regular expression
       
   266 for describing the English language short of enumerating all English words. 
       
   267 But they seem useful for describing all permitted email addresses, as seen above. 
       
   268 
       
   269 Regular expressions are given by the following grammar:
       
   270 
       
   271 \begin{center}
       
   272 \begin{tabular}{r@{\hspace{1mm}}r@{\hspace{1mm}}l@{\hspace{13mm}}l}
       
   273   $r$ & $::=$ &   $\varnothing$         & null\\
       
   274         & $\mid$ & $\epsilon$              & empty string / "" / []\\
       
   275         & $\mid$ & $c$                         & single character\\
       
   276         & $\mid$ & $r_1 \cdot r_2$      & sequence\\
       
   277         & $\mid$ & $r_1 + r_2$            & alternative / choice\\
       
   278         & $\mid$ & $r^*$                      & star (zero or more)\\
       
   279   \end{tabular}
       
   280 \end{center}
       
   281 
       
   282 \noindent
       
   283 Because we overload our notation, there are some subtleties you should be aware of. First, the letter 
       
   284 $c$ stands for any character from the
       
   285 alphabet at hand. Second, we will use parentheses to disambiguate
       
   286 regular expressions. For example we will write $(r_1 + r_2)^*$, which is different from, say $r_1 + (r_2)^*$.
       
   287 The former means roughly zero or more times $r_1$ or $r_2$, while the latter means $r_1$ or zero or more times
       
   288 $r_2$. We should also write $(r_1 + r_2) + r_3$, which is different from the regular expression $r_1 + (r_2 + r_3)$,
       
   289 but in case of $+$ and $\cdot$ we actually do not care about the order and just write $r_1 + r_2 + r_3$, or $r_1 \cdot r_2 \cdot r_3$,
       
   290 respectively. The reasons for this will become clear shortly. In the literature you will often find that the choice
       
   291 $r_1 + r_2$  is written as $r_1\mid{}r_2$ or $r_1\mid\mid{}r_2$. Also following the convention in the literature, we will in case of $\cdot$ even 
       
   292 often omit it all together. For example the regular expression for email addresses shown above is meant to be of the form
       
   293 
       
   294 \[
       
   295 ([\ldots])^+ \cdot @ \cdot ([\ldots])^+ \cdot . \cdot \ldots
       
   296 \]
       
   297 
       
   298 \noindent
       
   299 meaning first comes a name (specified by the regular expression $([\ldots])^+$), then an $@$-sign, then
       
   300 a domain name (specified by the regular expression $([\ldots])^+$), then a dot and then a top-level domain. Similarly if
       
   301 we want to specify the regular expression for the string {\it "hello"} we should write
       
   302 
       
   303 \[
       
   304 {\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}
       
   305 \]
       
   306 
       
   307 \noindent
       
   308 but often just write {\it hello}.
       
   309 
       
   310 Another source of confusion might arise from the fact that we use the term \emph{regular expression} for the regular expressions used in ``theory''
       
   311 and also the ones used in ``practice''. In this course we refer by default to the regular expressions defined by the grammar above. 
       
   312 In ``practice'' we often use $r^+$ to stand for one or more times, $\backslash{}d$ to stand for a digit, $r^?$ to stand for an optional regular
       
   313 expression, or ranges such as $[\text{\it a - z}]$ to stand for any lower case letter from $a$ to $z$. They are however mere convenience 
       
   314 as they can be seen as shorthand for
       
   315 
       
   316 \begin{center}
       
   317 \begin{tabular}{rcl}
       
   318 $r^+$ & $\mapsto$ & $r\cdot r^*$\\
       
   319 $r^?$ & $\mapsto$ & $\epsilon + r$\\
       
   320 $\backslash d$ & $\mapsto$ & $0 + 1 + 2 + \ldots + 9$\\
       
   321 $[\text{\it a - z}]$ & $\mapsto$ & $a + b + \ldots + z$\\
       
   322 \end{tabular}
       
   323 \end{center}
       
   324 
       
   325 
       
   326 We will see later that the \emph{not}-regular-expression can also be seen as convenience. This regular
       
   327 expression is supposed to stand for every string \emph{not} matched by a regular expression. We will write
       
   328 such not-regular-expressions as $\sim{}r$. While being ``convenience'' it is often not so clear what the shorthand for
       
   329 these kind of not-regular-expressions is. Try to write down the regular expression which can match any
       
   330 string except the two strings {\it "hello"} and {\it "world"}. It is possible in principle, but often it is easier to just include
       
   331 $\sim{}r$ in the definition of regular expressions. Whenever we do so, we will state it explicitly.
       
   332 
       
   333 So far we have only considered informally what the \emph{meaning} of a regular expression is.  
       
   334 To do so more formally we will associate with every regular expression a set of strings 
       
   335 that is supposed to be matched by this
       
   336 regular expression. This can be defined recursively  as follows
       
   337 
       
   338 \begin{center}
       
   339 \begin{tabular}{rcl}
       
   340 $L(\varnothing)$  & $\dn$ & $\{\,\}$\\
       
   341 $L(\epsilon)$       & $\dn$ & $\{""\}$\\
       
   342 $L(c)$                  & $\dn$ & $\{"c"\}$\\
       
   343 $L(r_1+ r_2)$      & $\dn$ & $L(r_1) \cup L(r_2)$\\
       
   344 $L(r_1 \cdot r_2)$  & $\dn$ & $\{s_1@ s_2 | s_1 \in L(r_1) \wedge s_2 \in L(r_2) \}$\\
       
   345 $L(r^*)$                   & $\dn$ & $\bigcup_{n \ge 0} L(r)^n$\\
       
   346 \end{tabular}
       
   347 \end{center}
       
   348 
       
   349 \noindent
       
   350 As a result we can now precisely state what the meaning, for example, of the regular expression 
       
   351 ${\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}$ is, namely 
       
   352 $L({\it h} \cdot {\it e} \cdot {\it l} \cdot {\it l} \cdot {\it o}) = \{\text{\it"hello"}\}$...as expected. Similarly if we have the 
       
   353 choice-regular-expression $a + b$, its meaning is $L(a + b) = \{\text{\it"a"}, \text{\it"b"}\}$, namely the only two strings which can possibly
       
   354 be matched by this choice. You can now also see why we do not make a difference
       
   355 between the different regular expressions $(r_1 + r_2) + r_3$ and $r_1 + (r_2 + r_3)$....they 
       
   356 are not the same regular expression, but have the same meaning. 
       
   357 
       
   358 The point of the definition of $L$ is that we can use it to precisely specify when a string $s$ is matched by a
       
   359 regular expression $r$, namely only when $s \in L(r)$. In fact we will write a program {\it match} that takes any string $s$ and
       
   360 any regular expression $r$ as argument and returns \emph{yes}, if $s \in L(r)$ and \emph{no},
       
   361 if $s \not\in L(r)$. We leave this for the next lecture.
       
   362 
       
   363 \begin{figure}[p]
       
   364 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/crawler1.scala}}}
       
   365 \caption{Scala code for a web-crawler that can detect broken links in a web-page. It uses
       
   366 the regular expression {\tt http\_pattern} in Line~15 for recognising URL-addresses. It finds
       
   367 all links using the library function {\tt findAllIn} in Line~21.}
       
   368 \end{figure}
       
   369 
       
   370 \begin{figure}[p]
       
   371 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/crawler2.scala}}}
       
   372 \caption{A version of the web-crawler which only follows links in ``my'' domain---since these are the
       
   373 ones I am interested in to fix. It uses the regular expression {\tt my\_urls} in Line~16.
       
   374 The main change is in Line~26 where there is a test whether URL is in ``my'' domain or not.}
       
   375 
       
   376 \end{figure}
       
   377 
       
   378 \begin{figure}[p]
       
   379 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/crawler3.scala}}}
       
   380 \caption{A small email harvester---whenever we download a web-page, we also check whether
       
   381 it contains any email addresses. For this we use the regular expression {\tt email\_pattern} in
       
   382 Line~17.  The main change is in Lines 33 and 34 where all email addresses that can be found in a page are printed.}
       
   383 \end{figure}
       
   384 
   195 
   385 \end{document}
   196 \end{document}
   386 
   197 
   387 %%% Local Variables: 
   198 %%% Local Variables: 
   388 %%% mode: latex
   199 %%% mode: latex