pep-material: cws/cw01.tex@018b9c12ee1f


\documentclass{article}
\usepackage{../style}
\usepackage{disclaimer}
\usepackage{../langs}

\begin{document}

\section*{Assignment 6 (Scala)}

\mbox{}\hfill\textit{``The most effective debugging tool is still careful thought,}\\
\mbox{}\hfill\textit{coupled with judiciously placed print statements.''}\smallskip\\
\mbox{}\hfill\textit{ --- Brian W. Kernighan, in Unix for Beginners (1979)}\bigskip


\noindent
This assignment is about Scala and worth 10\%. The first and second
part are due on 16 November at 11pm, and the third part on 21 December
at 11pm. You are asked to implement two programs about list
processing and recursion. The third part is more advanced and might
include material you have not yet seen in the first lecture.
\bigskip
 
\IMPORTANT{}

\noindent
Also note that the running time of each part will be restricted to a
maximum of 30 seconds on my laptop.

\DISCLAIMER{}

\subsection*{Reference Implementation}

Like the C++ assignments, the Scala assignments will work like this: you
push your files to GitHub and receive (after sometimes a long delay) some
automated feedback. In the end we take a snapshot of the submitted files and
apply an automated marking script to them.

In addition, the Scala assignments come with a reference implementation
in form of a \texttt{jar}-file. This allows you to run any test cases
on your own computer. For example you can call Scala on the command
line with the option \texttt{-cp collatz.jar} and then query any
function from the template file. Say you want to find out what
the functions \texttt{collatz} and \texttt{collatz\_max}
produce: for this you just need to prefix them with the object name
\texttt{CW6a} (and \texttt{CW6b} respectively for \texttt{drumb.jar}).
If you want to find out what these functions produce for the argument
\texttt{6}, you would type something like:

\begin{lstlisting}[language={},numbers=none,basicstyle=\ttfamily\small]
$ scala -cp collatz.jar
  
scala> CW6a.collatz(6)
...
scala> CW6a.collatz_max(6)
...
\end{lstlisting}%$

\subsection*{Hints}

\noindent
\textbf{For Part 1:} useful math operators: \texttt{\%} for modulo; useful
functions: \mbox{\texttt{(1\,to\,10)}} for ranges, \texttt{.toInt},
\texttt{.toList} for conversions, \texttt{List(...).max} for the
maximum of a list, \texttt{List(...).indexOf(...)} for the first index of
a value in a list.\bigskip

\noindent
\textbf{For Part 2 + 3:} useful string functions: \texttt{.startsWith(...)} for
checking whether a string has a given prefix, \texttt{\_ ++ \_} for
concatenating two strings; useful option functions: \texttt{.flatten}
flattens a list of options such that it filters way all
\texttt{None}'s, \texttt{Try(...).getOrElse ...} runs some code that
might raise an exception---if yes, then a default value can be given;
useful list functions: \texttt{.head} for obtaining the first element
in a non-empty list, \texttt{.length} for the length of a
list; \texttt{.filter(...)} for filtering out elements in a list; \texttt{.getLines.toList} for obtaining a list of lines from
a file; \texttt{.split(",").toList} for splitting strings according to
a comma.\bigskip

\noindent
Fortunately Scala supports operator overloading. But make sure you understand the difference between \texttt{100 / 3} and
\texttt{100.0 / 3}!

\newpage
\subsection*{Part 1 (3 Marks, file collatz.scala)}

This part is about recursion. You are asked to implement a Scala
program that tests examples of the \emph{$3n + 1$-conjecture}, also
called \emph{Collatz conjecture}. This conjecture can be described as
follows: Start with any positive number $n$ greater than $0$:

\begin{itemize}
\item If $n$ is even, divide it by $2$ to obtain $n / 2$.
\item If $n$ is odd, multiply it by $3$ and add $1$ to obtain $3n +
  1$.
\item Repeat this process and you will always end up with $1$.
\end{itemize}

\noindent
For example if you start with $6$, or $9$, you obtain the
series

\[
\begin{array}{@{}l@{\hspace{5mm}}l@{}}
6, 3, 10, 5, 16, 8, 4, 2, 1 & \text{(= 8 steps)}\\
9, 28, 14, 7, 22, 11, 34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1  & \text{(= 19 steps)}\\
\end{array}
\]

\noindent
As you can see, the numbers go up and down like a roller-coaster, but
curiously they seem to always terminate in $1$. The conjecture is that
this will \emph{always} happen for every number greater than
0.\footnote{While it is relatively easy to test this conjecture with
  particular numbers, it is an interesting open problem to
  \emph{prove} that the conjecture is true for \emph{all} numbers ($>
  0$). Paul Erd\"o{}s, a famous mathematician you might have hard
  about, said about this conjecture: ``Mathematics may not be ready
  for such problems.'' and also offered a \$500 cash prize for its
  solution. Jeffrey Lagarias, another mathematician, claimed that
  based only on known information about this problem, ``this is an
  extraordinarily difficult problem, completely out of reach of
  present day mathematics.'' There is also a
  \href{https://xkcd.com/710/}{xkcd} cartoon about this conjecture
  (click \href{https://xkcd.com/710/}{here}). If you are able to solve
  this conjecture, you will definitely get famous.}\bigskip

\noindent
\textbf{Tasks}

\begin{itemize}
\item[(1)] You are asked to implement a recursive function that
  calculates the number of steps needed until a series ends
  with $1$. In case of starting with $6$, it takes $8$ steps and in
  case of starting with $9$, it takes $19$ (see above). In order to
  try out this function with large numbers, you should use
  \texttt{Long} as argument type, instead of \texttt{Int}.  You can
  assume this function will be called with numbers between $1$ and
  $1$ Million. \hfill[2 Marks]

\item[(2)] Write a second function that takes an upper bound as
  argument and calculates the steps for all numbers in the range from
  1 up to this bound. It returns the maximum number of steps and the
  corresponding number that needs that many steps.  More precisely
  it returns a pair where the first
  component is the number of steps and the second is the
  corresponding number. \hfill\mbox{[1 Mark]}
\end{itemize}

\noindent
\textbf{Test Data:} Some test ranges are:

\begin{itemize}
\item 1 to 10 where $9$ takes 19 steps 
\item 1 to 100 where $97$ takes 118 steps,
\item 1 to 1,000 where $871$ takes 178 steps,
\item 1 to 10,000 where $6,171$ takes 261 steps,
\item 1 to 100,000 where $77,031$ takes 350 steps, 
\item 1 to 1 Million where $837,799$ takes 524 steps
  %% runs out of stack space
  %% \item[$\bullet$] $1 - 10$ million where $8,400,511$ takes 685 steps
\end{itemize}
  




\subsection*{Part 2 (3 Marks, file drumb.scala)}

A purely fictional character named Mr T.~Drumb inherited in 1978
approximately 200 Million Dollar from his father. Mr Drumb prides
himself to be a brilliant business man because nowadays it is
estimated he is 3 Billion Dollar worth (one is not sure, of course,
because Mr Drumb refuses to make his tax records public).

Since the question about Mr Drumb's business acumen remains open,
let's do a quick back-of-the-envelope calculation in Scala whether his
claim has any merit. Let's suppose we are given \$100 in 1978 and we
follow a really dumb investment strategy, namely:

\begin{itemize}
\item We blindly choose a portfolio of stocks, say some Blue-Chip stocks
  or some Real Estate stocks.
\item If some of the stocks in our portfolio are traded in January of
  a year, we invest our money in equal amounts in each of these
  stocks.  For example if we have \$100 and there are four stocks that
  are traded in our portfolio, we buy \$25 worth of stocks
  from each. Be careful to also test cases where you trade with 3 stocks, for example. 
\item Next year in January, we look at how our stocks did, liquidate
  everything, and re-invest our (hopefully) increased money in again
  the stocks from our portfolio (there might be more stocks available,
  if companies from our portfolio got listed in that year, or less if
  some companies went bust or were de-listed).
\item We do this for 40 years until January 2018 and check what would
  have become out of our \$100.
\end{itemize}

\noindent
Until Yahoo was bought by Altaba last year, historical stock market
data for such back-of-the-envelope calculations was freely available
online. Unfortunately nowadays this kind of data is more difficult to
obtain, unless you are prepared to pay extortionate prices or be
severely rate-limited.  Therefore this assignment comes with a number
of files containing CSV-lists with the historical stock prices for the
companies in our portfolios. Use these files for the following
tasks.\bigskip

\newpage
\noindent
\textbf{Tasks}

\begin{itemize}
\item[(1)] Write a function \texttt{get\_january\_data} that takes a
  stock symbol and a year as arguments. The function reads the
  corresponding CSV-file and returns the list of strings that start
  with the given year (each line in the CSV-list is of the form
  \texttt{someyear-01-someday,someprice}).\hfill[1 Mark]

\item[(2)] Write a function \texttt{get\_first\_price} that takes
  again a stock symbol and a year as arguments. It should return the
  first January price for the stock symbol in the given year. For this
  it uses the list of strings generated by
  \texttt{get\_january\_data}.  A problem is that normally a stock
  exchange is not open on 1st of January, but depending on the day of
  the week on a later day (maybe 3rd or 4th). The easiest way to solve
  this problem is to obtain the whole January data for a stock symbol
  and then select the earliest, or first, entry in this list. The
  stock price of this entry should be converted into a double.  Such a
  price might not exist, in case the company does not exist in the given
  year. For example, if you query for Google in January of 1980, then
  clearly Google did not exist yet.  Therefore you are asked to
  return a trade price with type \texttt{Option[Double]}\ldots\texttt{None}
  will be the value for when no price exists; \texttt{Some} if  there is a
  price.\hfill[1 Mark]

\item[(3)] Write a function \texttt{get\_prices} that takes a
  portfolio (a list of stock symbols), a years range and gets all the
  first trading prices for each year in the range. You should organise
  this as a list of lists of \texttt{Option[Double]}'s. The inner
  lists are for all stock symbols from the portfolio and the outer
  list for the years.  For example for Google and Apple in years 2010
  (first line), 2011 (second line) and 2012 (third line) you obtain:

\begin{verbatim}
  List(List(Some(311.349976), Some(20.544939)), 
       List(Some(300.222351), Some(31.638695)),
       List(Some(330.555054), Some(39.478039)))
\end{verbatim}\hfill[1 Mark]
\end{itemize}

\subsection*{Advanced Part 3 (4 Marks, continue in file drumb.scala)}

\noindent
\textbf{Tasks}

\begin{itemize}  
\item[(4)] Write a function that calculates the \emph{change factor} (delta)
  for how a stock price has changed from one year to the next. This is
  only well-defined, if the corresponding company has been traded in both
  years. In this case you can calculate

  \[
  \frac{price_{new} - price_{old}}{price_{old}}
  \]

  If the change factor is defined, you should return it
  as \texttt{Some(change\_factor)}; if not, you should return
  \texttt{None}.\mbox{}\hfill\mbox{[1 Mark]}
  
\item[(5)] Write a function that calculates all change factors
  (deltas) for the prices we obtained under Part 2. For the running
  example of Google and Apple for the years 2010 to 2012 you should
  obtain 4 change factors:

\begin{verbatim}  
  List(List(Some(-0.03573992567129673), Some(0.539975124774038))
       List(Some(0.10103412653643493), Some(0.24777709700099845)))
\end{verbatim}

  That means Google did a bit badly in 2010, while Apple did very well.
  Both did OK in 2011. Make sure you handle the cases where a company is
  not listed in a year. In such cases the change factor should be \texttt{None}
  (see~(4)).
  \mbox{}\hfill\mbox{[1 Mark]}

\item[(6)] Write a function that calculates the ``yield'', or
  balance, for one year for our portfolio.  This function takes the
  change factors, the starting balance and the year as arguments. If
  no company from our portfolio existed in that year, the balance is
  unchanged. Otherwise we invest in each existing company an equal
  amount of our balance. Using the change factors computed under Task
  2, calculate the new balance. Say we had \$100 in 2010, we would have
  received in our running example involving Google and Apple:

  \begin{verbatim}
  $50 * -0.03573992567129673 + $50 * 0.539975124774038
                                       = $25.21175995513706
  \end{verbatim}

  as profit for that year, and our new balance for 2011 is \$125 when
  converted to a \texttt{Long}.\mbox{}\hfill\mbox{[1 Mark]}
  
\item[(7)] Write a function that calculates the overall balance
  for a range of years where each year the yearly profit is compounded to
  the new balances and then re-invested into our portfolio.
  For this use the function and results generated under (6).\\
  \mbox{}\hfill\mbox{[1 Mark]}
\end{itemize}\medskip  



\noindent
\textbf{Test Data:} File \texttt{drumb.scala} contains two portfolios
collected from the S\&P 500, one for blue-chip companies, including
Facebook, Amazon and Baidu; and another for listed real-estate
companies, whose names I have never heard of. Following the dumb
investment strategy from 1978 until 2018 would have turned a starting
balance of \$100 into roughly \$101,589 for real estate and a whopping
\$1,587,528 for blue chips.  Note when comparing these results with your
own calculations: there might be some small rounding errors, which
when compounded lead to moderately different values.\bigskip


\noindent
\textbf{Moral:} Reflecting on our assumptions, we are over-estimating
our yield in many ways: first, who can know in 1978 about what will
turn out to be a blue chip company.  Also, since the portfolios are
chosen from the current S\&P 500, they do not include the myriad
of companies that went bust or were de-listed over the years.
So where does this leave our fictional character Mr T.~Drumb? Well, given
his inheritance, a really dumb investment strategy would have done
equally well, if not much better.\medskip

\end{document}

\newpage

This part is about web-scraping and list-processing in Scala. It uses
online data about the per-capita alcohol consumption for each country
(per year?), and a file containing the data about the population size of
each country.  From this data you are supposed to estimate how many
litres of pure alcohol are consumed worldwide.\bigskip

\noindent
\textbf{Tasks (file alcohol.scala):}

\begin{itemize}
\item[(1)] Write a function that given an URL requests a
  comma-separated value (CSV) list.  We are interested in the list
  from the following URL

\begin{center}
  \url{https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv}
\end{center}

\noindent Your function should take a string (the URL) as input, and
produce a list of strings as output, where each string is one line in
the corresponding CSV-list.  This list from the URL above should
contain 194 lines.\medskip

\noindent
Write another function that can read the file \texttt{population.csv}
from disk (the file is distributed with the assignment). This
function should take a string as argument, the file name, and again
return a list of strings corresponding to each entry in the
CSV-list. For \texttt{population.csv}, this list should contain 216
lines.\hfill[1 Mark]


\item[(2)] Unfortunately, the CSV-lists contain a lot of ``junk'' and we
  need to extract the data that interests us.  From the header of the
  alcohol list, you can see there are 5 columns
  
  \begin{center}
    \begin{tabular}{l}
      \texttt{country (name),}\\
      \texttt{beer\_servings,}\\
      \texttt{spirit\_servings,}\\
      \texttt{wine\_servings,}\\
      \texttt{total\_litres\_of\_pure\_alcohol}
    \end{tabular}  
  \end{center}

  \noindent
  Write a function that extracts the data from the first column,
  the country name, and the data from the fifth column (converted into
  a \texttt{Double}). For this go through each line of the CSV-list
  (except the first line), use the \texttt{split(",")} function to
  divide each line into an array of 5 elements. Keep the data from the
  first and fifth element in these arrays.\medskip

  \noindent
  Write another function that processes the population size list. This
  is already of the form country name and population size.\footnote{Your
    friendly lecturer already did the messy processing for you from the
  Worldbank database, see \url{https://github.com/datasets/population/tree/master/data} for the original.} Again, split the
  strings according to the commas. However, this time generate a
  \texttt{Map} from country names to population sizes.\hfill[1 Mark]

\item[(3)] In (2) you generated the data about the alcohol consumption
  per capita for each country, and also the population size for each
  country. From this generate next a sorted(!) list of the overall
  alcohol consumption for each country. The list should be sorted from
  highest alcohol consumption to lowest. The difficulty is that the
  data is scraped off from ``random'' sources on the Internet and
  annoyingly the spelling of some country names does not always agree in both
  lists. For example the alcohol list contains
  \texttt{Bosnia-Herzegovina}, while the population writes this country as
  \texttt{Bosnia and Herzegovina}. In your sorted
  overall list include only countries from the alcohol list, whose
  exact country name is also in the population size list. This means
  you can ignore countries like Bosnia-Herzegovina from the overall
  alcohol consumption. There are 177 countries where the names
  agree. The UK is ranked 10th on this list by
  consuming 671,976,864 Litres of pure alcohol each year.\medskip
  
  \noindent
  Finally, write another function that takes an integer, say
  \texttt{n}, as argument. You can assume this integer is between 0
  and 177 (the number of countries in the sorted list above).  The
  function should return a triple, where the first component is the
  sum of the alcohol consumption in all countries (on the list); the
  second component is the sum of the \texttt{n}-highest alcohol
  consumers on the list; and the third component is the percentage the
  \texttt{n}-highest alcohol consumers drink with respect to the
  the world consumption. You will see that according to our data, 164
  countries (out of 177) gobble up 100\% of the World alcohol
  consumption.\hfill\mbox{[1 Mark]}
\end{itemize}

\noindent
\textbf{Hints:} useful list functions: \texttt{.drop(n)},
\texttt{.take(n)} for dropping or taking some elements in a list,
\texttt{.getLines} for separating lines in a string;
\texttt{.sortBy(\_.\_2)} sorts a list of pairs according to the second
elements in the pairs---the sorting is done from smallest to highest;
useful \texttt{Map} functions: \texttt{.toMap} converts a list of
pairs into a \texttt{Map}, \texttt{.isDefinedAt(k)} tests whether the
map is defined at that key, that is would produce a result when
called with this key; useful data functions: \texttt{Source.fromURL},
\texttt{Source.fromFile} for obtaining a webpage and reading a file.

\newpage






%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End:
author	Christian Urban <urbanc@in.tum.de>
	Fri, 09 Nov 2018 07:30:02 +0000
changeset 201	018b9c12ee1f
parent 199	54befaf23648
child 202	f7bcb27d1940
permissions	-rw-r--r--