cws/cw01.tex
changeset 127 b4def82f3f9f
parent 125 dcaab8068baa
child 129 b1a51285de7e
--- a/cws/cw01.tex	Sun Nov 05 12:56:55 2017 +0000
+++ b/cws/cw01.tex	Tue Nov 07 13:08:18 2017 +0000
@@ -11,24 +11,38 @@
 at 11pm. You are asked to implement three programs about list
 processing and recursion. The third part is more advanced and might
 include material you have not yet seen in the first lecture.
-Make sure the files you submit can be processed by just calling
-\texttt{scala <<filename.scala>>}.\bigskip
+\bigskip
 
 \noindent
-\textbf{Important:} Do not use any mutable data structures in your
+\textbf{Important:}
+
+\begin{itemize}
+\item Make sure the files you submit can be processed by just calling\\
+\mbox{\texttt{scala <<filename.scala>>}} on the commandline.
+
+\item Do not use any mutable data structures in your
 submissions! They are not needed. This means you cannot use 
-\texttt{ListBuffer}s, for example. Do not use \texttt{return} in your
-code! It has a different meaning in Scala, than in Java.
-Do not use \texttt{var}! This declares a mutable variable. ??? Make sure the
-functions you submit are defined on the ``top-level'' of Scala, not
-inside a class or object. Also note that the running time of
-each part will be restricted to a maximum of 360 seconds on my laptop.
+\texttt{ListBuffer}s, for example. 
+
+\item Do not use \texttt{return} in your code! It has a different
+  meaning in Scala, than in Java.
+
+\item Do not use \texttt{var}! This declares a mutable variable. Only
+  use \texttt{val}!
+
+\item Do not use any parallel collections! No \texttt{.par} therefore!
+  Our testing and marking infrastructure is not set up for it.
+\end{itemize}
+
+\noindent
+Also note that the running time of each part will be restricted to a
+maximum of 360 seconds on my laptop.
 
 
 \subsection*{Disclaimer}
 
 It should be understood that the work you submit represents
-your own effort. You have not copied from anyone else. An
+your \textbf{own} effort. You have not copied from anyone else. An
 exception is the Scala code I showed during the lectures or
 uploaded to KEATS, which you can freely use.\bigskip
 
@@ -76,7 +90,6 @@
   (click \href{https://xkcd.com/710/}{here}). If you are able to solve
   this conjecture, you will definitely get famous.}\bigskip
 
-\newpage
 \noindent
 \textbf{Tasks (file collatz.scala):}
 
@@ -88,7 +101,7 @@
   try out this function with large numbers, you should use
   \texttt{Long} as argument type, instead of \texttt{Int}.  You can
   assume this function will be called with numbers between $1$ and
-  $1$ million. \hfill[2 Marks]
+  $1$ Million. \hfill[2 Marks]
 
 \item[(2)] Write a second function that takes an upper bound as
   argument and calculates the steps for all numbers in the range from
@@ -108,91 +121,124 @@
 \item 1 to 1,000 where $871$ takes 179 steps,
 \item 1 to 10,000 where $6,171$ takes 262 steps,
 \item 1 to 100,000 where $77,031$ takes 351 steps, 
-\item 1 to 1 million where $837,799$ takes 525 steps
+\item 1 to 1 Million where $837,799$ takes 525 steps
   %%\item[$\bullet$] $1 - 10$ million where $8,400,511$ takes 686 steps
-\end{itemize}\bigskip
+\end{itemize}
   
+\noindent
+\textbf{Hints:} useful math operators: \texttt{\%} for modulo; useful
+functions: \mbox{\texttt{(1\,to\,10)}} for ranges, \texttt{.toInt},
+\texttt{.toList} for conversions, \texttt{List(...).max} for the
+maximum of a list, \texttt{List(...).indexOf(...)} for the first index of
+a value in a list.
 
 
-\subsection*{Part 2 (4 Marks)}
+
+\subsection*{Part 2 (3 Marks)}
 
-This part is about list processing---it's a variant of
-``buy-low-sell-high'' in Scala. It uses the online financial data
-service from Yahoo.\bigskip 
+This part is about web-scraping and list-processing in Scala. It uses
+online data about the per-capita alcohol consumption for each country
+(per year?), and a file with the data about the population size of
+each country.  From this data you are supposed to estimate how many
+litres of pure alcohol are consumed worldwide.\bigskip
 
 \noindent
-\textbf{Tasks (file trade.scala):}
+\textbf{Tasks (file alcohol.scala):}
 
 \begin{itemize}
-\item[(1)] Given a list of prices for a commodity, for example
-
-\[
-\texttt{List(28.0, 18.0, 20.0, 26.0, 24.0)}
-\]
-
-\noindent
-you need to write a function that returns a pair of indices for when
-to buy and when to sell this commodity. In the example above it should
-return the pair $\texttt{(1, 3)}$ because at index $1$ the price is lowest and
-then at index $3$ the price is highest. Note the prices are given as
-lists of \texttt{Double}s.\newline \mbox{} \hfill[1 Mark]
-
-\item[(2)] Write a function that requests a comma-separated value (CSV) list
-  from the Yahoo websevice that provides historical data for stock
-  indices. For example if you query the URL
+\item[(1)] Write a function that given an URL requests a
+  comma-separated value (CSV) list.  We are interested in the list
+  from the following URL
 
 \begin{center}
-\url{http://ichart.yahoo.com/table.csv?s=GOOG}
+  \url{https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv}
 \end{center}
 
-\noindent where \texttt{GOOG} stands for Google's stock market symbol,
-then you will receive a CSV-list of the daily stock prices since
-Google was listed. You can also try this with other stock market
-symbols, for instance AAPL, MSFT, IBM, FB, YHOO, AMZN, BIDU and so
-on. 
-
-This function should return a List of strings, where each string
-is one line in this CVS-list (representing one day's
-data). Note that Yahoo generates its answer such that the newest data
-is at the front of this list, and the oldest data is at the end.
-\hfill[1 Mark]
-
-\item[(3)] As you can see, the financial data from Yahoo is organised in 7 columns,
-for example
-
-{\small\begin{verbatim}
-Date,Open,High,Low,Close,Volume,Adj Close
-2016-11-04,750.659973,770.359985,750.560974,762.02002,2126900,762.02002
-2016-11-03,767.25,769.950012,759.030029,762.130005,1914000,762.130005
-2016-11-02,778.200012,781.650024,763.450012,768.700012,1872400,768.700012
-2016-11-01,782.890015,789.48999,775.539978,783.609985,2404500,783.609985
-....
-\end{verbatim}}
+\noindent Your function should take a string (the URL) as input, and
+produce a list of strings as output, where each string is one line in
+the corresponding CSV-list.  This list should contain 194 lines.\medskip
 
 \noindent
-Write a function that ignores the first line (the header) and then
-extracts from each line the date (first column) and the Adjusted Close
-price (last column). The Adjusted Close price should be converted into
-a \texttt{Double}. So the result of this function is a list of pairs where the
-first components are strings (the dates) and the second are doubles
-(the adjusted close prices).\newline\mbox{}\hfill\mbox{[1 Mark]}
+Write another function that can read the file \texttt{population.csv}
+from disk (the file is distributed with the coursework). This
+function should take a string as argument, the file name, and again
+return a list of strings corresponding to each entry in the
+CSV-list. For \texttt{population.csv}, this list should contain 216
+lines.\hfill[1 Mark]
+
+
+\item[(2)] Unfortunately, the CSV-lists contain a lot of ``junk'' and we
+  need to extract the data that interests us.  From the header of the
+  alcohol list, you can see there are 5 columns
+  
+  \begin{center}
+    \begin{tabular}{l}
+      \texttt{country (name),}\\
+      \texttt{beer\_servings,}\\
+      \texttt{spirit\_servings,}\\
+      \texttt{wine\_servings,}\\
+      \texttt{total\_litres\_of\_pure\_alcohol}
+    \end{tabular}  
+  \end{center}
+
+  \noindent
+  Write a function that extracts the data from the first column,
+  the country name, and the data from the fifth column (converted into
+  a \texttt{Double}). For this go through each line of the CSV-list
+  (except the first line), use the \texttt{split(",")} function to
+  divide each line into an array of 5 elements. Keep the data from the
+  first and fifth element in these arrays.\medskip
 
-\item[(4)] Write a function that takes a stock market symbol as
-  argument (you can assume it is a valid one, like GOOG or AAPL). The
-  function calculates the \underline{dates} when you should have
-  bought the corresponding shares (lowest price) and when you should
-  have sold them (highest price).\hfill\mbox{[1 Mark]}
+  \noindent
+  Write another function that processes the population size list. This
+  is already of the form country name and population size.\footnote{Your
+    friendly lecturer already did the messy processing for you from the
+  Worldbank database, see \url{https://github.com/datasets/population/tree/master/data}.} Again, split the
+  strings according to the commas. However, this time generate a
+  \texttt{Map} from country names to population sizes.\hfill[1 Mark]
+
+\item[(3)] In (2) you generated the data about the alcohol consumption
+  per capita for each country, and also the population size for each
+  country. From this generate next a sorted(!) list of the overall
+  alcohol consumption for each country. The list should be sorted from
+  highest alcohol consumption to lowest. The difficulty is that the
+  data is scrapped off from ``random'' sources on the Internet and
+  annoyingly the spelling of some country names does not always agree in the
+  lists. For example the alcohol list contains
+  \texttt{Bosnia-Herzegovina}, while the population writes this country as
+  \texttt{Bosnia and Herzegovina}. In your sorted
+  overall list include only countries from the alcohol list, whose
+  exact country name is also in the population size list. This means
+  you can ignore countries like Bosnia-Herzegovina from the overall
+  alcohol consumption. There are 177 countries where the names
+  agree. The UK is ranked 10th on this list with
+  consuming 671,976,864 Litres of pure alcohol each year.\medskip
+  
+  \noindent
+  Finally, write another function that takes an integer, say
+  \texttt{n}, as argument. You can assume this integer is between 0
+  and 177.  The function should use the sorted list from above.  It returns
+  a triple, where the first component is the sum of the alcohol
+  consumption in all countries (on the list); the second component is
+  the sum of the \texttt{n}-highest alcohol consumers on the list; and
+  the third component is the percentage the \texttt{n}-highest alcohol
+  consumers feast on with respect to the the world consumption. You will
+  see that according to our data, 164 countries (out of 177) gobble up 100\%
+  of the world alcohol consumption.\hfill\mbox{[1 Mark]}
 \end{itemize}
 
 \noindent
-\textbf{Test Data:}
-In case of Google, the financial data records 3077 entries starting
-from 2004-08-19 until 2016-11-04 (which is the last entry on the day
-when I prepared the course work...namely on 6 November; remember stock
-markets are typically closed on weekends and no financial data is
-produced then; also I did not count the header line). The lowest
-shareprice for Google was on 2004-09-03 with \$49.95513 per share and the
-highest on 2016-10-24 with \$813.109985 per share.\bigskip
+\textbf{Hints:} useful list functions: \texttt{.drop(n)},
+\texttt{.take(n)} for dropping or taking some elements in a list,
+\texttt{.getLines} for separating lines in a string;
+\texttt{.sortBy(\_.\_2)} sorts a list of pairs according to the second
+elements in the pairs---the sorting is done from smallest to highest;
+useful \texttt{Map} functions: \texttt{.toMap} converts a list of
+pairs into a \texttt{Map}, \texttt{.isDefinedAt(k)} tests whether the
+map is defined at that key, that is would produce a result when
+called with this key.
+
+\newpage
 
 \subsection*{Advanced Part 3 (3 Marks)}
 
@@ -202,10 +248,10 @@
 estimated he is 3 Billion Dollar worth (one is not sure, of course,
 because Mr Drumb refuses to make his tax records public).
 
-Since the question about Mr Drumb's business acumen remains, let's do a
-quick back-of-the-envelope calculation in Scala whether his claim has
-any merit. Let's suppose we are given \$100 in 1978 and we follow a
-really dumb investment strategy, namely:
+Since the question about Mr Drumb's business acumen remains open,
+let's do a quick back-of-the-envelope calculation in Scala whether his
+claim has any merit. Let's suppose we are given \$100 in 1978 and we
+follow a really dumb investment strategy, namely:
 
 \begin{itemize}
 \item We blindly choose a portfolio of stocks, say some Blue-Chip stocks
@@ -220,9 +266,12 @@
   the stocks from our portfolio (there might be more stocks available,
   if companies from our portfolio got listed in that year, or less if
   some companies went bust or de-listed).
-\item We do this for 38 years until January 2016 and check what would
+\item We do this for 38 years until January 2017 and check what would
   have become out of our \$100.
-\end{itemize}\medskip  
+\end{itemize}
+
+
+\medskip  
 
 \noindent
 \textbf{Tasks (file drumb.scala):}