pep-material: comparison cws/cw01.tex

equal deleted inserted replaced

-:fc3ac7b70a06
+:c50b074b3047
 \begin{document}
 \section*{Coursework 6 (Scala)}
 This coursework is about Scala and is worth 10\%. The first and second
-part are due on 15 November at 11pm, and the third part on 20 December
+part are due on 16 November at 11pm, and the third part on 21 December
 at 11pm. You are asked to implement three programs about list
 processing and recursion. The third part is more advanced and might
-include material you have not yet seen in the lecture.
+include material you have not yet seen in the first lecture.
-Make sure the files you submit can be processed by just calling
+\bigskip
-\texttt{scala <<filename.scala>>}.\bigskip
 \IMPORTANT{}
 \noindent
 Also note that the running time of each part will be restricted to a
-maximum of 60 seconds on my laptop.
+maximum of 360 seconds on my laptop.
+\DISCLAIMER{}
-\subsection*{Disclaimer}
-It should be understood that the work you submit represents
-your own effort. You have not copied from anyone else. An
-exception is the Scala code I showed during the lectures or
-uploaded to KEATS, which you can freely use. If you want to
-look up anything from the Scala library, the link to the api
-is
-\begin{center}
-\url{http://www.scala-lang.org/api/current/}
-\end{center}
-\noindent
-I usually google for things like \texttt{scala} \texttt{library} \texttt{groupBy}
-and click on the link about the Scala Standard Library for more information.
-\bigskip
 \subsection*{Part 1 (3 Marks)}
 This part is about recursion. You are asked to implement a Scala
 As you can see, the numbers go up and down like a roller-coaster, but
 curiously they seem to always terminate in $1$. The conjecture is that
 this will \emph{always} happen for every number greater than
 0.\footnote{While it is relatively easy to test this conjecture with
 particular numbers, it is an interesting open problem to
-\emph{prove} that the conjecture is true for \emph{all} numbers
+\emph{prove} that the conjecture is true for \emph{all} numbers ($>
-($> 0$). Paul Erd\"o{}s, a famous mathematician you might have hard
+0$). Paul Erd\"o{}s, a famous mathematician you might have hard
-about, is quoted to have said about this conjecture: ``Mathematics
+about, said about this conjecture: ``Mathematics may not be ready
-is not yet ripe enough for such questions.''  and also offered a
+for such problems.'' and also offered a \$500 cash prize for its
-\$500 cash prize for its solution. Jeffrey Lagarias, another
+solution. Jeffrey Lagarias, another mathematician, claimed that
-mathematician, claimed that based only on known information about
+based only on known information about this problem, ``this is an
-this problem, ``this is an extraordinarily difficult problem,
+extraordinarily difficult problem, completely out of reach of
-completely out of reach of present day mathematics.'' There is also
+present day mathematics.'' There is also a
-a \href{https://xkcd.com/710/}{xkcd} cartoon about this conjecture
+\href{https://xkcd.com/710/}{xkcd} cartoon about this conjecture
 (click \href{https://xkcd.com/710/}{here}). If you are able to solve
 this conjecture, you will definitely get famous.}\bigskip
-\newpage
 \noindent
 \textbf{Tasks (file collatz.scala):}
 \begin{itemize}
 \item[(1)] You are asked to implement a recursive function that
 with $1$. In case of starting with $6$, it takes $9$ steps and in
 case of starting with $9$, it takes $20$ (see above). In order to
 try out this function with large numbers, you should use
 \texttt{Long} as argument type, instead of \texttt{Int}.  You can
 assume this function will be called with numbers between $1$ and
-$1$ million. \hfill[2 Marks]
+$1$ Million. \hfill[2 Marks]
 \item[(2)] Write a second function that takes an upper bound as
 argument and calculates the steps for all numbers in the range from
 1 up to this bound. It returns the maximum number of steps and the
 corresponding number that needs that many steps.  More precisely
 \item 1 to 10 where $9$ takes 20 steps
 \item 1 to 100 where $97$ takes 119 steps,
 \item 1 to 1,000 where $871$ takes 179 steps,
 \item 1 to 10,000 where $6,171$ takes 262 steps,
 \item 1 to 100,000 where $77,031$ takes 351 steps,
-\item 1 to 1 million where $837,799$ takes 525 steps
+\item 1 to 1 Million where $837,799$ takes 525 steps
 %%\item[$\bullet$] $1 - 10$ million where $8,400,511$ takes 686 steps
-\end{itemize}\bigskip
+\end{itemize}
+\noindent
+\textbf{Hints:} useful math operators: \texttt{\%} for modulo; useful
-\subsection*{Part 2 (4 Marks)}
+functions: \mbox{\texttt{(1\,to\,10)}} for ranges, \texttt{.toInt},
+\texttt{.toList} for conversions, \texttt{List(...).max} for the
-This part is about list processing---it's a variant of
+maximum of a list, \texttt{List(...).indexOf(...)} for the first index of
-``buy-low-sell-high'' in Scala. It uses the online financial data
+a value in a list.
-service from Yahoo.\bigskip
-\noindent
-\textbf{Tasks (file trade.scala):}
+\subsection*{Part 2 (3 Marks)}
-\begin{itemize}
+This part is about web-scraping and list-processing in Scala. It uses
-\item[(1)] Given a list of prices for a commodity, for example
+online data about the per-capita alcohol consumption for each country
+(per year?), and a file containing the data about the population size of
-\[
+each country.  From this data you are supposed to estimate how many
-\texttt{List(28.0, 18.0, 20.0, 26.0, 24.0)}
+litres of pure alcohol are consumed worldwide.\bigskip
-\]
+\noindent
-\noindent
+\textbf{Tasks (file alcohol.scala):}
-you need to write a function that returns a pair of indices for when
-to buy and when to sell this commodity. In the example above it should
+\begin{itemize}
-return the pair $\texttt{(1, 3)}$ because at index $1$ the price is lowest and
+\item[(1)] Write a function that given an URL requests a
-then at index $3$ the price is highest. Note the prices are given as
+comma-separated value (CSV) list.  We are interested in the list
-lists of \texttt{Double}s.\newline \mbox{} \hfill[1 Mark]
+from the following URL
-\item[(2)] Write a function that requests a comma-separated value (CSV) list
-from the Yahoo websevice that provides historical data for stock
-indices. For example if you query the URL
 \begin{center}
-\url{http://ichart.yahoo.com/table.csv?s=GOOG}
+\url{https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv}
 \end{center}
-\noindent where \texttt{GOOG} stands for Google's stock market symbol,
+\noindent Your function should take a string (the URL) as input, and
-then you will receive a CSV-list of the daily stock prices since
+produce a list of strings as output, where each string is one line in
-Google was listed. You can also try this with other stock market
+the corresponding CSV-list.  This list from the URL above should
-symbols, for instance AAPL, MSFT, IBM, FB, YHOO, AMZN, BIDU and so
+contain 194 lines.\medskip
-on.
+\noindent
-This function should return a List of strings, where each string
+Write another function that can read the file \texttt{population.csv}
-is one line in this CVS-list (representing one day's
+from disk (the file is distributed with the coursework). This
-data). Note that Yahoo generates its answer such that the newest data
+function should take a string as argument, the file name, and again
-is at the front of this list, and the oldest data is at the end.
+return a list of strings corresponding to each entry in the
-\hfill[1 Mark]
+CSV-list. For \texttt{population.csv}, this list should contain 216
+lines.\hfill[1 Mark]
-\item[(3)] As you can see, the financial data from Yahoo is organised in 7 columns,
-for example
+\item[(2)] Unfortunately, the CSV-lists contain a lot of ``junk'' and we
-{\small\begin{verbatim}
+need to extract the data that interests us.  From the header of the
-Date,Open,High,Low,Close,Volume,Adj Close
+alcohol list, you can see there are 5 columns
-2016-11-04,750.659973,770.359985,750.560974,762.02002,2126900,762.02002
-2016-11-03,767.25,769.950012,759.030029,762.130005,1914000,762.130005
+\begin{center}
-2016-11-02,778.200012,781.650024,763.450012,768.700012,1872400,768.700012
+\begin{tabular}{l}
-2016-11-01,782.890015,789.48999,775.539978,783.609985,2404500,783.609985
+\texttt{country (name),}\\
-....
+\texttt{beer\_servings,}\\
-\end{verbatim}}
+\texttt{spirit\_servings,}\\
+\texttt{wine\_servings,}\\
-\noindent
+\texttt{total\_litres\_of\_pure\_alcohol}
-Write a function that ignores the first line (the header) and then
+\end{tabular}
-extracts from each line the date (first column) and the Adjusted Close
+\end{center}
-price (last column). The Adjusted Close price should be converted into
-a \texttt{Double}. So the result of this function is a list of pairs where the
+\noindent
-first components are strings (the dates) and the second are doubles
+Write a function that extracts the data from the first column,
-(the adjusted close prices).\newline\mbox{}\hfill\mbox{[1 Mark]}
+the country name, and the data from the fifth column (converted into
+a \texttt{Double}). For this go through each line of the CSV-list
-\item[(4)] Write a function that takes a stock market symbol as
+(except the first line), use the \texttt{split(",")} function to
-argument (you can assume it is a valid one, like GOOG or AAPL). The
+divide each line into an array of 5 elements. Keep the data from the
-function calculates the \underline{dates} when you should have
+first and fifth element in these arrays.\medskip
-bought the corresponding shares (lowest price) and when you should
-have sold them (highest price).\hfill\mbox{[1 Mark]}
+\noindent
-\end{itemize}
+Write another function that processes the population size list. This
+is already of the form country name and population size.\footnote{Your
-\noindent
+friendly lecturer already did the messy processing for you from the
-\textbf{Test Data:}
+Worldbank database, see \url{https://github.com/datasets/population/tree/master/data} for the original.} Again, split the
-In case of Google, the financial data records 3077 entries starting
+strings according to the commas. However, this time generate a
-from 2004-08-19 until 2016-11-04 (which is the last entry on the day
+\texttt{Map} from country names to population sizes.\hfill[1 Mark]
-when I prepared the course work...namely on 6 November; remember stock
-markets are typically closed on weekends and no financial data is
+\item[(3)] In (2) you generated the data about the alcohol consumption
-produced then; also I did not count the header line). The lowest
+per capita for each country, and also the population size for each
-shareprice for Google was on 2004-09-03 with \$49.95513 per share and the
+country. From this generate next a sorted(!) list of the overall
-highest on 2016-10-24 with \$813.109985 per share.\bigskip
+alcohol consumption for each country. The list should be sorted from
+highest alcohol consumption to lowest. The difficulty is that the
-\subsection*{Advanced Part 3 (3 Marks)}
+data is scraped off from ``random'' sources on the Internet and
+annoyingly the spelling of some country names does not always agree in both
+lists. For example the alcohol list contains
+\texttt{Bosnia-Herzegovina}, while the population writes this country as
+\texttt{Bosnia and Herzegovina}. In your sorted
+overall list include only countries from the alcohol list, whose
+exact country name is also in the population size list. This means
+you can ignore countries like Bosnia-Herzegovina from the overall
+alcohol consumption. There are 177 countries where the names
+agree. The UK is ranked 10th on this list by
+consuming 671,976,864 Litres of pure alcohol each year.\medskip
+\noindent
+Finally, write another function that takes an integer, say
+\texttt{n}, as argument. You can assume this integer is between 0
+and 177 (the number of countries in the sorted list above).  The
+function should return a triple, where the first component is the
+sum of the alcohol consumption in all countries (on the list); the
+second component is the sum of the \texttt{n}-highest alcohol
+consumers on the list; and the third component is the percentage the
+\texttt{n}-highest alcohol consumers drink with respect to the
+the world consumption. You will see that according to our data, 164
+countries (out of 177) gobble up 100\% of the World alcohol
+consumption.\hfill\mbox{[1 Mark]}
+\end{itemize}
+\noindent
+\textbf{Hints:} useful list functions: \texttt{.drop(n)},
+\texttt{.take(n)} for dropping or taking some elements in a list,
+\texttt{.getLines} for separating lines in a string;
+\texttt{.sortBy(\_.\_2)} sorts a list of pairs according to the second
+elements in the pairs---the sorting is done from smallest to highest;
+useful \texttt{Map} functions: \texttt{.toMap} converts a list of
+pairs into a \texttt{Map}, \texttt{.isDefinedAt(k)} tests whether the
+map is defined at that key, that is would produce a result when
+called with this key; useful data functions: \texttt{Source.fromURL},
+\texttt{Source.fromFile} for obtaining a webpage and reading a file.
+\newpage
+\subsection*{Advanced Part 3 (4 Marks)}
 A purely fictional character named Mr T.~Drumb inherited in 1978
 approximately 200 Million Dollar from his father. Mr Drumb prides
 himself to be a brilliant business man because nowadays it is
 estimated he is 3 Billion Dollar worth (one is not sure, of course,
 because Mr Drumb refuses to make his tax records public).
-Since the question about Mr Drumb's business acumen remains, let's do a
+Since the question about Mr Drumb's business acumen remains open,
-quick back-of-the-envelope calculation in Scala whether his claim has
+let's do a quick back-of-the-envelope calculation in Scala whether his
-any merit. Let's suppose we are given \$100 in 1978 and we follow a
+claim has any merit. Let's suppose we are given \$100 in 1978 and we
-really dumb investment strategy, namely:
+follow a really dumb investment strategy, namely:
 \begin{itemize}
 \item We blindly choose a portfolio of stocks, say some Blue-Chip stocks
 or some Real Estate stocks.
 \item If some of the stocks in our portfolio are traded in January of
 from each.
 \item Next year in January, we look how our stocks did, liquidate
 everything, and re-invest our (hopefully) increased money in again
 the stocks from our portfolio (there might be more stocks available,
 if companies from our portfolio got listed in that year, or less if
-some companies went bust or de-listed).
+some companies went bust or were de-listed).
-\item We do this for 38 years until January 2016 and check what would
+\item We do this for 39 years until January 2017 and check what would
 have become out of our \$100.
-\end{itemize}\medskip
+\end{itemize}
+\noindent
+Until Yahoo was bought by Altaba this summer, historical stock market
+data for such back-of-the-envelope calculations was freely available
+online. Unfortuantely nowadays this kind of data is difficult to
+obtain, unless you are prepared to pay extortionate prices or be
+severely rate-limited.  Therefore this coursework comes with a number
+of files containing CSV-lists with the historical stock prices for the
+companies in our portfolios. Use these files for the following
+tasks.\bigskip
 \noindent
 \textbf{Tasks (file drumb.scala):}
 \begin{itemize}
-\item[(1.a)] Write a function that queries the Yahoo financial data
+\item[(1.a)] Write a function \texttt{get\_january\_data} that takes a
-service and obtains the first trade (adjusted close price) of a
+stock symbol and a year as arguments. The function reads the
-stock symbol and a year. A problem is that normally a stock exchange
+corresponding CSV-file and returns the list of strings that start
-is not open on 1st of January, but depending on the day of the week
+with the given year (each line in the CSV-list is of the form
-on a later day (maybe 3rd or 4th). The easiest way to solve this
+\texttt{year-01-someday,someprice}).
-problem is to obtain the whole January data for a stock symbol as
-CSV-list and then select the earliest entry in this list. For this
+\item[(1.b)] Write a function \texttt{get\_first\_price} that takes
-you can specify a date range with the Yahoo service. For example if
+again a stock symbol and a year as arguments. It should return the
-you want to obtain all January data for Google in 2000, you can form
+first January price for the stock symbol in given the year. For this
-the query:\mbox{}\\[-8mm]
+it uses the list of strings generated by
+\texttt{get\_january\_data}.  A problem is that normally a stock
-\begin{center}\small
+exchange is not open on 1st of January, but depending on the day of
-\mbox{\url{http://ichart.yahoo.com/table.csv?s=GOOG&a=0&b=1&c=2000&d=1&e=1&f=2000}}
+the week on a later day (maybe 3rd or 4th). The easiest way to solve
-\end{center}
+this problem is to obtain the whole January data for a stock symbol
+and then select the earliest, or first, entry in this list. The
-For other companies and years, you need to change the stock symbol
+stock price of this entry should be converted into a double.  Such a
-(\texttt{GOOG}) and the year \texttt{2000} (in the \texttt{c} and
+price might not exist, in case the company does not exist in the given
-\texttt{f} argument of the query). Such a request might fail, if the
+year. For example, if you query for Google in January of 1980, then
-company does not exist during this period. For example, if you query
+clearly Google did not exist yet.  Therefore you are asked to
-for Google in January of 1980, then clearly Google did not exists yet.
+return a trade price as \texttt{Option[Double]}\ldots\texttt{None}
-Therefore you are asked to return a trade price as
+will be the value for when no price exists.
-\texttt{Option[Double]}.
+\item[(1.c)] Write a function \texttt{get\_prices} that takes a
-\item[(1.b)] Write a function that takes a portfolio (a list of stock symbols),
+portfolio (a list of stock symbols), a years range and gets all the
-a years range and gets all the first trading prices for each year. You should
+first trading prices for each year in the range. You should organise
-organise this as a list of lists of \texttt{Option[Double]}'s. The inner lists
+this as a list of lists of \texttt{Option[Double]}'s. The inner
-are for all stock symbols from the portfolio and the outer list for the years.
+lists are for all stock symbols from the portfolio and the outer
-For example for Google and Apple in years 2010 (first line), 2011
+list for the years.  For example for Google and Apple in years 2010
-(second line) and 2012 (third line) you obtain:
+(first line), 2011 (second line) and 2012 (third line) you obtain:
 \begin{verbatim}
-List(List(Some(313.062468), Some(27.847252)),
+List(List(Some(311.349976), Some(27.505054)),
-List(Some(301.873641), Some(42.884065)),
+List(Some(300.222351), Some(42.357094)),
-List(Some(332.373186), Some(53.509768)))
+List(Some(330.555054), Some(52.852215)))
-\end{verbatim}\hfill[1 Mark]
+\end{verbatim}\hfill[2 Marks]
 \item[(2.a)] Write a function that calculates the \emph{change factor} (delta)
 for how a stock price has changed from one year to the next. This is
 only well-defined, if the corresponding company has been traded in both
 years. In this case you can calculate
 \[
 \frac{price_{new} - price_{old}}{price_{old}}
 \]
+If the change factor is defined, you should return it
+as \texttt{Some(change factor)}; if not, you should return
+\texttt{None}.
 \item[(2.b)] Write a function that calculates all change factors
 (deltas) for the prices we obtained under Task 1. For the running
 example of Google and Apple for the years 2010 to 2012 you should
 obtain 4 change factors:
 \begin{verbatim}
-List(List(Some(-0.03573991820699504), Some(0.5399747522663995))
+List(List(Some(-0.03573992567129673), Some(0.5399749442411563))
-List(Some(0.10103414428290529), Some(0.24777742035415723)))
+List(Some(0.10103412653643493), Some(0.2477771728154912)))
 \end{verbatim}
 That means Google did a bit badly in 2010, while Apple did very well.
-Both did OK in 2011.\hfill\mbox{[1 Mark]}
+Both did OK in 2011. Make sure you handle the cases where a company is
+not listed in a year. In such cases the change factor should be \texttt{None}
+(see 2.a).\\
+\mbox{}\hfill\mbox{[1 Mark]}
 \item[(3.a)] Write a function that calculates the ``yield'', or
 balance, for one year for our portfolio.  This function takes the
 change factors, the starting balance and the year as arguments. If
 no company from our portfolio existed in that year, the balance is
 unchanged. Otherwise we invest in each existing company an equal
 amount of our balance. Using the change factors computed under Task
 2, calculate the new balance. Say we had \$100 in 2010, we would have
-received in our running example
+received in our running example involving Google and Apple:
 \begin{verbatim}
-$50 * -0.03573991820699504 + $50 * 0.5399747522663995
+$50 * -0.03573992567129673 + $50 * 0.5399749442411563
-= $25.211741702970222
+= $25.21175092849298
 \end{verbatim}
 as profit for that year, and our new balance for 2011 is \$125 when
 converted to a \texttt{Long}.
 \item[(3.b)] Write a function that calculates the overall balance
 for a range of years where each year the yearly profit is compounded to
-the new balances and then re-invested into our portfolio.\mbox{}\hfill\mbox{[1 Mark]}
+the new balances and then re-invested into our portfolio.\\
+\mbox{}\hfill\mbox{[1 Mark]}
 \end{itemize}\medskip
 \noindent
 \textbf{Test Data:} File \texttt{drumb.scala} contains two portfolios
 collected from the S\&P 500, one for blue-chip companies, including
-Facebook, Amazon and Baidu; and another for listed real-estate companies, whose
+Facebook, Amazon and Baidu; and another for listed real-estate
-names I have never heard of. Following the dumb investment strategy
+companies, whose names I have never heard of. Following the dumb
-from 1978 until 2016 would have turned a starting balance of \$100
+investment strategy from 1978 until 2017 would have turned a starting
-into \$23,794 for real estate and a whopping \$524,609 for blue chips.\medskip
+balance of \$100 into roughly \$30,895 for real estate and a whopping
+\$349,597 for blue chips.  Note when comparing these results with your
+own calculations: there might be some small rounding errors, which
+when compounded lead to moderately different values.\bigskip
+\noindent
+\textbf{Hints:} useful string functions: \texttt{.startsWith(...)} for
+checking whether a string has a given prefix, \texttt{\_ ++ \_} for
+concatenating two strings; useful option functions: \texttt{.flatten}
+flattens a list of options such that it filters way all
+\texttt{None}'s, \texttt{Try(...) getOrElse ...} runs some code that
+might raise an exception---if yes, then a default value can be given;
+useful list functions: \texttt{.head} for obtaining the first element
+in a non-empty list, \texttt{.length} for the length of a
+list.\bigskip
 \noindent
 \textbf{Moral:} Reflecting on our assumptions, we are over-estimating
 our yield in many ways: first, who can know in 1978 about what will
 turn out to be a blue chip company.  Also, since the portfolios are
 chosen from the current S\&P 500, they do not include the myriad
 of companies that went bust or were de-listed over the years.
 So where does this leave our fictional character Mr T.~Drumb? Well, given
 his inheritance, a really dumb investment strategy would have done
-equally well, if not much better.
+equally well, if not much better.\medskip
-About rounding errors: \url{https://www.youtube.com/watch?v=pQs_wx8eoQ8}
-(PBS Infinity Series).
 \end{document}
 %%% Local Variables:
 %%% mode: latex
 %%% TeX-master: t

changeset 196	c50b074b3047
parent 195	fc3ac7b70a06
child 197	c3e39fdeea3b