diff -r 3c81352ec565 -r 6e990ae2c6a3 cws/main_cw03.tex --- a/cws/main_cw03.tex Sat Oct 08 00:30:51 2022 +0100 +++ b/cws/main_cw03.tex Tue Nov 01 15:03:48 2022 +0000 @@ -119,7 +119,7 @@ % BF IDE % https://www.microsoft.com/en-us/p/brainf-ck/9nblgggzhvq5 -\section*{Main Part 3 (Scala, 6 Marks)} +\section*{Main Part 3 (Scala, 7 Marks)} \mbox{}\hfill\textit{``Java is the most distressing thing to happen to computing since MS-DOS.''}\smallskip\\ \mbox{}\hfill\textit{ --- Alan Kay, the inventor of object-oriented programming}\bigskip\medskip @@ -206,7 +206,7 @@ {\small \begin{itemize} \item[$\bullet$] \url{https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019} -\item[$\bullet$] \url{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} +\item[$\bullet$] \texttt{\href{https://web.archive.org/web/20160801163029/https://www.stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}{https://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}} \item[$\bullet$] \url{https://vimeo.com/112065252} \item[$\bullet$] \url{https://davidvgalbraith.com/how-i-fixed-atom} \end{itemize}} @@ -218,9 +218,9 @@ \subsubsection*{Tasks (file re.scala)} The file \texttt{re.scala} has already a definition for regular -expressions and also defines some handy shorthand notation for -regular expressions. The notation in this document matches up -with the code in the file as follows: +expressions and also defines some handy shorthand notation for regular +expressions. The notation in this coursework description matches up +with the code as follows: \begin{center} \begin{tabular}{rcl@{\hspace{10mm}}l} @@ -230,6 +230,7 @@ $c$ & $\mapsto$ & \texttt{CHAR(c)}\\ $\sum rs$ & $\mapsto$ & \texttt{ALTs(rs)}\\ $r_1 + r_2$ & $\mapsto$ & \texttt{ALT(r1, r2)} & \texttt{r1 | r2}\\ + $\prod rs$ & $\mapsto$ & \texttt{SEQs(rs)}\\ $r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)} & \texttt{r1 $\sim$ r2}\\ $r^*$ & $\mapsto$ & \texttt{STAR(r)} & \texttt{r.\%} \end{tabular} @@ -244,13 +245,15 @@ if it is non-empty, we sometimes write $\sum\,[r_1,..., r_n]$. The binary alternative can be seen as an abbreviation, that is $r_1 + r_2 \dn \sum\,[r_1, r_2]$. As a result we can ignore the -binary version and only implement the n-ary alternative. +binary version and only implement the n-ary alternative. Similarly +the sequence regular expression is only implemented with lists and the +binary version can be obtained by defining $r_1 \cdot r_2 \dn \prod\,[r_1, r_2]$. \begin{itemize} \item[(1)] Implement a function, called \textit{nullable}, by recursion over regular expressions. This function tests whether a regular expression can match the empty string. This means given a - regular expression it either returns true or false. The function + regular expression, it either returns true or false. The function \textit{nullable} is defined as follows: @@ -260,7 +263,7 @@ $\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\ $\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\ $\textit{nullable}(\sum rs)$ & $\dn$ & $\exists r \in rs.\;\textit{nullable}(r)$\\ -$\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\ +$\textit{nullable}(\prod rs)$ & $\dn$ & $\forall r\in rs.\;\textit{nullable}(r)$\\ $\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\\ \end{tabular} \end{center}~\hfill[0.5 Marks] @@ -276,130 +279,130 @@ $\textit{der}\;c\;(\ONE)$ & $\dn$ & $\ZERO$\\ $\textit{der}\;c\;(d)$ & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\ $\textit{der}\;c\;(\sum\;[r_1,..,r_n])$ & $\dn$ & $\sum\;[\textit{der}\;c\;r_1,..,\textit{der}\;c\;r_n]$\\ -$\textit{der}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\ - & & $\textit{then}\;((\textit{der}\;c\;r_1)\cdot r_2) + (\textit{der}\;c\;r_2)$\\ - & & $\textit{else}\;(\textit{der}\;c\;r_1)\cdot r_2$\\ +$\textit{der}\;c\;(\prod\;[])$ & $\dn$ & $\ZERO$\\ +$\textit{der}\;c\;(\prod\;r\!::\!rs)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r)$\\ + & & $\textit{then}\;(\prod\;(\textit{der}\;c\;r)\!::\!rs) + (\textit{der}\;c\;(\prod rs))$\\ + & & $\textit{else}\;(\prod\;(\textit{der}\;c\;r)\!::\! rs)$\\ $\textit{der}\;c\;(r^*)$ & $\dn$ & $(\textit{der}\;c\;r)\cdot (r^*)$\\ \end{tabular} \end{center} - -For example given the regular expression $r = (a \cdot b) \cdot c$, the derivatives -w.r.t.~the characters $a$, $b$ and $c$ are - -\begin{center} - \begin{tabular}{lcll} - $\textit{der}\;a\;r$ & $=$ & $(\ONE \cdot b)\cdot c$ & \quad($= r'$)\\ - $\textit{der}\;b\;r$ & $=$ & $(\ZERO \cdot b)\cdot c$\\ - $\textit{der}\;c\;r$ & $=$ & $(\ZERO \cdot b)\cdot c$ - \end{tabular} -\end{center} - -Let $r'$ stand for the first derivative, then taking the derivatives of $r'$ -w.r.t.~the characters $a$, $b$ and $c$ gives - -\begin{center} - \begin{tabular}{lcll} - $\textit{der}\;a\;r'$ & $=$ & $((\ZERO \cdot b) + \ZERO)\cdot c$ \\ - $\textit{der}\;b\;r'$ & $=$ & $((\ZERO \cdot b) + \ONE)\cdot c$ & \quad($= r''$)\\ - $\textit{der}\;c\;r'$ & $=$ & $((\ZERO \cdot b) + \ZERO)\cdot c$ - \end{tabular} -\end{center} - -One more example: Let $r''$ stand for the second derivative above, -then taking the derivatives of $r''$ w.r.t.~the characters $a$, $b$ -and $c$ gives - -\begin{center} - \begin{tabular}{lcll} - $\textit{der}\;a\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ZERO$ \\ - $\textit{der}\;b\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ZERO$\\ - $\textit{der}\;c\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ONE$ & - (is $\textit{nullable}$) - \end{tabular} -\end{center} - -Note, the last derivative can match the empty string, that is it is \textit{nullable}.\\ \mbox{}\hfill\mbox{[1 Mark]} \item[(3)] We next want to simplify regular expressions: essentially we want to remove $\ZERO$ in regular expressions like $r + \ZERO$ and $\ZERO + r$. However, our n-ary alternative takes a list of - regular expressions as argument, we therefore need a more general - ``flatten'' function, called \texttt{flts}. This function should - analyse a list of regular expressions, say $rs$, as follows: + regular expressions as argument, and we therefore need a more general + ``denesting'' function, which deletes $\ZERO$s and ``spills out'' nested + $\sum$s. This function, called \texttt{denest}, should analyse a + list of regular expressions, say $rs$, as follows: \begin{center} \begin{tabular}{lllll} 1) &$rs = []$ & $\dn$ & $[]$ & (empty list)\\ - 2) &$rs = \ZERO :: rest$ & $\dn$ & $\textrm{flatten}\;rest$\\ - 3) &$rs = (\sum rs_1) :: rest$ & $\dn$ & $rs_1 ::: \textrm{flatten}\;rest$\\ - 4) &$rs = r :: rest$ & $\dn$ & $r :: \textrm{flatten}\;rest$ & (otherwise)\\ + 2) &$rs = \ZERO :: rest$ & $\dn$ & $\texttt{denest}\;rest$ & (throw away $\ZERO$)\\ + 3) &$rs = (\sum rs) :: rest$ & $\dn$ & $rs ::: \texttt{denest}\;rest$ & (spill out $\sum$)\\ + 4) &$rs = r :: rest$ & $\dn$ & $r :: \texttt{denest}\;rest$ & (otherwise)\\ \end{tabular} \end{center} The first clause states that empty lists cannot be further - flattened. The second removes the first $\ZERO$ from the list and recurses. + denested. The second removes the first $\ZERO$ from the list and recurses. The third is when the first regular expression is an \texttt{ALTs}, then the content of this alternative should be spilled out and appended - with the flattened rest of the list. The last case is for all other + with the denested rest of the list. The last case is for all other cases where the head of the list is not $\ZERO$ and not an \texttt{ALTs}, - then we just keep the head of the list and flatten the rest. + then we just keep the head of the list and denest the rest.\\ \mbox{}\hfill\mbox{[1 Mark]} -\item[(4)] Implement the function \textit{simp}, which recursively +\item[(4)] Implement the function \texttt{flts} which flattens our + n-ary sequence regular expression $\prod$. Like \texttt{denest}, this + function is intended to delete $\ONE$s and spill out nested $\prod$s. + Unfortunately, there is a special case to do with $\ZERO$: If this function encounters a $\ZERO$, then + the whole ``product'' should be $\ZERO$. The problem is that the $\ZERO$ can be anywhere + inside the list. The easiest way to implement this function is therefore by using an + accumulator, which when called is set to \texttt{Nil}. This means \textit{flts} takes + two arguments (which are both lists of regular expressions) + + \[ + \texttt{flts}\;rs\;acc + \] + + This function analyses the list $rs$ as follows + + \begin{center} + \begin{tabular}{@{}lllll@{}} + 1) &$rs = []$ & $\dn$ & $acc$ & (empty list)\\ + 2) &$rs = \ZERO :: rest$ & $\dn$ & $[\ZERO]$ & (special case for $\ZERO$)\\ + 3) &$rs = \ONE :: rest$ & $\dn$ & $\texttt{flts}\,rest\,acc$ & (throw away $\ONE$)\\ + 4) &$rs = (\prod rs) :: rest$ & $\dn$ & $\texttt{flts}\;rest\,(acc ::: rs)$ & (spill out $\prod$)\\ + 5) &$rs = r :: rest$ & $\dn$ & $\texttt{flts}\;rest\,(acc ::: [r])$& (otherwise)\\ + \end{tabular} + \end{center} + + In the first case we just return whatever has accumulated in + $acc$. In the fourth case we spill out the $rs$ by appending the + $rs$ to the end of the accumulator. Similarly in the last case we + append the single regular expression $r$ to the end of the + accumulator. I let you think why the ``end'' is needed. \mbox{}\hfill\mbox{[1 Mark]} + +\item[(5)] Before we can simplify regular expressions, we need what is often called + \emph{smart constructors} for $\sum$ and $\prod$. While the ``normal'' constructors + \texttt{ALTs} and \texttt{SEQs} give us alternatives and sequences, respectively, \emph{smart} + constructors might return something different depending on what list of regular expressions + they are given as argument. + + \begin{center} + \begin{tabular}{@{}c@{\hspace{9mm}}c@{}} + \begin{tabular}{l@{\hspace{2mm}}l@{\hspace{1mm}}ll} + & $\sum^{smart}$\smallskip\\ + 1) & $rs = []$ & $\dn$ & $\ZERO$\\ + 2) & $rs = [r]$ & $\dn$ & $r$\\ \\ + 3) & otherwise & $\dn$ & $\sum\,rs$ + \end{tabular} & + \begin{tabular}{l@{\hspace{2mm}}l@{\hspace{1mm}}ll} + & $\prod^{smart}$\smallskip\\ + 1) & $rs = []$ & $\dn$ & $\ONE$\\ + 2a) & $rs = [\ZERO]$ & $\dn$ & $\ZERO$\\ + 2b) & $rs = [r]$ & $\dn$ & $r$\\ + 3) & otherwise & $\dn$ & $\prod\,rs$ + \end{tabular} + \end{tabular} + \end{center} + \mbox{}\hfill\mbox{[0.5 Marks]} + +\item[(6)] Implement the function \textit{simp}, which recursively traverses a regular expression, and on the way up simplifies every regular expression on the left (see below) to the regular expression on the right, except it does not simplify inside ${}^*$-regular - expressions. + expressions and also does not simplify $\ZERO$, $\ONE$ and characters. \begin{center} -\begin{tabular}{l@{\hspace{4mm}}c@{\hspace{4mm}}ll} -$r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ -$\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ -$r \cdot \ONE$ & $\mapsto$ & $r$\\ -$\ONE \cdot r$ & $\mapsto$ & $r$\\ -$\sum\;[r_1,..,r_n]$ & $\mapsto$ & $\sum\;(\texttt{(flts + distinct)}[simp(r_1),..,simp(r_n)])$\\ + \begin{tabular}{@{}l@{\hspace{3mm}}c@{\hspace{3mm}}ll@{}} + LHS: & & RHS:\smallskip\\ +$\sum\;[r_1,..,r_n]$ & $\mapsto$ & $\sum^{smart}\;(\texttt{(denest + distinct)}[simp(r_1),..,simp(r_n)])$\\ + $\prod\;[r_1,..,r_n]$ & $\mapsto$ & $\prod^{smart}\;(\texttt{(flts)}[simp(r_1),..,simp(r_n)])$\\ + $r$ & $\mapsto$ & $r$ \quad (all other cases) \end{tabular} \end{center} -The last case is as follows: first apply $simp$ to all regular -expressions $r_1,.. ,r_n$; then flatten the resulting list using -\texttt{flts}; finally remove all duplicates in this list (this can be +The first case is as follows: first apply $simp$ to all regular +expressions $r_1,.. ,r_n$; then denest the resulting list using +\texttt{denest}; after that remove all duplicates in this list (this can be done in Scala using the function -\texttt{\_.distinct}). When you perform these - operations, you end up with three cases, namely where the list is - empty, contains a single element and ``otherwise''. These cases - should be processed as follows -\begin{center} -\begin{tabular}{l@{\hspace{4mm}}c@{\hspace{4mm}}ll} -$\sum\;[]$ & $\mapsto$ & $\ZERO$\\ -$\sum\;[r]$ & $\mapsto$ & $r$\\ -$\sum\;rs$ & $\mapsto$ & $\sum\;rs$ & ``otherwise''\\ -\end{tabular} -\end{center} - +\texttt{\_.distinct}). Finally, you end up with a list of (simplified) +regular expressions; apply the smart constructor $\sum^{smart}$ to this list. +Similarly in the $\prod$ case: simplify first all regular +expressions $r_1,.. ,r_n$; then flatten the resulting list using \texttt{flts} and apply the +smart constructor $\prod^{smart}$ to the result. In all other cases, just return the +input $r$ as is. For example the regular expression \[(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)\] - simplifies to just $r_1$. \textbf{Hint:} Regular expressions can be - seen as trees and there are several methods for traversing - trees. One of them corresponds to the inside-out traversal, which is - also sometimes called post-order tra\-versal: you traverse inside - the tree and on the way up you apply simplification rules. - \textbf{Another Hint:} Remember numerical expressions from school - times---there you had expressions like - $u + \ldots + (1 \cdot x) - \ldots (z + (y \cdot 0)) \ldots$ and - simplification rules that looked very similar to rules above. You - would simplify such numerical expressions by replacing for example - the $y \cdot 0$ by $0$, or $1\cdot x$ by $x$, and then look whether - more rules are applicable. If you regard regular expressions as - trees and if you organise the simplification in an inside-out - fashion, it is always clear which simplification should be applied - next.\mbox{}\hfill\mbox{[1 Mark]} + simplifies to just $r_1$. \mbox{}\hfill\mbox{[1 Mark]} -\item[(5)] Implement two functions: The first, called \textit{ders}, +\item[(7)] Implement two functions: The first, called \textit{ders}, takes a list of characters and a regular expression as arguments, and builds the derivative w.r.t.~the list as follows: @@ -407,7 +410,7 @@ \begin{tabular}{lcl} $\textit{ders}\;(Nil)\;r$ & $\dn$ & $r$\\ $\textit{ders}\;(c::cs)\;r$ & $\dn$ & - $\textit{ders}\;cs\;(\textit{simp}(\textit{der}\;c\;r))$\\ + $\textit{ders}\;cs\;(\textit{simp}\,(\textit{der}\;c\;r))$\\ \end{tabular} \end{center} @@ -420,9 +423,9 @@ derivative regular expression can match the empty string (using \textit{nullable}). For example the \textit{matcher} will produce true for the regular expression $(a\cdot b)\cdot c$ and the string -$abc$, but false if you give it the string $ab$. \hfill[1 Mark] +$abc$, but false if you give it the string $ab$. \hfill[0.5 Mark] -\item[(6)] Implement a function, called \textit{size}, by recursion +\item[(8)] Implement a function, called \textit{size}, by recursion over regular expressions. If a regular expression is seen as a tree, then \textit{size} should return the number of nodes in such a tree. Therefore this function is defined as follows: @@ -433,7 +436,7 @@ $\textit{size}(\ONE)$ & $\dn$ & $1$\\ $\textit{size}(c)$ & $\dn$ & $1$\\ $\textit{size}(\sum\,[r_1,..,r_n]$ & $\dn$ & $1 + \textit{size}(r_1) + ... + \textit{size}(r_n)$\\ -$\textit{size}(r_1 \cdot r_2)$ & $\dn$ & $1 + \textit{size}(r_1) + \textit{size}(r_2)$\\ +$\textit{size}(\prod\,[r_1,..,r_n]$ & $\dn$ & $1 + \textit{size}(r_1) + ... + \textit{size}(r_n)$\\ $\textit{size}(r^*)$ & $\dn$ & $1 + \textit{size}(r)$\\ \end{tabular} \end{center} @@ -444,7 +447,7 @@ taking the derivative, but simplify the result. The sizes are given in \texttt{re.scala}. \hfill[0.5 Marks] -\item[(7)] You do not have to implement anything specific under this +\item[(9)] You do not have to implement anything specific under this task. The purpose here is that you will be marked for some ``power'' test cases. For example can your matcher decide within 30 seconds whether the regular expression $(a^*)^*\cdot b$ matches strings of the @@ -463,7 +466,7 @@ \subsection*{Background} Although easily implementable in Scala (ok maybe the \texttt{simp} functions and -\texttt{ALTs} needs a bit more thinking), the idea behind the +the constructors \texttt{ALTs}/\texttt{SEQs} needs a bit more thinking), the idea behind the derivative function might not so easy to be seen. To understand its purpose better, assume a regular expression $r$ can match strings of the form $c\!::\!cs$ (that means strings which start with a character @@ -484,11 +487,11 @@ $\textit{der}\;c\;(\textit{der}\;b\;(\textit{der}\;a\;r))$, you obtain a regular expression that can match the empty string. You can test whether this is indeed the case using the function nullable, which is -what your matcher is doing. +what the matcher you have implemented is doing. The purpose of the $\textit{simp}$ function is to keep the regular expressions small. Normally the derivative function makes the regular -expression bigger (see the SEQ case and the example in (2)) and the +expression bigger (see the \texttt{SEQs} case and the example in Task (2)) and the algorithm would be slower and slower over time. The $\textit{simp}$ function counters this increase in size and the result is that the algorithm is fast throughout. By the way, this algorithm is by Janusz @@ -501,16 +504,19 @@ If you want to see how badly the regular expression matchers do in -Java\footnote{Version 8 and below; Version 9 and above does not seem to be as - catastrophic, but still much worse than the regular expression - matcher based on derivatives.}, JavaScript and Python with the -``evil'' regular expression $(a^*)^*\cdot b$, then have a look at the -graphs below (you can try it out for yourself: have a look at the files +Java\footnote{Version 8 and below; Version 9 and above does not seem + to be as catastrophic, but still much worse than the regular + expression matcher based on derivatives. BTW, Scala uses the regular + expression matcher provided by the Java libraries. So is just as bad.}, JavaScript, +Python Swift and Dart with the ``evil'' regular expression +$(a^*)^*\cdot b$, then have a look at the graphs below (you can try it +out for yourself: have a look at the files \texttt{catastrophic9.java}, \texttt{catastrophic.js}, -\texttt{catastrophic.py} etc on KEATS). Compare this with the matcher you -have implemented. How long can a string of $a$'s be in your matcher -and still stay within the 30 seconds time limit? It should be muuuch better -than your off-the-shelf matcher in your bog-standard language. +\texttt{catastrophic.py} etc on KEATS). Compare this with the matcher +you have implemented. How long can a string of $a$'s be in your +matcher and still stay within the 30 seconds time limit? It should be +mu(uu)$^*$ch better than your off-the-shelf matcher in your +bog-standard language. \begin{center} \begin{tabular}{@{}cc@{}} @@ -574,6 +580,46 @@ \end{document} + +For example given the regular expression $r = (a \cdot b) \cdot c$, the derivatives +w.r.t.~the characters $a$, $b$ and $c$ are + +\begin{center} + \begin{tabular}{lcll} + $\textit{der}\;a\;r$ & $=$ & $(\ONE \cdot b)\cdot c$ & \quad($= r'$)\\ + $\textit{der}\;b\;r$ & $=$ & $(\ZERO \cdot b)\cdot c$\\ + $\textit{der}\;c\;r$ & $=$ & $(\ZERO \cdot b)\cdot c$ + \end{tabular} +\end{center} + +Let $r'$ stand for the first derivative, then taking the derivatives of $r'$ +w.r.t.~the characters $a$, $b$ and $c$ gives + +\begin{center} + \begin{tabular}{lcll} + $\textit{der}\;a\;r'$ & $=$ & $((\ZERO \cdot b) + \ZERO)\cdot c$ \\ + $\textit{der}\;b\;r'$ & $=$ & $((\ZERO \cdot b) + \ONE)\cdot c$ & \quad($= r''$)\\ + $\textit{der}\;c\;r'$ & $=$ & $((\ZERO \cdot b) + \ZERO)\cdot c$ + \end{tabular} +\end{center} + +One more example: Let $r''$ stand for the second derivative above, +then taking the derivatives of $r''$ w.r.t.~the characters $a$, $b$ +and $c$ gives + +\begin{center} + \begin{tabular}{lcll} + $\textit{der}\;a\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ZERO$ \\ + $\textit{der}\;b\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ZERO$\\ + $\textit{der}\;c\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ONE$ & + (is $\textit{nullable}$) + \end{tabular} +\end{center} + +Note, the last derivative can match the empty string, that is it is \textit{nullable}. + + + %%% Local Variables: %%% mode: latex %%% TeX-master: t