\documentclass{article}\usepackage{../style}\usepackage{../langs}\usepackage{disclaimer}\usepackage{tikz}\usepackage{pgf}\usepackage{pgfplots}\usepackage{stackengine}%% \usepackage{accents}\newcommand\barbelow[1]{\stackunder[1.2pt]{#1}{\raisebox{-4mm}{\boldmath$\uparrow$}}}\begin{filecontents}{re-python2.data}1 0.0335 0.03610 0.03415 0.03618 0.05919 0.084 20 0.14121 0.24822 0.48523 0.87824 1.7125 3.4026 7.0827 14.1228 26.69\end{filecontents}\begin{filecontents}{re-java.data}5 0.0029810 0.0041815 0.0099616 0.0171017 0.0349218 0.0330319 0.0508420 0.1017721 0.1996022 0.4115923 0.8223424 1.7025125 3.3611226 6.6399827 13.3512028 29.81185\end{filecontents}\begin{filecontents}{re-js.data}5 0.06110 0.06115 0.06120 0.07023 0.13125 0.30826 0.56428 1.99430 7.64831 15.881 32 32.190\end{filecontents}\begin{filecontents}{re-java9.data}1000 0.014102000 0.048823000 0.106094000 0.174565000 0.275306000 0.411167000 0.537418000 0.702619000 0.9398110000 0.9741911000 1.2869712000 1.5138714000 2.0707916000 2.6984620000 4.4182324000 6.4607726000 7.6437330000 9.9944634000 12.96688538000 16.28162142000 19.18022846000 21.98472150000 26.95020360000 43.0327746\end{filecontents}\begin{document}% BF IDE% https://www.microsoft.com/en-us/p/brainf-ck/9nblgggzhvq5\section*{Coursework 9 (Scala)}This coursework is worth 10\%. It is about a regular expressionmatcher and the shunting yard algorithm by Dijkstra. The first part isdue on \cwNINE{} at 11pm; the second, more advanced part, is due on\cwNINEa{} at 11pm. In the first part, you are asked to implement aregular expression matcher based on derivatives of regularexpressions. The background is that ``out-of-the-box'' regularexpression matching in mainstream languages like Java, JavaScript andPython can sometimes be excruciatingly slow. You are supposed to implementan regular expression macther that is much, much faster. The advanced part isabout the shunting yard algorithm that transforms the usual infixnotation of arithmetic expressions into the postfix notation, which isfor example used in compilers.\bigskip\IMPORTANT{}\noindentAlso note that the running time of each part will be restricted to amaximum of 30 seconds on my laptop.\DISCLAIMER{}\subsection*{Reference Implementation}This Scala assignment comes with three reference implementations in form of\texttt{jar}-files you can download from KEATS. This allows you to run anytest cases on your owncomputer. For example you can call Scala on the command line with theoption \texttt{-cp re.jar} and then query any function from the\texttt{re.scala} template file. As usual you have toprefix the calls with \texttt{CW9a}, \texttt{CW9b} and \texttt{CW9c}.Since some tasks are time sensitive, you can check the referenceimplementation as follows: if you want to know, for example, how long it takesto match strings of $a$'s using the regular expression $(a^*)^*\cdot b$you can query as follows:\begin{lstlisting}[xleftmargin=1mm,numbers=none,basicstyle=\ttfamily\small]$ scala -cp re.jarscala> import CW9a._ scala> for (i <- 0 to 5000000 by 500000) { | println(i + " " + "%.5f".format(time_needed(2, matcher(EVIL, "a" * i))) + "secs.") | }0 0.00002 secs.500000 0.10608 secs.1000000 0.22286 secs.1500000 0.35982 secs.2000000 0.45828 secs.2500000 0.59558 secs.3000000 0.73191 secs.3500000 0.83499 secs.4000000 0.99149 secs.4500000 1.15395 secs.5000000 1.29659 secs.\end{lstlisting}%$\subsection*{Part 1 (6 Marks)}The task is to implement a regular expression matcher that is based onderivatives of regular expressions. Most of the functions are defined byrecursion over regular expressions and can be elegantly implementedusing Scala's pattern-matching. The implementation should deal with thefollowing regular expressions, which have been predefined in the file\texttt{re.scala}:\begin{center}\begin{tabular}{lcll} $r$ & $::=$ & $\ZERO$ & cannot match anything\\ & $|$ & $\ONE$ & can only match the empty string\\ & $|$ & $c$ & can match a single character (in this case $c$)\\ & $|$ & $r_1 + r_2$ & can match a string either with $r_1$ or with $r_2$\\ & $|$ & $r_1\cdot r_2$ & can match the first part of a string with $r_1$ and\\ & & & then the second part with $r_2$\\ & $|$ & $r^*$ & can match a string with zero or more copies of $r$\\\end{tabular}\end{center}\noindent Why? Regular expressions areone of the simplest ways to match patterns in text, andare endlessly useful for searching, editing and analysing data in allsorts of places (for example analysing network traffic in order todetect security breaches). However, you need to be fast, otherwise youwill stumble over problems such as recently reported at{\small\begin{itemize}\item[$\bullet$] \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}\item[$\bullet$] \url{https://vimeo.com/112065252}\item[$\bullet$] \url{http://davidvgalbraith.com/how-i-fixed-atom/} \end{itemize}}% Knowing how to match regular expressions and strings will let you% solve a lot of problems that vex other humans.\subsubsection*{Tasks (file re.scala)}The file \texttt{re.scala} has already a definition for regularexpressions and also defines some handy shorthand notation forregular expressions. The notation in this document matches upwith the code in the file as follows:\begin{center} \begin{tabular}{rcl@{\hspace{10mm}}l} & & code: & shorthand:\smallskip \\ $\ZERO$ & $\mapsto$ & \texttt{ZERO}\\ $\ONE$ & $\mapsto$ & \texttt{ONE}\\ $c$ & $\mapsto$ & \texttt{CHAR(c)}\\ $r_1 + r_2$ & $\mapsto$ & \texttt{ALT(r1, r2)} & \texttt{r1 | r2}\\ $r_1 \cdot r_2$ & $\mapsto$ & \texttt{SEQ(r1, r2)} & \texttt{r1 $\sim$ r2}\\ $r^*$ & $\mapsto$ & \texttt{STAR(r)} & \texttt{r.\%}\end{tabular} \end{center} \begin{itemize}\item[(1)] Implement a function, called \textit{nullable}, by recursion over regular expressions. This function tests whether a regular expression can match the empty string. This means given a regular expression it either returns true or false. The function \textit{nullable} is defined as follows:\begin{center}\begin{tabular}{lcl}$\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\$\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\$\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\$\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\$\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\$\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\\\end{tabular}\end{center}~\hfill[1 Mark]\item[(2)] Implement a function, called \textit{der}, by recursion over regular expressions. It takes a character and a regular expression as arguments and calculates the derivative of a xregular expression according to the rules:\begin{center}\begin{tabular}{lcl}$\textit{der}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\$\textit{der}\;c\;(\ONE)$ & $\dn$ & $\ZERO$\\$\textit{der}\;c\;(d)$ & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\$\textit{der}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{der}\;c\;r_1) + (\textit{der}\;c\;r_2)$\\$\textit{der}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\ & & $\textit{then}\;((\textit{der}\;c\;r_1)\cdot r_2) + (\textit{der}\;c\;r_2)$\\ & & $\textit{else}\;(\textit{der}\;c\;r_1)\cdot r_2$\\$\textit{der}\;c\;(r^*)$ & $\dn$ & $(\textit{der}\;c\;r)\cdot (r^*)$\\\end{tabular}\end{center}For example given the regular expression $r = (a \cdot b) \cdot c$, the derivativesw.r.t.~the characters $a$, $b$ and $c$ are\begin{center} \begin{tabular}{lcll} $\textit{der}\;a\;r$ & $=$ & $(\ONE \cdot b)\cdot c$ & \quad($= r'$)\\ $\textit{der}\;b\;r$ & $=$ & $(\ZERO \cdot b)\cdot c$\\ $\textit{der}\;c\;r$ & $=$ & $(\ZERO \cdot b)\cdot c$ \end{tabular}\end{center}Let $r'$ stand for the first derivative, then taking the derivatives of $r'$w.r.t.~the characters $a$, $b$ and $c$ gives\begin{center} \begin{tabular}{lcll} $\textit{der}\;a\;r'$ & $=$ & $((\ZERO \cdot b) + \ZERO)\cdot c$ \\ $\textit{der}\;b\;r'$ & $=$ & $((\ZERO \cdot b) + \ONE)\cdot c$ & \quad($= r''$)\\ $\textit{der}\;c\;r'$ & $=$ & $((\ZERO \cdot b) + \ZERO)\cdot c$ \end{tabular}\end{center}One more example: Let $r''$ stand for the second derivative above,then taking the derivatives of $r''$ w.r.t.~the characters $a$, $b$and $c$ gives\begin{center} \begin{tabular}{lcll} $\textit{der}\;a\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ZERO$ \\ $\textit{der}\;b\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ZERO$\\ $\textit{der}\;c\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ONE$ & (is $\textit{nullable}$) \end{tabular}\end{center}Note, the last derivative can match the empty string, that is it is \textit{nullable}.\\\mbox{}\hfill\mbox{[1 Mark]}\item[(3)] Implement the function \textit{simp}, which recursively traverses a regular expression, and on the way up simplifies every regular expression on the left (see below) to the regular expression on the right, except it does not simplify inside ${}^*$-regular expressions. \begin{center}\begin{tabular}{l@{\hspace{4mm}}c@{\hspace{4mm}}ll}$r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ $\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ $r \cdot \ONE$ & $\mapsto$ & $r$\\ $\ONE \cdot r$ & $\mapsto$ & $r$\\ $r + \ZERO$ & $\mapsto$ & $r$\\ $\ZERO + r$ & $\mapsto$ & $r$\\ $r + r$ & $\mapsto$ & $r$\\ \end{tabular} \end{center} For example the regular expression \[(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)\] simplifies to just $r_1$. \textbf{Hint:} Regular expressions can be seen as trees and there are several methods for traversing trees. One of them corresponds to the inside-out traversal, which is also sometimes called post-order tra\-versal: you traverse inside the tree and on the way up you apply simplification rules. \textbf{Another Hint:} Remember numerical expressions from school times---there you had expressions like $u + \ldots + (1 \cdot x) - \ldots (z + (y \cdot 0)) \ldots$ and simplification rules that looked very similar to rules above. You would simplify such numerical expressions by replacing for example the $y \cdot 0$ by $0$, or $1\cdot x$ by $x$, and then look whether more rules are applicable. If you organise the simplification in an inside-out fashion, it is always clear which simplification should be applied next.\hfill[1 Mark]\item[(4)] Implement two functions: The first, called \textit{ders}, takes a list of characters and a regular expression as arguments, and builds the derivative w.r.t.~the list as follows:\begin{center}\begin{tabular}{lcl}$\textit{ders}\;(Nil)\;r$ & $\dn$ & $r$\\ $\textit{ders}\;(c::cs)\;r$ & $\dn$ & $\textit{ders}\;cs\;(\textit{simp}(\textit{der}\;c\;r))$\\\end{tabular}\end{center}Note that this function is different from \textit{der}, which onlytakes a single character.The second function, called \textit{matcher}, takes a string and aregular expression as arguments. It builds first the derivativesaccording to \textit{ders} and after that tests whether the resultingderivative regular expression can match the empty string (using\textit{nullable}). For example the \textit{matcher} will producetrue for the regular expression $(a\cdot b)\cdot c$ and the string$abc$, but false if you give it the string $ab$. \hfill[1 Mark]\item[(5)] Implement a function, called \textit{size}, by recursion over regular expressions. If a regular expression is seen as a tree, then \textit{size} should return the number of nodes in such a tree. Therefore this function is defined as follows:\begin{center}\begin{tabular}{lcl}$\textit{size}(\ZERO)$ & $\dn$ & $1$\\$\textit{size}(\ONE)$ & $\dn$ & $1$\\$\textit{size}(c)$ & $\dn$ & $1$\\$\textit{size}(r_1 + r_2)$ & $\dn$ & $1 + \textit{size}(r_1) + \textit{size}(r_2)$\\$\textit{size}(r_1 \cdot r_2)$ & $\dn$ & $1 + \textit{size}(r_1) + \textit{size}(r_2)$\\$\textit{size}(r^*)$ & $\dn$ & $1 + \textit{size}(r)$\\\end{tabular}\end{center}You can use \textit{size} in order to test how much the ``evil'' regularexpression $(a^*)^* \cdot b$ grows when taking successive derivativesaccording the letter $a$ without simplification and then compare it totaking the derivative, but simplify the result. The sizesare given in \texttt{re.scala}. \hfill[1 Mark]\item[(6)] You do not have to implement anything specific under this task. The purpose here is that you will be marked for some ``power'' test cases. For example can your matcher decide within 30 seconds whether the regular expression $(a^*)^*\cdot b$ matches strings of the form $aaa\ldots{}aaaa$, for say 1 Million $a$'s. And does simplification simplify the regular expression \[ \texttt{SEQ(SEQ(SEQ(..., ONE | ONE) , ONE | ONE), ONE | ONE)} \] \noindent correctly to just \texttt{ONE}, where \texttt{SEQ} is nested 50 or more times?\\ \mbox{}\hfill[1 Mark]\end{itemize}\subsection*{Background}Although easily implementable in Scala, the idea behind the derivativefunction might not so easy to be seen. To understand its purposebetter, assume a regular expression $r$ can match strings of the form$c\!::\!cs$ (that means strings which start with a character $c$ and havesome rest, or tail, $cs$). If you take the derivative of $r$ withrespect to the character $c$, then you obtain a regular expressionthat can match all the strings $cs$. In other words, the regularexpression $\textit{der}\;c\;r$ can match the same strings $c\!::\!cs$that can be matched by $r$, except that the $c$ is chopped off.Assume now $r$ can match the string $abc$. If you take the derivativeaccording to $a$ then you obtain a regular expression that can match$bc$ (it is $abc$ where the $a$ has been chopped off). If you nowbuild the derivative $\textit{der}\;b\;(\textit{der}\;a\;r)$ youobtain a regular expression that can match the string $c$ (it is $bc$where $b$ is chopped off). If you finally build the derivative of thisaccording $c$, that is$\textit{der}\;c\;(\textit{der}\;b\;(\textit{der}\;a\;r))$, you obtaina regular expression that can match the empty string. You can testwhether this is indeed the case using the function nullable, which iswhat your matcher is doing.The purpose of the $\textit{simp}$ function is to keep the regularexpressions small. Normally the derivative function makes the regularexpression bigger (see the SEQ case and the example in (2)) and thealgorithm would be slower and slower over time. The $\textit{simp}$function counters this increase in size and the result is that thealgorithm is fast throughout. By the way, this algorithm is by JanuszBrzozowski who came up with the idea of derivatives in 1964 in his PhDthesis.\begin{center}\small\url{https://en.wikipedia.org/wiki/Janusz_Brzozowski_(computer_scientist)}\end{center}If you want to see how badly the regular expression matchers do inJava\footnote{Version 8 and below; Version 9 and above does not seem to be as catastrophic, but still much worse than the regular expression matcher based on derivatives.}, JavaScript and Python with the`evil' regular expression $(a^*)^*\cdot b$, then have a look at thegraphs below (you can try it out for yourself: have a look at the file\texttt{catastrophic9.java}, \texttt{catastrophic.js} and\texttt{catastrophic.py} on KEATS). Compare this with the matcher youhave implemented. How long can the string of $a$'s be in your matcherand still stay within the 30 seconds time limit?\begin{center}\begin{tabular}{@{}cc@{}}\multicolumn{2}{c}{Graph: $(a^*)^*\cdot b$ and strings $\underbrace{a\ldots a}_{n}$}\bigskip\\\begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,0.0)}}, ylabel={time in secs}, y label style={at={(0.06,0.5)}}, enlargelimits=false, xtick={0,5,...,30}, xmax=33, ymax=45, ytick={0,5,...,40}, scaled ticks=false, axis lines=left, width=6cm, height=5.5cm, legend entries={Python, Java 8, JavaScript}, legend pos=north west, legend cell align=left]\addplot[blue,mark=*, mark options={fill=white}] table {re-python2.data};\addplot[cyan,mark=*, mark options={fill=white}] table {re-java.data};\addplot[red,mark=*, mark options={fill=white}] table {re-js.data};\end{axis}\end{tikzpicture} & \begin{tikzpicture}\begin{axis}[ xlabel={$n$}, x label style={at={(1.05,0.0)}}, ylabel={time in secs}, y label style={at={(0.06,0.5)}}, %enlargelimits=false, %xtick={0,5000,...,30000}, xmax=65000, ymax=45, ytick={0,5,...,40}, scaled ticks=false, axis lines=left, width=6cm, height=5.5cm, legend entries={Java 9}, legend pos=north west]\addplot[cyan,mark=*, mark options={fill=white}] table {re-java9.data};\end{axis}\end{tikzpicture}\end{tabular} \end{center}\newpage\subsection*{Part 2 (4 Marks)}The \emph{Shunting Yard Algorithm} has been developed by Edsger Dijkstra,an influential computer scientist who developed many well-knownalgorithms. This algorithm transforms the usual infix notation ofarithmetic expressions into the postfix notation, sometimes alsocalled reverse Polish notation.Why on Earth do people use the postfix notation? It is much moreconvenient to work with the usual infix notation for arithmeticexpressions. Most modern calculators (as opposed to the ones used 20years ago) understand infix notation. So why on Earth? \ldots{}Well,many computers under the hood, even nowadays, use postfix notationextensively. For example if you give to the Java compiler theexpression $1 + ((2 * 3) + (4 - 3))$, it will generate the Java Bytecode\begin{lstlisting}[language=JVMIS,numbers=none]ldc 1 ldc 2 ldc 3 imul ldc 4 ldc 3 isub iadd iadd\end{lstlisting}\noindentwhere the command \texttt{ldc} loads a constant onto the stack, and \texttt{imul},\texttt{isub} and \texttt{iadd} are commands acting on the stack. Clearly thisis the arithmetic expression in postfix notation.\bigskip\noindentThe shunting yard algorithm processes an input token list using anoperator stack and an output list. The input consists of numbers,operators ($+$, $-$, $*$, $/$) and parentheses, and for the purpose ofthe assignment we assume the input is always a well-formed expressionin infix notation. The calculation in the shunting yard algorithm usesinformation about theprecedences of the operators (given in the template file). Thealgorithm processes the input token list as follows:\begin{itemize}\item If there is a number as input token, then this token is transferred directly to the output list. Then the rest of the input is processed.\item If there is an operator as input token, then you need to check what is on top of the operator stack. If there are operators with a higher or equal precedence, these operators are first popped off from the stack and moved to the output list. Then the operator from the input is pushed onto the stack and the rest of the input is processed.\item If the input is a left-parenthesis, you push it on to the stack and continue processing the input.\item If the input is a right-parenthesis, then you pop off all operators from the stack to the output list until you reach the left-parenthesis. Then you discharge the $($ and $)$ from the input and stack, and continue processing the input list.\item If the input is empty, then you move all remaining operators from the stack to the output list. \end{itemize} \subsubsection*{Tasks (file postfix.scala)}\begin{itemize}\item[(7)] Implement the shunting yard algorithm described above. The function, called \texttt{syard}, takes a list of tokens as first argument. The second and third arguments are the stack and output list represented as Scala lists. The most convenient way to implement this algorithm is to analyse what the input list, stack and output list look like in each step using pattern-matching. The algorithm transforms for example the input \[ \texttt{List(3, +, 4, *, (, 2, -, 1, ))} \] into the postfix output \[ \texttt{List(3, 4, 2, 1, -, *, +)} \] You can assume the input list is always a list representing a well-formed infix arithmetic expression.\hfill[1 Mark]\item[(8)] Implement a compute function that takes a postfix expression as argument and evaluates it generating an integer as result. It uses a stack to evaluate the postfix expression. The operators $+$, $-$, $*$ are as usual; $/$ is division on integers, for example $7 / 3 = 2$. \hfill[1 Mark]\end{itemize}\subsubsection*{Task (file postfix2.scala)}\begin{itemize}\item[(9)] Extend the code in (7) and (8) to include the power operator. This requires proper account of associativity of the operators. The power operator is right-associative, whereas the other operators are left-associative. Left-associative operators are popped off if the precedence is bigger or equal, while right-associative operators are only popped off if the precedence is bigger. The compute function in this task should use \texttt{Long}s, rather than \texttt{Int}s.\hfill[2 Marks]\end{itemize}\end{document}%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: