\documentclass{article}+ −
\usepackage{../style}+ −
\usepackage{../langs}+ −
+ −
\begin{document}+ −
+ −
\section*{Coursework 8 (Scala, Regular Expressions)}+ −
+ −
This coursework is worth 10\%. It is about regular expressions,+ −
pattern matching and polymorphism. The first part is due on 30+ −
November at 11pm; the second, more advanced part, is due on 21+ −
December at 11pm. You are asked to implement a regular expression+ −
matcher based on derivatives of regular expressions. The reason is+ −
that regular expression matching in Java can be extremely slow+ −
sometimes.\bigskip+ −
+ −
\noindent+ −
\textbf{Important:}+ −
+ −
\begin{itemize}+ −
\item Make sure the files you submit can be processed by just calling\\+ −
\mbox{\texttt{scala <<filename.scala>>}} on the commandline. Use the+ −
template files provided and do not make any changes to arguments of+ −
functions or to any types. You are free to implement any auxiliary+ −
function you might need.+ −
+ −
\item Do not use any mutable data structures in your+ −
submissions! They are not needed. This means you cannot create new + −
\texttt{Array}s or \texttt{ListBuffer}s, for example. + −
+ −
\item Do not use \texttt{return} in your code! It has a different+ −
meaning in Scala, than in Java.+ −
+ −
\item Do not use \texttt{var}! This declares a mutable variable. Only+ −
use \texttt{val}!+ −
+ −
\item Do not use any parallel collections! No \texttt{.par} therefore!+ −
Our testing and marking infrastructure is not set up for it.+ −
\end{itemize}+ −
+ −
\noindent+ −
Also note that the running time of each part will be restricted to a+ −
maximum of 360 seconds on my laptop+ −
+ −
+ −
\subsection*{Disclaimer}+ −
+ −
It should be understood that the work you submit represents+ −
your own effort! You have not copied from anyone else. An+ −
exception is the Scala code I showed during the lectures or+ −
uploaded to KEATS, which you can freely use.\bigskip+ −
+ −
+ −
\subsection*{Part 1 (6 Marks)}+ −
+ −
The task is to implement a regular expression matcher that is based on+ −
derivatives of regular expressions. The implementation can deal+ −
with the following regular expressions, which have been predefined+ −
in the file re.scala:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcll}+ −
$r$ & $::=$ & $\ZERO$ & cannot match anything\\+ −
& $|$ & $\ONE$ & can only match the empty string\\+ −
& $|$ & $c$ & can match a character (in this case $c$)\\+ −
& $|$ & $r_1 + r_2$ & can match a string either with $r_1$ or with $r_2$\\+ −
& $|$ & $r_1\cdot r_2$ & can match the first part of a string with $r_1$ and\\+ −
& & & then the second part with $r_2$\\+ −
& $|$ & $r^*$ & can match zero or more times $r$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent + −
Why? Knowing how to match regular expressions and strings will let you+ −
solve a lot of problems that vex other humans. Regular expressions are+ −
one of the fastest and simplest ways to match patterns in text, and+ −
are endlessly useful for searching, editing and analysing data in all+ −
sorts of places (for example analysing network traffic in order to+ −
detect security breaches). However, you need to be fast, otherwise you+ −
will stumble over problems such as recently reported at+ −
+ −
{\small+ −
\begin{itemize}+ −
\item[$\bullet$] \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}+ −
\item[$\bullet$] \url{https://vimeo.com/112065252}+ −
\item[$\bullet$] \url{http://davidvgalbraith.com/how-i-fixed-atom/} + −
\end{itemize}}+ −
+ −
\subsubsection*{Tasks (file re.scala)}+ −
+ −
\begin{itemize}+ −
\item[(1a)] Implement a function, called \textit{nullable}, by+ −
recursion over regular expressions. This function tests whether a+ −
regular expression can match the empty string, that is given a+ −
regular expression it either returns true or false.+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\+ −
$\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\+ −
$\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\+ −
$\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\+ −
$\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\+ −
$\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\\+ −
\end{tabular}+ −
\end{center}\hfill[1 Mark]+ −
+ −
\item[(1b)] Implement a function, called \textit{der}, by recursion over+ −
regular expressions. It takes a character and a regular expression+ −
as arguments and calculates the derivative regular expression according+ −
to the rules:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{der}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\+ −
$\textit{der}\;c\;(\ONE)$ & $\dn$ & $\ZERO$\\+ −
$\textit{der}\;c\;(d)$ & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\+ −
$\textit{der}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{der}\;c\;r_1) + (\textit{der}\;c\;r_2)$\\+ −
$\textit{der}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\+ −
& & $\textit{then}\;((\textit{der}\;c\;r_1)\cdot r_2) + (\textit{der}\;c\;r_2)$\\+ −
& & $\textit{else}\;(\textit{der}\;c\;r_1)\cdot r_2$\\+ −
$\textit{der}\;c\;(r^*)$ & $\dn$ & $(\textit{der}\;c\;r)\cdot (r^*)$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
For example given the regular expression $r = (a \cdot b) \cdot c$, the derivatives+ −
w.r.t.~the characters $a$, $b$ and $c$ are+ −
+ −
\begin{center}+ −
\begin{tabular}{lcll}+ −
$\textit{der}\;a\;r$ & $=$ & $(\ONE \cdot b)\cdot c$ & ($= r'$)\\+ −
$\textit{der}\;b\;r$ & $=$ & $(\ZERO \cdot b)\cdot c$\\+ −
$\textit{der}\;c\;r$ & $=$ & $(\ZERO \cdot b)\cdot c$+ −
\end{tabular}+ −
\end{center}+ −
+ −
Let $r'$ stand for the first derivative, then taking the derivatives of $r'$+ −
w.r.t.~the characters $a$, $b$ and $c$ gives+ −
+ −
\begin{center}+ −
\begin{tabular}{lcll}+ −
$\textit{der}\;a\;r'$ & $=$ & $((\ZERO \cdot b) + \ZERO)\cdot c$ \\+ −
$\textit{der}\;b\;r'$ & $=$ & $((\ZERO \cdot b) + \ONE)\cdot c$ & ($= r''$)\\+ −
$\textit{der}\;c\;r'$ & $=$ & $((\ZERO \cdot b) + \ZERO)\cdot c$+ −
\end{tabular}+ −
\end{center}+ −
+ −
One more example: Let $r''$ stand for the second derivative above,+ −
then taking the derivatives of $r''$ w.r.t.~the characters $a$, $b$+ −
and $c$ gives+ −
+ −
\begin{center}+ −
\begin{tabular}{lcll}+ −
$\textit{der}\;a\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ZERO$ \\+ −
$\textit{der}\;b\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ZERO$\\+ −
$\textit{der}\;c\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ONE$ &+ −
(is $\textit{nullable}$) + −
\end{tabular}+ −
\end{center}+ −
+ −
Note, the last derivative can match the empty string, that is it is \textit{nullable}.\\+ −
\mbox{}\hfill\mbox{[1 Mark]}+ −
+ −
\item[(1c)] Implement the function \textit{simp}, which recursively+ −
traverses a regular expression from the inside to the outside, and+ −
simplifies every sub-regular-expression on the left (see below) to+ −
the regular expression on the right, except it does not simplify inside+ −
${}^*$-regular expressions.+ −
+ −
\begin{center}+ −
\begin{tabular}{l@{\hspace{4mm}}c@{\hspace{4mm}}ll}+ −
$r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ + −
$\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ + −
$r \cdot \ONE$ & $\mapsto$ & $r$\\ + −
$\ONE \cdot r$ & $\mapsto$ & $r$\\ + −
$r + \ZERO$ & $\mapsto$ & $r$\\ + −
$\ZERO + r$ & $\mapsto$ & $r$\\ + −
$r + r$ & $\mapsto$ & $r$\\ + −
\end{tabular}+ −
\end{center}+ −
+ −
For example the regular expression+ −
\[(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)\]+ −
+ −
simplifies to just $r_1$. \textbf{Hints:} Regular expressions can be+ −
seen as trees and there are several methods for traversing+ −
trees. One of them corresponds to the inside-out traversal. Also+ −
remember numerical expressions from school: there you had expressions+ −
like $u + \ldots + (1 \cdot x) - \ldots (z + (y \cdot 0)) \ldots$+ −
and simplification rules that looked very similar to rules+ −
above. You would simplify such numerical expressions by replacing+ −
for example the $y \cdot 0$ by $0$, or $1\cdot x$ by $x$, and then+ −
look whether more rules are applicable. If you organise the+ −
simplification in an inside-out fashion, it is always clear which+ −
rule should applied next.\\\mbox{}\hfill[1 Mark]+ −
+ −
\item[(1d)] Implement two functions: The first, called \textit{ders},+ −
takes a list of characters and a regular expression as arguments, and+ −
builds the derivative w.r.t.~the list as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{ders}\;(Nil)\;r$ & $\dn$ & $r$\\+ −
$\textit{ders}\;(c::cs)\;r$ & $\dn$ &+ −
$\textit{ders}\;cs\;(\textit{simp}(\textit{der}\;c\;r))$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
Note that this function is different from \textit{der}, which only+ −
takes a single character.+ −
+ −
The second function, called \textit{matcher}, takes a string and a+ −
regular expression as arguments. It builds first the derivatives+ −
according to \textit{ders} and after that tests whether the resulting+ −
derivative regular expression can match the empty string (using+ −
\textit{nullable}). For example the \textit{matcher} will produce+ −
true given the regular expression $(a\cdot b)\cdot c$ and the string+ −
$abc$.\\ \mbox{}\hfill[1 Mark]+ −
+ −
\item[(1e)] Implement the function $\textit{replace}\;r\;s_1\;s_2$: it searches+ −
(from the left to + −
right) in the string $s_1$ all the non-empty substrings that match the + −
regular expression $r$---these substrings are assumed to be+ −
the longest substrings matched by the regular expression and+ −
assumed to be non-overlapping. All these substrings in $s_1$ matched by $r$+ −
are replaced by $s_2$. For example given the regular expression+ −
+ −
\[(a \cdot a)^* + (b \cdot b)\]+ −
+ −
\noindent the string $s_1 = aabbbaaaaaaabaaaaabbaaaabb$ and+ −
replacement the string $s_2 = c$ yields the string+ −
+ −
\[+ −
ccbcabcaccc+ −
\]+ −
+ −
\hfill[2 Marks]+ −
\end{itemize}+ −
+ −
+ −
+ −
+ −
\subsection*{Part 2 (4 Marks)}+ −
+ −
You need to copy all the code from \texttt{re.scala} into+ −
\texttt{re2.scala} in order to complete Part 2. Parts (2a) and (2b)+ −
give you another method and datapoints for testing the \textit{der}+ −
and \textit{simp} functions from Part~1.+ −
+ −
\subsubsection*{Tasks (file re2.scala)}+ −
+ −
\begin{itemize}+ −
\item[(2a)] Write a \textbf{polymorphic} function, called+ −
\textit{iterT}, that is \textbf{tail-recursive}(!) and takes an+ −
integer $n$, a function $f$ and an $x$ as arguments. This function+ −
should iterate $f$ $n$-times starting with the argument $x$, like+ −
+ −
\[\underbrace{f(\ldots (f}_{n\text{-times}}(x)))+ −
\]+ −
+ −
More formally that means \textit{iterT} behaves as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{iterT}(n, f, x)$ & $\dn$ &+ −
$\begin{cases}+ −
\;x & \textit{if}\;n = 0\\+ −
\;f(\textit{iterT}(n - 1, f, x)) & \textit{otherwise}+ −
\end{cases}$ + −
\end{tabular}+ −
\end{center}+ −
+ −
Make sure you write a \textbf{tail-recursive} version of+ −
\textit{iterT}. If you add the annotation \texttt{@tailrec} (see+ −
below) your code should not produce an error message.+ −
+ −
\begin{lstlisting}[language=Scala, numbers=none, xleftmargin=-1mm] + −
import scala.annotation.tailrec+ −
+ −
@tailrec+ −
def iterT[A](n: Int, f: A => A, x: A): A = ...+ −
\end{lstlisting}+ −
+ −
You can assume that \textit{iterT} will only be called for positive+ −
integers $0 \le n$. Given the type variable \texttt{A}, the type of+ −
$f$ is \texttt{A => A} and the type of $x$ is \texttt{A}. This means+ −
\textit{iterT} can be used, for example, for functions from integers+ −
to integers, or strings to strings, or regular expressions to+ −
regular expressions. \\ \mbox{}\hfill[2 Marks]+ −
+ −
\item[(2b)] Implement a function, called \textit{size}, by recursion+ −
over regular expressions. If a regular expression is seen as a tree,+ −
then \textit{size} should return the number of nodes in such a+ −
tree. Therefore this function is defined as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{size}(\ZERO)$ & $\dn$ & $1$\\+ −
$\textit{size}(\ONE)$ & $\dn$ & $1$\\+ −
$\textit{size}(c)$ & $\dn$ & $1$\\+ −
$\textit{size}(r_1 + r_2)$ & $\dn$ & $1 + \textit{size}(r_1) + \textit{size}(r_2)$\\+ −
$\textit{size}(r_1 \cdot r_2)$ & $\dn$ & $1 + \textit{size}(r_1) + \textit{size}(r_2)$\\+ −
$\textit{size}(r^*)$ & $\dn$ & $1 + \textit{size}(r)$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
You can use \textit{size} and \textit{iterT} in order to test how much+ −
the 'evil' regular expression $(a^*)^* \cdot b$ grows when taking+ −
successive derivatives according the letter $a$ and then compare it to+ −
taking the derivative, but simlifying the derivative after each step.+ −
For example, the calls+ −
+ −
\begin{lstlisting}[language=Scala, numbers=none, xleftmargin=-1mm] + −
size(iterT(20, (r: Rexp) => der('a', r), EVIL))+ −
size(iterT(20, (r: Rexp) => simp(der('a', r)), EVIL))+ −
\end{lstlisting}+ −
+ −
produce without simplification a regular expression of size of+ −
7340068 after 20 iterations, while the one with+ −
simplification gives+ −
just 8.\\ \mbox{}\hfill[1 Mark]+ −
+ −
+ −
\item[(2c)] Write a \textbf{polymorphic} function, called+ −
\textit{fixpT}, that takes+ −
a function $f$ and an $x$ as arguments. The purpose+ −
of \textit{fixpT} is to calculate a fixpoint of the function $f$+ −
starting from the argument $x$.+ −
A fixpoint, say $y$, is when $f(y) = y$ holds. + −
That means \textit{fixpT} behaves as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{fixpT}(f, x)$ & $\dn$ &+ −
$\begin{cases}+ −
\;x & \textit{if}\;f(x) = x\\+ −
\;\textit{fixpT}(f, f(x)) & \textit{otherwise}+ −
\end{cases}$ + −
\end{tabular}+ −
\end{center}+ −
+ −
Make sure you calculate in the code of $\textit{fixpT}$ the result+ −
of $f(x)$ only once. Given the type variable \texttt{A} in+ −
$\textit{fixpT}$, the type of $f$ is \texttt{A => A} and the type of+ −
$x$ is \texttt{A}. The file \texttt{re2.scala} gives two example+ −
function where in one the fixpoint is 1 and in the other+ −
it is the string $a$.\\ \mbox{}\hfill[1 Mark] + −
\end{itemize}\bigskip + −
+ −
+ −
\subsection*{Background}+ −
+ −
Although easily implementable in Scala, the idea behind the derivative+ −
function might not so easy to be seen. To understand its purpose+ −
better, assume a regular expression $r$ can match strings of the form+ −
$c\!::\!cs$ (that means strings which start with a character $c$ and have+ −
some rest, or tail, $cs$). If you now take the derivative of $r$ with+ −
respect to the character $c$, then you obtain a regular expressions+ −
that can match all the strings $cs$. In other words, the regular+ −
expression $\textit{der}\;c\;r$ can match the same strings $c::cs$+ −
that can be matched by $r$, except that the $c$ is chopped off.+ −
+ −
Assume now $r$ can match the string $abc$. If you take the derivative+ −
according to $a$ then you obtain a regular expression that can match+ −
$bc$ (it is $abc$ where the $a$ has been chopped off). If you now+ −
build the derivative $\textit{der}\;b\;(\textit{der}\;a\;r))$ you+ −
obtain a regular expression that can match the string $c$ (it is $bc$+ −
where $b$ is chopped off). If you finally build the derivative of this+ −
according $c$, that is+ −
$\textit{der}\;c\;(\textit{der}\;b\;(\textit{der}\;a\;r)))$, you+ −
obtain a regular expression that can match the empty string. You can+ −
test this using the function nullable, which is what your matcher is+ −
doing.+ −
+ −
The purpose of the simp function is to keep the regular expression+ −
small. Normally the derivative function makes the regular expression+ −
bigger (see the SEQ case and the example in (2b)) and the algorithm+ −
would be slower and slower over time. The simp function counters this+ −
increase in size and the result is that the algorithm is fast+ −
throughout. By the way, this algorithm is by Janusz Brzozowski who+ −
came up with the idea of derivatives in 1964 in his PhD thesis.+ −
+ −
\begin{center}\small+ −
\url{https://en.wikipedia.org/wiki/Janusz_Brzozowski_(computer_scientist)}+ −
\end{center}+ −
+ −
\end{document}+ −
+ −
+ −
%%% Local Variables: + −
%%% mode: latex+ −
%%% TeX-master: t+ −
%%% End: + −