author | Christian Urban <christian dot urban at kcl dot ac dot uk> |
Wed, 25 Jan 2017 14:51:09 +0000 | |
changeset 106 | fae9db0d9793 |
parent 100 | 93260be6770f |
child 110 | 62389faa66e4 |
permissions | -rw-r--r-- |
6 | 1 |
\documentclass{article} |
62 | 2 |
\usepackage{../style} |
78 | 3 |
\usepackage{../langs} |
6 | 4 |
|
5 |
\begin{document} |
|
6 |
||
100 | 7 |
\section*{Coursework 8 (Scala, Regular Expressions)} |
6 | 8 |
|
79 | 9 |
This coursework is worth 10\%. It is about regular expressions, |
10 |
pattern matching and polymorphism. The first part is due on 30 |
|
11 |
November at 11pm; the second, more advanced part, is due on 7 December |
|
12 |
at 11pm. You are asked to implement a regular expression matcher. Make |
|
13 |
sure the files you submit can be processed by just calling |
|
14 |
\texttt{scala <<filename.scala>>}.\bigskip |
|
62 | 15 |
|
16 |
\noindent |
|
17 |
\textbf{Important:} Do not use any mutable data structures in your |
|
74 | 18 |
submission! They are not needed. This excludes the use of |
62 | 19 |
\texttt{ListBuffer}s, for example. Do not use \texttt{return} in your |
68 | 20 |
code! It has a different meaning in Scala, than in Java. Do not use |
21 |
\texttt{var}! This declares a mutable variable. Make sure the |
|
62 | 22 |
functions you submit are defined on the ``top-level'' of Scala, not |
86 | 23 |
inside a class or object. Also note that the running time of |
24 |
each part will be restricted to a maximum of 360 seconds. |
|
25 |
||
6 | 26 |
|
27 |
||
68 | 28 |
\subsection*{Disclaimer!!!!!!!!} |
6 | 29 |
|
30 |
It should be understood that the work you submit represents |
|
68 | 31 |
your own effort! You have not copied from anyone else. An |
6 | 32 |
exception is the Scala code I showed during the lectures or |
33 |
uploaded to KEATS, which you can freely use.\bigskip |
|
34 |
||
35 |
||
68 | 36 |
\subsection*{Part 1 (6 Marks)} |
6 | 37 |
|
69 | 38 |
The task is to implement a regular expression matcher that is based on |
68 | 39 |
derivatives of regular expressions. The implementation can deal |
40 |
with the following regular expressions, which have been predefined |
|
69 | 41 |
in the file re.scala: |
6 | 42 |
|
43 |
\begin{center} |
|
44 |
\begin{tabular}{lcll} |
|
45 |
$r$ & $::=$ & $\ZERO$ & cannot match anything\\ |
|
46 |
& $|$ & $\ONE$ & can only match the empty string\\ |
|
68 | 47 |
& $|$ & $c$ & can match a character (in this case $c$)\\ |
48 |
& $|$ & $r_1 + r_2$ & can match a string either with $r_1$ or with $r_2$\\ |
|
49 |
& $|$ & $r_1\cdot r_2$ & can match the first part of a string with $r_1$ and\\ |
|
50 |
& & & then the second part with $r_2$\\ |
|
6 | 51 |
& $|$ & $r^*$ & can match zero or more times $r$\\ |
52 |
\end{tabular} |
|
53 |
\end{center} |
|
54 |
||
68 | 55 |
\noindent |
56 |
Why? Knowing how to match regular expressions and strings fast will |
|
57 |
let you solve a lot of problems that vex other humans. Regular |
|
58 |
expressions are one of the fastest and simplest ways to match patterns |
|
59 |
in text, and are endlessly useful for searching, editing and |
|
60 |
analysing text in all sorts of places. However, you need to be |
|
69 | 61 |
fast, otherwise you will stumble over problems such as recently reported at |
68 | 62 |
|
63 |
{\small |
|
64 |
\begin{itemize} |
|
65 |
\item[$\bullet$] \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016} |
|
66 |
\item[$\bullet$] \url{https://vimeo.com/112065252} |
|
67 |
\item[$\bullet$] \url{http://davidvgalbraith.com/how-i-fixed-atom/} |
|
68 |
\end{itemize}} |
|
69 |
||
79 | 70 |
\subsubsection*{Tasks (file re.scala)} |
68 | 71 |
|
72 |
\begin{itemize} |
|
73 |
\item[(1a)] Implement a function, called \textit{nullable}, by recursion over |
|
69 | 74 |
regular expressions. This function tests whether a regular expression can match |
68 | 75 |
the empty string. |
6 | 76 |
|
77 |
\begin{center} |
|
78 |
\begin{tabular}{lcl} |
|
79 |
$\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\ |
|
80 |
$\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\ |
|
81 |
$\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\ |
|
82 |
$\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\ |
|
83 |
$\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\ |
|
84 |
$\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\\ |
|
85 |
\end{tabular} |
|
68 | 86 |
\end{center}\hfill[1 Mark] |
87 |
||
88 |
\item[(1b)] Implement a function, called \textit{der}, by recursion over |
|
89 |
regular expressions. It takes a character and a regular expression |
|
69 | 90 |
as arguments and calculates the derivative regular expression according |
91 |
to the rules: |
|
6 | 92 |
|
93 |
\begin{center} |
|
94 |
\begin{tabular}{lcl} |
|
95 |
$\textit{der}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\ |
|
96 |
$\textit{der}\;c\;(\ONE)$ & $\dn$ & $\ZERO$\\ |
|
97 |
$\textit{der}\;c\;(d)$ & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\ |
|
98 |
$\textit{der}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{der}\;c\;r_1) + (\textit{der}\;c\;r_2)$\\ |
|
99 |
$\textit{der}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\ |
|
100 |
& & $\textit{then}\;((\textit{der}\;c\;r_1)\cdot r_2) + (\textit{der}\;c\;r_2)$\\ |
|
101 |
& & $\textit{else}\;(\textit{der}\;c\;r_1)\cdot r_2$\\ |
|
102 |
$\textit{der}\;c\;(r^*)$ & $\dn$ & $(\textit{der}\;c\;r)\cdot (r^*)$\\ |
|
103 |
\end{tabular} |
|
69 | 104 |
\end{center} |
105 |
||
106 |
For example given the regular expression $r = (a \cdot b) \cdot c$, the derivatives |
|
107 |
w.r.t.~the characters $a$, $b$ and $c$ are |
|
108 |
||
109 |
\begin{center} |
|
110 |
\begin{tabular}{lcll} |
|
111 |
$\textit{der}\;a\;r$ & $=$ & $(\ONE \cdot b)\cdot c$ & ($= r'$)\\ |
|
112 |
$\textit{der}\;b\;r$ & $=$ & $(\ZERO \cdot b)\cdot c$\\ |
|
113 |
$\textit{der}\;c\;r$ & $=$ & $(\ZERO \cdot b)\cdot c$ |
|
114 |
\end{tabular} |
|
115 |
\end{center} |
|
116 |
||
117 |
Let $r'$ stand for the first derivative, then taking the derivatives of $r'$ |
|
118 |
w.r.t.~the characters $a$, $b$ and $c$ gives |
|
119 |
||
120 |
\begin{center} |
|
121 |
\begin{tabular}{lcll} |
|
122 |
$\textit{der}\;a\;r'$ & $=$ & $((\ZERO \cdot b) + \ZERO)\cdot c$ \\ |
|
123 |
$\textit{der}\;b\;r'$ & $=$ & $((\ZERO \cdot b) + \ONE)\cdot c$ & ($= r''$)\\ |
|
124 |
$\textit{der}\;c\;r'$ & $=$ & $((\ZERO \cdot b) + \ZERO)\cdot c$ |
|
125 |
\end{tabular} |
|
126 |
\end{center} |
|
127 |
||
128 |
One more example: Let $r''$ stand for the second derivative above, |
|
129 |
then taking the derivatives of $r''$ w.r.t.~the characters $a$, $b$ |
|
130 |
and $c$ gives |
|
131 |
||
132 |
\begin{center} |
|
133 |
\begin{tabular}{lcll} |
|
134 |
$\textit{der}\;a\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ZERO$ \\ |
|
135 |
$\textit{der}\;b\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ZERO$\\ |
|
136 |
$\textit{der}\;c\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ONE$ |
|
137 |
\end{tabular} |
|
138 |
\end{center} |
|
139 |
||
140 |
Note, the last derivative can match the empty string, that is it is \textit{nullable}.\\ |
|
141 |
\mbox{}\hfill\mbox{[1 Mark]} |
|
6 | 142 |
|
68 | 143 |
\item[(1c)] Implement the function \textit{simp}, which recursively |
69 | 144 |
traverses a regular expression from the inside to the outside, and |
145 |
simplifies every sub-regular-expression on the left (see below) to |
|
68 | 146 |
the regular expression on the right, except it does not simplify inside |
147 |
${}^*$-regular expressions. |
|
6 | 148 |
|
68 | 149 |
\begin{center} |
69 | 150 |
\begin{tabular}{l@{\hspace{4mm}}c@{\hspace{4mm}}ll} |
6 | 151 |
$r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ |
152 |
$\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ |
|
153 |
$r \cdot \ONE$ & $\mapsto$ & $r$\\ |
|
154 |
$\ONE \cdot r$ & $\mapsto$ & $r$\\ |
|
155 |
$r + \ZERO$ & $\mapsto$ & $r$\\ |
|
156 |
$\ZERO + r$ & $\mapsto$ & $r$\\ |
|
157 |
$r + r$ & $\mapsto$ & $r$\\ |
|
158 |
\end{tabular} |
|
68 | 159 |
\end{center} |
160 |
||
69 | 161 |
For example the regular expression |
68 | 162 |
\[(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)\] |
163 |
||
79 | 164 |
simplifies to just $r_1$. \textbf{Hints:} Regular expressions can be |
165 |
seen as trees and there are several methods for traversing |
|
166 |
trees. One of them corresponds to the inside-out traversal. Also |
|
167 |
remember numerical expressions from school: there you had exprssions |
|
168 |
like $u + \ldots + (1 \cdot x) * \ldots (z + (y \cdot 0)) \ldots$ |
|
169 |
and simplification rules that looked very similar to rules |
|
170 |
above. You would simplify such numerical expressions by replacing |
|
171 |
for example the $y \cdot 0$ by $0$, or $1\cdot x$ by $x$, and then |
|
172 |
look if more rules are applicable. If you organise this |
|
173 |
simplification in an inside-out fashion, it is always clear which |
|
174 |
rule should applied next.\\\mbox{}\hfill[1 Mark] |
|
68 | 175 |
|
176 |
\item[(1d)] Implement two functions: The first, called \textit{ders}, |
|
69 | 177 |
takes a list of characters and a regular expression as arguments, and |
178 |
builds the derivative w.r.t.~the list as follows: |
|
68 | 179 |
|
180 |
\begin{center} |
|
181 |
\begin{tabular}{lcl} |
|
69 | 182 |
$\textit{ders}\;(Nil)\;r$ & $\dn$ & $r$\\ |
183 |
$\textit{ders}\;(c::cs)\;r$ & $\dn$ & |
|
68 | 184 |
$\textit{ders}\;cs\;(\textit{simp}(\textit{der}\;c\;r))$\\ |
185 |
\end{tabular} |
|
6 | 186 |
\end{center} |
187 |
||
78 | 188 |
Note that this function is different from \textit{der}, which only |
189 |
takes a single character. |
|
190 |
||
191 |
The second function, called \textit{matcher}, takes a string and a |
|
192 |
regular expression as arguments. It builds first the derivatives |
|
193 |
according to \textit{ders} and after that tests whether the resulting |
|
194 |
derivative regular expression can match the empty string (using |
|
195 |
\textit{nullable}). For example the \textit{matcher} will produce |
|
196 |
true given the regular expression $(a\cdot b)\cdot c$ and the string |
|
197 |
$abc$.\\ \mbox{}\hfill[1 Mark] |
|
6 | 198 |
|
69 | 199 |
\item[(1e)] Implement the function $\textit{replace}\;r\;s_1\;s_2$: it searches |
200 |
(from the left to |
|
201 |
right) in the string $s_1$ all the non-empty substrings that match the |
|
202 |
regular expression $r$---these substrings are assumed to be |
|
68 | 203 |
the longest substrings matched by the regular expression and |
69 | 204 |
assumed to be non-overlapping. All these substrings in $s_1$ matched by $r$ |
205 |
are replaced by $s_2$. For example given the regular expression |
|
6 | 206 |
|
68 | 207 |
\[(a \cdot a)^* + (b \cdot b)\] |
6 | 208 |
|
69 | 209 |
\noindent the string $s_1 = aabbbaaaaaaabaaaaabbaaaabb$ and |
78 | 210 |
replacement the string $s_2 = c$ yields the string |
6 | 211 |
|
68 | 212 |
\[ |
213 |
ccbcabcaccc |
|
214 |
\] |
|
6 | 215 |
|
78 | 216 |
\hfill[2 Marks] |
68 | 217 |
\end{itemize} |
6 | 218 |
|
78 | 219 |
|
220 |
||
221 |
||
222 |
\subsection*{Part 2 (4 Marks)} |
|
223 |
||
224 |
You need to copy all the code from \texttt{re.scala} into |
|
225 |
\texttt{re2.scala} in order to complete Part 2. Parts (2a) and (2b) |
|
226 |
give you another method and datapoints for testing the \textit{der} |
|
227 |
and \textit{simp} functions from Part~1. |
|
228 |
||
79 | 229 |
\subsubsection*{Tasks (file re2.scala)} |
78 | 230 |
|
231 |
\begin{itemize} |
|
232 |
\item[(2a)] Write a \textbf{polymorphic} function, called |
|
233 |
\textit{iterT}, that is \textbf{tail-recursive}(!) and takes an |
|
234 |
integer $n$, a function $f$ and an $x$ as arguments. This function |
|
235 |
should iterate $f$ $n$-times starting with the argument $x$, like |
|
236 |
||
237 |
\[\underbrace{f(\ldots (f}_{n\text{-times}}(x))) |
|
238 |
\] |
|
239 |
||
240 |
More formally that means \textit{iterT} behaves as follows: |
|
241 |
||
242 |
\begin{center} |
|
243 |
\begin{tabular}{lcl} |
|
244 |
$\textit{iterT}(n, f, x)$ & $\dn$ & |
|
245 |
$\begin{cases} |
|
246 |
\;x & \textit{if}\;n = 0\\ |
|
247 |
\;f(\textit{iterT}(n - 1, f, x)) & \textit{otherwise} |
|
248 |
\end{cases}$ |
|
249 |
\end{tabular} |
|
250 |
\end{center} |
|
251 |
||
252 |
Make sure you write a \textbf{tail-recursive} version of |
|
253 |
\textit{iterT}. If you add the annotation \texttt{@tailrec} (see |
|
79 | 254 |
below) your code should not produce an error message. |
78 | 255 |
|
256 |
\begin{lstlisting}[language=Scala, numbers=none, xleftmargin=-1mm] |
|
257 |
import scala.annotation.tailrec |
|
258 |
||
259 |
@tailrec |
|
260 |
def iterT[A](n: Int, f: A => A, x: A): A = ... |
|
261 |
\end{lstlisting} |
|
262 |
||
263 |
You can assume that \textit{iterT} will only be called for positive |
|
264 |
integers $0 \le n$. Given the type variable \texttt{A}, the type of |
|
265 |
$f$ is \texttt{A => A} and the type of $x$ is \texttt{A}. This means |
|
266 |
\textit{iterT} can be used, for example, for functions from integers |
|
79 | 267 |
to integers, or strings to strings, or regular expressions to |
268 |
regular expressions. \\ \mbox{}\hfill[2 Marks] |
|
78 | 269 |
|
270 |
\item[(2b)] Implement a function, called \textit{size}, by recursion |
|
271 |
over regular expressions. If a regular expression is seen as a tree, |
|
272 |
then \textit{size} should return the number of nodes in such a |
|
273 |
tree. Therefore this function is defined as follows: |
|
274 |
||
275 |
\begin{center} |
|
276 |
\begin{tabular}{lcl} |
|
277 |
$\textit{size}(\ZERO)$ & $\dn$ & $1$\\ |
|
278 |
$\textit{size}(\ONE)$ & $\dn$ & $1$\\ |
|
279 |
$\textit{size}(c)$ & $\dn$ & $1$\\ |
|
280 |
$\textit{size}(r_1 + r_2)$ & $\dn$ & $1 + \textit{size}(r_1) + \textit{size}(r_2)$\\ |
|
281 |
$\textit{size}(r_1 \cdot r_2)$ & $\dn$ & $1 + \textit{size}(r_1) + \textit{size}(r_2)$\\ |
|
282 |
$\textit{size}(r^*)$ & $\dn$ & $1 + \textit{size}(r)$\\ |
|
283 |
\end{tabular} |
|
284 |
\end{center} |
|
285 |
||
286 |
You can use \textit{size} and \textit{iterT} in order to test how much |
|
287 |
the 'evil' regular expression $(a^*)^* \cdot b$ grows when taking |
|
79 | 288 |
successive derivatives according the letter $a$ and then compare it to |
78 | 289 |
taking the derivative, but simlifying the derivative after each step. |
290 |
For example, the calls |
|
291 |
||
292 |
\begin{lstlisting}[language=Scala, numbers=none, xleftmargin=-1mm] |
|
293 |
size(iterT(20, (r: Rexp) => der('a', r), EVIL)) |
|
294 |
size(iterT(20, (r: Rexp) => simp(der('a', r)), EVIL)) |
|
295 |
\end{lstlisting} |
|
296 |
||
297 |
produce without simplification a regular expression of size of |
|
79 | 298 |
7340068 after 20 iterations, while the one with |
299 |
simplification gives |
|
78 | 300 |
just 8.\\ \mbox{}\hfill[1 Mark] |
301 |
||
302 |
||
303 |
\item[(2c)] Write a \textbf{polymorphic} function, called |
|
304 |
\textit{fixpT}, that takes |
|
305 |
a function $f$ and an $x$ as arguments. The purpose |
|
306 |
of \textit{fixpT} is to calculate a fixpoint of the function $f$ |
|
307 |
starting from the argument $x$. |
|
308 |
A fixpoint, say $y$, is when $f(y) = y$ holds. |
|
309 |
That means \textit{fixpT} behaves as follows: |
|
310 |
||
311 |
\begin{center} |
|
312 |
\begin{tabular}{lcl} |
|
313 |
$\textit{fixpT}(f, x)$ & $\dn$ & |
|
314 |
$\begin{cases} |
|
315 |
\;x & \textit{if}\;f(x) = x\\ |
|
316 |
\;\textit{fixpT}(f, f(x)) & \textit{otherwise} |
|
317 |
\end{cases}$ |
|
318 |
\end{tabular} |
|
319 |
\end{center} |
|
320 |
||
321 |
Make sure you calculate in the code of $\textit{fixpT}$ the result |
|
322 |
of $f(x)$ only once. Given the type variable \texttt{A} in |
|
323 |
$\textit{fixpT}$, the type of $f$ is \texttt{A => A} and the type of |
|
324 |
$x$ is \texttt{A}. The file \texttt{re2.scala} gives two example |
|
325 |
function where in one the fixpoint is 1 and in the other |
|
326 |
it is the string $a$.\\ \mbox{}\hfill[1 Mark] |
|
327 |
\end{itemize}\bigskip |
|
328 |
||
329 |
||
94
ae4708c851ee
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
86
diff
changeset
|
330 |
\subsection*{Background} |
78 | 331 |
|
94
ae4708c851ee
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
86
diff
changeset
|
332 |
Although easily implementable in Scala, the idea behind the derivative |
ae4708c851ee
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
86
diff
changeset
|
333 |
function might not so easy to be seen. To understand its purpose |
ae4708c851ee
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
86
diff
changeset
|
334 |
better, assume a regular expression $r$ can match strings of the form |
ae4708c851ee
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
86
diff
changeset
|
335 |
$c::cs$ (that means strings which start with a character $c$ and have |
ae4708c851ee
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
86
diff
changeset
|
336 |
some rest, or tail, $cs$). If you now take the derivative of $r$ with |
ae4708c851ee
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
86
diff
changeset
|
337 |
respect to the character $c$, then you obtain a regular expressions |
ae4708c851ee
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
86
diff
changeset
|
338 |
that can match all the strings $cs$. In other words, the regular |
ae4708c851ee
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
86
diff
changeset
|
339 |
expression $\textit{der}\;c\;r$ can match the same strings $c::cs$ |
ae4708c851ee
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
86
diff
changeset
|
340 |
that can be matched by $r$, except that the $c$ is chopped off. |
75 | 341 |
|
342 |
Assume now $r$ can match the string $abc$. If you take the derivative |
|
343 |
according to $a$ then you obtain a regular expression that can match |
|
344 |
$bc$ (it is $abc$ where the $a$ has been chopped off). If you now |
|
78 | 345 |
build the derivative $\textit{der}\;b\;(\textit{der}\;a\;r))$ you |
346 |
obtain a regular expression that can match the string $c$ (it is $bc$ |
|
347 |
where $b$ is chopped off). If you finally build the derivative of this |
|
348 |
according $c$, that is |
|
349 |
$\textit{der}\;c\;(\textit{der}\;b\;(\textit{der}\;a\;r)))$, you |
|
350 |
obtain a regular expression that can match the empty string. You can |
|
351 |
test this using the function nullable, which is what your matcher is |
|
352 |
doing. |
|
75 | 353 |
|
78 | 354 |
The purpose of the simp function is to keep the regular expression |
355 |
small. Normally the derivative function makes the regular expression |
|
356 |
bigger (see the SEQ case and the example in (2b)) and the algorithm |
|
357 |
would be slower and slower over time. The simp function counters this |
|
358 |
increase in size and the result is that the algorithm is fast |
|
359 |
throughout. By the way, this algorithm is by Janusz Brzozowski who |
|
360 |
came up with the idea of derivatives in 1964 in his PhD thesis. |
|
75 | 361 |
|
78 | 362 |
\begin{center}\small |
363 |
\url{https://en.wikipedia.org/wiki/Janusz_Brzozowski_(computer_scientist)} |
|
364 |
\end{center} |
|
6 | 365 |
|
366 |
\end{document} |
|
367 |
||
68 | 368 |
|
6 | 369 |
%%% Local Variables: |
370 |
%%% mode: latex |
|
371 |
%%% TeX-master: t |
|
372 |
%%% End: |