6
|
1 |
\documentclass{article}
|
62
|
2 |
\usepackage{../style}
|
78
|
3 |
\usepackage{../langs}
|
6
|
4 |
|
|
5 |
\begin{document}
|
|
6 |
|
62
|
7 |
\section*{Coursework 8 (Scala, Regular Expressions}
|
6
|
8 |
|
79
|
9 |
This coursework is worth 10\%. It is about regular expressions,
|
|
10 |
pattern matching and polymorphism. The first part is due on 30
|
|
11 |
November at 11pm; the second, more advanced part, is due on 7 December
|
|
12 |
at 11pm. You are asked to implement a regular expression matcher. Make
|
|
13 |
sure the files you submit can be processed by just calling
|
|
14 |
\texttt{scala <<filename.scala>>}.\bigskip
|
62
|
15 |
|
|
16 |
\noindent
|
|
17 |
\textbf{Important:} Do not use any mutable data structures in your
|
74
|
18 |
submission! They are not needed. This excludes the use of
|
62
|
19 |
\texttt{ListBuffer}s, for example. Do not use \texttt{return} in your
|
68
|
20 |
code! It has a different meaning in Scala, than in Java. Do not use
|
|
21 |
\texttt{var}! This declares a mutable variable. Make sure the
|
62
|
22 |
functions you submit are defined on the ``top-level'' of Scala, not
|
86
|
23 |
inside a class or object. Also note that the running time of
|
|
24 |
each part will be restricted to a maximum of 360 seconds.
|
|
25 |
|
6
|
26 |
|
|
27 |
|
68
|
28 |
\subsection*{Disclaimer!!!!!!!!}
|
6
|
29 |
|
|
30 |
It should be understood that the work you submit represents
|
68
|
31 |
your own effort! You have not copied from anyone else. An
|
6
|
32 |
exception is the Scala code I showed during the lectures or
|
|
33 |
uploaded to KEATS, which you can freely use.\bigskip
|
|
34 |
|
|
35 |
|
68
|
36 |
\subsection*{Part 1 (6 Marks)}
|
6
|
37 |
|
69
|
38 |
The task is to implement a regular expression matcher that is based on
|
68
|
39 |
derivatives of regular expressions. The implementation can deal
|
|
40 |
with the following regular expressions, which have been predefined
|
69
|
41 |
in the file re.scala:
|
6
|
42 |
|
|
43 |
\begin{center}
|
|
44 |
\begin{tabular}{lcll}
|
|
45 |
$r$ & $::=$ & $\ZERO$ & cannot match anything\\
|
|
46 |
& $|$ & $\ONE$ & can only match the empty string\\
|
68
|
47 |
& $|$ & $c$ & can match a character (in this case $c$)\\
|
|
48 |
& $|$ & $r_1 + r_2$ & can match a string either with $r_1$ or with $r_2$\\
|
|
49 |
& $|$ & $r_1\cdot r_2$ & can match the first part of a string with $r_1$ and\\
|
|
50 |
& & & then the second part with $r_2$\\
|
6
|
51 |
& $|$ & $r^*$ & can match zero or more times $r$\\
|
|
52 |
\end{tabular}
|
|
53 |
\end{center}
|
|
54 |
|
68
|
55 |
\noindent
|
|
56 |
Why? Knowing how to match regular expressions and strings fast will
|
|
57 |
let you solve a lot of problems that vex other humans. Regular
|
|
58 |
expressions are one of the fastest and simplest ways to match patterns
|
|
59 |
in text, and are endlessly useful for searching, editing and
|
|
60 |
analysing text in all sorts of places. However, you need to be
|
69
|
61 |
fast, otherwise you will stumble over problems such as recently reported at
|
68
|
62 |
|
|
63 |
{\small
|
|
64 |
\begin{itemize}
|
|
65 |
\item[$\bullet$] \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
|
|
66 |
\item[$\bullet$] \url{https://vimeo.com/112065252}
|
|
67 |
\item[$\bullet$] \url{http://davidvgalbraith.com/how-i-fixed-atom/}
|
|
68 |
\end{itemize}}
|
|
69 |
|
79
|
70 |
\subsubsection*{Tasks (file re.scala)}
|
68
|
71 |
|
|
72 |
\begin{itemize}
|
|
73 |
\item[(1a)] Implement a function, called \textit{nullable}, by recursion over
|
69
|
74 |
regular expressions. This function tests whether a regular expression can match
|
68
|
75 |
the empty string.
|
6
|
76 |
|
|
77 |
\begin{center}
|
|
78 |
\begin{tabular}{lcl}
|
|
79 |
$\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\
|
|
80 |
$\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\
|
|
81 |
$\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\
|
|
82 |
$\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\
|
|
83 |
$\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\
|
|
84 |
$\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\\
|
|
85 |
\end{tabular}
|
68
|
86 |
\end{center}\hfill[1 Mark]
|
|
87 |
|
|
88 |
\item[(1b)] Implement a function, called \textit{der}, by recursion over
|
|
89 |
regular expressions. It takes a character and a regular expression
|
69
|
90 |
as arguments and calculates the derivative regular expression according
|
|
91 |
to the rules:
|
6
|
92 |
|
|
93 |
\begin{center}
|
|
94 |
\begin{tabular}{lcl}
|
|
95 |
$\textit{der}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\
|
|
96 |
$\textit{der}\;c\;(\ONE)$ & $\dn$ & $\ZERO$\\
|
|
97 |
$\textit{der}\;c\;(d)$ & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\
|
|
98 |
$\textit{der}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{der}\;c\;r_1) + (\textit{der}\;c\;r_2)$\\
|
|
99 |
$\textit{der}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\
|
|
100 |
& & $\textit{then}\;((\textit{der}\;c\;r_1)\cdot r_2) + (\textit{der}\;c\;r_2)$\\
|
|
101 |
& & $\textit{else}\;(\textit{der}\;c\;r_1)\cdot r_2$\\
|
|
102 |
$\textit{der}\;c\;(r^*)$ & $\dn$ & $(\textit{der}\;c\;r)\cdot (r^*)$\\
|
|
103 |
\end{tabular}
|
69
|
104 |
\end{center}
|
|
105 |
|
|
106 |
For example given the regular expression $r = (a \cdot b) \cdot c$, the derivatives
|
|
107 |
w.r.t.~the characters $a$, $b$ and $c$ are
|
|
108 |
|
|
109 |
\begin{center}
|
|
110 |
\begin{tabular}{lcll}
|
|
111 |
$\textit{der}\;a\;r$ & $=$ & $(\ONE \cdot b)\cdot c$ & ($= r'$)\\
|
|
112 |
$\textit{der}\;b\;r$ & $=$ & $(\ZERO \cdot b)\cdot c$\\
|
|
113 |
$\textit{der}\;c\;r$ & $=$ & $(\ZERO \cdot b)\cdot c$
|
|
114 |
\end{tabular}
|
|
115 |
\end{center}
|
|
116 |
|
|
117 |
Let $r'$ stand for the first derivative, then taking the derivatives of $r'$
|
|
118 |
w.r.t.~the characters $a$, $b$ and $c$ gives
|
|
119 |
|
|
120 |
\begin{center}
|
|
121 |
\begin{tabular}{lcll}
|
|
122 |
$\textit{der}\;a\;r'$ & $=$ & $((\ZERO \cdot b) + \ZERO)\cdot c$ \\
|
|
123 |
$\textit{der}\;b\;r'$ & $=$ & $((\ZERO \cdot b) + \ONE)\cdot c$ & ($= r''$)\\
|
|
124 |
$\textit{der}\;c\;r'$ & $=$ & $((\ZERO \cdot b) + \ZERO)\cdot c$
|
|
125 |
\end{tabular}
|
|
126 |
\end{center}
|
|
127 |
|
|
128 |
One more example: Let $r''$ stand for the second derivative above,
|
|
129 |
then taking the derivatives of $r''$ w.r.t.~the characters $a$, $b$
|
|
130 |
and $c$ gives
|
|
131 |
|
|
132 |
\begin{center}
|
|
133 |
\begin{tabular}{lcll}
|
|
134 |
$\textit{der}\;a\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ZERO$ \\
|
|
135 |
$\textit{der}\;b\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ZERO$\\
|
|
136 |
$\textit{der}\;c\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ONE$
|
|
137 |
\end{tabular}
|
|
138 |
\end{center}
|
|
139 |
|
|
140 |
Note, the last derivative can match the empty string, that is it is \textit{nullable}.\\
|
|
141 |
\mbox{}\hfill\mbox{[1 Mark]}
|
6
|
142 |
|
68
|
143 |
\item[(1c)] Implement the function \textit{simp}, which recursively
|
69
|
144 |
traverses a regular expression from the inside to the outside, and
|
|
145 |
simplifies every sub-regular-expression on the left (see below) to
|
68
|
146 |
the regular expression on the right, except it does not simplify inside
|
|
147 |
${}^*$-regular expressions.
|
6
|
148 |
|
68
|
149 |
\begin{center}
|
69
|
150 |
\begin{tabular}{l@{\hspace{4mm}}c@{\hspace{4mm}}ll}
|
6
|
151 |
$r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\
|
|
152 |
$\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\
|
|
153 |
$r \cdot \ONE$ & $\mapsto$ & $r$\\
|
|
154 |
$\ONE \cdot r$ & $\mapsto$ & $r$\\
|
|
155 |
$r + \ZERO$ & $\mapsto$ & $r$\\
|
|
156 |
$\ZERO + r$ & $\mapsto$ & $r$\\
|
|
157 |
$r + r$ & $\mapsto$ & $r$\\
|
|
158 |
\end{tabular}
|
68
|
159 |
\end{center}
|
|
160 |
|
69
|
161 |
For example the regular expression
|
68
|
162 |
\[(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)\]
|
|
163 |
|
79
|
164 |
simplifies to just $r_1$. \textbf{Hints:} Regular expressions can be
|
|
165 |
seen as trees and there are several methods for traversing
|
|
166 |
trees. One of them corresponds to the inside-out traversal. Also
|
|
167 |
remember numerical expressions from school: there you had exprssions
|
|
168 |
like $u + \ldots + (1 \cdot x) * \ldots (z + (y \cdot 0)) \ldots$
|
|
169 |
and simplification rules that looked very similar to rules
|
|
170 |
above. You would simplify such numerical expressions by replacing
|
|
171 |
for example the $y \cdot 0$ by $0$, or $1\cdot x$ by $x$, and then
|
|
172 |
look if more rules are applicable. If you organise this
|
|
173 |
simplification in an inside-out fashion, it is always clear which
|
|
174 |
rule should applied next.\\\mbox{}\hfill[1 Mark]
|
68
|
175 |
|
|
176 |
\item[(1d)] Implement two functions: The first, called \textit{ders},
|
69
|
177 |
takes a list of characters and a regular expression as arguments, and
|
|
178 |
builds the derivative w.r.t.~the list as follows:
|
68
|
179 |
|
|
180 |
\begin{center}
|
|
181 |
\begin{tabular}{lcl}
|
69
|
182 |
$\textit{ders}\;(Nil)\;r$ & $\dn$ & $r$\\
|
|
183 |
$\textit{ders}\;(c::cs)\;r$ & $\dn$ &
|
68
|
184 |
$\textit{ders}\;cs\;(\textit{simp}(\textit{der}\;c\;r))$\\
|
|
185 |
\end{tabular}
|
6
|
186 |
\end{center}
|
|
187 |
|
78
|
188 |
Note that this function is different from \textit{der}, which only
|
|
189 |
takes a single character.
|
|
190 |
|
|
191 |
The second function, called \textit{matcher}, takes a string and a
|
|
192 |
regular expression as arguments. It builds first the derivatives
|
|
193 |
according to \textit{ders} and after that tests whether the resulting
|
|
194 |
derivative regular expression can match the empty string (using
|
|
195 |
\textit{nullable}). For example the \textit{matcher} will produce
|
|
196 |
true given the regular expression $(a\cdot b)\cdot c$ and the string
|
|
197 |
$abc$.\\ \mbox{}\hfill[1 Mark]
|
6
|
198 |
|
69
|
199 |
\item[(1e)] Implement the function $\textit{replace}\;r\;s_1\;s_2$: it searches
|
|
200 |
(from the left to
|
|
201 |
right) in the string $s_1$ all the non-empty substrings that match the
|
|
202 |
regular expression $r$---these substrings are assumed to be
|
68
|
203 |
the longest substrings matched by the regular expression and
|
69
|
204 |
assumed to be non-overlapping. All these substrings in $s_1$ matched by $r$
|
|
205 |
are replaced by $s_2$. For example given the regular expression
|
6
|
206 |
|
68
|
207 |
\[(a \cdot a)^* + (b \cdot b)\]
|
6
|
208 |
|
69
|
209 |
\noindent the string $s_1 = aabbbaaaaaaabaaaaabbaaaabb$ and
|
78
|
210 |
replacement the string $s_2 = c$ yields the string
|
6
|
211 |
|
68
|
212 |
\[
|
|
213 |
ccbcabcaccc
|
|
214 |
\]
|
6
|
215 |
|
78
|
216 |
\hfill[2 Marks]
|
68
|
217 |
\end{itemize}
|
6
|
218 |
|
78
|
219 |
|
|
220 |
|
|
221 |
|
|
222 |
\subsection*{Part 2 (4 Marks)}
|
|
223 |
|
|
224 |
You need to copy all the code from \texttt{re.scala} into
|
|
225 |
\texttt{re2.scala} in order to complete Part 2. Parts (2a) and (2b)
|
|
226 |
give you another method and datapoints for testing the \textit{der}
|
|
227 |
and \textit{simp} functions from Part~1.
|
|
228 |
|
79
|
229 |
\subsubsection*{Tasks (file re2.scala)}
|
78
|
230 |
|
|
231 |
\begin{itemize}
|
|
232 |
\item[(2a)] Write a \textbf{polymorphic} function, called
|
|
233 |
\textit{iterT}, that is \textbf{tail-recursive}(!) and takes an
|
|
234 |
integer $n$, a function $f$ and an $x$ as arguments. This function
|
|
235 |
should iterate $f$ $n$-times starting with the argument $x$, like
|
|
236 |
|
|
237 |
\[\underbrace{f(\ldots (f}_{n\text{-times}}(x)))
|
|
238 |
\]
|
|
239 |
|
|
240 |
More formally that means \textit{iterT} behaves as follows:
|
|
241 |
|
|
242 |
\begin{center}
|
|
243 |
\begin{tabular}{lcl}
|
|
244 |
$\textit{iterT}(n, f, x)$ & $\dn$ &
|
|
245 |
$\begin{cases}
|
|
246 |
\;x & \textit{if}\;n = 0\\
|
|
247 |
\;f(\textit{iterT}(n - 1, f, x)) & \textit{otherwise}
|
|
248 |
\end{cases}$
|
|
249 |
\end{tabular}
|
|
250 |
\end{center}
|
|
251 |
|
|
252 |
Make sure you write a \textbf{tail-recursive} version of
|
|
253 |
\textit{iterT}. If you add the annotation \texttt{@tailrec} (see
|
79
|
254 |
below) your code should not produce an error message.
|
78
|
255 |
|
|
256 |
\begin{lstlisting}[language=Scala, numbers=none, xleftmargin=-1mm]
|
|
257 |
import scala.annotation.tailrec
|
|
258 |
|
|
259 |
@tailrec
|
|
260 |
def iterT[A](n: Int, f: A => A, x: A): A = ...
|
|
261 |
\end{lstlisting}
|
|
262 |
|
|
263 |
You can assume that \textit{iterT} will only be called for positive
|
|
264 |
integers $0 \le n$. Given the type variable \texttt{A}, the type of
|
|
265 |
$f$ is \texttt{A => A} and the type of $x$ is \texttt{A}. This means
|
|
266 |
\textit{iterT} can be used, for example, for functions from integers
|
79
|
267 |
to integers, or strings to strings, or regular expressions to
|
|
268 |
regular expressions. \\ \mbox{}\hfill[2 Marks]
|
78
|
269 |
|
|
270 |
\item[(2b)] Implement a function, called \textit{size}, by recursion
|
|
271 |
over regular expressions. If a regular expression is seen as a tree,
|
|
272 |
then \textit{size} should return the number of nodes in such a
|
|
273 |
tree. Therefore this function is defined as follows:
|
|
274 |
|
|
275 |
\begin{center}
|
|
276 |
\begin{tabular}{lcl}
|
|
277 |
$\textit{size}(\ZERO)$ & $\dn$ & $1$\\
|
|
278 |
$\textit{size}(\ONE)$ & $\dn$ & $1$\\
|
|
279 |
$\textit{size}(c)$ & $\dn$ & $1$\\
|
|
280 |
$\textit{size}(r_1 + r_2)$ & $\dn$ & $1 + \textit{size}(r_1) + \textit{size}(r_2)$\\
|
|
281 |
$\textit{size}(r_1 \cdot r_2)$ & $\dn$ & $1 + \textit{size}(r_1) + \textit{size}(r_2)$\\
|
|
282 |
$\textit{size}(r^*)$ & $\dn$ & $1 + \textit{size}(r)$\\
|
|
283 |
\end{tabular}
|
|
284 |
\end{center}
|
|
285 |
|
|
286 |
You can use \textit{size} and \textit{iterT} in order to test how much
|
|
287 |
the 'evil' regular expression $(a^*)^* \cdot b$ grows when taking
|
79
|
288 |
successive derivatives according the letter $a$ and then compare it to
|
78
|
289 |
taking the derivative, but simlifying the derivative after each step.
|
|
290 |
For example, the calls
|
|
291 |
|
|
292 |
\begin{lstlisting}[language=Scala, numbers=none, xleftmargin=-1mm]
|
|
293 |
size(iterT(20, (r: Rexp) => der('a', r), EVIL))
|
|
294 |
size(iterT(20, (r: Rexp) => simp(der('a', r)), EVIL))
|
|
295 |
\end{lstlisting}
|
|
296 |
|
|
297 |
produce without simplification a regular expression of size of
|
79
|
298 |
7340068 after 20 iterations, while the one with
|
|
299 |
simplification gives
|
78
|
300 |
just 8.\\ \mbox{}\hfill[1 Mark]
|
|
301 |
|
|
302 |
|
|
303 |
\item[(2c)] Write a \textbf{polymorphic} function, called
|
|
304 |
\textit{fixpT}, that takes
|
|
305 |
a function $f$ and an $x$ as arguments. The purpose
|
|
306 |
of \textit{fixpT} is to calculate a fixpoint of the function $f$
|
|
307 |
starting from the argument $x$.
|
|
308 |
A fixpoint, say $y$, is when $f(y) = y$ holds.
|
|
309 |
That means \textit{fixpT} behaves as follows:
|
|
310 |
|
|
311 |
\begin{center}
|
|
312 |
\begin{tabular}{lcl}
|
|
313 |
$\textit{fixpT}(f, x)$ & $\dn$ &
|
|
314 |
$\begin{cases}
|
|
315 |
\;x & \textit{if}\;f(x) = x\\
|
|
316 |
\;\textit{fixpT}(f, f(x)) & \textit{otherwise}
|
|
317 |
\end{cases}$
|
|
318 |
\end{tabular}
|
|
319 |
\end{center}
|
|
320 |
|
|
321 |
Make sure you calculate in the code of $\textit{fixpT}$ the result
|
|
322 |
of $f(x)$ only once. Given the type variable \texttt{A} in
|
|
323 |
$\textit{fixpT}$, the type of $f$ is \texttt{A => A} and the type of
|
|
324 |
$x$ is \texttt{A}. The file \texttt{re2.scala} gives two example
|
|
325 |
function where in one the fixpoint is 1 and in the other
|
|
326 |
it is the string $a$.\\ \mbox{}\hfill[1 Mark]
|
|
327 |
\end{itemize}\bigskip
|
|
328 |
|
|
329 |
|
|
330 |
|
|
331 |
\noindent
|
75
|
332 |
\textbf{Background} Although easily implementable in Scala, the idea
|
|
333 |
behind the derivative function might not so easy to be seen. To
|
|
334 |
understand its purpose better, assume a regular expression $r$ can
|
|
335 |
match strings of the form $c::cs$ (that means strings which start with
|
78
|
336 |
a character $c$ and have some rest, or tail, $cs$). If you now take the
|
75
|
337 |
derivative of $r$ with respect to the character $c$, then you obtain a
|
|
338 |
regular expressions that can match all the strings $cs$. In other
|
78
|
339 |
words, the regular expression $\textit{der}\;c\;r$ can match the same
|
75
|
340 |
strings $c::cs$ that can be matched by $r$, except that the $c$ is
|
|
341 |
chopped off.
|
|
342 |
|
|
343 |
Assume now $r$ can match the string $abc$. If you take the derivative
|
|
344 |
according to $a$ then you obtain a regular expression that can match
|
|
345 |
$bc$ (it is $abc$ where the $a$ has been chopped off). If you now
|
78
|
346 |
build the derivative $\textit{der}\;b\;(\textit{der}\;a\;r))$ you
|
|
347 |
obtain a regular expression that can match the string $c$ (it is $bc$
|
|
348 |
where $b$ is chopped off). If you finally build the derivative of this
|
|
349 |
according $c$, that is
|
|
350 |
$\textit{der}\;c\;(\textit{der}\;b\;(\textit{der}\;a\;r)))$, you
|
|
351 |
obtain a regular expression that can match the empty string. You can
|
|
352 |
test this using the function nullable, which is what your matcher is
|
|
353 |
doing.
|
75
|
354 |
|
78
|
355 |
The purpose of the simp function is to keep the regular expression
|
|
356 |
small. Normally the derivative function makes the regular expression
|
|
357 |
bigger (see the SEQ case and the example in (2b)) and the algorithm
|
|
358 |
would be slower and slower over time. The simp function counters this
|
|
359 |
increase in size and the result is that the algorithm is fast
|
|
360 |
throughout. By the way, this algorithm is by Janusz Brzozowski who
|
|
361 |
came up with the idea of derivatives in 1964 in his PhD thesis.
|
75
|
362 |
|
78
|
363 |
\begin{center}\small
|
|
364 |
\url{https://en.wikipedia.org/wiki/Janusz_Brzozowski_(computer_scientist)}
|
|
365 |
\end{center}
|
6
|
366 |
|
|
367 |
\end{document}
|
|
368 |
|
68
|
369 |
|
6
|
370 |
%%% Local Variables:
|
|
371 |
%%% mode: latex
|
|
372 |
%%% TeX-master: t
|
|
373 |
%%% End:
|