6
+ − 1
\documentclass{article}
62
+ − 2
\usepackage{../style}
6
+ − 3
%%\usepackage{../langs}
+ − 4
+ − 5
\begin{document}
+ − 6
62
+ − 7
\section*{Coursework 8 (Scala, Regular Expressions}
6
+ − 8
68
+ − 9
This coursework is worth 10\%. It is about regular expressions and
+ − 10
pattern matching. The first part is due on 30 November at 11pm; the
+ − 11
second, more advanced part, is due on 7 December at 11pm. The
+ − 12
second part is not yet included. For the first part you are
+ − 13
asked to implement a regular expression matcher. Make sure the files
62
+ − 14
you submit can be processed by just calling \texttt{scala
68
+ − 15
<<filename.scala>>}.\bigskip
62
+ − 16
+ − 17
\noindent
+ − 18
\textbf{Important:} Do not use any mutable data structures in your
74
+ − 19
submission! They are not needed. This excludes the use of
62
+ − 20
\texttt{ListBuffer}s, for example. Do not use \texttt{return} in your
68
+ − 21
code! It has a different meaning in Scala, than in Java. Do not use
+ − 22
\texttt{var}! This declares a mutable variable. Make sure the
62
+ − 23
functions you submit are defined on the ``top-level'' of Scala, not
+ − 24
inside a class or object.
6
+ − 25
+ − 26
68
+ − 27
\subsection*{Disclaimer!!!!!!!!}
6
+ − 28
+ − 29
It should be understood that the work you submit represents
68
+ − 30
your own effort! You have not copied from anyone else. An
6
+ − 31
exception is the Scala code I showed during the lectures or
+ − 32
uploaded to KEATS, which you can freely use.\bigskip
+ − 33
+ − 34
68
+ − 35
\subsection*{Part 1 (6 Marks)}
6
+ − 36
69
+ − 37
The task is to implement a regular expression matcher that is based on
68
+ − 38
derivatives of regular expressions. The implementation can deal
+ − 39
with the following regular expressions, which have been predefined
69
+ − 40
in the file re.scala:
6
+ − 41
+ − 42
\begin{center}
+ − 43
\begin{tabular}{lcll}
+ − 44
$r$ & $::=$ & $\ZERO$ & cannot match anything\\
+ − 45
& $|$ & $\ONE$ & can only match the empty string\\
68
+ − 46
& $|$ & $c$ & can match a character (in this case $c$)\\
+ − 47
& $|$ & $r_1 + r_2$ & can match a string either with $r_1$ or with $r_2$\\
+ − 48
& $|$ & $r_1\cdot r_2$ & can match the first part of a string with $r_1$ and\\
+ − 49
& & & then the second part with $r_2$\\
6
+ − 50
& $|$ & $r^*$ & can match zero or more times $r$\\
+ − 51
\end{tabular}
+ − 52
\end{center}
+ − 53
68
+ − 54
\noindent
+ − 55
Why? Knowing how to match regular expressions and strings fast will
+ − 56
let you solve a lot of problems that vex other humans. Regular
+ − 57
expressions are one of the fastest and simplest ways to match patterns
+ − 58
in text, and are endlessly useful for searching, editing and
+ − 59
analysing text in all sorts of places. However, you need to be
69
+ − 60
fast, otherwise you will stumble over problems such as recently reported at
68
+ − 61
+ − 62
{\small
+ − 63
\begin{itemize}
+ − 64
\item[$\bullet$] \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
+ − 65
\item[$\bullet$] \url{https://vimeo.com/112065252}
+ − 66
\item[$\bullet$] \url{http://davidvgalbraith.com/how-i-fixed-atom/}
+ − 67
\end{itemize}}
+ − 68
+ − 69
\subsection*{Tasks (file re.scala)}
+ − 70
+ − 71
\begin{itemize}
+ − 72
\item[(1a)] Implement a function, called \textit{nullable}, by recursion over
69
+ − 73
regular expressions. This function tests whether a regular expression can match
68
+ − 74
the empty string.
6
+ − 75
+ − 76
\begin{center}
+ − 77
\begin{tabular}{lcl}
+ − 78
$\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\
+ − 79
$\textit{nullable}(\ONE)$ & $\dn$ & $\textit{true}$\\
+ − 80
$\textit{nullable}(c)$ & $\dn$ & $\textit{false}$\\
+ − 81
$\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\
+ − 82
$\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\
+ − 83
$\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\\
+ − 84
\end{tabular}
68
+ − 85
\end{center}\hfill[1 Mark]
+ − 86
+ − 87
\item[(1b)] Implement a function, called \textit{der}, by recursion over
+ − 88
regular expressions. It takes a character and a regular expression
69
+ − 89
as arguments and calculates the derivative regular expression according
+ − 90
to the rules:
6
+ − 91
+ − 92
\begin{center}
+ − 93
\begin{tabular}{lcl}
+ − 94
$\textit{der}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\
+ − 95
$\textit{der}\;c\;(\ONE)$ & $\dn$ & $\ZERO$\\
+ − 96
$\textit{der}\;c\;(d)$ & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\
+ − 97
$\textit{der}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{der}\;c\;r_1) + (\textit{der}\;c\;r_2)$\\
+ − 98
$\textit{der}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\
+ − 99
& & $\textit{then}\;((\textit{der}\;c\;r_1)\cdot r_2) + (\textit{der}\;c\;r_2)$\\
+ − 100
& & $\textit{else}\;(\textit{der}\;c\;r_1)\cdot r_2$\\
+ − 101
$\textit{der}\;c\;(r^*)$ & $\dn$ & $(\textit{der}\;c\;r)\cdot (r^*)$\\
+ − 102
\end{tabular}
69
+ − 103
\end{center}
+ − 104
+ − 105
For example given the regular expression $r = (a \cdot b) \cdot c$, the derivatives
+ − 106
w.r.t.~the characters $a$, $b$ and $c$ are
+ − 107
+ − 108
\begin{center}
+ − 109
\begin{tabular}{lcll}
+ − 110
$\textit{der}\;a\;r$ & $=$ & $(\ONE \cdot b)\cdot c$ & ($= r'$)\\
+ − 111
$\textit{der}\;b\;r$ & $=$ & $(\ZERO \cdot b)\cdot c$\\
+ − 112
$\textit{der}\;c\;r$ & $=$ & $(\ZERO \cdot b)\cdot c$
+ − 113
\end{tabular}
+ − 114
\end{center}
+ − 115
+ − 116
Let $r'$ stand for the first derivative, then taking the derivatives of $r'$
+ − 117
w.r.t.~the characters $a$, $b$ and $c$ gives
+ − 118
+ − 119
\begin{center}
+ − 120
\begin{tabular}{lcll}
+ − 121
$\textit{der}\;a\;r'$ & $=$ & $((\ZERO \cdot b) + \ZERO)\cdot c$ \\
+ − 122
$\textit{der}\;b\;r'$ & $=$ & $((\ZERO \cdot b) + \ONE)\cdot c$ & ($= r''$)\\
+ − 123
$\textit{der}\;c\;r'$ & $=$ & $((\ZERO \cdot b) + \ZERO)\cdot c$
+ − 124
\end{tabular}
+ − 125
\end{center}
+ − 126
+ − 127
One more example: Let $r''$ stand for the second derivative above,
+ − 128
then taking the derivatives of $r''$ w.r.t.~the characters $a$, $b$
+ − 129
and $c$ gives
+ − 130
+ − 131
\begin{center}
+ − 132
\begin{tabular}{lcll}
+ − 133
$\textit{der}\;a\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ZERO$ \\
+ − 134
$\textit{der}\;b\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ZERO$\\
+ − 135
$\textit{der}\;c\;r''$ & $=$ & $((\ZERO \cdot b) + \ZERO) \cdot c + \ONE$
+ − 136
\end{tabular}
+ − 137
\end{center}
+ − 138
+ − 139
Note, the last derivative can match the empty string, that is it is \textit{nullable}.\\
+ − 140
\mbox{}\hfill\mbox{[1 Mark]}
6
+ − 141
68
+ − 142
\item[(1c)] Implement the function \textit{simp}, which recursively
69
+ − 143
traverses a regular expression from the inside to the outside, and
+ − 144
simplifies every sub-regular-expression on the left (see below) to
68
+ − 145
the regular expression on the right, except it does not simplify inside
+ − 146
${}^*$-regular expressions.
6
+ − 147
68
+ − 148
\begin{center}
69
+ − 149
\begin{tabular}{l@{\hspace{4mm}}c@{\hspace{4mm}}ll}
6
+ − 150
$r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\
+ − 151
$\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\
+ − 152
$r \cdot \ONE$ & $\mapsto$ & $r$\\
+ − 153
$\ONE \cdot r$ & $\mapsto$ & $r$\\
+ − 154
$r + \ZERO$ & $\mapsto$ & $r$\\
+ − 155
$\ZERO + r$ & $\mapsto$ & $r$\\
+ − 156
$r + r$ & $\mapsto$ & $r$\\
+ − 157
\end{tabular}
68
+ − 158
\end{center}
+ − 159
69
+ − 160
For example the regular expression
68
+ − 161
\[(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)\]
+ − 162
+ − 163
simplifies to just $r_1$.
+ − 164
\hfill[1 Mark]
+ − 165
+ − 166
\item[(1d)] Implement two functions: The first, called \textit{ders},
69
+ − 167
takes a list of characters and a regular expression as arguments, and
+ − 168
builds the derivative w.r.t.~the list as follows:
68
+ − 169
+ − 170
\begin{center}
+ − 171
\begin{tabular}{lcl}
69
+ − 172
$\textit{ders}\;(Nil)\;r$ & $\dn$ & $r$\\
+ − 173
$\textit{ders}\;(c::cs)\;r$ & $\dn$ &
68
+ − 174
$\textit{ders}\;cs\;(\textit{simp}(\textit{der}\;c\;r))$\\
+ − 175
\end{tabular}
6
+ − 176
\end{center}
+ − 177
68
+ − 178
The second, called \textit{matcher}, takes a string and a regular expression
+ − 179
as arguments. It builds first the derivatives according to \textit{ders}
69
+ − 180
and after that tests whether the resulting derivative regular expression can match
68
+ − 181
the empty string (using \textit{nullable}).
69
+ − 182
For example the \textit{matcher} will produce true given the
+ − 183
regular expression $(a\cdot b)\cdot c$ and the string $abc$.
68
+ − 184
\hfill[1 Mark]
6
+ − 185
69
+ − 186
\item[(1e)] Implement the function $\textit{replace}\;r\;s_1\;s_2$: it searches
+ − 187
(from the left to
+ − 188
right) in the string $s_1$ all the non-empty substrings that match the
+ − 189
regular expression $r$---these substrings are assumed to be
68
+ − 190
the longest substrings matched by the regular expression and
69
+ − 191
assumed to be non-overlapping. All these substrings in $s_1$ matched by $r$
+ − 192
are replaced by $s_2$. For example given the regular expression
6
+ − 193
68
+ − 194
\[(a \cdot a)^* + (b \cdot b)\]
6
+ − 195
69
+ − 196
\noindent the string $s_1 = aabbbaaaaaaabaaaaabbaaaabb$ and
+ − 197
replacement string $s_2 = c$ yields the string
6
+ − 198
68
+ − 199
\[
+ − 200
ccbcabcaccc
+ − 201
\]
6
+ − 202
68
+ − 203
\hfill[2 Mark]
+ − 204
\end{itemize}
6
+ − 205
75
+ − 206
\textbf{Background} Although easily implementable in Scala, the idea
+ − 207
behind the derivative function might not so easy to be seen. To
+ − 208
understand its purpose better, assume a regular expression $r$ can
+ − 209
match strings of the form $c::cs$ (that means strings which start with
+ − 210
a character $c$ and have some rest $cs$). If you now take the
+ − 211
derivative of $r$ with respect to the character $c$, then you obtain a
+ − 212
regular expressions that can match all the strings $cs$. In other
+ − 213
words the regular expression $\textit{der}\;c\;r$ can match the same
+ − 214
strings $c::cs$ that can be matched by $r$, except that the $c$ is
+ − 215
chopped off.
+ − 216
+ − 217
Assume now $r$ can match the string $abc$. If you take the derivative
+ − 218
according to $a$ then you obtain a regular expression that can match
+ − 219
$bc$ (it is $abc$ where the $a$ has been chopped off). If you now
+ − 220
build the derivative $\textit{der}\;b\;(\textit{der}\;a\;r))$ you obtain a regular
+ − 221
expression that can match the string "c" (it is "bc" where 'b' is
+ − 222
chopped off). If you finally build the derivative of this according
+ − 223
'c', that is der('c', der('b', der('a', r))), you obtain a regular
+ − 224
expression that can match the empty string. You can test this using
+ − 225
the function nullable, which is what your matcher is doing.
+ − 226
+ − 227
The purpose of the simp function is to keep the regular expression small. Normally the derivative function makes the regular expression bigger (see the SEQ case) and the algorithm would be slower and slower over time. The simp function counters this increase in size and the result is that the algorithm is fast throughout.
+ − 228
By the way this whole idea is by Janusz Brzozowski who came up with this in 1964 in his PhD thesis.
+ − 229
https://en.wikipedia.org/wiki/Janusz_Brzozowski_(computer_scientist)
+ − 230
+ − 231
+ − 232
68
+ − 233
\subsection*{Part 2 (4 Marks)}
+ − 234
+ − 235
Coming soon.
6
+ − 236
+ − 237
\end{document}
+ − 238
68
+ − 239
6
+ − 240
%%% Local Variables:
+ − 241
%%% mode: latex
+ − 242
%%% TeX-master: t
+ − 243
%%% End: