cws/cw03.tex
author Christian Urban <urbanc@in.tum.de>
Thu, 24 Nov 2016 09:42:49 +0000
changeset 68 8da9e0c16194
parent 62 2151c77e1e24
child 69 f1295a0ab4ed
permissions -rw-r--r--
updated

\documentclass{article}
\usepackage{../style}
%%\usepackage{../langs}

\begin{document}

\section*{Coursework 8 (Scala, Regular Expressions}

This coursework is worth 10\%. It is about regular expressions and
pattern matching. The first part is due on 30 November at 11pm; the
second, more advanced part, is due on 7 December at 11pm. The
second part is not yet included. For the first part you are
asked to implement a regular expression matcher. Make sure the files
you submit can be processed by just calling \texttt{scala
  <<filename.scala>>}.\bigskip

\noindent
\textbf{Important:} Do not use any mutable data structures in your
submissions! They are not needed. This excluded the use of
\texttt{ListBuffer}s, for example. Do not use \texttt{return} in your
code! It has a different meaning in Scala, than in Java.  Do not use
\texttt{var}! This declares a mutable variable.  Make sure the
functions you submit are defined on the ``top-level'' of Scala, not
inside a class or object.


\subsection*{Disclaimer!!!!!!!!}

It should be understood that the work you submit represents
your own effort! You have not copied from anyone else. An
exception is the Scala code I showed during the lectures or
uploaded to KEATS, which you can freely use.\bigskip


\subsection*{Part 1 (6 Marks)}

The task is to implement a regular expression matcher based on
derivatives of regular expressions. The implementation can deal
with the following regular expressions, which have been predefined
file re.scala:

\begin{center}
\begin{tabular}{lcll}
  $r$ & $::=$ & $\ZERO$     & cannot match anything\\
      &   $|$ & $\ONE$      & can only match the empty string\\
      &   $|$ & $c$         & can match a character (in this case $c$)\\
      &   $|$ & $r_1 + r_2$ & can match a string either with $r_1$ or with $r_2$\\
  &   $|$ & $r_1\cdot r_2$ & can match the first part of a string with $r_1$ and\\
          &  & & then the second part with $r_2$\\
      &   $|$ & $r^*$       & can match zero or more times $r$\\
\end{tabular}
\end{center}

\noindent 
Why? Knowing how to match regular expressions and strings fast will
let you solve a lot of problems that vex other humans. Regular
expressions are one of the fastest and simplest ways to match patterns
in text, and are endlessly useful for searching, editing and
analysing text in all sorts of places. However, you need to be
fast, otherwise you will stumble upon problems such as recently reported at

{\small
\begin{itemize}
\item[$\bullet$] \url{http://stackstatus.net/post/147710624694/outage-postmortem-july-20-2016}
\item[$\bullet$] \url{https://vimeo.com/112065252}
\item[$\bullet$] \url{http://davidvgalbraith.com/how-i-fixed-atom/}  
\end{itemize}}

\subsection*{Tasks (file re.scala)}

\begin{itemize}
\item[(1a)] Implement a function, called \textit{nullable}, by recursion over
  regular expressions. This function test whether a regular expression can match
  the empty string.

\begin{center}
\begin{tabular}{lcl}
$\textit{nullable}(\ZERO)$ & $\dn$ & $\textit{false}$\\
$\textit{nullable}(\ONE)$  & $\dn$ & $\textit{true}$\\
$\textit{nullable}(c)$     & $\dn$ & $\textit{false}$\\
$\textit{nullable}(r_1 + r_2)$ & $\dn$ & $\textit{nullable}(r_1) \vee \textit{nullable}(r_2)$\\
$\textit{nullable}(r_1 \cdot r_2)$ & $\dn$ & $\textit{nullable}(r_1) \wedge \textit{nullable}(r_2)$\\
$\textit{nullable}(r^*)$ & $\dn$ & $\textit{true}$\\
\end{tabular}
\end{center}\hfill[1 Mark]

\item[(1b)] Implement a function, called \textit{der}, by recursion over
  regular expressions. It takes a character and a regular expression
  as arguments and calculates the derivative regular expression.

\begin{center}
\begin{tabular}{lcl}
$\textit{der}\;c\;(\ZERO)$ & $\dn$ & $\ZERO$\\
$\textit{der}\;c\;(\ONE)$  & $\dn$ & $\ZERO$\\
$\textit{der}\;c\;(d)$     & $\dn$ & $\textit{if}\; c = d\;\textit{then} \;\ONE \; \textit{else} \;\ZERO$\\
$\textit{der}\;c\;(r_1 + r_2)$ & $\dn$ & $(\textit{der}\;c\;r_1) + (\textit{der}\;c\;r_2)$\\
$\textit{der}\;c\;(r_1 \cdot r_2)$ & $\dn$ & $\textit{if}\;\textit{nullable}(r_1)$\\
      & & $\textit{then}\;((\textit{der}\;c\;r_1)\cdot r_2) + (\textit{der}\;c\;r_2)$\\
      & & $\textit{else}\;(\textit{der}\;c\;r_1)\cdot r_2$\\
$\textit{der}\;c\;(r^*)$ & $\dn$ & $(\textit{der}\;c\;r)\cdot (r^*)$\\
\end{tabular}
\end{center}\hfill[1 Mark]

\item[(1c)] Implement the function \textit{simp}, which recursively
  traverses a regular expression from inside to outside, and
  simplifies every sub-regular-expressions on the left to
  the regular expression on the right, except it does not simplify inside
  ${}^*$-regular expressions.

  \begin{center}
\begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}ll}
$r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ 
$\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ 
$r \cdot \ONE$ & $\mapsto$ & $r$\\ 
$\ONE \cdot r$ & $\mapsto$ & $r$\\ 
$r + \ZERO$ & $\mapsto$ & $r$\\ 
$\ZERO + r$ & $\mapsto$ & $r$\\ 
$r + r$ & $\mapsto$ & $r$\\ 
\end{tabular}
  \end{center}

  For example
  \[(r_1 + \ZERO) \cdot \ONE + ((\ONE + r_2) + r_3) \cdot (r_4 \cdot \ZERO)\]

  simplifies to just $r_1$. 
  \hfill[1 Mark]

\item[(1d)] Implement two functions: The first, called \textit{ders},
  takes a list of characters as arguments and a regular expression and
  buids the derivative as follows:

\begin{center}
\begin{tabular}{lcl}
$\textit{ders}\;Nil\;(r)$ & $\dn$ & $r$\\
  $\textit{ders}\;c::cs\;(r)$  & $\dn$ &
    $\textit{ders}\;cs\;(\textit{simp}(\textit{der}\;c\;r))$\\
\end{tabular}
\end{center}

The second, called \textit{matcher}, takes a string and a regular expression
as arguments. It builds first the derivatives according to \textit{ders}
and at the end tests whether the resulting redular expression can match
the empty string (using \textit{nullable}).
For example the \textit{matcher} will produce true if given the
regular expression $a\cdot b\cdot c$ and the string $abc$.
\hfill[1 Mark]

\item[(1e)] Implement the function \textit{replace}: it searches (from the left to 
right) in string $s_1$ all the non-empty substrings that match the 
regular expression---these substrings are assumed to be
the longest substrings matched by the regular expression and
assumed to be non-overlapping. All these substrings in $s_1$ are replaced
by $s_2$. For example given the regular expression

\[(a \cdot a)^* + (b \cdot b)\]

\noindent the string $aabbbaaaaaaabaaaaabbaaaabb$ and
replacement string $c$ yields the string

\[
ccbcabcaccc
\]

\hfill[2 Mark]
\end{itemize}

\subsection*{Part 2 (4 Marks)}

Coming soon.

\end{document}


%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: