Binary file coursework/cw01.pdf has changed
--- a/coursework/cw01.tex Wed May 10 17:03:21 2017 +0100
+++ b/coursework/cw01.tex Sun May 21 00:43:02 2017 +0100
@@ -2,7 +2,11 @@
\usepackage{../style}
\usepackage{../langs}
+\usepackage{array}
+
+
\begin{document}
+\newcolumntype{C}[1]{>{\centering}m{#1}}
\section*{Coursework 1 (Strand 1)}
@@ -25,7 +29,7 @@
uploaded to KEATS, which you can freely use.\bigskip
-\subsubsection*{Task}
+\subsection*{Task}
The task is to implement a regular expression matcher based on
derivatives of regular expressions. The implementation should
@@ -40,25 +44,41 @@
\begin{center}
\begin{tabular}{ll}
-$[c_1 c_2 \ldots c_n]$ & a range of characters\\
-$r^+$ & one or more times $r$\\
-$r^?$ & optional $r$\\
-$r^{\{n,m\}}$ & at least $n$-times $r$ but no more than $m$-times\\
-$\sim{}r$ & not-regular expression of $r$\\
+ $[c_1,c_2,\ldots,c_n]$ & a set of characters---for character ranges\\
+ $r^+$ & one or more times $r$\\
+ $r^?$ & optional $r$\\
+ $r^{\{n\}}$ & exactly $n$-times\\
+ $r^{\{..m\}}$ & zero or more times $r$ but no more than $m$-times\\
+ $r^{\{n..\}}$ & at least $n$-times $r$\\
+ $r^{\{n..m\}}$ & at least $n$-times $r$ but no more than $m$-times\\
+ $\sim{}r$ & not-regular-expression of $r$\\
\end{tabular}
\end{center}
-\noindent In the case of $r^{\{n,m\}}$ you can assume the
-convention that $0 \le n \le m$. The meanings of the extended
-regular expressions are
+\noindent You can assume that $n$ and $m$ are greater or equal than
+$0$. In the case of $r^{\{n,m\}}$ you can also assume $0 \le n \le m$.\bigskip
+
+\noindent {\bf Important!} Your implementation should have explicit
+cases for the basic regular expressions, but also explicit cases for
+the extended regular expressions. That means do not treat the extended
+regular expressions by just translating them into the basic ones. See
+also Question 2, where you are asked to explicitly give the rules for
+\textit{nullable} and \textit{der} for the extended regular
+expressions.\newpage
+
+\noindent
+The meanings of the extended regular expressions are
\begin{center}
\begin{tabular}{r@{\hspace{2mm}}c@{\hspace{2mm}}l}
-$L([c_1 c_2 \ldots c_n])$ & $\dn$ & $\{[c_1], [c_2], \ldots, [c_n]\}$\\
-$L(r^+)$ & $\dn$ & $\bigcup_{1\le i}. L(r)^i$\\
-$L(r^?)$ & $\dn$ & $L(r) \cup \{[]\}$\\
-$L(r^{\{n,m\}})$ & $\dn$ & $\bigcup_{n\le i \le m}. L(r)^i$\\
-$L(\sim{}r)$ & $\dn$ & $\Sigma^* - L(r)$
+ $L([c_1,c_2,\ldots,c_n])$ & $\dn$ & $\{[c_1], [c_2], \ldots, [c_n]\}$\\
+ $L(r^+)$ & $\dn$ & $\bigcup_{1\le i}.\;L(r)^i$\\
+ $L(r^?)$ & $\dn$ & $L(r) \cup \{[]\}$\\
+ $L(r^{\{n\}})$ & $\dn$ & $L(r)^n$\\
+ $L(r^{\{..m\}})$ & $\dn$ & $\bigcup_{0\le i \le m}.\;L(r)^i$\\
+ $L(r^{\{n..\}})$ & $\dn$ & $\bigcup_{n\le i}.\;L(r)^i$\\
+ $L(r^{\{n..m\}})$ & $\dn$ & $\bigcup_{n\le i \le m}.\;L(r)^i$\\
+ $L(\sim{}r)$ & $\dn$ & $\Sigma^* - L(r)$
\end{tabular}
\end{center}
@@ -66,31 +86,24 @@
for the set of \emph{all} strings over the alphabet $\Sigma$
(in the implementation the alphabet can be just what is
represented by, say, the type \pcode{Char}). So $\sim{}r$
-means `all the strings that $r$ cannot match'.
+means in effect ``all the strings that $r$ cannot match''.\medskip
+\noindent
Be careful that your implementation of \textit{nullable} and
-\textit{der} satisfies for every $r$ the following two
-properties (see also Question 2):
+\textit{der} satisfies for every regular expression $r$ the following
+two properties (see also Question 2):
\begin{itemize}
\item $\textit{nullable}(r)$ if and only if $[]\in L(r)$
\item $L(der\,c\,r) = Der\,c\,(L(r))$
\end{itemize}
-\noindent {\bf Important!} Your implementation should have
-explicit cases for the basic regular expressions, but also
-explicit cases for the extended regular expressions. That
-means do not treat the extended regular expressions by just
-translating them into the basic ones. See also Question 2,
-where you are asked to explicitly give the rules for
-\textit{nullable} and \textit{der} for the extended regular
-expressions.
\subsection*{Question 1}
What is your King's email address (you will need it in
-Question 3)?
+Question 4)?
\subsection*{Question 2}
@@ -102,16 +115,27 @@
\begin{center}
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
-$\textit{nullable}([c_1 c_2 \ldots c_n])$ & $\dn$ & $?$\\
-$\textit{nullable}(r^+)$ & $\dn$ & $?$\\
-$\textit{nullable}(r^?)$ & $\dn$ & $?$\\
-$\textit{nullable}(r^{\{n,m\}})$ & $\dn$ & $?$\\
-$\textit{nullable}(\sim{}r)$ & $\dn$ & $?$\medskip\\
-$der\, c\, ([c_1 c_2 \ldots c_n])$ & $\dn$ & $?$\\
-$der\, c\, (r^+)$ & $\dn$ & $?$\\
-$der\, c\, (r^?)$ & $\dn$ & $?$\\
-$der\, c\, (r^{\{n,m\}})$ & $\dn$ & $?$\\
-$der\, c\, (\sim{}r)$ & $\dn$ & $?$\\
+ $\textit{nullable}([c_1,c_2,\ldots,c_n])$ & $\dn$ & $?$\\
+ $\textit{nullable}(r^+)$ & $\dn$ & $?$\\
+ $\textit{nullable}(r^?)$ & $\dn$ & $?$\\
+ $\textit{nullable}(r^{\{n\}})$ & $\dn$ & $?$\\
+ $\textit{nullable}(r^{\{..m\}})$ & $\dn$ & $?$\\
+ $\textit{nullable}(r^{\{n..\}})$ & $\dn$ & $?$\\
+ $\textit{nullable}(r^{\{n..m\}})$ & $\dn$ & $?$\\
+ $\textit{nullable}(\sim{}r)$ & $\dn$ & $?$
+\end{tabular}
+\end{center}
+
+\begin{center}
+\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
+ $der\, c\, ([c_1,c_2,\ldots,c_n])$ & $\dn$ & $?$\\
+ $der\, c\, (r^+)$ & $\dn$ & $?$\\
+ $der\, c\, (r^?)$ & $\dn$ & $?$\\
+ $der\, c\, (r^{\{n\}})$ & $\dn$ & $?$\\
+ $der\, c\, (r^{\{..m\}})$ & $\dn$ & $?$\\
+ $der\, c\, (r^{\{n..\}})$ & $\dn$ & $?$\\
+ $der\, c\, (r^{\{n..m\}})$ & $\dn$ & $?$\\
+ $der\, c\, (\sim{}r)$ & $\dn$ & $?$\\
\end{tabular}
\end{center}
@@ -124,14 +148,77 @@
\end{itemize}
\noindent
-Give the implementation and the text-version of the clauses above.
+Given the definitions of \textit{nullable} and \textit{der}, it is
+easy to implement a regular expression matcher. Test your regular
+expression matcher with (at least) the examples:
+
+
+\begin{center}
+\def\arraystretch{1.2}
+\begin{tabular}{r|m{12mm}|m{12mm}|m{12mm}|m{12mm}|m{12mm}|m{12mm}}
+ string & $a^{\{3\}}$ & $(a^?)^{\{3\}}$ & $a^{\{..3\}}$ &
+ $(a^?)^{\{..3\}}$ & $a^{\{3..5\}}$ & $(a^?)^{\{3..5\}}$\\\hline
+ $[]$ &&&&&& \\\hline
+ \texttt{a} &&&&&& \\\hline
+ \texttt{aa} &&&&&& \\\hline
+ \texttt{aaa} &&&&&& \\\hline
+ \texttt{aaaaa} &&&&&& \\\hline
+ \texttt{aaaaaa}&&&&&& \\
+\end{tabular}
+\end{center}
+
+\noindent
+Does your matcher produce the expected results?
\subsection*{Question 3}
-Implement the following regular expression for email addresses
+There are a number of explicit regular expressions that deal with single or several
+characters, for example:
+
+\begin{center}
+\begin{tabular}{ll}
+ $c$ & match a single character\\
+ $[c_1,c_2,\ldots,c_n]$ & match a set of characters---for character ranges\\
+ $\textit{ALL}$ & match any character
+\end{tabular}
+\end{center}
+
+\noindent
+the latter is useful for matching any string (for example
+$\textit{ALL}^*$). In order to avoid having an explicit constructor
+for each case, we can generalise all such cases and introduce a single
+constructor $\textit{CFUN}(f)$ where $f$ is a function from characters
+to a boolean. Implement
+
+\begin{center}
+\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
+ $\textit{nullable}(\textit{CFUN}(f))$ & $\dn$ & $?$\\
+ $\textit{der}\,c\,(\textit{CFUN}(f))$ & $\dn$ & $?$
+\end{tabular}
+\end{center}
+
+\noindent in your matcher and then give definitions for
+
+\begin{center}
+\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
+ $c$ & $\dn$ & $\textit{CFUN}(?)$\\
+ $[c_1,c_2,\ldots,c_n]$ & $\dn$ & $\textit{CFUN}(?)$\\
+ $\textit{ALL}$ & $\dn$ & $\textit{CFUN}(?)$
+\end{tabular}
+\end{center}
+
+
+\subsection*{Question 4}
+
+Suppose $[a\mbox{-}z0\mbox{-}9\_\,.\mbox{-}]$ stands for the regular expression
+
+\[[a,b,c,\ldots,z,0,\dots,9,\_,.,\mbox{-}]\;.\]
+
+\noindent
+Define in your code the following regular expression for email addresses
\[
-([a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot ([a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot ([a\mbox{-}z\,.]^{\{2,6\}})
+([a\mbox{-}z0\mbox{-}9\_\,.-]^+)\cdot @\cdot ([a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot ([a\mbox{-}z\,.]^{\{2,6\}})
\]
\noindent and calculate the derivative according to your email
@@ -154,13 +241,12 @@
\noindent Write down your simplified derivative in a readable
notation using parentheses where necessary. That means you
should use the infix notation $+$, $\cdot$, $^*$ and so on,
-instead of code.
+instead of code.\bigskip
-\subsection*{Question 4}
-
-Suppose \textit{[a-z]} stands for the range regular expression
-$[a,b,c,\ldots,z]$. Consider the regular expression $/ \cdot * \cdot
-(\sim{}([a\mbox{-}z]^* \cdot * \cdot / \cdot [a\mbox{-}z]^*)) \cdot *
+\noindent
+Implement the simplification rules in your regular expression matcher.
+Consider the regular expression $/ \cdot * \cdot
+(\sim{}(\textit{ALL}^* \cdot * \cdot / \cdot \textit{ALL}^*)) \cdot *
\cdot /$ and decide wether the following four strings are matched by
this regular expression. Answer yes or no.
@@ -172,24 +258,7 @@
\end{enumerate}
\noindent
-Also test your regular expression matcher with the regular
-expressions $a^{\{3,5\}}$ and $(a^?)^{\{3,5\}}$. Test whether the
-strings
-
-\begin{enumerate}
-\setcounter{enumi}{4}
-\item \texttt{aa}
-\item \texttt{aaa}
-\item \texttt{aaaaa}
-\item \texttt{aaaaaa}
-\end{enumerate}
-
-\noindent
-are matched or not. Does your matcher produce the expected results?
-
-\subsection*{Question 5}
-
-Let $r_1$ be the regular expression $a\cdot a\cdot a$ and $r_2$ be
+Also let $r_1$ be the regular expression $a\cdot a\cdot a$ and $r_2$ be
$(a^{\{19,19\}}) \cdot (a^?)$. Decide whether the following three
strings consisting of $a$s only can be matched by $(r_1^+)^+$.
Similarly test them with $(r_2^+)^+$. Again answer in all six cases
@@ -201,6 +270,7 @@
to not introducing any other character.
\begin{enumerate}
+\setcounter{enumi}{4}
\item \texttt{"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\\
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\\
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"}
@@ -213,6 +283,7 @@
\end{enumerate}
+
\end{document}
%%% Local Variables:
Binary file handouts/ho01.pdf has changed
--- a/handouts/ho01.tex Wed May 10 17:03:21 2017 +0100
+++ b/handouts/ho01.tex Sun May 21 00:43:02 2017 +0100
@@ -39,13 +39,13 @@
\section*{Handout 1}
This module is about text processing, be it for web-crawlers,
-compilers, dictionaries, DNA-data and so on. When looking for a
+compilers, dictionaries, DNA-data, ad filters and so on. When looking for a
particular string, like $abc$ in a large text we can use the
Knuth-Morris-Pratt algorithm, which is currently the most efficient
general string search algorithm. But often we do \emph{not} just look
for a particular string, but for string patterns. For example, in
program code we need to identify what are the keywords (if, then,
-while, etc), what are the identifiers (variable names). A pattern for
+while, for, etc), what are the identifiers (variable names). A pattern for
identifiers could be stated as: they start with a letter, followed by
zero or more letters, numbers and underscores. Often we also face the
problem that we are given a string (for example some user input) and
@@ -536,7 +536,7 @@
precisely specify when a string $s$ is matched by a regular
expression $r$, namely if and only if $s \in L(r)$. In fact we
will write a program \pcode{match} that takes any string $s$
-and any regular expression $r$ as argument and returns
+and any regular expression $r$ as arguments and returns
\emph{yes}, if $s \in L(r)$ and \emph{no}, if $s \not\in
L(r)$. We leave this for the next lecture.
@@ -641,9 +641,9 @@
Python, Ruby and Java in some instances and the problems in Stack
Exchange and the Atom editor). People who are not very familiar with
the mathematical background of regular expressions get them
-consistently wrong (surprising given they are a supposed to be core
-skill for computer scientists). The hope is that we can do better in
-the future---for example by proving that the algorithms actually
+consistently wrong (this is surprising given they are a supposed to be
+core skill for computer scientists). The hope is that we can do better
+in the future---for example by proving that the algorithms actually
satisfy their specification and that the corresponding implementations
do not contain any bugs. We are close, but not yet quite there.
@@ -652,15 +652,15 @@
``theoretical'' shortcomings, for example recognising strings of the
form $a^{n}b^{n}$ is not possible with regular expressions. This means
for example if we try to regognise whether parentheses are well-nested
-is impossible with (basic) regular expressions. I am not so bothered
-by these shortcomings. What I am bothered about is when regular
-expressions are in the way of practical programming. For example, it
-turns out that the regular expression for email addresses shown in
-\eqref{email} is hopelessly inadequate for recognising all of them
-(despite being touted as something every computer scientist should
-know about). The W3 Consortium (which standardises the Web) proposes
-to use the following, already more complicated regular expressions for
-email addresses:
+in an expression is impossible with (basic) regular expressions. I am
+not so bothered by these shortcomings. What I am bothered about is
+when regular expressions are in the way of practical programming. For
+example, it turns out that the regular expression for email addresses
+shown in \eqref{email} is hopelessly inadequate for recognising all of
+them (despite being touted as something every computer scientist
+should know about). The W3 Consortium (which standardises the Web)
+proposes to use the following, already more complicated regular
+expressions for email addresses:
{\small\begin{lstlisting}[language={},keywordstyle=\color{black},numbers=none]
[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
Binary file handouts/ho02.pdf has changed
--- a/handouts/ho02.tex Wed May 10 17:03:21 2017 +0100
+++ b/handouts/ho02.tex Sun May 21 00:43:02 2017 +0100
@@ -14,10 +14,10 @@
This lecture is about implementing a more efficient regular expression
matcher (the plots on the right below)---more efficient than the
matchers from regular expression libraries in Ruby, Python and Java
-(the plots on the left). The first pair of plots show the running time
+(the plots on the left). The first pair of plots shows the running time
for the regular expression $(a^*)^*\cdot b$ and strings composed of
$n$ \pcode{a}s (meaning this regular expression actually does not
-match the strings). The second pair of plots show the running time for
+match the strings). The second pair of plots shows the running time for
the regular expressions $a^?{}^{\{n\}}\cdot a^{\{n\}}$ and strings
also composed of $n$ \pcode{a}s (this time the regular expressions
match the strings). To see the substantial differences in the left
@@ -278,9 +278,9 @@
\end{center}
\noindent
-We will not use them in our algorithm, but feel free to make sure they
-hold. As an aside, there has been a lot of research about questions
-like: Can one always decide when two regular expressions are
+We will not use them in our algorithm, but feel free to convince you
+that they hold. As an aside, there has been a lot of research about
+questions like: Can one always decide when two regular expressions are
equivalent or not? What does an algorithm look like to decide this
efficiently?
Binary file handouts/ho03.pdf has changed
--- a/handouts/ho03.tex Wed May 10 17:03:21 2017 +0100
+++ b/handouts/ho03.tex Sun May 21 00:43:02 2017 +0100
@@ -416,9 +416,9 @@
\stackrel{\epsilon}{\longrightarrow}\ldots\stackrel{\epsilon}{\longrightarrow} q'
\]
-\noindent and replace them with $q \stackrel{a}{\longrightarrow}
-q'$. Doing this to the $\epsilon$NFA on the right-hand side above gives
-the NFA
+\noindent where somewhere in the ``middle'' is an $a$-transition. We
+replace them with $q \stackrel{a}{\longrightarrow} q'$. Doing this to
+the $\epsilon$NFA on the right-hand side above gives the NFA
\begin{center}
\begin{tikzpicture}[>=stealth',very thick,
@@ -624,11 +624,11 @@
\node[state, accepting] (c_2) [above=of c_1] {$\mbox{}$};
\node[state, accepting] (c_3) [below=of c_1] {$\mbox{}$};
\path[->] (t_1) edge (A_01);
-\path[->] (t_2) edge node [above] {$\epsilon$} (A_01);
+\path[->] (t_2) edge node [above] {$\epsilon$s} (A_01);
\path[->] (t_3) edge (A_01);
\path[->] (t_1) edge (A_02);
\path[->] (t_2) edge (A_02);
-\path[->] (t_3) edge node [below] {$\epsilon$} (A_02);
+\path[->] (t_3) edge node [below] {$\epsilon$s} (A_02);
\begin{pgfonlayer}{background}
\node (3) [rounded corners, inner sep=1mm, thick,
@@ -1077,14 +1077,14 @@
\noindent
You might be able to spend some quality tinkering with this code and
-time to ponder. Then you will probably notice it is actually
-silly. The whole point of translating the NFA into a DFA via the
+time to ponder about it. Then you will probably notice it is actually
+a bit silly. The whole point of translating the NFA into a DFA via the
subset construction is to make the decision of whether a string is
accepted or not faster. Given the code above, the generated DFA will
be exactly as fast, or as slow, as the NFA we started with (actually
it will even be a tiny bit slower). The reason is that we just re-use
-the \texttt{nexts} function from the NFA. This fucntion implements the
-non-deterministic breadth-first search. You might be thinking: That
+the \texttt{nexts} function from the NFA. This function implements the
+non-deterministic breadth-first search. You might be thinking: This
is cheating! \ldots{} Well, not quite as you will see later, but in
terms of speed we still need to work a bit in order to get
sometimes(!) a faster DFA. Let's do this next.
@@ -1242,20 +1242,29 @@
\end{center}
+By the way, we are not bothering with implementing the above
+minimisation algorith: while up to now all the transformations used
+some clever composition of functions, the minimisation algorithm
+cannot be implemented by just composing some functions. For this we
+would require a more concrete representation of the transition
+function (like maps). If we did this, however, then many advantages of
+the functions would be thrown away. So the compromise is to not being
+able to minimise (easily) our DFAs.
+
\subsection*{Brzozowski's Method}
I know tyhis is already a long, long rant: but after all it is a topic
that has been researched for more than 60 years. If you reflect on
-what you have read so far, the story you can take a regular
+what you have read so far, the story is that you can take a regular
expression, translate it via the Thompson Construction into an
$\epsilon$NFA, then translate it into a NFA by removing all
$\epsilon$-transitions, and then via the subset construction obtain a
DFA. In all steps we made sure the language, or which strings can be
recognised, stays the same. After the last section, we can even
-minimise the DFA. But again we made sure the same language is
-recognised. You might be wondering: Can we go into the other
-direction? Can we go from a DFA and obtain a regular expression that
-can recognise the same language as the DFA?\medskip
+minimise the DFA (maybe not in code). But again we made sure the same
+language is recognised. You might be wondering: Can we go into the
+other direction? Can we go from a DFA and obtain a regular expression
+that can recognise the same language as the DFA?\medskip
\noindent
The answer is yes. Again there are several methods for calculating a
@@ -1508,6 +1517,19 @@
that would lead us too far afield for what we want to do in this
module.
+
+\subsection*{Where Have Derivatives Gone?}
+
+By now you are probably fed up by this text. It is now way too long
+for one lecture, but there is still one aspect of the
+automata-connection I like to highlight for you. Perhaps by now you
+are asking yourself: Where have the derivatives gone? Did we just
+forget them? Well, they have a place in the picture of calculating a
+DFA from the regular expression.
+
+To be done
+
+
%\section*{Further Reading}
%Compare what a ``human expert'' would create as an automaton for the
Binary file handouts/ho04.pdf has changed
--- a/handouts/ho04.tex Wed May 10 17:03:21 2017 +0100
+++ b/handouts/ho04.tex Sun May 21 00:43:02 2017 +0100
@@ -582,6 +582,7 @@
\end{figure}
\begin{figure}[p]
+\small
\lstinputlisting[numbers=left,linebackgroundcolor=
{\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app61.scala}
--- a/handouts/ho08.tex Wed May 10 17:03:21 2017 +0100
+++ b/handouts/ho08.tex Sun May 21 00:43:02 2017 +0100
@@ -8,6 +8,10 @@
% modern compilers are different
% https://channel9.msdn.com/Blogs/Seth-Juarez/Anders-Hejlsberg-on-Modern-Compiler-Construction
+%halting problem
+%https://dfilaretti.github.io/2017-04-30/halting-problem
+
+
\begin{document}
\lstset{morekeywords={def,if,then,else,write,read},keywordstyle=\color{codepurple}\bfseries}
--- a/progs/cw1.scala Wed May 10 17:03:21 2017 +0100
+++ b/progs/cw1.scala Sun May 21 00:43:02 2017 +0100
@@ -1,137 +1,90 @@
+// CW 1
import scala.language.implicitConversions
import scala.language.reflectiveCalls
-abstract class Rexp {
- def simp : Rexp = this
-}
+abstract class Rexp
+case object ZERO extends Rexp // matches nothing
+case object ONE extends Rexp // matches the empty string
+case class CHAR(c: Char) extends Rexp // matches a character c
+case class ALT(r1: Rexp, r2: Rexp) extends Rexp // alternative
+case class SEQ(r1: Rexp, r2: Rexp) extends Rexp // sequence
+case class STAR(r: Rexp) extends Rexp // star
+case class RANGE(cs: Set[Char]) extends Rexp // set of characters
+case class PLUS(r: Rexp) extends Rexp // plus
+case class OPT(r: Rexp) extends Rexp // optional
+case class NTIMES(r: Rexp, n: Int) extends Rexp // n-times
+case class UPNTIMES(r: Rexp, n: Int) extends Rexp // up n-times
+case class FROMNTIMES(r: Rexp, n: Int) extends Rexp // from n-times
+case class NMTIMES(r: Rexp, n: Int, m: Int) extends Rexp // between nm-times
+case class NOT(r: Rexp) extends Rexp // not
+case class CFUN(f: Char => Boolean) extends Rexp // subsuming CHAR and RANGE
-case object NULL extends Rexp
-case object EMPTY extends Rexp
-case class CHAR(c: Char) extends Rexp
-case class ALT(r1: Rexp, r2: Rexp) extends Rexp {
- override def toString = r1.toString + " | " + r2.toString
- override def simp = (r1.simp, r2.simp) match {
- case (NULL, r) => r
- case (r, NULL) => r
- case (r, EMPTY) => if (nullable(r)) r else ALT(r, EMPTY)
- case (EMPTY, r) => if (nullable(r)) r else ALT(r, EMPTY)
- case (r1, r2) => if (r1 == r2) r1 else ALT(r1, r2)
- }
-}
-case class RANGE(cs: List[Char]) extends Rexp {
- //override def toString = cs.mkString("[",",","]")
- override def toString = "[" + cs.head + " .. " + cs.last + "]"
-}
-object RANGE {
- def apply(s: String) : RANGE = RANGE(s.toList)
-}
-case class SEQ(r1: Rexp, r2: Rexp) extends Rexp {
- override def toString = r1.toString + " ~ " + r2.toString
- override def simp = (r1.simp, r2.simp) match {
- case (NULL, _) => NULL
- case (_, NULL) => NULL
- case (EMPTY, r) => r
- case (r, EMPTY) => r
- case (r1, r2) => SEQ(r1, r2)
- }
-}
-case class PLUS(r: Rexp) extends Rexp {
- override def simp = r.simp match {
- case NULL => EMPTY
- case EMPTY => EMPTY
- case r => PLUS(r)
- }
-}
-case class STAR(r: Rexp) extends Rexp {
- override def simp = r.simp match {
- case NULL => EMPTY
- case EMPTY => EMPTY
- case r => STAR(r)
- }
-}
-case class NTIMES(r: Rexp, n: Int) extends Rexp {
- override def simp = if (n == 0) EMPTY else
- r.simp match {
- case NULL => NULL
- case EMPTY => EMPTY
- case r => NTIMES(r, n)
- }
-}
-case class NUPTOM(r: Rexp, n: Int, m: Int) extends Rexp {
- override def simp = if (m == 0) EMPTY else
- r.simp match {
- case NULL => NULL
- case EMPTY => EMPTY
- case r => NUPTOM(r, n, m)
- }
-}
-def NMTIMES(r: Rexp, n: Int, m: Int) = {
- if(m < n) throw new IllegalArgumentException("the number m cannot be smaller than n.")
- else NUPTOM(r, n, m - n)
-}
-
-
-case class NOT(r: Rexp) extends Rexp {
- override def simp = NOT(r.simp)
-}
-case class OPT(r: Rexp) extends Rexp {
- override def simp = OPT(r.simp)
-}
// nullable function: tests whether the regular
// expression can recognise the empty string
def nullable (r: Rexp) : Boolean = r match {
- case NULL => false
- case EMPTY => true
+ case ZERO => false
+ case ONE => true
case CHAR(_) => false
case ALT(r1, r2) => nullable(r1) || nullable(r2)
case SEQ(r1, r2) => nullable(r1) && nullable(r2)
case STAR(_) => true
+ case RANGE(_) => false
case PLUS(r) => nullable(r)
- case NTIMES(r, i) => if (i == 0) true else nullable(r)
- case NUPTOM(r, i, j) => if (i == 0) true else nullable(r)
- case RANGE(_) => false
- case NOT(r) => !(nullable(r))
case OPT(_) => true
-}
-
-def zeroable (r: Rexp) : Boolean = r match {
- case NULL => true
- case EMPTY => false
- case CHAR(_) => false
- case ALT(r1, r2) => zeroable(r1) && zeroable(r2)
- case SEQ(r1, r2) => zeroable(r1) || zeroable(r2)
- case STAR(_) => false
- case PLUS(r) => zeroable(r)
- case NTIMES(r, i) => if (i == 0) false else zeroable(r)
- case NUPTOM(r, i, j) => if (i == 0) false else zeroable(r)
- case RANGE(_) => false
- case NOT(r) => !(zeroable(r))
- case OPT(_) => false
+ case NTIMES(r, n) => if (n == 0) true else nullable(r)
+ case UPNTIMES(_, _) => true
+ case FROMNTIMES(r, n) => if (n == 0) true else nullable(r)
+ case NMTIMES(r, n, m) => if (n == 0) true else nullable(r)
+ case NOT(r) => !(nullable(r))
+ case CFUN(_) => false
}
// derivative of a regular expression w.r.t. a character
def der (c: Char, r: Rexp) : Rexp = r match {
- case NULL => NULL
- case EMPTY => NULL
- case CHAR(d) => if (c == d) EMPTY else NULL
+ case ZERO => ZERO
+ case ONE => ZERO
+ case CHAR(d) => if (c == d) ONE else ZERO
case ALT(r1, r2) => ALT(der(c, r1), der(c, r2))
case SEQ(r1, r2) =>
if (nullable(r1)) ALT(SEQ(der(c, r1), r2), der(c, r2))
else SEQ(der(c, r1), r2)
case STAR(r) => SEQ(der(c, r), STAR(r))
+ case RANGE(cs) => if (cs contains c) ONE else ZERO
case PLUS(r) => SEQ(der(c, r), STAR(r))
- case NTIMES(r, i) =>
- if (i == 0) NULL else der(c, SEQ(r, NTIMES(r, i - 1)))
- case NUPTOM(r, i, j) =>
- if (i == 0 && j == 0) NULL else
- if (i == 0) ALT(der(c, NTIMES(r, j)), der(c, NUPTOM(r, 0, j - 1)))
- else der(c, SEQ(r, NUPTOM(r, i - 1, j)))
- case RANGE(cs) => if (cs contains c) EMPTY else NULL
+ case OPT(r) => der(c, r)
+ case NTIMES(r, n) =>
+ if (n == 0) ZERO else SEQ(der(c, r), NTIMES(r, n - 1))
+ case UPNTIMES(r, n) =>
+ if (n == 0) ZERO else SEQ(der(c, r), UPNTIMES(r, n - 1))
+ case FROMNTIMES(r, n) =>
+ if (n == 0) ZERO else SEQ(der(c, r), FROMNTIMES(r, n - 1))
+ case NMTIMES(r, n, m) =>
+ if (m < n) ZERO else
+ if (n == 0 && m == 0) ZERO else
+ if (n == 0) SEQ(der(c, r), UPNTIMES(r, m - 1))
+ else SEQ(der(c, r), NMTIMES(r, n - 1, m - 1))
case NOT(r) => NOT(der (c, r))
- case OPT(r) => der(c, r)
+ case CFUN(f) => if (f(c)) ONE else ZERO
}
+def simp(r: Rexp) : Rexp = r match {
+ case ALT(r1, r2) => (simp(r1), simp(r2)) match {
+ case (ZERO, r2s) => r2s
+ case (r1s, ZERO) => r1s
+ case (r1s, r2s) => if (r1s == r2s) r1s else ALT (r1s, r2s)
+ }
+ case SEQ(r1, r2) => (simp(r1), simp(r2)) match {
+ case (ZERO, _) => ZERO
+ case (_, ZERO) => ZERO
+ case (ONE, r2s) => r2s
+ case (r1s, ONE) => r1s
+ case (r1s, r2s) => SEQ(r1s, r2s)
+ }
+ case r => r
+}
+
+
// derivative w.r.t. a string (iterates der)
def ders (s: List[Char], r: Rexp) : Rexp = s match {
case Nil => r
@@ -140,15 +93,18 @@
def ders_simp (s: List[Char], r: Rexp) : Rexp = s match {
case Nil => r
- case c::s => ders_simp(s, der(c, r).simp)
+ case c::s => ders_simp(s, simp(der(c, r)))
}
-// main matcher function
-def matcher(r: Rexp, s: String) : Boolean = nullable(ders_simp(s.toList, r))
+// main matcher function including simplification
+def matcher(r: Rexp, s: String) : Boolean =
+ nullable(ders_simp(s.toList, r))
+
+
// some convenience for typing in regular expressions
def charlist2rexp(s : List[Char]) : Rexp = s match {
- case Nil => EMPTY
+ case Nil => ONE
case c::Nil => CHAR(c)
case c::s => SEQ(CHAR(c), charlist2rexp(s))
}
@@ -168,12 +124,29 @@
def ~ (r: String) = SEQ(s, r)
}
+val r1 = NTIMES("a", 3)
+val r2 = NTIMES(OPT("a"), 3)
+val r3 = UPNTIMES("a", 3)
+val r4 = UPNTIMES(OPT("a"), 3)
+val r5 = NMTIMES("a", 3, 5)
+val r6 = NMTIMES(OPT("a"), 3, 5)
+
+val rs = List(r1, r2, r3, r4, r5, r6)
+
+rs.map(matcher(_, ""))
+rs.map(matcher(_, "a"))
+rs.map(matcher(_, "aa"))
+rs.map(matcher(_, "aaa"))
+rs.map(matcher(_, "aaaa"))
+rs.map(matcher(_, "aaaaa"))
+rs.map(matcher(_, "aaaaaa"))
+
println("EMAIL:")
-val LOWERCASE = "abcdefghijklmnopqrstuvwxyz"
-val DIGITS = "0123456789"
-val EMAIL = PLUS(RANGE(LOWERCASE + DIGITS + "_" + "." + "-")) ~ "@" ~
- PLUS(RANGE(LOWERCASE + DIGITS + "." + "-")) ~ "." ~
- NMTIMES(RANGE(LOWERCASE + "."), 2, 6)
+val LOWERCASE = ('a' to 'z').toSet
+val DIGITS = ('0' to '9').toSet
+val EMAIL = { PLUS(CFUN(LOWERCASE | DIGITS | ("_.-").toSet)) ~ "@" ~
+ PLUS(CFUN(LOWERCASE | DIGITS | (".-").toSet)) ~ "." ~
+ NMTIMES(CFUN(LOWERCASE | Set('.')), 2, 6) }
val my_email = "christian.urban@kcl.ac.uk"
@@ -182,7 +155,7 @@
println(ders_simp(my_email.toList, EMAIL))
println("COMMENTS:")
-val ALL = RANGE(LOWERCASE + " ")
+val ALL = CFUN((c:Char) => true)
val COMMENT = "/*" ~ (NOT(ALL.% ~ "*/" ~ ALL.%)) ~ "*/"
println(matcher(COMMENT, "/**/"))