diff -r 70c307641d05 -r 1e4da6d2490c handouts/ho02.tex --- a/handouts/ho02.tex Mon Sep 22 13:42:14 2014 +0100 +++ b/handouts/ho02.tex Fri Sep 26 14:06:55 2014 +0100 @@ -7,10 +7,9 @@ \section*{Handout 2} -Having specified what problem our matching algorithm, -\pcode{match}, is supposed to solve, namely for a given -regular expression $r$ and string $s$ answer \textit{true} if -and only if +Having specified what problem our matching algorithm, \pcode{match}, +is supposed to solve, namely for a given regular expression $r$ and +string $s$ answer \textit{true} if and only if \[ s \in L(r) @@ -21,10 +20,18 @@ because in general the set of strings $L$ returns is infinite (recall what $L(a^*)$ is). In such cases there is no way we can implement an exhaustive test for whether a string is -member of this set or not. +member of this set or not. Before we come to the matching +algorithm, lets have a closer look at what it means when +two regular expressions are equivalent. + +\subsection*{Regular Expression Equivalences} -The algorithm we will define below consists of two parts. One is the function $nullable$ which takes a -regular expression as argument and decides whether it can match the empty string (this means it returns a + +\subsection*{Matching Algorithm} + +The algorithm we will define below consists of two parts. One is the +function $nullable$ which takes a regular expression as argument and +decides whether it can match the empty string (this means it returns a boolean). This can be easily defined recursively as follows: \begin{center} @@ -46,13 +53,17 @@ \] \noindent -Note on the left-hand side we have a function we can implement; on the right we have its specification. +Note on the left-hand side we have a function we can implement; on the +right we have its specification. -The other function of our matching algorithm calculates a \emph{derivative} of a regular expression. This is a function -which will take a regular expression, say $r$, and a character, say $c$, as argument and return -a new regular expression. Be careful that the intuition behind this function is not so easy to grasp on first -reading. Essentially this function solves the following problem: if $r$ can match a string of the form -$c\!::\!s$, what does the regular expression look like that can match just $s$. The definition of this +The other function of our matching algorithm calculates a +\emph{derivative} of a regular expression. This is a function which +will take a regular expression, say $r$, and a character, say $c$, as +argument and return a new regular expression. Be careful that the +intuition behind this function is not so easy to grasp on first +reading. Essentially this function solves the following problem: if +$r$ can match a string of the form $c\!::\!s$, what does the regular +expression look like that can match just $s$. The definition of this function is as follows: \begin{center} @@ -69,33 +80,45 @@ \end{center} \noindent -The first two clauses can be rationalised as follows: recall that $der$ should calculate a regular -expression, if the ``input'' regular expression can match a string of the form $c\!::\!s$. Since neither -$\varnothing$ nor $\epsilon$ can match such a string we return $\varnothing$. In the third case -we have to make a case-distinction: In case the regular expression is $c$, then clearly it can recognise -a string of the form $c\!::\!s$, just that $s$ is the empty string. Therefore we return the $\epsilon$-regular -expression. In the other case we again return $\varnothing$ since no string of the $c\!::\!s$ can be matched. -The $+$-case is relatively straightforward: all strings of the form $c\!::\!s$ are either matched by the -regular expression $r_1$ or $r_2$. So we just have to recursively call $der$ with these two regular -expressions and compose the results again with $+$. The $\cdot$-case is more complicated: -if $r_1\cdot r_2$ matches a string of the form $c\!::\!s$, then the first part must be matched by $r_1$. -Consequently, it makes sense to construct the regular expression for $s$ by calling $der$ with $r_1$ and -``appending'' $r_2$. There is however one exception to this simple rule: if $r_1$ can match the empty -string, then all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is nullable (that is can match the -empty string) we have to allow the choice $der\,c\,r_2$ for calculating the regular expression that can match -$s$. The $*$-case is again simple: if $r^*$ matches a string of the form $c\!::\!s$, then the first part must be -``matched'' by a single copy of $r$. Therefore we call recursively $der\,c\,r$ and ``append'' $r^*$ in order to -match the rest of $s$. +The first two clauses can be rationalised as follows: recall that +$der$ should calculate a regular expression, if the ``input'' regular +expression can match a string of the form $c\!::\!s$. Since neither +$\varnothing$ nor $\epsilon$ can match such a string we return +$\varnothing$. In the third case we have to make a case-distinction: +In case the regular expression is $c$, then clearly it can recognise a +string of the form $c\!::\!s$, just that $s$ is the empty +string. Therefore we return the $\epsilon$-regular expression. In the +other case we again return $\varnothing$ since no string of the +$c\!::\!s$ can be matched. The $+$-case is relatively +straightforward: all strings of the form $c\!::\!s$ are either matched +by the regular expression $r_1$ or $r_2$. So we just have to +recursively call $der$ with these two regular expressions and compose +the results again with $+$. The $\cdot$-case is more complicated: if +$r_1\cdot r_2$ matches a string of the form $c\!::\!s$, then the first +part must be matched by $r_1$. Consequently, it makes sense to +construct the regular expression for $s$ by calling $der$ with $r_1$ +and ``appending'' $r_2$. There is however one exception to this simple +rule: if $r_1$ can match the empty string, then all of $c\!::\!s$ is +matched by $r_2$. So in case $r_1$ is nullable (that is can match the +empty string) we have to allow the choice $der\,c\,r_2$ for +calculating the regular expression that can match $s$. The $*$-case is +again simple: if $r^*$ matches a string of the form $c\!::\!s$, then +the first part must be ``matched'' by a single copy of $r$. Therefore +we call recursively $der\,c\,r$ and ``append'' $r^*$ in order to match +the rest of $s$. -Another way to rationalise the definition of $der$ is to consider the following operation on sets: +Another way to rationalise the definition of $der$ is to consider the +following operation on sets: \[ Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\} \] \noindent -which essentially transforms a set of strings $A$ by filtering out all strings that do not start with $c$ and then -strips off the $c$ from all the remaining strings. For example suppose $A = \{"f\!oo", "bar", "f\!rak"\}$ then +which essentially transforms a set of strings $A$ by filtering out all +strings that do not start with $c$ and then strips off the $c$ from +all the remaining strings. For example suppose $A = \{"f\!oo", "bar", +"f\!rak"\}$ then \[ Der\,f\,A = \{"oo", "rak"\}\quad,\quad Der\,b\,A = \{"ar"\} \quad \text{and} \quad @@ -103,20 +126,23 @@ \] \noindent -Note that in the last case $Der$ is empty, because no string in $A$ starts with $a$. With this operation we can -state the following property about $der$: +Note that in the last case $Der$ is empty, because no string in $A$ +starts with $a$. With this operation we can state the following +property about $der$: \[ L(der\,c\,r) = Der\,c\,(L(r)) \] \noindent -This property clarifies what regular expression $der$ calculates, namely take the set of strings -that $r$ can match (that is $L(r)$), filter out all strings not starting with $c$ and strip off the $c$ from the -remaining strings---this is exactly the language that $der\,c\,r$ can match. +This property clarifies what regular expression $der$ calculates, +namely take the set of strings that $r$ can match (that is $L(r)$), +filter out all strings not starting with $c$ and strip off the $c$ +from the remaining strings---this is exactly the language that +$der\,c\,r$ can match. -If we want to find out whether the string $"abc"$ is matched by the regular expression $r$ -then we can iteratively apply $Der$ as follows +If we want to find out whether the string $"abc"$ is matched by the +regular expression $r$ then we can iteratively apply $Der$ as follows \begin{enumerate} \item $Der\,a\,(L(r))$ @@ -125,10 +151,12 @@ \end{enumerate} \noindent -In the last step we need to test whether the empty string is in the set. Our matching algorithm will work similarly, -just using regular expression instead of sets. For this we need to lift the notion of derivatives from characters to strings. This can be -done using the following function, taking a string and regular expression as input and a regular expression -as output. +In the last step we need to test whether the empty string is in the +set. Our matching algorithm will work similarly, just using regular +expression instead of sets. For this we need to lift the notion of +derivatives from characters to strings. This can be done using the +following function, taking a string and regular expression as input +and a regular expression as output. \begin{center} \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}} @@ -152,9 +180,15 @@ \] \noindent -holds, which means our algorithm satisfies the specification. This algorithm was introduced by -Janus Brzozowski in 1964. Its main attractions are simplicity and being fast, as well as -being easily extendable for other regular expressions such as $r^{\{n\}}$, $r^?$, $\sim{}r$ and so on. +holds, which means our algorithm satisfies the specification. This +algorithm was introduced by Janus Brzozowski in 1964. Its main +attractions are simplicity and being fast, as well as being easily +extendable for other regular expressions such as $r^{\{n\}}$, $r^?$, +$\sim{}r$ and so on. + +\subsection*{The Matching Algorithm in Scala} + + \begin{figure}[p] {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/app5.scala}}}