handouts/ho02.tex
changeset 258 1e4da6d2490c
parent 251 5b5a68df6d16
child 259 e5f4b8ff23b8
equal deleted inserted replaced
257:70c307641d05 258:1e4da6d2490c
     5 
     5 
     6 \begin{document}
     6 \begin{document}
     7 
     7 
     8 \section*{Handout 2}
     8 \section*{Handout 2}
     9 
     9 
    10 Having specified what problem our matching algorithm,
    10 Having specified what problem our matching algorithm, \pcode{match},
    11 \pcode{match}, is supposed to solve, namely for a given
    11 is supposed to solve, namely for a given regular expression $r$ and
    12 regular expression $r$ and string $s$ answer \textit{true} if
    12 string $s$ answer \textit{true} if and only if
    13 and only if
       
    14 
    13 
    15 \[
    14 \[
    16 s \in L(r)
    15 s \in L(r)
    17 \]
    16 \]
    18 
    17 
    19 \noindent we can look at an algorithm to solve this problem.
    18 \noindent we can look at an algorithm to solve this problem.
    20 Clearly we cannot use the function $L$ directly for this,
    19 Clearly we cannot use the function $L$ directly for this,
    21 because in general the set of strings $L$ returns is infinite
    20 because in general the set of strings $L$ returns is infinite
    22 (recall what $L(a^*)$ is). In such cases there is no way we
    21 (recall what $L(a^*)$ is). In such cases there is no way we
    23 can implement an exhaustive test for whether a string is
    22 can implement an exhaustive test for whether a string is
    24 member of this set or not.
    23 member of this set or not. Before we come to the matching 
    25 
    24 algorithm, lets have a closer look at what it means when
    26 The algorithm we will define below consists of two parts. One is the function $nullable$ which takes a
    25 two regular expressions are equivalent.
    27 regular expression as argument and decides whether it can match the empty string (this means it returns a 
    26 
       
    27 \subsection*{Regular Expression Equivalences}
       
    28 
       
    29 
       
    30 \subsection*{Matching Algorithm}
       
    31 
       
    32 The algorithm we will define below consists of two parts. One is the
       
    33 function $nullable$ which takes a regular expression as argument and
       
    34 decides whether it can match the empty string (this means it returns a
    28 boolean). This can be easily defined recursively as follows:
    35 boolean). This can be easily defined recursively as follows:
    29 
    36 
    30 \begin{center}
    37 \begin{center}
    31 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
    38 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
    32 $nullable(\varnothing)$      & $\dn$ & $f\!\/alse$\\
    39 $nullable(\varnothing)$      & $\dn$ & $f\!\/alse$\\
    44 \[
    51 \[
    45 nullable(r) \;\;\text{if and only if}\;\; ""\in L(r)
    52 nullable(r) \;\;\text{if and only if}\;\; ""\in L(r)
    46 \]
    53 \]
    47 
    54 
    48 \noindent
    55 \noindent
    49 Note on the left-hand side we have a function we can implement; on the right we have its specification. 
    56 Note on the left-hand side we have a function we can implement; on the
    50 
    57 right we have its specification.
    51 The other function of our matching algorithm calculates a \emph{derivative} of a regular expression. This is a function
    58 
    52 which will take a regular expression, say $r$, and a character, say $c$, as argument and return 
    59 The other function of our matching algorithm calculates a
    53 a new regular expression. Be careful that the intuition behind this function is not so easy to grasp on first
    60 \emph{derivative} of a regular expression. This is a function which
    54 reading. Essentially this function solves the following problem: if $r$ can match a string of the form
    61 will take a regular expression, say $r$, and a character, say $c$, as
    55 $c\!::\!s$, what does the regular expression look like that can match just $s$. The definition of this
    62 argument and return a new regular expression. Be careful that the
       
    63 intuition behind this function is not so easy to grasp on first
       
    64 reading. Essentially this function solves the following problem: if
       
    65 $r$ can match a string of the form $c\!::\!s$, what does the regular
       
    66 expression look like that can match just $s$. The definition of this
    56 function is as follows:
    67 function is as follows:
    57 
    68 
    58 \begin{center}
    69 \begin{center}
    59 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}}
    70 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}}
    60   $der\, c\, (\varnothing)$      & $\dn$ & $\varnothing$ & \\
    71   $der\, c\, (\varnothing)$      & $\dn$ & $\varnothing$ & \\
    67   $der\, c\, (r^*)$          & $\dn$ & $(der\,c\,r) \cdot (r^*)$ &
    78   $der\, c\, (r^*)$          & $\dn$ & $(der\,c\,r) \cdot (r^*)$ &
    68   \end{tabular}
    79   \end{tabular}
    69 \end{center}
    80 \end{center}
    70 
    81 
    71 \noindent
    82 \noindent
    72 The first two clauses can be rationalised as follows: recall that $der$ should calculate a regular
    83 The first two clauses can be rationalised as follows: recall that
    73 expression, if the ``input'' regular expression can match a string of the form $c\!::\!s$. Since neither
    84 $der$ should calculate a regular expression, if the ``input'' regular
    74 $\varnothing$ nor $\epsilon$ can match such a string we return $\varnothing$. In the third case
    85 expression can match a string of the form $c\!::\!s$. Since neither
    75 we have to make a case-distinction: In case the regular expression is $c$, then clearly it can recognise
    86 $\varnothing$ nor $\epsilon$ can match such a string we return
    76 a string of the form $c\!::\!s$, just that $s$ is the empty string. Therefore we return the $\epsilon$-regular 
    87 $\varnothing$. In the third case we have to make a case-distinction:
    77 expression. In the other case we again return $\varnothing$ since no string of the $c\!::\!s$ can be matched.
    88 In case the regular expression is $c$, then clearly it can recognise a
    78 The $+$-case is relatively straightforward: all strings of the form $c\!::\!s$ are either matched by the
    89 string of the form $c\!::\!s$, just that $s$ is the empty
    79 regular expression $r_1$ or $r_2$. So we just have to recursively call $der$ with these two regular
    90 string. Therefore we return the $\epsilon$-regular expression. In the
    80 expressions and compose the results again with $+$. The $\cdot$-case is more complicated:
    91 other case we again return $\varnothing$ since no string of the
    81 if $r_1\cdot r_2$ matches a string of the form $c\!::\!s$, then the first part must be matched by $r_1$.
    92 $c\!::\!s$ can be matched.  The $+$-case is relatively
    82 Consequently, it makes sense to construct the regular expression for $s$ by calling $der$ with $r_1$ and
    93 straightforward: all strings of the form $c\!::\!s$ are either matched
    83 ``appending'' $r_2$. There is however one exception to this simple rule: if $r_1$ can match the empty
    94 by the regular expression $r_1$ or $r_2$. So we just have to
    84 string, then all of $c\!::\!s$ is matched by $r_2$. So in case $r_1$ is nullable (that is can match the
    95 recursively call $der$ with these two regular expressions and compose
    85 empty string) we have to allow the choice $der\,c\,r_2$ for calculating the regular expression that can match 
    96 the results again with $+$. The $\cdot$-case is more complicated: if
    86 $s$. The $*$-case is again simple: if $r^*$ matches a string of the form $c\!::\!s$, then the first part must be
    97 $r_1\cdot r_2$ matches a string of the form $c\!::\!s$, then the first
    87 ``matched'' by a single copy of $r$. Therefore we call recursively $der\,c\,r$ and ``append'' $r^*$ in order to 
    98 part must be matched by $r_1$.  Consequently, it makes sense to
    88 match the rest of $s$.
    99 construct the regular expression for $s$ by calling $der$ with $r_1$
    89 
   100 and ``appending'' $r_2$. There is however one exception to this simple
    90 Another way to rationalise the definition of $der$ is to consider the following operation on sets:
   101 rule: if $r_1$ can match the empty string, then all of $c\!::\!s$ is
       
   102 matched by $r_2$. So in case $r_1$ is nullable (that is can match the
       
   103 empty string) we have to allow the choice $der\,c\,r_2$ for
       
   104 calculating the regular expression that can match $s$. The $*$-case is
       
   105 again simple: if $r^*$ matches a string of the form $c\!::\!s$, then
       
   106 the first part must be ``matched'' by a single copy of $r$. Therefore
       
   107 we call recursively $der\,c\,r$ and ``append'' $r^*$ in order to match
       
   108 the rest of $s$.
       
   109 
       
   110 Another way to rationalise the definition of $der$ is to consider the
       
   111 following operation on sets:
    91 
   112 
    92 \[
   113 \[
    93 Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}
   114 Der\,c\,A\;\dn\;\{s\,|\,c\!::\!s \in A\}
    94 \]
   115 \]
    95 
   116 
    96 \noindent
   117 \noindent
    97 which essentially transforms a set of strings $A$ by filtering out all strings that do not start with $c$ and then
   118 which essentially transforms a set of strings $A$ by filtering out all
    98 strips off the $c$ from all the remaining strings. For example suppose $A = \{"f\!oo", "bar", "f\!rak"\}$ then
   119 strings that do not start with $c$ and then strips off the $c$ from
       
   120 all the remaining strings. For example suppose $A = \{"f\!oo", "bar",
       
   121 "f\!rak"\}$ then
    99 \[
   122 \[
   100 Der\,f\,A = \{"oo", "rak"\}\quad,\quad
   123 Der\,f\,A = \{"oo", "rak"\}\quad,\quad
   101 Der\,b\,A = \{"ar"\}  \quad \text{and} \quad
   124 Der\,b\,A = \{"ar"\}  \quad \text{and} \quad
   102 Der\,a\,A = \varnothing
   125 Der\,a\,A = \varnothing
   103 \]
   126 \]
   104  
   127  
   105 \noindent
   128 \noindent
   106 Note that in the last case $Der$ is empty, because no string in $A$ starts with $a$. With this operation we can
   129 Note that in the last case $Der$ is empty, because no string in $A$
   107 state the following property about $der$:
   130 starts with $a$. With this operation we can state the following
       
   131 property about $der$:
   108 
   132 
   109 \[
   133 \[
   110 L(der\,c\,r) = Der\,c\,(L(r))
   134 L(der\,c\,r) = Der\,c\,(L(r))
   111 \]
   135 \]
   112 
   136 
   113 \noindent
   137 \noindent
   114 This property clarifies what regular expression $der$ calculates, namely take the set of strings
   138 This property clarifies what regular expression $der$ calculates,
   115 that $r$ can match (that is $L(r)$), filter out all strings not starting with $c$ and strip off the $c$ from the
   139 namely take the set of strings that $r$ can match (that is $L(r)$),
   116 remaining strings---this is exactly the language that $der\,c\,r$ can match.
   140 filter out all strings not starting with $c$ and strip off the $c$
   117 
   141 from the remaining strings---this is exactly the language that
   118 If we want to find out whether the string $"abc"$ is matched by the regular expression $r$
   142 $der\,c\,r$ can match.
   119 then we can iteratively apply $Der$ as follows
   143 
       
   144 If we want to find out whether the string $"abc"$ is matched by the
       
   145 regular expression $r$ then we can iteratively apply $Der$ as follows
   120 
   146 
   121 \begin{enumerate}
   147 \begin{enumerate}
   122 \item $Der\,a\,(L(r))$
   148 \item $Der\,a\,(L(r))$
   123 \item $Der\,b\,(Der\,a\,(L(r)))$
   149 \item $Der\,b\,(Der\,a\,(L(r)))$
   124 \item $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$
   150 \item $Der\,c\,(Der\,b\,(Der\,a\,(L(r))))$
   125 \end{enumerate}
   151 \end{enumerate}
   126 
   152 
   127 \noindent
   153 \noindent
   128 In the last step we need to test whether the empty string is in the set. Our matching algorithm will work similarly, 
   154 In the last step we need to test whether the empty string is in the
   129 just using regular expression instead of sets. For this we need to lift the notion of derivatives from characters to strings. This can be
   155 set. Our matching algorithm will work similarly, just using regular
   130 done using the following function, taking a string and regular expression as input and a regular expression 
   156 expression instead of sets. For this we need to lift the notion of
   131 as output.
   157 derivatives from characters to strings. This can be done using the
       
   158 following function, taking a string and regular expression as input
       
   159 and a regular expression as output.
   132 
   160 
   133 \begin{center}
   161 \begin{center}
   134 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}}
   162 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {\hspace{-10mm}}l@ {}}
   135   $der\!s\, []\, r$     & $\dn$ & $r$ & \\
   163   $der\!s\, []\, r$     & $\dn$ & $r$ & \\
   136   $der\!s\, (c\!::\!s)\, r$ & $\dn$ & $der\!s\,s\,(der\,c\,r)$ & \\
   164   $der\!s\, (c\!::\!s)\, r$ & $\dn$ & $der\!s\,s\,(der\,c\,r)$ & \\
   150 \[
   178 \[
   151 match\,s\,r\quad\text{if and only if}\quad s\in L(r)
   179 match\,s\,r\quad\text{if and only if}\quad s\in L(r)
   152 \]
   180 \]
   153 
   181 
   154 \noindent
   182 \noindent
   155 holds, which means our algorithm satisfies the specification. This algorithm was introduced by
   183 holds, which means our algorithm satisfies the specification. This
   156 Janus Brzozowski in 1964. Its main attractions are simplicity and being fast, as well as
   184 algorithm was introduced by Janus Brzozowski in 1964. Its main
   157 being easily extendable for other regular expressions such as $r^{\{n\}}$, $r^?$, $\sim{}r$ and so on.
   185 attractions are simplicity and being fast, as well as being easily
       
   186 extendable for other regular expressions such as $r^{\{n\}}$, $r^?$,
       
   187 $\sim{}r$ and so on.
       
   188 
       
   189 \subsection*{The Matching Algorithm in Scala}
       
   190 
       
   191 
   158 
   192 
   159 \begin{figure}[p]
   193 \begin{figure}[p]
   160 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/app5.scala}}}
   194 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/app5.scala}}}
   161 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/app6.scala}}}
   195 {\lstset{language=Scala}\texttt{\lstinputlisting{../progs/app6.scala}}}
   162 \caption{Scala implementation of the nullable and derivatives functions.}
   196 \caption{Scala implementation of the nullable and derivatives functions.}