handouts/ho04.tex
changeset 669 2f5a4d76756d
parent 643 08375ca3874e
child 671 83e38043ed78
equal deleted inserted replaced
668:9ce78065f68d 669:2f5a4d76756d
       
     1 % !TEX program = xelatex
     1 \documentclass{article}
     2 \documentclass{article}
     2 \usepackage{../style}
     3 \usepackage{../style}
     3 \usepackage{../langs}
     4 \usepackage{../langs}
     4 \usepackage{../graphics}
     5 \usepackage{../graphics}
     5 
     6 
     6 
     7 
     7 \begin{document}
     8 \begin{document}
     8 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017}
     9 \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2019}
     9 
    10 
    10 \section*{Handout 4 (Sulzmann \& Lu Algorithm)}
    11 \section*{Handout 4 (Sulzmann \& Lu Algorithm)}
    11 
    12 
    12 So far our algorithm based on derivatives was only able to say
    13 So far our algorithm based on derivatives was only able to say
    13 yes or no depending on whether a string was matched by a regular
    14 yes or no depending on whether a string was matched by a regular
    21 provided on KEATS, in case you are interested.\footnote{In my
    22 provided on KEATS, in case you are interested.\footnote{In my
    22 humble opinion this is an interesting instance of the research
    23 humble opinion this is an interesting instance of the research
    23 literature: it contains a very neat idea, but its presentation
    24 literature: it contains a very neat idea, but its presentation
    24 is rather sloppy. In earlier versions of this paper, a King's
    25 is rather sloppy. In earlier versions of this paper, a King's
    25 undergraduate student and I found several rather annoying typos in the
    26 undergraduate student and I found several rather annoying typos in the
    26 examples and definitions.} My PhD student Fahad Ausaf and I even more recently 
    27 examples and definitions.} My former PhD student Fahad Ausaf and I even more recently 
    27 wrote a paper where we build on their result: we provided a 
    28 wrote a paper where we build on their result: we provided a 
    28 mathematical proof that their algorithm is really correct---the proof 
    29 mathematical proof that their algorithm is really correct---the proof 
    29 Sulzmann \& Lu had originally given contained major flaws. Such correctness
    30 Sulzmann \& Lu had originally given contained major flaws. Such correctness
    30 proofs are important: Kuklewicz maintains a unit-test library
    31 proofs are important: Kuklewicz maintains a unit-test library
    31 for the kind of algorithms we are interested in here and he showed 
    32 for the kind of algorithms we are interested in here and he showed 
    35 \begin{center}
    36 \begin{center}
    36 \url{http://www.haskell.org/haskellwiki/Regex_Posix}
    37 \url{http://www.haskell.org/haskellwiki/Regex_Posix}
    37 \end{center}
    38 \end{center}
    38 
    39 
    39 
    40 
    40 In order to give an answer for
    41 Coming back to the algorithm by Sulzmann \& Lu, their idea is to
    41 \emph{how} a regular expression matches a string, Sulzmann and
    42 introduce \emph{values} for producing an answer for \emph{how} a regular
    42 Lu use \emph{values}. A value will be the output of the
    43 expression matches a string. So rather than a boolean like in the
    43 algorithm whenever the regular expression matches the string.
    44 Brzozowski algorithm, a value will be the output of the Sulzman \& Lu
    44 If the string does not match the string, an error will be
    45 algorithm whenever the regular expression matches the string. If the
    45 raised. 
    46 string does not match the string, an error will be raised. The
    46 The definition for values is given below. They are shown 
    47 definition for values is given below. They are shown together with the
    47 together with the regular expressions $r$ to which
    48 regular expressions $r$ to which they correspond:
    48 they correspond:
       
    49 
    49 
    50 \begin{center}
    50 \begin{center}
    51 \begin{tabular}{cc}
    51 \begin{tabular}{cc}
    52 \begin{tabular}{@{}rrl@{}}
    52 \begin{tabular}{@{}rrl@{}}
    53 \multicolumn{3}{c}{regular expressions}\medskip\\
    53 \multicolumn{3}{c}{regular expressions}\medskip\\
    79 each regular expression with the exception of $r_1 + r_2$
    79 each regular expression with the exception of $r_1 + r_2$
    80 where there are two values, namely $\Left(v)$ and $Right(v)$
    80 where there are two values, namely $\Left(v)$ and $Right(v)$
    81 corresponding to the two alternatives. Note that $r^*$ is
    81 corresponding to the two alternatives. Note that $r^*$ is
    82 associated with a list of values, one for each copy of $r$
    82 associated with a list of values, one for each copy of $r$
    83 that was needed to match the string. This means we might also
    83 that was needed to match the string. This means we might also
    84 return the empty list $Stars []$, if no copy was needed
    84 return the empty list $Stars\,[]$, if no copy was needed
    85 for $r^*$. For sequence, there is exactly one value, composed 
    85 for $r^*$. For sequence, there is exactly one value, composed 
    86 of two component values ($v_1$ and $v_2$).
    86 of two component values ($v_1$ and $v_2$).
    87 
    87 
    88 My implementation of regular expressions and values in Scala is
    88 My implementation of regular expressions and values in Scala is
    89 shown below. I have in my implementation the convention that
    89 shown below. I have in my implementation the convention that
   101 {../progs/app02.scala}}
   101 {../progs/app02.scala}}
   102 
   102 
   103 
   103 
   104 Graphically the algorithm by Sulzmann \& Lu can be illustrated
   104 Graphically the algorithm by Sulzmann \& Lu can be illustrated
   105 by the picture in Figure~\ref{Sulz} where the path from the
   105 by the picture in Figure~\ref{Sulz} where the path from the
   106 left to the right involving $\textit{der}/\textit{nullable}$ is the first phase
   106 left to the right involving $\der/\nullable$ is the first phase
   107 of the algorithm and $\textit{mkeps}/\textit{inj}$, the path from right to left,
   107 of the algorithm and $\textit{mkeps}/\inj$, the path from right to left,
   108 the second phase. This picture shows the steps required when a
   108 the second phase. This picture shows the steps required when a
   109 regular expression, say $r_1$, matches the string $abc$. We
   109 regular expression, say $r_1$, matches the string $abc$. We
   110 first build the three derivatives (according to $a$, $b$ and
   110 first build the three derivatives (according to $a$, $b$ and
   111 $c$). We then use $\textit{nullable}$ to find out whether the resulting
   111 $c$). We then use $\nullable$ to find out whether the resulting
   112 regular expression can match the empty string. If yes, we call
   112 regular expression can match the empty string. If yes, we call
   113 the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular
   113 the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular
   114 expression has matched the empty string. Its definition
   114 expression has matched the empty string. Its definition
   115 is as follows:
   115 is as follows:
   116 
   116 
   118 \begin{center}
   118 \begin{center}
   119 \begin{tikzpicture}[scale=2,node distance=1.2cm,
   119 \begin{tikzpicture}[scale=2,node distance=1.2cm,
   120                     every node/.style={minimum size=7mm}]
   120                     every node/.style={minimum size=7mm}]
   121 \node (r1)  {$r_1$};
   121 \node (r1)  {$r_1$};
   122 \node (r2) [right=of r1]{$r_2$};
   122 \node (r2) [right=of r1]{$r_2$};
   123 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\textit{der}\,a$};
   123 \draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\der\,a$};
   124 \node (r3) [right=of r2]{$r_3$};
   124 \node (r3) [right=of r2]{$r_3$};
   125 \draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\textit{der}\,b$};
   125 \draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\der\,b$};
   126 \node (r4) [right=of r3]{$r_4$};
   126 \node (r4) [right=of r3]{$r_4$};
   127 \draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\textit{der}\,c$};
   127 \draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\der\,c$};
   128 \draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\textit{nullable}$}};
   128 \draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\nullable$}};
   129 \node (v4) [below=of r4]{$v_4$};
   129 \node (v4) [below=of r4]{$v_4$};
   130 \draw[->,line width=1mm](r4) -- (v4);
   130 \draw[->,line width=1mm](r4) -- (v4);
   131 \node (v3) [left=of v4] {$v_3$};
   131 \node (v3) [left=of v4] {$v_3$};
   132 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\textit{inj}\,c$};
   132 \draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\inj\,c$};
   133 \node (v2) [left=of v3]{$v_2$};
   133 \node (v2) [left=of v3]{$v_2$};
   134 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\textit{inj}\,b$};
   134 \draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\inj\,b$};
   135 \node (v1) [left=of v2] {$v_1$};
   135 \node (v1) [left=of v2] {$v_1$};
   136 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\textit{inj}\,a$};
   136 \draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\inj\,a$};
   137 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}};
   137 \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}};
   138 \end{tikzpicture}
   138 \end{tikzpicture}
   139 \end{center}
   139 \end{center}
   140 \caption{The two phases of the algorithm by Sulzmann \& Lu.
   140 \caption{The two phases of the algorithm by Sulzmann \& Lu.
   141 \label{Sulz}}
   141 \label{Sulz}}
   144 
   144 
   145 
   145 
   146 \begin{center}
   146 \begin{center}
   147 \begin{tabular}{lcl}
   147 \begin{tabular}{lcl}
   148   $\textit{mkeps}(\ONE)$       & $\dn$ & $Empty$\\
   148   $\textit{mkeps}(\ONE)$       & $\dn$ & $Empty$\\
   149   $\textit{mkeps}(r_1 + r_2)$      & $\dn$ & if $\textit{nullable}(r_1)$  \\
   149   $\textit{mkeps}(r_1 + r_2)$      & $\dn$ & if $\nullable(r_1)$  \\
   150                           &       & then $\Left(\textit{mkeps}(r_1))$\\
   150                           &       & then $\Left(\textit{mkeps}(r_1))$\\
   151                           &       & else $Right(\textit{mkeps}(r_2))$\\
   151                           &       & else $Right(\textit{mkeps}(r_2))$\\
   152   $\textit{mkeps}(r_1 \cdot r_2)$  & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\
   152   $\textit{mkeps}(r_1 \cdot r_2)$  & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\
   153   $\textit{mkeps}(r^*)$            & $\dn$ & $Stars\,[]$  \\
   153   $\textit{mkeps}(r^*)$            & $\dn$ & $Stars\,[]$  \\
   154 \end{tabular}
   154 \end{tabular}
   163 The second phase of the algorithm is organised so that it will
   163 The second phase of the algorithm is organised so that it will
   164 calculate a value for how the derivative regular expression
   164 calculate a value for how the derivative regular expression
   165 has matched a string. For this we need a function that
   165 has matched a string. For this we need a function that
   166 reverses this ``chopping off'' for values which we did in the
   166 reverses this ``chopping off'' for values which we did in the
   167 first phase for derivatives. The corresponding function is
   167 first phase for derivatives. The corresponding function is
   168 called $\textit{inj}$ (short for injection). This function takes three
   168 called $\inj$ (short for injection). This function takes three
   169 arguments: the first one is a regular expression for which we
   169 arguments: the first one is a regular expression for which we
   170 want to calculate the value, the second is the character we
   170 want to calculate the value, the second is the character we
   171 want to inject and the third argument is the value where we
   171 want to inject and the third argument is the value where we
   172 will inject the character into. The result of this function is a
   172 will inject the character into. The result of this function is a
   173 new value. The definition of $\textit{inj}$ is as follows: 
   173 new value. The definition of $\inj$ is as follows: 
   174 
   174 
   175 \begin{center}
   175 \begin{center}
   176 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
   176 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
   177   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
   177   $\inj\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
   178   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
   178   $\inj\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\inj\,r_1\,c\,v)$\\
   179   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
   179   $\inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\inj\,r_2\,c\,v)$\\
   180   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
   180   $\inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\
   181   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
   181   $\inj\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\
   182   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
   182   $\inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\inj\,r_2\,c\,v)$\\
   183   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars(\textit{inj}\,r\,c\,v\,::\,vs)$\\
   183   $\inj\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars(\inj\,r\,c\,v\,::\,vs)$\\
   184 \end{tabular}
   184 \end{tabular}
   185 \end{center}
   185 \end{center}
   186 
   186 
   187 \noindent This definition is by recursion on the regular
   187 \noindent This definition is by recursion on the regular
   188 expression and by analysing the shape of the values. Therefore
   188 expression and by analysing the shape of the values. Therefore
   189 there are three cases for sequence regular
   189 there are three cases for sequence regular
   190 expressions (for all possible shapes of the value we can encounter). The last
   190 expressions (for all possible shapes of the value we can encounter). The last
   191 clause for the star regular expression returns a list where
   191 clause for the star regular expression returns a list where
   192 the first element is $\textit{inj}\,r\,c\,v$ and the other elements are
   192 the first element is $\inj\,r\,c\,v$ and the other elements are
   193 $vs$. That means $\_\,::\,\_$ should be read as list cons.
   193 $vs$. That means $\_\,::\,\_$ should be read as list cons.
   194 
   194 
   195 To understand what is going on, it might be best to do some
   195 To understand what is going on, it might be best to do some
   196 example calculations and compare them with Figure~\ref{Sulz}.
   196 example calculations and compare them with Figure~\ref{Sulz}.
   197 For this note that we have not yet dealt with the need of
   197 For this note that we have not yet dealt with the need of
   233 \noindent If there had been a $\ONE$ on the left, then
   233 \noindent If there had been a $\ONE$ on the left, then
   234 $\textit{mkeps}$ would have returned something of the form
   234 $\textit{mkeps}$ would have returned something of the form
   235 $\Left(\ldots)$. The point is that from this value we can
   235 $\Left(\ldots)$. The point is that from this value we can
   236 directly read off which part of $r_4$ matched the empty
   236 directly read off which part of $r_4$ matched the empty
   237 string: take the right-alternative first, and then the
   237 string: take the right-alternative first, and then the
   238 right-alternative again. 
   238 right-alternative again, then you got to the $\ONE$.
   239 
   239 
   240 Next we have to ``inject'' the last character, that is $c$ in
   240 Next we have to ``inject'' the last character, that is $c$ in
   241 the running example, into this value $v_4$ in order to
   241 the running example, into this value $v_4$ in order to
   242 calculate how $r_3$ could have matched the string $c$.
   242 calculate how $r_3$ could have matched the string $c$.
   243 For this we call injection with $\textit{inj}(r_3, c, v_4)$.
   243 For this we call injection with $\inj(r_3, c, v_4)$.
   244 According to the definition of $\textit{inj}$ we obtain
   244 According to the definition of $\inj$ we obtain
   245 
   245 
   246 \begin{center}
   246 \begin{center}
   247 $v_3:\;Right(Seq(Empty, Char(c)))$
   247 $v_3:\;Right(Seq(Empty, Char(c)))$
   248 \end{center}
   248 \end{center}
   249 
   249 
   250 \noindent This is the correct result, because $r_3$ needs
   250 \noindent This is the correct result, because $r_3$ needs
   251 to use the right-hand alternative, and then $\ONE$ needs
   251 to use the right-hand alternative, and then $\ONE$ needs
   252 to match the empty string and $c$ needs to match $c$.
   252 to match the empty string and $c$ needs to match $c$.
   253 Next we need to inject back the letter $b$ into $v_3$.
   253 Next we need to inject back the letter $b$ into $v_3$.
   254 For this we call injection with $\textit{inj}(r_2, b, v_3)$.
   254 For this we call injection with $\inj(r_2, b, v_3)$.
   255 This gives
   255 This gives
   256 
   256 
   257 \begin{center}
   257 \begin{center}
   258 $v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$
   258 $v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$
   259 \end{center}
   259 \end{center}
   260 
   260 
   261 \noindent which is again the correct result for how $r_2$
   261 \noindent which is again the correct result for how $r_2$
   262 matched the string $bc$. Finally we need to inject back the
   262 matched the string $bc$. Finally we need to inject back the
   263 letter $a$ into $v_2$ giving the final result.
   263 letter $a$ into $v_2$ giving the final result.
   264 For this we call injection with $\textit{inj}(r_1, a, v_2)$
   264 For this we call injection with $\inj(r_1, a, v_2)$
   265 and obtain
   265 and obtain
   266 
   266 
   267 \begin{center}
   267 \begin{center}
   268 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
   268 $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
   269 \end{center}
   269 \end{center}
   270 
   270 
   271 \noindent This value corresponds to how the regular expression $r_1$,
   271 \noindent This value corresponds to how the regular expression $r_1$,
   272 namely $a \cdot (b \cdot c)$, matched the string $abc$.
   272 namely $a \cdot (b \cdot c)$, matched the string $abc$.
       
   273 
   273 
   274 
   274 There are a few auxiliary functions that are of interest
   275 There are a few auxiliary functions that are of interest
   275 when analysing this algorithm. One is called \emph{flatten},
   276 when analysing this algorithm. One is called \emph{flatten},
   276 written $|\_|$, which extracts the string ``underlying'' a 
   277 written $|\_|$, which extracts the string ``underlying'' a 
   277 value. It is defined recursively as
   278 value. It is defined recursively as
   286   $|Stars\,[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
   287   $|Stars\,[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
   287 \end{tabular}
   288 \end{tabular}
   288 \end{center}
   289 \end{center}
   289 
   290 
   290 \noindent Using flatten we can see what are the strings behind 
   291 \noindent Using flatten we can see what are the strings behind 
   291 the values calculated by $\textit{mkeps}$ and $\textit{inj}$. In our running 
   292 the values calculated by $\textit{mkeps}$ and $\inj$. In our running 
   292 example:
   293 example:
   293 
   294 
   294 \begin{center}
   295 \begin{center}
   295 \begin{tabular}{ll}
   296 \begin{tabular}{ll}
   296 $|v_4|$: & $[]$\\
   297 $|v_4|$: & $[]$\\
   298 $|v_2|$: & $bc$\\
   299 $|v_2|$: & $bc$\\
   299 $|v_1|$: & $abc$
   300 $|v_1|$: & $abc$
   300 \end{tabular}
   301 \end{tabular}
   301 \end{center}
   302 \end{center}
   302 
   303 
   303 \noindent This indicates that $\textit{inj}$ indeed is injecting, or
   304 \noindent This indicates that $\inj$ indeed is injecting, or
   304 adding, back a character into the value. 
   305 adding, back a character into the value. 
   305 
   306 
   306 There is a problem, however, with the described algorithm
   307 The definition of $\inj$ might still be very puzzling and each clause
   307 so far: it is very slow. We need to include the simplification 
   308 might look arbitrary, but there is in fact a relatively simple idea
   308 from Lecture 2. This is what we shall do next.
   309 behind it. Ultimately the $\inj$-functions is determined by the
       
   310 derivative functions. For this consider one of the ``squares'' from
       
   311 Figure~\ref{Sulz}:
       
   312 
       
   313 
       
   314   \begin{center}
       
   315   \begin{tikzpicture}[scale=2,node distance=1.2cm,
       
   316                       every node/.style={minimum size=7mm}]
       
   317   \node (r)  {$r$};
       
   318   \node (rd) [right=of r]{$r_{der}$};
       
   319   \draw[->,line width=1mm](r)--(rd) node[above,midway] {$\der\,c$};
       
   320   \node (vd) [below=of r2]{$v_{der}$};
       
   321   \draw[->,line width=1mm](rd) -- (vd);
       
   322   \node (v) [left=of vd] {$v$};
       
   323   \draw[->,line width=1mm](vd)--(v) node[below,midway] {$\inj\,c$};
       
   324   \draw[->,line width=0.5mm,dotted](r) -- (v) node[left,midway,red] {\bf ?};
       
   325   \end{tikzpicture}
       
   326   \end{center}
       
   327 
       
   328 \noindent
       
   329 The input to the $\inj$-function is $r$ (on the top left), $c$ (the
       
   330 character to be injected) and $v_{der}$ (bottom right). The output is
       
   331 the value $v$ for how the regular expression $r$ matched the
       
   332 corresponding string. How does $\inj$ calculate this value? Well, it has
       
   333 to analyse the value $v_{der}$ and transform it into the value $v$ for
       
   334 the regular expression $r$. The value $v_{der}$ corresponds to the
       
   335 $r_{der}$-regular expression which is the derivative of $r$. Remember
       
   336 that $v_{der}$ is the value for how $r_{der}$ matches the corresponding string
       
   337 where $c$ has been chopped off.
       
   338 
       
   339 Let $r$ be $r_1 + r_2$. Then $r_{der}$
       
   340 is of the form $(\der\,c\,r_1) + (\der\,c\,r_2)$. What are the possible
       
   341 values corresponding to $r_{der}$? Well, they can be only of the form
       
   342 $\Left(\ldots)$ and $\Right(\ldots)$. Therefore you have two
       
   343 cases in the $\inj$ function -- one for $\Left$ and one for $\Right$.
       
   344 
       
   345 \begin{center}
       
   346 \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l}
       
   347     $\inj\,(r_1 + r_2)\,c\,\,\Left(v)$ & $\dn$ & $\ldots$\\
       
   348     $\inj\,(r_1 + r_2)\,c\,\,\Right(v)$ & $\dn$ & $\ldots$\\
       
   349 \end{tabular}
       
   350 \end{center}
       
   351 
       
   352 \noindent
       
   353 Let's look next at the sequence case where $r = r_1 \cdot r_2$. What does the
       
   354 derivative of $r$ look like? According to the definition it is: 
       
   355 
       
   356 \begin{center}
       
   357   \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   358     $\der\, c\, (r_1 \cdot r_2)$  & $\dn$  &  if $\nullable (r_1)$\\
       
   359     & & then $(\der\,c\,r_1) \cdot r_2 + \der\, c\, r_2$\\ 
       
   360     & & else $(\der\, c\, r_1) \cdot r_2$\\
       
   361   \end{tabular}  
       
   362 \end{center}  
       
   363 
       
   364 \noindent As you can see there is a derivative in the if-branch and another
       
   365 in the else-branch. In the if-branch we have an alternative of the form 
       
   366 $\_ + \_$. We already know what the possible values are for such a regular 
       
   367 expression, namely $\Left$ or $\Right$. Therefore we have in $\inj$ the
       
   368 two cases:
       
   369 
       
   370 \begin{center}
       
   371   \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l}
       
   372     $\inj\,(r_1 \cdot r_2)\,c\,\,\Left(Seq(v_1,v_2))$ & $\dn$  & $\ldots$\\
       
   373     $\inj\,(r_1 \cdot r_2)\,c\,\,Right(v)$ & $\dn$  & $\ldots$\\
       
   374 \end{tabular}
       
   375 \end{center}
       
   376 
       
   377 \noindent
       
   378 In the first case we even know that it is not just a $\Left$-value, but 
       
   379 $\Left(Seq(\ldots))$, because the corresponding regular expression 
       
   380 in the derivation is \mbox{$(\der\,c\,r_1) \cdot r_2$}. Hence we know 
       
   381 it must be a $Seq$-value enclosed inside $\Left$. 
       
   382 The third clause for $r_1\cdot r_2$ in the $\inj$-function
       
   383 
       
   384 \begin{center}
       
   385   \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   386     $\inj\,(r_1 \cdot r_2)\,c\,\,Seq(v_1,v_2)$ & $\dn$  & $\ldots$\\
       
   387   \end{tabular}
       
   388   \end{center}
       
   389 
       
   390 \noindent corresponds to the else-branch in the derivative. In this
       
   391 case we know the derivative is of the form $(\der\,c\,r_1) \cdot r_2$ and
       
   392 therefore the value must be of the form $Seq(\ldots)$.
       
   393 
       
   394 Hopefully the $inj$-function makes now more sense. I let you explain
       
   395 the $r^*$ case. What do the derivative of $r^*$  and
       
   396 the corresponding value look like? Does this explain the shape of 
       
   397 the clause?
       
   398 
       
   399 \begin{center}
       
   400   \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
       
   401     $\inj\,(r^*)\,c\,\,Seq(v,Stars\,vs)$ & $\dn$  & $Stars(\inj\,r\,c\,v\,::\,vs)$\\
       
   402   \end{tabular}
       
   403 \end{center}
       
   404 
       
   405 \noindent If yes, you made sense of the left-hand sides of
       
   406 the $\inj$-definition. 
       
   407 
       
   408 How about the right-hand sides? Well, in the
       
   409 $r^*$ case for example we have according to the square in the picture 
       
   410 above a value $v_{der}$ which says how the derivative $r_{der}$
       
   411 matched the string. Since the derivative is of the form
       
   412 $(\der\,c\,r) \cdot (r^*)$ the corresponding value is of the
       
   413 form $Seq(v, Stars\,vs)$. But for $r^*$ we are looking for a value
       
   414 for the original (top left) regular expression --- so it cannot
       
   415 be a value of the shape $Seq(\ldots, Stars\ldots)$ because that is the
       
   416 shape for sequence regular expressions. What we need is a value
       
   417 of the form $Stars \ldots$ (remember the correspondence between 
       
   418 values and regular expressions). This suggests to define the right hand 
       
   419 side as
       
   420 
       
   421 \begin{center}
       
   422 $\ldots \dn Stars(\inj\,r\,c\,v\,::\,vs)$
       
   423 \end{center}  
       
   424 
       
   425 \noindent It is a value of the right shape, namely $Stars$. It injected 
       
   426 $c$ into the first-value, which is in fact the value where we need to 
       
   427 undo the derivative. Remember again the shape of the derivative of $r^*$.
       
   428 In place where we chopped off the $c$, we now need to do the $\inj$ of $c$.
       
   429 Therefore $\inj\,r\,c\,v$ in the definition above. That is the same with
       
   430 the other clauses in $\inj$. 
       
   431 
       
   432 
       
   433 Phew\ldotsSweat\ldots Unfortunately, there is one more problem with the
       
   434 described algorithm so far: it is very slow. We need to include in all
       
   435 this the simplification from Lecture 2. This is what we shall do next.
   309 
   436 
   310 
   437 
   311 \subsubsection*{Simplification}
   438 \subsubsection*{Simplification}
   312 
   439 
   313 Generally the matching algorithms based on derivatives do
   440 Generally the matching algorithms based on derivatives do
   318 require that you do some example calculations yourself. As a
   445 require that you do some example calculations yourself. As a
   319 first example consider the last derivation step in our earlier
   446 first example consider the last derivation step in our earlier
   320 example:
   447 example:
   321 
   448 
   322 \begin{equation}\label{regexfour}
   449 \begin{equation}\label{regexfour}
   323 r_4 = \textit{der}\,c\,r_3 = 
   450 r_4 = \der\,c\,r_3 = 
   324 (\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)
   451 (\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)
   325 \end{equation}
   452 \end{equation}
   326 
   453 
   327 \noindent Simplifying this regular expression would just give
   454 \noindent Simplifying this regular expression would just give
   328 us $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ as
   455 us $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ as
   599 
   726 
   600 We are now in the position to define a \emph{lexing function} as follows:
   727 We are now in the position to define a \emph{lexing function} as follows:
   601 
   728 
   602 \begin{center}
   729 \begin{center}
   603 \begin{tabular}{lcl}
   730 \begin{tabular}{lcl}
   604 $lex\,r\,[]$ & $\dn$ & if $\textit{nullable}(r)$ then $\textit{mkeps}(r)$\\
   731 $lex\,r\,[]$ & $\dn$ & if $\nullable(r)$ then $\textit{mkeps}(r)$\\
   605              &       & else $error$\\
   732              &       & else $error$\\
   606 $lex\,r\,c\!::\!s$ & $\dn$ & let 
   733 $lex\,r\,c\!::\!s$ & $\dn$ & let 
   607    $(r_{simp}, f_{rect}) = simp(\textit{der}(c, r))$\\
   734    $(r_{simp}, f_{rect}) = simp(\der(c, r))$\\
   608 & & $\textit{inj}\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
   735 & & $\inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
   609 \end{tabular}
   736 \end{tabular}
   610 \end{center}
   737 \end{center}
   611 
   738 
   612 \noindent This corresponds to the $matches$ function we have
   739 \noindent This corresponds to the $matches$ function we have
   613 seen in earlier lectures. In the first clause we are given an
   740 seen in earlier lectures. In the first clause we are given an
   650 {\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
   777 {\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
   651                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
   778                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
   652 {../progs/app03.scala}}
   779 {../progs/app03.scala}}
   653 
   780 
   654 \noindent Since we regard records as regular expressions we
   781 \noindent Since we regard records as regular expressions we
   655 need to extend the functions $\textit{nullable}$ and $\textit{der}$. Similarly
   782 need to extend the functions $\nullable$ and $\der$. Similarly
   656 $\textit{mkeps}$ and $\textit{inj}$ need to be extended. This means we also need
   783 $\textit{mkeps}$ and $\inj$ need to be extended. This means we also need
   657 to extend the definition of values, which in Scala looks as
   784 to extend the definition of values, which in Scala looks as
   658 follows:
   785 follows:
   659 
   786 
   660 {\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
   787 {\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
   661                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
   788                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]