handouts/ho04.tex
author Christian Urban <christian dot urban at kcl dot ac dot uk>
Thu, 16 Oct 2014 17:30:05 +0100
changeset 283 c14e5ebf0c3b
parent 282 3e3b927a85cf
child 284 0afe43616b6a
permissions -rw-r--r--
updated
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     1
\documentclass{article}
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     2
\usepackage{../style}
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     3
\usepackage{../langs}
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
     4
\usepackage{../graphics}
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     5
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     6
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     7
\begin{document}
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     8
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
     9
\section*{Handout 4 (Sulzmann \& Lu Algorithm)}
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    10
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    11
So far our algorithm based on derivatives was only able to say
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    12
yes or no depending on whether a string was matched by regular
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    13
expression or not. Often a more interesting question is to
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    14
find out \emph{how} a regular expression matched a string?
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    15
Answering this question will help us with the problem we are
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    16
after, namely tokenising an input string, that is splitting it
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    17
up into its ``word'' components. The algorithm we will be
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    18
looking at was designed by Sulzmann \& Lu in a rather recent
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    19
paper. A link to it is provided on KEATS, in case you are
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    20
interested.\footnote{In my humble opinion this is an
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    21
interesting instance of the research literature: it contains a
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    22
very neat idea, but its presentation is rather sloppy. In
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    23
earlier versions of their paper, students and I found several
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    24
rather annoying typos in their examples and definitions.}
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    25
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    26
In order to give an answer for how a regular expression matched
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    27
a string, Sulzmann and Lu introduce \emph{values}. A value 
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    28
will be the output of the algorithm whenever the regular 
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    29
expression matches the string. If not, an error will be raised.
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    30
Since the first phase of the algorithm is identical to the
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    31
derivative based matcher from the first coursework, the 
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    32
function $nullable$ will be used to decide whether as string
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    33
is matched by a regular expression. If $nullable$ says
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    34
yes, then values are constructed that reflect how the regular
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    35
expression matched the string. The definitions for regular 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    36
expressions $r$ and values $v$ is shown below:
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    37
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    38
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    39
\begin{tabular}{cc}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    40
\begin{tabular}{@{}rrl@{}}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    41
  $r$ & $::=$  & $\varnothing$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    42
      & $\mid$ & $\epsilon$   \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    43
      & $\mid$ & $c$          \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    44
      & $\mid$ & $r_1 \cdot r_2$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    45
      & $\mid$ & $r_1 + r_2$   \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    46
  \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    47
      & $\mid$ & $r^*$         \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    48
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    49
&
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    50
\begin{tabular}{@{\hspace{0mm}}rrl@{}}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    51
  $v$ & $::=$  & \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    52
      &        & $Empty$   \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    53
      & $\mid$ & $Char(c)$          \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    54
      & $\mid$ & $Seq(v_1,v_2)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    55
      & $\mid$ & $Left(v)$   \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    56
      & $\mid$ & $Right(v)$  \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    57
      & $\mid$ & $[v_1,\ldots\,v_n]$ \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    58
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    59
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    60
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    61
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    62
\noindent As you can see there is a very strong correspondence
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    63
between regular expressions and values. There is no value for
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    64
the $\varnothing$ regular expression (since it does not match
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    65
any string). Then there is exactly one value corresponding to 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    66
each regular expression. with the exception of $r_1 + r_2$ 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    67
where there are two values $Left(v)$ and $Right(v)$ 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    68
corresponding  to the two alternatives. Note that $r^*$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    69
is associated with a list of values, one for each copy of $r$ 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    70
that was needed to match the string. This mean we might also 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    71
return the empty list $[]$, if no copy was needed.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    72
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    73
Graphically the algorithm can be represneted by the 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    74
picture in Figure~\ref{Sulz} where the path involving $der/nullable$ is the first
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    75
phase of the algorithm and $mkeps/inj$ the second phase.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    76
This picture shows the steps required when a regular
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    77
expression, say $r_1$, matches the string $abc$. We first
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    78
build the three derivatives (according to $a$, $b$ and $c$).
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    79
We then use $nullable$ to find out whether the resulting
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    80
regular expression can match the empty string. If yes
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    81
we call the function $mkeps$.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    82
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    83
\begin{figure}[t]
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    84
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    85
\begin{tikzpicture}[scale=2,node distance=1.2cm,
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    86
                    every node/.style={minimum size=7mm}]
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    87
\node (r1)  {$r_1$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    88
\node (r2) [right=of r1]{$r_2$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    89
\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$der\,a$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    90
\node (r3) [right=of r2]{$r_3$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    91
\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$der\,b$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    92
\node (r4) [right=of r3]{$r_4$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    93
\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$der\,c$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    94
\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$nullable$}};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    95
\node (v4) [below=of r4]{$v_4$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    96
\draw[->,line width=1mm](r4) -- (v4);
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    97
\node (v3) [left=of v4] {$v_3$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    98
\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$inj\,c$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    99
\node (v2) [left=of v3]{$v_2$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   100
\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$inj\,b$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   101
\node (v1) [left=of v2] {$v_1$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   102
\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$inj\,a$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   103
\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$mkeps$}};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   104
\end{tikzpicture}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   105
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   106
\caption{The two phases of the algorithm by Sulzmann \& Lu.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   107
\label{Sulz}}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   108
\end{figure}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   109
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   110
The $mkeps$ function calculates a value for how a regular
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   111
expression could have matched the empty string. Its definition
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   112
is as follows:
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   113
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   114
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   115
\begin{tabular}{lcl}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   116
  $mkeps(\epsilon)$       & $\dn$ & $Empty$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   117
  $mkeps(r_1 + r_2)$      & $\dn$ & if $nullable(r_1)$  \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   118
                          &       & then $Left(mkeps(r_1))$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   119
                          &       & else $Right(mkeps(r_2))$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   120
  $mkeps(r_1 \cdot r_2)$  & $\dn$ & $Seq(mkeps(r_1),mkeps(r_2))$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   121
  $mkeps(r^*)$            & $\dn$ & $[]$  \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   122
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   123
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   124
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   125
\noindent There are no cases for $\epsilon$ and $c$, since
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   126
these regular expression cannot match the empty string. 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   127
Note that in case of alternatives we give preference to the
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   128
regular expression on the left-hand side. This will become
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   129
important later on.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   130
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   131
The algorithm is organised recursively such that it will 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   132
calculate a value for how the derivative regular expression
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   133
has matched a string where the first character has been
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   134
chopped off. Now we need a function that reverses this
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   135
``chopping off'' for values. The corresponding function
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   136
is called $inj$ for injection. This function takes
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   137
three arguments: the first one is a regular expression
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   138
for which we want to calculate the value, the second is the 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   139
character we want to inject and the third argument is the
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   140
value where we will inject the character. The result
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   141
of this function is a new value. The definition of $inj$ 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   142
is as follows: 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   143
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   144
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   145
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   146
  $inj\,(c)\,c\,Empty$            & $\dn$  & $Char\,c$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   147
  $inj\,(r_1 + r_2)\,c\,Left(v)$  & $\dn$  & $Left(inj\,r_1\,c\,v)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   148
  $inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$  & $Right(inj\,r_2\,c\,v)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   149
  $inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   150
  $inj\,(r_1 \cdot r_2)\,c\,Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   151
  $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   152
  $inj\,(r^*)\,c\,Seq(v,vs)$         & $\dn$  & $inj\,r\,c\,v\,::\,vs$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   153
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   154
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   155
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   156
\noindent This definition is by recursion on the
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   157
regular expression and by analysing the shape of the values. 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   158
Therefore there are, for example, three cases for sequnece
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   159
regular expressions. The last clause returns a list where 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   160
the first element is $inj\,r\,c\,v$ and the other elements 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   161
are $vs$.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   162
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   163
To understand what is going on, it might be best to do some
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   164
example calculations and compare with Figure~\ref{Sulz}. For
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   165
this note that we have not yet dealt with the need of
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   166
simplifying regular expreesions (this will be a topic on its
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   167
own later). Suppose the regular expression is $a \cdot (b
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   168
\cdot c)$ and the input string is $abc$. The derivatives are
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   169
as follows:
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   170
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   171
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   172
\begin{tabular}{ll}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   173
$r_1$: & $a \cdot (b \cdot c)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   174
$r_2$: & $\epsilon \cdot (b \cdot c)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   175
$r_3$: & $(\varnothing \cdot (b \cdot c)) + (\epsilon \cdot c)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   176
$r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   177
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   178
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   179
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   180
\noindent According to the simple algorithm. we would test
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   181
whether $r_4$ is nullable, which it is. This means we can
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   182
use the function $mkeps$ to calculate a value for how $r_4$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   183
was able to match the empty string. Remember that this 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   184
function gives preference for alternatives on the left-hand
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   185
side. However there is only $\epsilon$ on the very right-hand
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   186
side of $r_4$ that matches the empty string. Therefore
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   187
$mkeps$ returns the value
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   188
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   189
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   190
$v_4:\;Right(Right(Empty))$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   191
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   192
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   193
\noindent The point is that from this value we can directly 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   194
read off which part of $r_4$ matched the empty string. Next we 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   195
have to ``inject'' the last character, that is $c$ in the
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   196
running example, into this value in order to calculate how
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   197
$r_3$ could have matched the string $c$. According to the 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   198
definition of $inj$ we obtain
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   199
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   200
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   201
$v_3:\;Right(Seq(Empty, Char(c)))$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   202
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   203
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   204
\noindent This is the correct result, because $r_3$ needs
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   205
to use the right-hand alternative, and then $\epsilon$ needs
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   206
to match the empty string and $c$ needs to match $c$.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   207
Next we need to inject back the letter $b$ into $v_3$. This
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   208
gives
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   209
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   210
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   211
$v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   212
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   213
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   214
\noindent Finally we need to inject back the letter $a$ into
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   215
$v_2$ giving the final result
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   216
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   217
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   218
$v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   219
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   220
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   221
\noindent This now corresponds to how the regular
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   222
expression $a \cdot (b \cdot c)$ matched the string $abc$.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   223
This is the expected result. So at least in this case the
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   224
algorithm seems to calculate what it is supposed to.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   225
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   226
There are a few auxiliary function that are of interest
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   227
in analysing this algorithm. One is called \emph{flatten},
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   228
written $|\_|$, which extracts the string ``underlying'' a 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   229
value. It is defined as
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   230
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   231
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   232
\begin{tabular}{lcl}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   233
  $|Empty|$     & $\dn$ & $[]$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   234
  $|Char(c)|$   & $\dn$ & $[c]$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   235
  $|Left(v)|$   & $\dn$ & $|v|$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   236
  $|Right(v)|$  & $\dn$ & $|v|$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   237
  $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   238
  $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   239
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   240
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   241
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   242
\noindent Using flatten we can see what is the string behind 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   243
the values calculated by $mkeps$ and $inj$ in our running 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   244
example:
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   245
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   246
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   247
\begin{tabular}{ll}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   248
$v_4$: & $[]$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   249
$v_3$: & $c$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   250
$v_2$: & $bc$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   251
$v_1$: & $abc$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   252
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   253
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   254
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   255
\noindent 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   256
This indicates that $inj$ indeed is injecting, or adding, back
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   257
a character into the value.
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
   258
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
   259
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   260
\subsubsection*{Simplification}
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
   261
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   262
Generally the matching algorithms based on derivatives do
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   263
poorly unless the regular expressions are simplified after
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   264
each derivatives step.
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   265
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   266
\newpage
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   267
Algorithm by Sulzmann, Lexing
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   268
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   269
\end{document}
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   270
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   271
%%% Local Variables: 
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   272
%%% mode: latex
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   273
%%% TeX-master: t
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   274
%%% End: