handouts/ho04.tex
author Christian Urban <christian dot urban at kcl dot ac dot uk>
Fri, 17 Oct 2014 13:44:50 +0100
changeset 285 8a222559278f
parent 284 0afe43616b6a
child 286 19020b75d75e
permissions -rw-r--r--
updated
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     1
\documentclass{article}
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     2
\usepackage{../style}
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     3
\usepackage{../langs}
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
     4
\usepackage{../graphics}
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     5
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     6
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     7
\begin{document}
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     8
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
     9
\section*{Handout 4 (Sulzmann \& Lu Algorithm)}
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    10
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    11
So far our algorithm based on derivatives was only able to say
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    12
yes or no depending on whether a string was matched by regular
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    13
expression or not. Often a more interesting question is to
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    14
find out \emph{how} a regular expression matched a string?
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    15
Answering this question will help us with the problem we are
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    16
after, namely tokenising an input string. The algorithm we
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    17
will be looking at for this was designed by Sulzmann \& Lu in
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    18
a rather recent paper. A link to it is provided on KEATS, in
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    19
case you are interested.\footnote{In my humble opinion this is
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    20
an interesting instance of the research literature: it
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    21
contains a very neat idea, but its presentation is rather
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    22
sloppy. In earlier versions of their paper, students and I
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    23
found several rather annoying typos in their examples and
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    24
definitions.}
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    25
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    26
In order to give an answer for how a regular expression
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    27
matched a string, Sulzmann and Lu introduce \emph{values}. A
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    28
value will be the output of the algorithm whenever the regular
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    29
expression matches the string. If not, an error will be
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    30
raised. Since the first phase of the algorithm by Sulzmann \&
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    31
Lu is identical to the derivative based matcher from the first
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    32
coursework, the function $nullable$ will be used to decide
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    33
whether as string is matched by a regular expression. If
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    34
$nullable$ says yes, then values are constructed that reflect
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    35
how the regular expression matched the string. The definitions
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    36
for regular expressions $r$ and values $v$ is shown next to
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    37
each other below:
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    38
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    39
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    40
\begin{tabular}{cc}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    41
\begin{tabular}{@{}rrl@{}}
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    42
\multicolumn{3}{c}{regular expressions}\\
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    43
  $r$ & $::=$  & $\varnothing$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    44
      & $\mid$ & $\epsilon$   \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    45
      & $\mid$ & $c$          \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    46
      & $\mid$ & $r_1 \cdot r_2$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    47
      & $\mid$ & $r_1 + r_2$   \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    48
  \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    49
      & $\mid$ & $r^*$         \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    50
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    51
&
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    52
\begin{tabular}{@{\hspace{0mm}}rrl@{}}
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    53
\multicolumn{3}{c}{values}\\
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    54
   $v$ & $::=$  & \\
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    55
      &        & $Empty$   \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    56
      & $\mid$ & $Char(c)$          \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    57
      & $\mid$ & $Seq(v_1,v_2)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    58
      & $\mid$ & $Left(v)$   \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    59
      & $\mid$ & $Right(v)$  \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    60
      & $\mid$ & $[v_1,\ldots\,v_n]$ \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    61
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    62
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    63
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    64
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    65
\noindent The point is that there is a very strong
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    66
correspondence between them. There is no value for the
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    67
$\varnothing$ regular expression, since it does not match any
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    68
string. Otherwise there is exactly one value corresponding to
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    69
each regular expression with the exception of $r_1 + r_2$
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    70
where there are two values, namely $Left(v)$ and $Right(v)$
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    71
corresponding to the two alternatives. Note that $r^*$ is
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    72
associated with a list of values, one for each copy of $r$
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    73
that was needed to match the string. This means we might also
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    74
return the empty list $[]$, if no copy was needed.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    75
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    76
Graphically the algorithm by Sulzmann \& Lu can be represneted
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    77
by the picture in Figure~\ref{Sulz} where the path from the
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    78
left to the right involving $der/nullable$ is the first phase
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    79
of the algorithm and $mkeps/inj$, the path from right to left,
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    80
the second phase. This picture shows the steps required when a
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    81
regular expression, say $r_1$, matches the string $abc$. We
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    82
first build the three derivatives (according to $a$, $b$ and
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    83
$c$). We then use $nullable$ to find out whether the resulting
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    84
regular expression can match the empty string. If yes we call
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    85
the function $mkeps$.
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    86
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    87
\begin{figure}[t]
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    88
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    89
\begin{tikzpicture}[scale=2,node distance=1.2cm,
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    90
                    every node/.style={minimum size=7mm}]
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    91
\node (r1)  {$r_1$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    92
\node (r2) [right=of r1]{$r_2$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    93
\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$der\,a$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    94
\node (r3) [right=of r2]{$r_3$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    95
\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$der\,b$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    96
\node (r4) [right=of r3]{$r_4$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    97
\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$der\,c$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    98
\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$nullable$}};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    99
\node (v4) [below=of r4]{$v_4$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   100
\draw[->,line width=1mm](r4) -- (v4);
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   101
\node (v3) [left=of v4] {$v_3$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   102
\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$inj\,c$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   103
\node (v2) [left=of v3]{$v_2$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   104
\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$inj\,b$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   105
\node (v1) [left=of v2] {$v_1$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   106
\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$inj\,a$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   107
\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$mkeps$}};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   108
\end{tikzpicture}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   109
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   110
\caption{The two phases of the algorithm by Sulzmann \& Lu.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   111
\label{Sulz}}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   112
\end{figure}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   113
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   114
The $mkeps$ function calculates a value for how a regular
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   115
expression has matched the empty string. Its definition
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   116
is as follows:
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   117
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   118
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   119
\begin{tabular}{lcl}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   120
  $mkeps(\epsilon)$       & $\dn$ & $Empty$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   121
  $mkeps(r_1 + r_2)$      & $\dn$ & if $nullable(r_1)$  \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   122
                          &       & then $Left(mkeps(r_1))$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   123
                          &       & else $Right(mkeps(r_2))$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   124
  $mkeps(r_1 \cdot r_2)$  & $\dn$ & $Seq(mkeps(r_1),mkeps(r_2))$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   125
  $mkeps(r^*)$            & $\dn$ & $[]$  \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   126
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   127
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   128
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   129
\noindent There are no cases for $\epsilon$ and $c$, since
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   130
these regular expression cannot match the empty string. Note
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   131
also that in case of alternatives we give preference to the
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   132
regular expression on the left-hand side. This will become
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   133
important later on.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   134
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   135
The second phase of the algorithm is organised recursively
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   136
such that it will calculate a value for how the derivative
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   137
regular expression has matched a string whose first character
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   138
has been chopped off. Now we need a function that reverses
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   139
this ``chopping off'' for values. The corresponding function
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   140
is called $inj$ for injection. This function takes three
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   141
arguments: the first one is a regular expression for which we
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   142
want to calculate the value, the second is the character we
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   143
want to inject and the third argument is the value where we
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   144
will inject the character. The result of this function is a
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   145
new value. The definition of $inj$ is as follows: 
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   146
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   147
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   148
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   149
  $inj\,(c)\,c\,Empty$            & $\dn$  & $Char\,c$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   150
  $inj\,(r_1 + r_2)\,c\,Left(v)$  & $\dn$  & $Left(inj\,r_1\,c\,v)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   151
  $inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$  & $Right(inj\,r_2\,c\,v)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   152
  $inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   153
  $inj\,(r_1 \cdot r_2)\,c\,Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   154
  $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   155
  $inj\,(r^*)\,c\,Seq(v,vs)$         & $\dn$  & $inj\,r\,c\,v\,::\,vs$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   156
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   157
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   158
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   159
\noindent This definition is by recursion on the regular
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   160
expression and by analysing the shape of the values. Therefore
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   161
there are, for example, three cases for sequnece regular
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   162
expressions. The last clause for the star regular expression
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   163
returns a list where the first element is $inj\,r\,c\,v$ and
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   164
the other elements are $vs$. That mean $\_\,::\,\_$ should be 
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   165
read as list cons.
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   166
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   167
To understand what is going on, it might be best to do some
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   168
example calculations and compare with Figure~\ref{Sulz}. For
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   169
this note that we have not yet dealt with the need of
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   170
simplifying regular expressions (this will be a topic on its
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   171
own later). Suppose the regular expression is $a \cdot (b
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   172
\cdot c)$ and the input string is $abc$. The derivatives from
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   173
the first phase are as follows:
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   174
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   175
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   176
\begin{tabular}{ll}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   177
$r_1$: & $a \cdot (b \cdot c)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   178
$r_2$: & $\epsilon \cdot (b \cdot c)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   179
$r_3$: & $(\varnothing \cdot (b \cdot c)) + (\epsilon \cdot c)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   180
$r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   181
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   182
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   183
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   184
\noindent According to the simple algorithm, we would test
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   185
whether $r_4$ is nullable, which in this case it is. This
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   186
means we can use the function $mkeps$ to calculate a value for
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   187
how $r_4$ was able to match the empty string. Remember that
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   188
this function gives preference for alternatives on the
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   189
left-hand side. However there is only $\epsilon$ on the very
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   190
right-hand side of $r_4$ that matches the empty string.
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   191
Therefore $mkeps$ returns the value
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   192
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   193
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   194
$v_4:\;Right(Right(Empty))$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   195
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   196
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   197
\noindent The point is that from this value we can directly
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   198
read off which part of $r_4$ matched the empty string. Next we
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   199
have to ``inject'' the last character, that is $c$ in the
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   200
running example, into this value $v_4$ in order to calculate
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   201
how $r_3$ could have matched the string $c$. According to the
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   202
definition of $inj$ we obtain
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   203
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   204
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   205
$v_3:\;Right(Seq(Empty, Char(c)))$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   206
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   207
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   208
\noindent This is the correct result, because $r_3$ needs
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   209
to use the right-hand alternative, and then $\epsilon$ needs
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   210
to match the empty string and $c$ needs to match $c$.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   211
Next we need to inject back the letter $b$ into $v_3$. This
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   212
gives
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   213
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   214
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   215
$v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   216
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   217
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   218
\noindent Finally we need to inject back the letter $a$ into
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   219
$v_2$ giving the final result
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   220
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   221
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   222
$v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   223
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   224
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   225
\noindent This now corresponds to how the regular
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   226
expression $a \cdot (b \cdot c)$ matched the string $abc$.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   227
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   228
There are a few auxiliary functions that are of interest
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   229
in analysing this algorithm. One is called \emph{flatten},
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   230
written $|\_|$, which extracts the string ``underlying'' a 
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   231
value. It is defined recursively as
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   232
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   233
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   234
\begin{tabular}{lcl}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   235
  $|Empty|$     & $\dn$ & $[]$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   236
  $|Char(c)|$   & $\dn$ & $[c]$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   237
  $|Left(v)|$   & $\dn$ & $|v|$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   238
  $|Right(v)|$  & $\dn$ & $|v|$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   239
  $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   240
  $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   241
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   242
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   243
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   244
\noindent Using flatten we can see what is the string behind 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   245
the values calculated by $mkeps$ and $inj$ in our running 
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   246
example are:
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   247
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   248
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   249
\begin{tabular}{ll}
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   250
$|v_4|$: & $[]$\\
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   251
$|v_3|$: & $c$\\
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   252
$|v_2|$: & $bc$\\
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   253
$|v_1|$: & $abc$
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   254
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   255
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   256
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   257
\noindent 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   258
This indicates that $inj$ indeed is injecting, or adding, back
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   259
a character into the value.
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
   260
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
   261
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   262
\subsubsection*{Simplification}
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
   263
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   264
Generally the matching algorithms based on derivatives do
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   265
poorly unless the regular expressions are simplified after
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   266
each derivatives step. But this is a bit more involved in 
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   267
algorithm of Sulzmann \& Lu. Consider the last derivation
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   268
step in our running example
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   269
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   270
\begin{center}
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   271
$r_4 = der\,c\,r_3 = (\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   272
\end{center}
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   273
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   274
\noindent Simplifying the result would just give us
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   275
$\epsilon$. Running $mkeps$ on this regular expression would
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   276
then provide us with $Empty$ instead of $Right(Right(Empty))$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   277
that was obtained without the simplification. The problem is
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   278
we need to recreate this more complicated value, rather than 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   279
just $Empty$.
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   280
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   281
This requires what I call \emph{rectification functions}. They
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   282
need to be calculated whenever a regular expression gets
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   283
simplified. Rectification functions take a value as argument
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   284
and return a (rectified) value. Our simplification rules so
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   285
far are
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   286
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   287
\begin{center}
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   288
\begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   289
$r \cdot \varnothing$ & $\mapsto$ & $\varnothing$\\ 
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   290
$\varnothing \cdot r$ & $\mapsto$ & $\varnothing$\\ 
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   291
$r \cdot \epsilon$ & $\mapsto$ & $r$\\ 
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   292
$\epsilon \cdot r$ & $\mapsto$ & $r$\\ 
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   293
$r + \varnothing$ & $\mapsto$ & $r$\\ 
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   294
$\varnothing + r$ & $\mapsto$ & $r$\\ 
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   295
$r + r$ & $\mapsto$ & $r$\\ 
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   296
\end{tabular}
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   297
\end{center}
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   298
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   299
\noindent Applying them to $r_4$ will require several nested
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   300
simplifications in order end up with just $\epsilon$. 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   301
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   302
We can implement this by letting simp return not just a
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   303
(simplified) regular expression, but also a rectification
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   304
function. Let us consider the alternative case, say $r_1 +
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   305
r_2$, first. We would first simplify the component regular
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   306
expressions $r_1$ and $r_2.$ This will return simplified
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   307
versions (if they can be simplified), say $r_{1s}$ and
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   308
$r_{2s}$, but also two rectification functions $f_{1s}$ and
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   309
$f_{2s}$. We need to assemble them in order to obtain a
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   310
rectified value for $r_1 + r_2$. In case $r_{1s}$ simplified
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   311
to $\varnothing$, we would continue the derivative calculation
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   312
with $r_{2s}$. The Sulzmann \& Lu algorithm would return a
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   313
corresponding value, say $v_{2s}$. But now this needs to
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   314
be ``rectified'' to the value 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   315
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   316
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   317
$Right(v_{2s})$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   318
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   319
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   320
\noindent Unfortunately, this is not enough because there
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   321
might be some simplifications that happened inside $r_2$ 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   322
and for which the simplification function retuned also
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   323
a rectification function $f_{2s}$. So in fact we need to
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   324
apply this one too which gives
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   325
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   326
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   327
$Right(f_{2s}(v_{2s}))$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   328
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   329
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   330
\noindent So if we want to return this as function,
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   331
we would need to return
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   332
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   333
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   334
$\lambda v.\,Right(f_{2s}(v))$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   335
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   336
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   337
\noindent which is the lambda-calculus notation for
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   338
a function that expects a value $v$ and returns everything
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   339
after the dot where $v$ is replaced by whatever value is 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   340
given.
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   341
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   342
Let us package these ideas into a single function (still only
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   343
considering the alternative case):
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   344
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   345
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   346
\begin{tabular}{l}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   347
$simp(r)$:\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   348
\quad case $r = r_1 + r_2$\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   349
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   350
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   351
\qquad case $r_{1s} = \varnothing$: 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   352
       return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   353
\qquad case $r_{2s} = \varnothing$: 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   354
       return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   355
\qquad otherwise: 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   356
       return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   357
\end{tabular}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   358
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   359
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   360
\noindent We first recursively call the simlification with $r_1$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   361
and $r_2$. This gives simplified regular expressions, $r_{1s}$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   362
and $r_{2s}$, as well as two rectification functions $f_{1s}$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   363
and $f_{2s}$. We next need to test whether the simplified 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   364
regular expressions are $\varnothing$ so as to make further 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   365
simplifications. In case $r_{1s}$ is $\varnothing$ then 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   366
we can return $r_{2s}$ (the other alternative). However
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   367
we need to now build a rectification function, which as
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   368
said above is
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   369
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   370
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   371
$\lambda v.\,Right(f_{2s}(v))$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   372
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   373
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   374
\noindent The case where $r_{2s} = \varnothing$ is similar.
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   375
We return $r_{1s}$ but now have to rectify such that we return
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   376
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   377
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   378
$\lambda v.\,Left(f_{1s}(v))$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   379
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   380
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   381
\noindent Note that in this case we have to apply $f_{1s}$,
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   382
not $f_{2s}$, which is responsible to rectify the inner parts
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   383
of $v$. The otherwise-case is slightly interesting. In this
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   384
case neither $r_{1s}$ nor $r_{2s}$ are $\varnothing$ and no
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   385
further simplification can be applied. Accordingly, we return
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   386
$r_{1s} + r_{2s}$ as the simplified regular expression. In
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   387
principle we also do not have to do any rectification, because
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   388
no simplification was done in this case. But this is actually
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   389
not true: There might have been simplifications inside
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   390
$r_{1s}$ and $r_2s$. We therefore need to take into account
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   391
the calculated rectification functions $f_{1s}$ and $f_{2s}$.
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   392
We can do this by defining a rectification function $f_{alt}$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   393
which takes two rectification functions as arguments
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   394
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   395
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   396
\begin{tabular}{l@{\hspace{1mm}}l}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   397
$f_{alt}(f_1, f_2) \dn$\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   398
\qquad $\lambda v.\,$ case $v = Left(v')$: 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   399
      & return $Left(f_1(v'))$\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   400
\qquad \phantom{$\lambda v.\,$} case $v = Right(v')$: 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   401
      & return $Right(f_2(v'))$\\      
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   402
\end{tabular}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   403
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   404
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   405
\noindent In essence we need to apply in this case 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   406
the appropriate rectification function to the inner part
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   407
of the value $v$, wherevy $v$ can only be of the form 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   408
$Right(\_)$ or $Left(\_)$.
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   409
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   410
\subsubsection*{Records and Tokenisation}
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   411
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   412
\newpage
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   413
Algorithm by Sulzmann, Lexing
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   414
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   415
\end{document}
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   416
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   417
%%% Local Variables: 
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   418
%%% mode: latex
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   419
%%% TeX-master: t
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   420
%%% End: