handouts/ho04.tex
author Christian Urban <christian dot urban at kcl dot ac dot uk>
Sat, 18 Oct 2014 02:08:17 +0100
changeset 288 39aeca14af8c
parent 287 2c50b8b5886c
child 296 796b9b81ac8d
permissions -rw-r--r--
updated
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     1
\documentclass{article}
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     2
\usepackage{../style}
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     3
\usepackage{../langs}
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
     4
\usepackage{../graphics}
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     5
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     6
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     7
\begin{document}
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     8
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
     9
\section*{Handout 4 (Sulzmann \& Lu Algorithm)}
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    10
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    11
So far our algorithm based on derivatives was only able to say
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    12
yes or no depending on whether a string was matched by regular
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    13
expression or not. Often a more interesting question is to
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    14
find out \emph{how} a regular expression matched a string?
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    15
Answering this question will also help us with the problem we
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    16
are after, namely tokenising an input string. The algorithm we
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    17
will be looking at for this was designed by Sulzmann \& Lu in
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    18
a rather recent paper. A link to it is provided on KEATS, in
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    19
case you are interested.\footnote{In my humble opinion this is
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    20
an interesting instance of the research literature: it
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    21
contains a very neat idea, but its presentation is rather
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    22
sloppy. In earlier versions of their paper, students and I
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    23
found several rather annoying typos in their examples and
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    24
definitions.}
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    25
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    26
In order to give an answer for how a regular expression
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    27
matched a string, Sulzmann and Lu introduce \emph{values}. A
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    28
value will be the output of the algorithm whenever the regular
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    29
expression matches the string. If not, an error will be
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    30
raised. Since the first phase of the algorithm by Sulzmann \&
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    31
Lu is identical to the derivative based matcher from the first
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    32
coursework, the function $nullable$ will be used to decide
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    33
whether as string is matched by a regular expression. If
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    34
$nullable$ says yes, then values are constructed that reflect
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    35
how the regular expression matched the string. The definitions
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    36
for regular expressions $r$ and values $v$ is shown next to
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    37
each other below:
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    38
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    39
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    40
\begin{tabular}{cc}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    41
\begin{tabular}{@{}rrl@{}}
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    42
\multicolumn{3}{c}{regular expressions}\\
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    43
  $r$ & $::=$  & $\varnothing$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    44
      & $\mid$ & $\epsilon$   \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    45
      & $\mid$ & $c$          \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    46
      & $\mid$ & $r_1 \cdot r_2$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    47
      & $\mid$ & $r_1 + r_2$   \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    48
  \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    49
      & $\mid$ & $r^*$         \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    50
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    51
&
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    52
\begin{tabular}{@{\hspace{0mm}}rrl@{}}
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    53
\multicolumn{3}{c}{values}\\
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    54
   $v$ & $::=$  & \\
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    55
      &        & $Empty$   \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    56
      & $\mid$ & $Char(c)$          \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    57
      & $\mid$ & $Seq(v_1,v_2)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    58
      & $\mid$ & $Left(v)$   \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    59
      & $\mid$ & $Right(v)$  \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    60
      & $\mid$ & $[v_1,\ldots\,v_n]$ \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    61
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    62
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    63
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    64
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    65
\noindent The point is that there is a very strong
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    66
correspondence between them. There is no value for the
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    67
$\varnothing$ regular expression, since it does not match any
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    68
string. Otherwise there is exactly one value corresponding to
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    69
each regular expression with the exception of $r_1 + r_2$
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    70
where there are two values, namely $Left(v)$ and $Right(v)$
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    71
corresponding to the two alternatives. Note that $r^*$ is
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    72
associated with a list of values, one for each copy of $r$
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    73
that was needed to match the string. This means we might also
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    74
return the empty list $[]$, if no copy was needed.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    75
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    76
To emphasise the connection between regular expressions and
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    77
values, I have in my implementation the convention that
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    78
regular expressions are written entirely with upper-case
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    79
letters, while values just start with a single upper-case
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    80
character. My definition of values in Scala is below. I use 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    81
this in the REPL of Scala; when I use the Scala compiler
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    82
I need to rename some constructors, because Scala on Macs
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    83
does not like classes that are called \pcode{EMPTY} and
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    84
\pcode{Empty}.
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    85
 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    86
{\small\lstinputlisting[language=Scala,numbers=none]
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    87
{../progs/app02.scala}}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    88
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    89
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    90
Graphically the algorithm by Sulzmann \& Lu can be illustrated
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    91
by the picture in Figure~\ref{Sulz} where the path from the
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    92
left to the right involving $der/nullable$ is the first phase
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    93
of the algorithm and $mkeps/inj$, the path from right to left,
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    94
the second phase. This picture shows the steps required when a
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    95
regular expression, say $r_1$, matches the string $abc$. We
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    96
first build the three derivatives (according to $a$, $b$ and
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    97
$c$). We then use $nullable$ to find out whether the resulting
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    98
regular expression can match the empty string. If yes we call
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    99
the function $mkeps$.
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   100
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   101
\begin{figure}[t]
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   102
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   103
\begin{tikzpicture}[scale=2,node distance=1.2cm,
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   104
                    every node/.style={minimum size=7mm}]
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   105
\node (r1)  {$r_1$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   106
\node (r2) [right=of r1]{$r_2$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   107
\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$der\,a$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   108
\node (r3) [right=of r2]{$r_3$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   109
\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$der\,b$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   110
\node (r4) [right=of r3]{$r_4$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   111
\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$der\,c$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   112
\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$nullable$}};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   113
\node (v4) [below=of r4]{$v_4$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   114
\draw[->,line width=1mm](r4) -- (v4);
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   115
\node (v3) [left=of v4] {$v_3$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   116
\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$inj\,c$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   117
\node (v2) [left=of v3]{$v_2$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   118
\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$inj\,b$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   119
\node (v1) [left=of v2] {$v_1$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   120
\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$inj\,a$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   121
\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$mkeps$}};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   122
\end{tikzpicture}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   123
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   124
\caption{The two phases of the algorithm by Sulzmann \& Lu.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   125
\label{Sulz}}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   126
\end{figure}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   127
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   128
The $mkeps$ function calculates a value for how a regular
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   129
expression has matched the empty string. Its definition
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   130
is as follows:
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   131
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   132
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   133
\begin{tabular}{lcl}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   134
  $mkeps(\epsilon)$       & $\dn$ & $Empty$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   135
  $mkeps(r_1 + r_2)$      & $\dn$ & if $nullable(r_1)$  \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   136
                          &       & then $Left(mkeps(r_1))$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   137
                          &       & else $Right(mkeps(r_2))$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   138
  $mkeps(r_1 \cdot r_2)$  & $\dn$ & $Seq(mkeps(r_1),mkeps(r_2))$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   139
  $mkeps(r^*)$            & $\dn$ & $[]$  \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   140
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   141
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   142
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   143
\noindent There are no cases for $\epsilon$ and $c$, since
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   144
these regular expression cannot match the empty string. Note
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   145
also that in case of alternatives we give preference to the
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   146
regular expression on the left-hand side. This will become
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   147
important later on.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   148
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   149
The second phase of the algorithm is organised so that it will
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   150
calculate a value for how the derivative regular expression
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   151
has matched a string whose first character has been chopped
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   152
off. Now we need a function that reverses this ``chopping
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   153
off'' for values. The corresponding function is called $inj$
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   154
for injection. This function takes three arguments: the first
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   155
one is a regular expression for which we want to calculate the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   156
value, the second is the character we want to inject and the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   157
third argument is the value where we will inject the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   158
character. The result of this function is a new value. The
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   159
definition of $inj$ is as follows: 
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   160
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   161
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   162
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   163
  $inj\,(c)\,c\,Empty$            & $\dn$  & $Char\,c$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   164
  $inj\,(r_1 + r_2)\,c\,Left(v)$  & $\dn$  & $Left(inj\,r_1\,c\,v)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   165
  $inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$  & $Right(inj\,r_2\,c\,v)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   166
  $inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   167
  $inj\,(r_1 \cdot r_2)\,c\,Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(inj\,r_1\,c\,v_1,v_2)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   168
  $inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(mkeps(r_1),inj\,r_2\,c\,v)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   169
  $inj\,(r^*)\,c\,Seq(v,vs)$         & $\dn$  & $inj\,r\,c\,v\,::\,vs$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   170
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   171
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   172
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   173
\noindent This definition is by recursion on the regular
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   174
expression and by analysing the shape of the values. Therefore
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   175
there are, for example, three cases for sequnece regular
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   176
expressions. The last clause for the star regular expression
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   177
returns a list where the first element is $inj\,r\,c\,v$ and
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   178
the other elements are $vs$. That means $\_\,::\,\_$ should be 
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   179
read as list cons.
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   180
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   181
To understand what is going on, it might be best to do some
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   182
example calculations and compare them with Figure~\ref{Sulz}.
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   183
For this note that we have not yet dealt with the need of
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   184
simplifying regular expressions (this will be a topic on its
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   185
own later). Suppose the regular expression is $a \cdot (b
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   186
\cdot c)$ and the input string is $abc$. The derivatives from
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   187
the first phase are as follows:
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   188
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   189
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   190
\begin{tabular}{ll}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   191
$r_1$: & $a \cdot (b \cdot c)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   192
$r_2$: & $\epsilon \cdot (b \cdot c)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   193
$r_3$: & $(\varnothing \cdot (b \cdot c)) + (\epsilon \cdot c)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   194
$r_4$: & $(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   195
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   196
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   197
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   198
\noindent According to the simple algorithm, we would test
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   199
whether $r_4$ is nullable, which in this case it is. This
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   200
means we can use the function $mkeps$ to calculate a value for
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   201
how $r_4$ was able to match the empty string. Remember that
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   202
this function gives preference for alternatives on the
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   203
left-hand side. However there is only $\epsilon$ on the very
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   204
right-hand side of $r_4$ that matches the empty string.
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   205
Therefore $mkeps$ returns the value
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   206
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   207
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   208
$v_4:\;Right(Right(Empty))$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   209
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   210
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   211
\noindent The point is that from this value we can directly
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   212
read off which part of $r_4$ matched the empty string. Next we
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   213
have to ``inject'' the last character, that is $c$ in the
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   214
running example, into this value $v_4$ in order to calculate
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   215
how $r_3$ could have matched the string $c$. According to the
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   216
definition of $inj$ we obtain
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   217
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   218
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   219
$v_3:\;Right(Seq(Empty, Char(c)))$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   220
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   221
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   222
\noindent This is the correct result, because $r_3$ needs
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   223
to use the right-hand alternative, and then $\epsilon$ needs
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   224
to match the empty string and $c$ needs to match $c$.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   225
Next we need to inject back the letter $b$ into $v_3$. This
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   226
gives
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   227
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   228
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   229
$v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   230
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   231
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   232
\noindent which is again the correct result for how $r_2$
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   233
matched the string $bc$. Finally we need to inject back the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   234
letter $a$ into $v_2$ giving the final result
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   235
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   236
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   237
$v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   238
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   239
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   240
\noindent This now corresponds to how the regular
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   241
expression $a \cdot (b \cdot c)$ matched the string $abc$.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   242
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   243
There are a few auxiliary functions that are of interest
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   244
when analysing this algorithm. One is called \emph{flatten},
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   245
written $|\_|$, which extracts the string ``underlying'' a 
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   246
value. It is defined recursively as
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   247
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   248
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   249
\begin{tabular}{lcl}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   250
  $|Empty|$     & $\dn$ & $[]$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   251
  $|Char(c)|$   & $\dn$ & $[c]$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   252
  $|Left(v)|$   & $\dn$ & $|v|$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   253
  $|Right(v)|$  & $\dn$ & $|v|$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   254
  $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   255
  $|[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   256
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   257
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   258
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   259
\noindent Using flatten we can see what is the string behind 
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   260
the values calculated by $mkeps$ and $inj$ in our running 
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   261
example are:
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   262
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   263
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   264
\begin{tabular}{ll}
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   265
$|v_4|$: & $[]$\\
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   266
$|v_3|$: & $c$\\
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   267
$|v_2|$: & $bc$\\
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   268
$|v_1|$: & $abc$
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   269
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   270
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   271
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   272
\noindent This indicates that $inj$ indeed is injecting, or
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   273
adding, back a character into the value. Next we look at how
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   274
simplification can be included into this algorithm.
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
   275
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
   276
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   277
\subsubsection*{Simplification}
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
   278
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   279
Generally the matching algorithms based on derivatives do
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   280
poorly unless the regular expressions are simplified after
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   281
each derivative step. But this is a bit more involved in the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   282
algorithm of Sulzmann \& Lu. So what follows might require you
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   283
to read several times before it makes sense and also might
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   284
require that you do some example calculations. As a first
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   285
example consider the last derivation step in our earlier
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   286
example:
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   287
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   288
\begin{center}
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   289
$r_4 = der\,c\,r_3 = 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   290
(\varnothing \cdot (b \cdot c)) + ((\varnothing \cdot c) + \epsilon)$
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   291
\end{center}
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   292
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   293
\noindent Simplifying this regular expression would just give
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   294
us $\epsilon$. Running $mkeps$ with this regular expression as
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   295
input, however, would then provide us with $Empty$ instead of
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   296
$Right(Right(Empty))$ that was obtained without the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   297
simplification. The problem is we need to recreate this more
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   298
complicated value, rather than just $Empty$.
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   299
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   300
This will require what I call \emph{rectification functions}.
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   301
They need to be calculated whenever a regular expression gets
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   302
simplified. Rectification functions take a value as argument
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   303
and return a (rectified) value. Let us first take a look again
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   304
at our simplification rules:
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   305
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   306
\begin{center}
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   307
\begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   308
$r \cdot \varnothing$ & $\mapsto$ & $\varnothing$\\ 
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   309
$\varnothing \cdot r$ & $\mapsto$ & $\varnothing$\\ 
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   310
$r \cdot \epsilon$ & $\mapsto$ & $r$\\ 
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   311
$\epsilon \cdot r$ & $\mapsto$ & $r$\\ 
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   312
$r + \varnothing$ & $\mapsto$ & $r$\\ 
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   313
$\varnothing + r$ & $\mapsto$ & $r$\\ 
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   314
$r + r$ & $\mapsto$ & $r$\\ 
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   315
\end{tabular}
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   316
\end{center}
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   317
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   318
\noindent Applying them to $r_4$ will require several nested
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   319
simplifications in order end up with just $\epsilon$. However,
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   320
it is possible to apply them in a depth-first, or inside-out,
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   321
manner in order to calculate this simplified regular
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   322
expression.
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   323
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   324
The rectification we can implement this by letting simp return
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   325
not just a (simplified) regular expression, but also a
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   326
rectification function. Let us consider the alternative case,
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   327
$r_1 + r_2$, first. By going depth-first, we first simplify
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   328
the component regular expressions $r_1$ and $r_2.$ This will
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   329
return simplified versions (if they can be simplified), say
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   330
$r_{1s}$ and $r_{2s}$, but also two rectification functions
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   331
$f_{1s}$ and $f_{2s}$. We need to assemble them in order to
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   332
obtain a rectified value for $r_1 + r_2$. In case $r_{1s}$
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   333
simplified to $\varnothing$, we continue the derivative
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   334
calculation with $r_{2s}$. The Sulzmann \& Lu algorithm would
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   335
return a corresponding value, say $v_{2s}$. But now this value
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   336
needs to be ``rectified'' to the value 
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   337
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   338
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   339
$Right(v_{2s})$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   340
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   341
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   342
\noindent The reason is that we look for the value that tells
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   343
us how $r_1 + r_2$ could have matched the string, not just
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   344
$r_2$ or $r_{2s}$. Unfortunately, this is still not the right
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   345
value in general because there might be some simplifications
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   346
that happened inside $r_2$ and for which the simplification
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   347
function retuned also a rectification function $f_{2s}$. So in
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   348
fact we need to apply this one too which gives
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   349
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   350
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   351
$Right(f_{2s}(v_{2s}))$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   352
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   353
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   354
\noindent This is now the correct, or rectified, value. Since
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   355
the simplification will be done in the first phase of the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   356
algorithm, but the rectification needs to be done to the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   357
values in the second phase, it is advantageous to calculate
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   358
the rectification as a function, remember this function and
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   359
then apply the value to this function during the second phase.
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   360
So if we want to implement the rectification as function, we 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   361
would need to return
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   362
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   363
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   364
$\lambda v.\,Right(f_{2s}(v))$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   365
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   366
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   367
\noindent which is the lambda-calculus notation for
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   368
a function that expects a value $v$ and returns everything
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   369
after the dot where $v$ is replaced by whatever value is 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   370
given.
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   371
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   372
Let us package this idea with rectification functions
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   373
into a single function (still only considering the alternative
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   374
case):
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   375
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   376
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   377
\begin{tabular}{l}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   378
$simp(r)$:\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   379
\quad case $r = r_1 + r_2$\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   380
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   381
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   382
\qquad case $r_{1s} = \varnothing$: 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   383
       return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   384
\qquad case $r_{2s} = \varnothing$: 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   385
       return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   386
\qquad case $r_{1s} = r_{2s}$:
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   387
       return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   388
\qquad otherwise: 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   389
       return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   390
\end{tabular}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   391
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   392
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   393
\noindent We first recursively call the simplification with
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   394
$r_1$ and $r_2$. This gives simplified regular expressions,
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   395
$r_{1s}$ and $r_{2s}$, as well as two rectification functions
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   396
$f_{1s}$ and $f_{2s}$. We next need to test whether the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   397
simplified regular expressions are $\varnothing$ so as to make
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   398
further simplifications. In case $r_{1s}$ is $\varnothing$,
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   399
then we can return $r_{2s}$ (the other alternative). However
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   400
for this we need to build a corresponding rectification 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   401
function, which as said above is
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   402
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   403
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   404
$\lambda v.\,Right(f_{2s}(v))$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   405
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   406
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   407
\noindent The case where $r_{2s} = \varnothing$ is similar:
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   408
We return $r_{1s}$ and rectify with $Left(\_)$ and the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   409
other calculated rectification function $f_{1s}$. This gives
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   410
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   411
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   412
$\lambda v.\,Left(f_{1s}(v))$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   413
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   414
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   415
\noindent The next case where $r_{1s} = r_{2s}$ can be treated
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   416
like the one where $r_{2s} = \varnothing$. We return $r_{1s}$
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   417
and rectify with $Left(\_)$ and so on.
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   418
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   419
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   420
The otherwise-case is slightly more complicated. In this case
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   421
neither $r_{1s}$ nor $r_{2s}$ are $\varnothing$ and also
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   422
$r_{1s} \not= r_{2s}$, which means no further simplification
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   423
can be applied. Accordingly, we return $r_{1s} + r_{2s}$ as
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   424
the simplified regular expression. In principle we also do not
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   425
have to do any rectification, because no simplification was
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   426
done in this case. But this is actually not true: There might
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   427
have been simplifications inside $r_{1s}$ and $r_{2s}$. We
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   428
therefore need to take into account the calculated
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   429
rectification functions $f_{1s}$ and $f_{2s}$. We can do this
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   430
by defining a rectification function $f_{alt}$ which takes two
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   431
rectification functions as arguments and applies them
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   432
according to whether the value is of the form $Left(\_)$ or
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   433
$Right(\_)$:
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   434
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   435
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   436
\begin{tabular}{l@{\hspace{1mm}}l}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   437
$f_{alt}(f_1, f_2) \dn$\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   438
\qquad $\lambda v.\,$ case $v = Left(v')$: 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   439
      & return $Left(f_1(v'))$\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   440
\qquad \phantom{$\lambda v.\,$} case $v = Right(v')$: 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   441
      & return $Right(f_2(v'))$\\      
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   442
\end{tabular}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   443
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   444
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   445
The other interesting case with simplification is the sequence
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   446
case. In this case the main simplification function is as
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   447
follows
286
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   448
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   449
\begin{center}
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   450
\begin{tabular}{l}
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   451
$simp(r)$:\qquad\qquad (continued)\\
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   452
\quad case $r = r_1 \cdot r_2$\\
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   453
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   454
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   455
\qquad case $r_{1s} = \varnothing$: 
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   456
       return $(\varnothing, f_{error})$\\
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   457
\qquad case $r_{2s} = \varnothing$: 
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   458
       return $(\varnothing, f_{error})$\\
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   459
\qquad case $r_{1s} = \epsilon$: 
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   460
return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   461
\qquad case $r_{2s} = \epsilon$: 
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   462
return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   463
\qquad otherwise: 
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   464
       return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   465
\end{tabular}
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   466
\end{center}
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   467
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   468
\noindent whereby in the last line $f_{seq}$ is again pushing
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   469
the two rectification functions into the two components of the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   470
Seq-value:
286
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   471
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   472
\begin{center}
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   473
\begin{tabular}{l@{\hspace{1mm}}l}
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   474
$f_{seq}(f_1, f_2) \dn$\\
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   475
\qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$: 
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   476
      & return $Seq(f_1(v_1), f_2(v_2))$\\
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   477
\end{tabular}
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   478
\end{center}
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   479
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   480
\noindent Note that in the case of $r_{1s} = \varnothing$
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   481
(similarly $r_{2s}$) we use the function $f_{error}$ for
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   482
rectification. If you think carefully, then you will realise
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   483
that this function will actually never been called. This is
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   484
because a sequence with $\varnothing$ will never recognise any
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   485
string and therefore the second phase of the algorithm would
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   486
never been called. The simplification function still expects
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   487
us to give a function. So in my own implementation I just
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   488
returned a function which raises an error. In the case
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   489
where $r_{1s} = \epsilon$ (similarly $r_{2s}$) we have
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   490
to create a sequence where the first component is a rectified
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   491
version of $Empty$. Therefore we call $f_{1s}$ with $Empty$.
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   492
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   493
Since we only simplify regular expressions of the form $r_1 +
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   494
r_2$ and $r_1 \cdot r_2$ we do not have to do anything else
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   495
in the remaining cases. The rectification function will
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   496
be just the identity, which in lambda-calculus terms is
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   497
just
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   498
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   499
\begin{center}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   500
$\lambda v.\,v$
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   501
\end{center}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   502
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   503
\noindent This completes the high-level version of the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   504
simplification function, which is also shown again in 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   505
Figure~\ref{simp}. This can now be used in a \emph{lexing
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   506
function} as follows:
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   507
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   508
\begin{figure}[t]
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   509
\begin{center}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   510
\begin{tabular}{l}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   511
$simp(r)$:\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   512
\quad case $r = r_1 + r_2$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   513
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   514
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   515
\qquad case $r_{1s} = \varnothing$: 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   516
       return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   517
\qquad case $r_{2s} = \varnothing$: 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   518
       return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   519
\qquad case $r_{1s} = r_{2s}$:
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   520
       return $(r_{1s}, \lambda v. \,Left(f_{1s}(v)))$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   521
\qquad otherwise: 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   522
       return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   523
       \medskip\\
286
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   524
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   525
\quad case $r = r_1 \cdot r_2$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   526
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   527
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   528
\qquad case $r_{1s} = \varnothing$: 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   529
       return $(\varnothing, f_{error})$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   530
\qquad case $r_{2s} = \varnothing$: 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   531
       return $(\varnothing, f_{error})$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   532
\qquad case $r_{1s} = \epsilon$: 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   533
return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   534
\qquad case $r_{2s} = \epsilon$: 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   535
return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   536
\qquad otherwise: 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   537
       return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   538
       \medskip\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   539
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   540
\quad otherwise:\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   541
\qquad return $(r, \lambda v.\,v)$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   542
\end{tabular}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   543
\end{center}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   544
\caption{The simplification function that returns a simplified 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   545
regular expression and a rectification function.\label{simp}}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   546
\end{figure}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   547
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   548
\begin{center}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   549
\begin{tabular}{lcl}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   550
$lex\,r\,[]$ & $\dn$ & if $nullable(r)$ then $mkeps(r)$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   551
             &       & else $error$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   552
$lex\,r\,c\!::\!s$ & $\dn$ & let 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   553
   $(r_{simp}, f_{rect}) = simp(der(c, r))$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   554
& & $inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   555
\end{tabular}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   556
\end{center}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   557
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   558
\noindent This corresponds to the $matches$ function we
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   559
have seen in earlier lectures. In the first clause we are
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   560
given an empty string, $[]$, and need to test wether the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   561
regular expression is $nullable$. If yes we can proceed
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   562
normally and just return the value calculated by $mkeps$.
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   563
The second clause is for strings where the first character is
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   564
$c$ and the rest of the string is $s$. We first build the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   565
derivative of $r$ with respect to $c$; simplify the resulting 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   566
regulare expression. We continue lexing with the simplified
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   567
regular expression and the string $s$. Whatever will be
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   568
returned as value, we sill rectify using the $f_{rect}$
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   569
from the simplification and finally inject $c$ back into
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   570
the (rectified) value.
286
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   571
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   572
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   573
\subsubsection*{Records}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   574
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   575
Remember we want to tokenize input strings, that means
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   576
splitting strings into their ``word'' components. But
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   577
furthermore we want to classify each token as being a keyword
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   578
or identifier and so on. For this one more feature will be
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   579
required, which I call \emph{record}. While values record
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   580
precisely how a regular expression matches a string, 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   581
records can be used to focus on some particular
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   582
parts of the regular expression and forget about others.
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   583
Let us look at an example. 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   584
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   585
Suppose you have the regular expression $ab + ac$. Clearly
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   586
this regular expression can only recognise two strings. But
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   587
suppose you are not interested whether it can recognise $ab$
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   588
or $ac$, but rather if it matched, then what was the last
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   589
character of the matched string\ldots either $b$ or $c$.
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   590
You can do this by annotating the regular expression with
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   591
a record, written $(x:r)$, where $x$ is just an identifier
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   592
(in my implementation a plain string) and $r$ is a regular
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   593
expression. A record will be regarded as a regular expression.
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   594
The extended definition in Scala looks as follows:
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   595
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   596
{\small\lstinputlisting[language=Scala]
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   597
{../progs/app03.scala}}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   598
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   599
\noindent Since we regard records as regular expressions
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   600
we need to extend the functions $nullable$ and $der$. 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   601
Similarly $mkeps$ and $inj$ need to be extended and they 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   602
sometimes can return a particular value for records. 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   603
This means we also need to extend the definition of values.
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   604
The extended definition in Scala looks as follows:
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   605
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   606
{\small\lstinputlisting[language=Scala]
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   607
{../progs/app04.scala}}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   608
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   609
\noindent Let us now look at the purpose of records more
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   610
closely and lets return to our question whether the string
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   611
terminated in a $b$ or $c$. We can do this as follows: we
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   612
annotate the regular expression $ab + ac$ with a record
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   613
as follows
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   614
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   615
\begin{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   616
$a(x:b) + a(x:c)$
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   617
\end{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   618
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   619
\noindent This regular expression can still only recognise
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   620
the strings $ab$ and $ac$, but we can now use a function
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   621
that takes a value and returns all records. I call this
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   622
function \emph{env} for environment\ldots it builds a list
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   623
of identifiers associated with their string. This function
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   624
can be defined as follows:
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   625
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   626
\begin{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   627
\begin{tabular}{lcl}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   628
  $env(Empty)$     & $\dn$ & $[]$\\
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   629
  $env(Char(c))$   & $\dn$ & $[]$\\
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   630
  $env(Left(v))$   & $\dn$ & $env(v)$\\
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   631
  $env(Right(v))$  & $\dn$ & $env(v)$\\
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   632
  $env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   633
  $env([v_1,\ldots ,v_n])$ & $\dn$ & 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   634
     $env(v_1) \,@\ldots @\, env(v_n)$\\
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   635
  $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   636
\end{tabular}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   637
\end{center}
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   638
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   639
\noindent where in the last clause we use the flatten function 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   640
defined earlier. The function $env$ ``picks'' out all 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   641
underlying strings where a record is given. Since there can be 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   642
more than one, the environment will potentially contain
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   643
many ``recordings''. If we now postprocess the value 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   644
calculated by $lex$ extracting all recordings using $env$, 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   645
we can answer the question whether the last element in the
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   646
string was an $b$ or a $c$. Lets see this in action: if
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   647
we use $ab + ac$ and $ac$ the calculated value will be
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   648
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   649
\begin{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   650
$Right(Seq(Char(a), Char(c)))$
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   651
\end{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   652
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   653
\noindent If we use instead $a(x:b) + a(x:c)$ and
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   654
use the $env$ function to extract the recording for 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   655
$x$ we obtain
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   656
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   657
\begin{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   658
$[(x:c)]$
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   659
\end{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   660
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   661
\noindent If we had given the string $ab$ instead, then the
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   662
record would have been $[(x:b)]$. The fun starts if we 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   663
iterate this. Consider the regular expression 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   664
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   665
\begin{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   666
$(a(x:b) + a(y:c))^*$
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   667
\end{center}
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   668
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   669
\noindent and the string $ababacabacab$. This string is 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   670
clearly matched by the regular expression, but we are only
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   671
interested in the sequence of $b$s and $c$s. Using $env$
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   672
we obtain
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   673
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   674
\begin{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   675
$[(x:b), (x:b), (y:c), (x:b), (y:c), (x:b)]$
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   676
\end{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   677
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   678
\noindent While this feature might look silly, it is in fact
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   679
quite useful. For example if we want to match the name of
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   680
an email we might use the regular expression
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   681
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   682
\[
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   683
(name: [a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   684
(domain: [a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   685
(top\_level: [a\mbox{-}z\,.]^{\{2,6\}})
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   686
\]
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   687
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   688
\noindent Then if we match the email address
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   689
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   690
\[
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   691
\texttt{christian.urban@kcl.ac.uk}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   692
\]
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   693
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   694
\noindent we can use the $env$ function and find out
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   695
what the name, domain and top-level part of the email
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   696
address are:
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   697
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   698
\begin{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   699
$[(name:\texttt{christian.urban}), 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   700
  (domain:\texttt{kcl}), 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   701
  (top\_level:\texttt{ac.uk})]$
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   702
\end{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   703
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   704
\noindent As you will see in the next lecture, this is now all
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   705
we need to tokenise an input string and classify each token.
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   706
\end{document}
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   707
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   708
%%% Local Variables: 
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   709
%%% mode: latex
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   710
%%% TeX-master: t
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   711
%%% End: