handouts/ho04.tex
author Christian Urban <christian.urban@kcl.ac.uk>
Fri, 30 Sep 2022 06:03:46 +0100
changeset 879 ad9d4a01e072
parent 874 ffe02fd574a5
child 936 0b5f06539a84
permissions -rw-r--r--
updated
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
     1
% !TEX program = xelatex
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     2
\documentclass{article}
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     3
\usepackage{../style}
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     4
\usepackage{../langs}
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
     5
\usepackage{../graphics}
671
83e38043ed78 updated
Christian Urban <urbanc@in.tum.de>
parents: 669
diff changeset
     6
\usepackage{skull}
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     7
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     8
\begin{document}
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
     9
\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2019}
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
    10
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    11
\section*{Handout 4 (Sulzmann \& Lu Algorithm)}
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    12
716
df7d47a507f8 updated
Christian Urban <urbanc@in.tum.de>
parents: 671
diff changeset
    13
So far our algorithm based on derivatives is only able to say
df7d47a507f8 updated
Christian Urban <urbanc@in.tum.de>
parents: 671
diff changeset
    14
yes or no depending on whether a string is matched by a regular
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    15
expression or not. Often a more interesting question is to
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    16
find out \emph{how} a regular expression matched a string?
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    17
Answering this question will also help us with the problem we
319
e7b110f93697 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 296
diff changeset
    18
are after, namely tokenising an input string. 
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
    19
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    20
The algorithm we will be looking at in this lecture was designed by Sulzmann
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    21
\& Lu in a rather recent research paper (from 2014). A link to it is
350
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
    22
provided on KEATS, in case you are interested.\footnote{In my
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
    23
humble opinion this is an interesting instance of the research
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
    24
literature: it contains a very neat idea, but its presentation
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    25
is rather sloppy. In earlier versions of this paper, a King's
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    26
undergraduate student and I found several rather annoying typos in the
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
    27
examples and definitions.} My former PhD student Fahad Ausaf and I even more recently 
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    28
wrote a paper where we build on their result: we provided a 
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    29
mathematical proof that their algorithm is really correct---the proof 
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    30
Sulzmann \& Lu had originally given contained major flaws. Such correctness
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    31
proofs are important: Kuklewicz maintains a unit-test library
520
2849c305b12d updated
cu
parents: 492
diff changeset
    32
for the kind of algorithms we are interested in here and he showed 
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    33
that many implementations in the ``wild'' are buggy, that is not
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    34
satisfy his unit tests:
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    35
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    36
\begin{center}
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    37
\url{http://www.haskell.org/haskellwiki/Regex_Posix}
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    38
\end{center}
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    39
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    40
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
    41
Coming back to the algorithm by Sulzmann \& Lu, their idea is to
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
    42
introduce \emph{values} for producing an answer for \emph{how} a regular
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
    43
expression matches a string. So rather than a boolean like in the
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
    44
Brzozowski algorithm, a value will be the output of the Sulzman \& Lu
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
    45
algorithm whenever the regular expression matches the string. If the
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
    46
string does not match the string, an error will be raised. The
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
    47
definition for values is given below. They are shown together with the
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
    48
regular expressions $r$ to which they correspond:
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    49
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    50
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    51
\begin{tabular}{cc}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    52
\begin{tabular}{@{}rrl@{}}
319
e7b110f93697 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 296
diff changeset
    53
\multicolumn{3}{c}{regular expressions}\medskip\\
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
    54
  $r$ & $::=$  & $\ZERO$\\
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
    55
      & $\mid$ & $\ONE$   \\
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    56
      & $\mid$ & $c$          \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    57
      & $\mid$ & $r_1 \cdot r_2$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    58
      & $\mid$ & $r_1 + r_2$   \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    59
  \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    60
      & $\mid$ & $r^*$         \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    61
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    62
&
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    63
\begin{tabular}{@{\hspace{0mm}}rrl@{}}
319
e7b110f93697 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 296
diff changeset
    64
\multicolumn{3}{c}{values}\medskip\\
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    65
   $v$ & $::=$  & \\
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    66
      &        & $Empty$   \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    67
      & $\mid$ & $Char(c)$          \\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    68
      & $\mid$ & $Seq(v_1,v_2)$\\
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
    69
      & $\mid$ & $\Left(v)$   \\
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    70
      & $\mid$ & $Right(v)$  \\
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    71
      & $\mid$ & $Stars\,[v_1,\ldots\,v_n]$ \\
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    72
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    73
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    74
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    75
417
e74c696821a2 updated
Christian Urban <urbanc@in.tum.de>
parents: 400
diff changeset
    76
\noindent There is no value for the
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
    77
$\ZERO$ regular expression, since it does not match any
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    78
string. Otherwise there is exactly one value corresponding to
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    79
each regular expression with the exception of $r_1 + r_2$
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
    80
where there are two values, namely $\Left(v)$ and $Right(v)$
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    81
corresponding to the two alternatives. Note that $r^*$ is
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    82
associated with a list of values, one for each copy of $r$
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
    83
that was needed to match the string. This means we might also
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
    84
return the empty list $Stars\,[]$, if no copy was needed
520
2849c305b12d updated
cu
parents: 492
diff changeset
    85
for $r^*$. For sequence, there is exactly one value, composed 
350
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
    86
of two component values ($v_1$ and $v_2$).
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
    87
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
    88
My implementation of regular expressions and values in Scala is
716
df7d47a507f8 updated
Christian Urban <urbanc@in.tum.de>
parents: 671
diff changeset
    89
shown below. I use the convention that
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    90
regular expressions are written entirely with upper-case
716
df7d47a507f8 updated
Christian Urban <urbanc@in.tum.de>
parents: 671
diff changeset
    91
letters, whereas values start with a single upper-case
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
    92
character and the rest are lower-case letters.
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
    93
 
490
4fee50f38305 updated
Christian Urban <urbanc@in.tum.de>
parents: 444
diff changeset
    94
{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=
4fee50f38305 updated
Christian Urban <urbanc@in.tum.de>
parents: 444
diff changeset
    95
                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
352
1e1b0fe66107 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 350
diff changeset
    96
{../progs/app01.scala}}
350
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
    97
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
    98
 
490
4fee50f38305 updated
Christian Urban <urbanc@in.tum.de>
parents: 444
diff changeset
    99
{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=
4fee50f38305 updated
Christian Urban <urbanc@in.tum.de>
parents: 444
diff changeset
   100
                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   101
{../progs/app02.scala}}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   102
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   103
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   104
Graphically the algorithm by Sulzmann \& Lu can be illustrated
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   105
by the picture in Figure~\ref{Sulz} where the path from the
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   106
left to the right involving $\der/\nullable$ is the first phase
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   107
of the algorithm and $\textit{mkeps}/\inj$, the path from right to left,
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   108
the second phase. This picture shows the steps required when a
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   109
regular expression, say $r_1$, matches the string $abc$. We
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   110
first build the three derivatives (according to $a$, $b$ and
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   111
$c$). We then use $\nullable$ to find out whether the resulting
296
796b9b81ac8d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 288
diff changeset
   112
regular expression can match the empty string. If yes, we call
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   113
the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   114
expression has matched the empty string. Its definition
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   115
is as follows:
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   116
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   117
\begin{figure}[t]
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   118
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   119
\begin{tikzpicture}[scale=2,node distance=1.2cm,
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   120
                    every node/.style={minimum size=7mm}]
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   121
\node (r1)  {$r_1$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   122
\node (r2) [right=of r1]{$r_2$};
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   123
\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\der\,a$};
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   124
\node (r3) [right=of r2]{$r_3$};
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   125
\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\der\,b$};
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   126
\node (r4) [right=of r3]{$r_4$};
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   127
\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\der\,c$};
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   128
\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\nullable$}};
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   129
\node (v4) [below=of r4]{$v_4$};
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   130
\draw[->,line width=1mm](r4) -- (v4);
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   131
\node (v3) [left=of v4] {$v_3$};
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   132
\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\inj\,c$};
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   133
\node (v2) [left=of v3]{$v_2$};
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   134
\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\inj\,b$};
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   135
\node (v1) [left=of v2] {$v_1$};
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   136
\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\inj\,a$};
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   137
\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}};
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   138
\end{tikzpicture}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   139
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   140
\caption{The two phases of the algorithm by Sulzmann \& Lu.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   141
\label{Sulz}}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   142
\end{figure}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   143
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   144
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   145
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   146
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   147
\begin{tabular}{lcl}
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   148
  $\textit{mkeps}(\ONE)$       & $\dn$ & $Empty$\\
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   149
  $\textit{mkeps}(r_1 + r_2)$      & $\dn$ & if $\nullable(r_1)$  \\
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   150
                          &       & then $\Left(\textit{mkeps}(r_1))$\\
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   151
                          &       & else $Right(\textit{mkeps}(r_2))$\\
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   152
  $\textit{mkeps}(r_1 \cdot r_2)$  & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   153
  $\textit{mkeps}(r^*)$            & $\dn$ & $Stars\,[]$  \\
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   154
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   155
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   156
596
6d6e79f95933 updated
Christian Urban <urbanc@in.tum.de>
parents: 551
diff changeset
   157
\noindent There are no cases for $\ZERO$ and $c$, since
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   158
these regular expression cannot match the empty string. Note
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   159
also that in case of alternatives we give preference to the
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   160
regular expression on the left-hand side. This will become
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   161
important later on.
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   162
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   163
The second phase of the algorithm is organised so that it will
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   164
calculate a value for how the derivative regular expression
350
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
   165
has matched a string. For this we need a function that
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
   166
reverses this ``chopping off'' for values which we did in the
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
   167
first phase for derivatives. The corresponding function is
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   168
called $\inj$ (short for injection). This function takes three
350
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
   169
arguments: the first one is a regular expression for which we
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
   170
want to calculate the value, the second is the character we
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
   171
want to inject and the third argument is the value where we
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
   172
will inject the character into. The result of this function is a
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   173
new value. The definition of $\inj$ is as follows: 
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   174
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   175
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   176
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   177
  $\inj\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   178
  $\inj\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\inj\,r_1\,c\,v)$\\
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   179
  $\inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\inj\,r_2\,c\,v)$\\
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   180
  $\inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   181
  $\inj\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   182
  $\inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\inj\,r_2\,c\,v)$\\
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   183
  $\inj\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars(\inj\,r\,c\,v\,::\,vs)$\\
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   184
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   185
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   186
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   187
\noindent This definition is by recursion on the regular
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   188
expression and by analysing the shape of the values. Therefore
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   189
there are three cases for sequence regular
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   190
expressions (for all possible shapes of the value we can encounter). The last
350
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
   191
clause for the star regular expression returns a list where
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   192
the first element is $\inj\,r\,c\,v$ and the other elements are
350
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
   193
$vs$. That means $\_\,::\,\_$ should be read as list cons.
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   194
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   195
To understand what is going on, it might be best to do some
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   196
example calculations and compare them with Figure~\ref{Sulz}.
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   197
For this note that we have not yet dealt with the need of
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   198
simplifying regular expressions (this will be a topic on its
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   199
own later). Suppose the regular expression is $a \cdot (b
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   200
\cdot c)$ and the input string is $abc$. The derivatives from
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   201
the first phase are as follows:
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   202
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   203
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   204
\begin{tabular}{ll}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   205
$r_1$: & $a \cdot (b \cdot c)$\\
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   206
$r_2$: & $\ONE \cdot (b \cdot c)$\\
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   207
$r_3$: & $(\ZERO \cdot (b \cdot c)) + (\ONE \cdot c)$\\
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   208
$r_4$: & $(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)$\\
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   209
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   210
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   211
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   212
\noindent According to the simple algorithm, we would test
350
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
   213
whether $r_4$ is nullable, which in this case it indeed is.
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   214
This means we can use the function $\textit{mkeps}$ to calculate a
350
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
   215
value for how $r_4$ was able to match the empty string.
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
   216
Remember that this function gives preference for alternatives
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   217
on the left-hand side. However there is only $\ONE$ on the
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   218
very right-hand side of $r_4$ (underlined)
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   219
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   220
\begin{center}
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   221
$r_4$:\;$(\ZERO \cdot (b \cdot c)) + 
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   222
         ((\ZERO \cdot c) + \underline{\ONE})$\\
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   223
\end{center}
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   224
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   225
\noindent
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   226
that matches the empty string.
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   227
Therefore $\textit{mkeps}$ returns the value
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   228
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   229
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   230
$v_4:\;Right(Right(Empty))$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   231
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   232
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   233
\noindent If there had been a $\ONE$ on the left, then
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   234
$\textit{mkeps}$ would have returned something of the form
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   235
$\Left(\ldots)$. The point is that from this value we can
350
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
   236
directly read off which part of $r_4$ matched the empty
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
   237
string: take the right-alternative first, and then the
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   238
right-alternative again, then you got to the $\ONE$.
326
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   239
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   240
Next we have to ``inject'' the last character, that is $c$ in
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   241
the running example, into this value $v_4$ in order to
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   242
calculate how $r_3$ could have matched the string $c$.
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   243
For this we call injection with $\inj(r_3, c, v_4)$.
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   244
According to the definition of $\inj$ we obtain
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   245
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   246
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   247
$v_3:\;Right(Seq(Empty, Char(c)))$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   248
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   249
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   250
\noindent This is the correct result, because $r_3$ needs
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   251
to use the right-hand alternative, and then $\ONE$ needs
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   252
to match the empty string and $c$ needs to match $c$.
596
6d6e79f95933 updated
Christian Urban <urbanc@in.tum.de>
parents: 551
diff changeset
   253
Next we need to inject back the letter $b$ into $v_3$.
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   254
For this we call injection with $\inj(r_2, b, v_3)$.
596
6d6e79f95933 updated
Christian Urban <urbanc@in.tum.de>
parents: 551
diff changeset
   255
This gives
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   256
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   257
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   258
$v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   259
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   260
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   261
\noindent which is again the correct result for how $r_2$
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   262
matched the string $bc$. Finally we need to inject back the
596
6d6e79f95933 updated
Christian Urban <urbanc@in.tum.de>
parents: 551
diff changeset
   263
letter $a$ into $v_2$ giving the final result.
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   264
For this we call injection with $\inj(r_1, a, v_2)$
596
6d6e79f95933 updated
Christian Urban <urbanc@in.tum.de>
parents: 551
diff changeset
   265
and obtain
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   266
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   267
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   268
$v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   269
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   270
596
6d6e79f95933 updated
Christian Urban <urbanc@in.tum.de>
parents: 551
diff changeset
   271
\noindent This value corresponds to how the regular expression $r_1$,
6d6e79f95933 updated
Christian Urban <urbanc@in.tum.de>
parents: 551
diff changeset
   272
namely $a \cdot (b \cdot c)$, matched the string $abc$.
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   273
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   274
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   275
There are a few auxiliary functions that are of interest
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   276
when analysing this algorithm. One is called \emph{flatten},
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   277
written $|\_|$, which extracts the string ``underlying'' a 
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   278
value. It is defined recursively as
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   279
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   280
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   281
\begin{tabular}{lcl}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   282
  $|Empty|$     & $\dn$ & $[]$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   283
  $|Char(c)|$   & $\dn$ & $[c]$\\
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   284
  $|\Left(v)|$   & $\dn$ & $|v|$\\
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   285
  $|Right(v)|$  & $\dn$ & $|v|$\\
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   286
  $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   287
  $|Stars\,[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   288
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   289
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   290
350
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
   291
\noindent Using flatten we can see what are the strings behind 
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   292
the values calculated by $\textit{mkeps}$ and $\inj$. In our running 
350
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
   293
example:
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   294
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   295
\begin{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   296
\begin{tabular}{ll}
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   297
$|v_4|$: & $[]$\\
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   298
$|v_3|$: & $c$\\
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   299
$|v_2|$: & $bc$\\
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   300
$|v_1|$: & $abc$
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   301
\end{tabular}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   302
\end{center}
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   303
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   304
\noindent This indicates that $\inj$ indeed is injecting, or
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   305
adding, back a character into the value. 
350
c4e7caa06c74 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 326
diff changeset
   306
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   307
The definition of $\inj$ might still be very puzzling and each clause
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   308
might look arbitrary, but there is in fact a relatively simple idea
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   309
behind it. Ultimately the $\inj$-functions is determined by the
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   310
derivative functions. For this consider one of the ``squares'' from
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   311
Figure~\ref{Sulz}:
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   312
720
ecbed0155f72 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 716
diff changeset
   313
\begin{center}
ecbed0155f72 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 716
diff changeset
   314
\begin{tikzpicture}[scale=2,node distance=1.2cm,
ecbed0155f72 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 716
diff changeset
   315
                    every node/.style={minimum size=7mm}]
ecbed0155f72 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 716
diff changeset
   316
\node (r)  {$r$};
ecbed0155f72 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 716
diff changeset
   317
\node (rd) [right=of r]{$r_{der}$};
ecbed0155f72 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 716
diff changeset
   318
\draw[->,line width=1mm](r)--(rd) node[above,midway] {$\der\,c$};
ecbed0155f72 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 716
diff changeset
   319
\node (vd) [below=of r2]{$v_{der}$};
ecbed0155f72 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 716
diff changeset
   320
\draw[->,line width=1mm](rd) -- (vd);
ecbed0155f72 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 716
diff changeset
   321
\node (v) [left=of vd] {$v$};
ecbed0155f72 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 716
diff changeset
   322
\draw[->,line width=1mm](vd)--(v) node[below,midway] {$\inj\,c$};
ecbed0155f72 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 716
diff changeset
   323
\draw[->,line width=0.5mm,dotted](r) -- (v) node[left,midway,red] {\bf ?};
ecbed0155f72 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 716
diff changeset
   324
\end{tikzpicture}
ecbed0155f72 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 716
diff changeset
   325
\end{center}
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   326
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   327
\noindent
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   328
The input to the $\inj$-function is $r$ (on the top left), $c$ (the
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   329
character to be injected) and $v_{der}$ (bottom right). The output is
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   330
the value $v$ for how the regular expression $r$ matched the
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   331
corresponding string. How does $\inj$ calculate this value? Well, it has
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   332
to analyse the value $v_{der}$ and transform it into the value $v$ for
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   333
the regular expression $r$. The value $v_{der}$ corresponds to the
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   334
$r_{der}$-regular expression which is the derivative of $r$. Remember
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   335
that $v_{der}$ is the value for how $r_{der}$ matches the corresponding string
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   336
where $c$ has been chopped off.
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   337
720
ecbed0155f72 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 716
diff changeset
   338
For a concrete example, let $r$ be $r_1 + r_2$. Then $r_{der}$
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   339
is of the form $(\der\,c\,r_1) + (\der\,c\,r_2)$. What are the possible
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   340
values corresponding to $r_{der}$? Well, they can be only of the form
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   341
$\Left(\ldots)$ and $\Right(\ldots)$. Therefore you have two
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   342
cases in the $\inj$ function -- one for $\Left$ and one for $\Right$.
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   343
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   344
\begin{center}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   345
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   346
    $\inj\,(r_1 + r_2)\,c\,\,\Left(v)$ & $\dn$ & $\ldots$\\
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   347
    $\inj\,(r_1 + r_2)\,c\,\,\Right(v)$ & $\dn$ & $\ldots$\\
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   348
\end{tabular}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   349
\end{center}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   350
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   351
\noindent
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   352
Let's look next at the sequence case where $r = r_1 \cdot r_2$. What does the
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   353
derivative of $r$ look like? According to the definition it is: 
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   354
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   355
\begin{center}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   356
  \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   357
    $\der\, c\, (r_1 \cdot r_2)$  & $\dn$  &  if $\nullable (r_1)$\\
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   358
    & & then $(\der\,c\,r_1) \cdot r_2 + \der\, c\, r_2$\\ 
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   359
    & & else $(\der\, c\, r_1) \cdot r_2$\\
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   360
  \end{tabular}  
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   361
\end{center}  
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   362
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   363
\noindent As you can see there is a derivative in the if-branch and another
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   364
in the else-branch. In the if-branch we have an alternative of the form 
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   365
$\_ + \_$. We already know what the possible values are for such a regular 
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   366
expression, namely $\Left$ or $\Right$. Therefore we have in $\inj$ the
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   367
two cases:
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   368
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   369
\begin{center}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   370
  \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   371
    $\inj\,(r_1 \cdot r_2)\,c\,\,\Left(Seq(v_1,v_2))$ & $\dn$  & $\ldots$\\
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   372
    $\inj\,(r_1 \cdot r_2)\,c\,\,Right(v)$ & $\dn$  & $\ldots$\\
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   373
\end{tabular}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   374
\end{center}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   375
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   376
\noindent
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   377
In the first case we even know that it is not just a $\Left$-value, but 
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   378
$\Left(Seq(\ldots))$, because the corresponding regular expression 
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   379
in the derivation is \mbox{$(\der\,c\,r_1) \cdot r_2$}. Hence we know 
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   380
it must be a $Seq$-value enclosed inside $\Left$. 
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   381
The third clause for $r_1\cdot r_2$ in the $\inj$-function
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   382
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   383
\begin{center}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   384
  \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   385
    $\inj\,(r_1 \cdot r_2)\,c\,\,Seq(v_1,v_2)$ & $\dn$  & $\ldots$\\
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   386
  \end{tabular}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   387
  \end{center}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   388
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   389
\noindent corresponds to the else-branch in the derivative. In this
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   390
case we know the derivative is of the form $(\der\,c\,r_1) \cdot r_2$ and
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   391
therefore the value must be of the form $Seq(\ldots)$.
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   392
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   393
Hopefully the $inj$-function makes now more sense. I let you explain
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   394
the $r^*$ case. What do the derivative of $r^*$  and
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   395
the corresponding value look like? Does this explain the shape of 
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   396
the clause?
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   397
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   398
\begin{center}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   399
  \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   400
    $\inj\,(r^*)\,c\,\,Seq(v,Stars\,vs)$ & $\dn$  & $Stars(\inj\,r\,c\,v\,::\,vs)$\\
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   401
  \end{tabular}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   402
\end{center}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   403
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   404
\noindent If yes, you made sense of the left-hand sides of
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   405
the $\inj$-definition. 
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   406
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   407
How about the right-hand sides? Well, in the
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   408
$r^*$ case for example we have according to the square in the picture 
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   409
above a value $v_{der}$ which says how the derivative $r_{der}$
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   410
matched the string. Since the derivative is of the form
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   411
$(\der\,c\,r) \cdot (r^*)$ the corresponding value is of the
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   412
form $Seq(v, Stars\,vs)$. But for $r^*$ we are looking for a value
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   413
for the original (top left) regular expression --- so it cannot
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   414
be a value of the shape $Seq(\ldots, Stars\ldots)$ because that is the
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   415
shape for sequence regular expressions. What we need is a value
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   416
of the form $Stars \ldots$ (remember the correspondence between 
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   417
values and regular expressions). This suggests to define the right hand 
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   418
side as
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   419
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   420
\begin{center}
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   421
$\ldots \dn Stars(\inj\,r\,c\,v\,::\,vs)$
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   422
\end{center}  
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   423
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   424
\noindent It is a value of the right shape, namely $Stars$. It injected 
874
ffe02fd574a5 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 792
diff changeset
   425
$c$ into the first-value, which is in fact the value where we need in order to 
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   426
undo the derivative. Remember again the shape of the derivative of $r^*$.
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   427
In place where we chopped off the $c$, we now need to do the $\inj$ of $c$.
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   428
Therefore $\inj\,r\,c\,v$ in the definition above. That is the same with
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   429
the other clauses in $\inj$. 
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   430
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   431
671
83e38043ed78 updated
Christian Urban <urbanc@in.tum.de>
parents: 669
diff changeset
   432
Phew\ldots{}Sweat\ldots!\#@$\skull$\%\ldots Unfortunately, there is
83e38043ed78 updated
Christian Urban <urbanc@in.tum.de>
parents: 669
diff changeset
   433
a gigantic problem with the described algorithm so far: it is very
720
ecbed0155f72 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 716
diff changeset
   434
slow. To make it faster, we have to include in all this the simplification 
ecbed0155f72 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 716
diff changeset
   435
from Lecture 2\ldots{}and what rotten luck: simplification messes things 
ecbed0155f72 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 716
diff changeset
   436
up and we need to rectify the mess. This is what we shall do next.
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
   437
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
   438
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   439
\subsubsection*{Simplification}
282
3e3b927a85cf added ho04
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 251
diff changeset
   440
283
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   441
Generally the matching algorithms based on derivatives do
c14e5ebf0c3b updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 282
diff changeset
   442
poorly unless the regular expressions are simplified after
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   443
each derivative step. But this is a bit more involved in the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   444
algorithm of Sulzmann \& Lu. So what follows might require you
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   445
to read several times before it makes sense and also might
326
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   446
require that you do some example calculations yourself. As a
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   447
first example consider the last derivation step in our earlier
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   448
example:
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   449
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   450
\begin{equation}\label{regexfour}
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   451
r_4 = \der\,c\,r_3 = 
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   452
(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   453
\end{equation}
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   454
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   455
\noindent Simplifying this regular expression would just give
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   456
us $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ as
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   457
input, however, would give us with the value $Empty$ instead of
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   458
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   459
\[Right(Right(Empty))\]
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   460
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   461
\noindent
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   462
that was obtained  without the simplification. The problem  is we need
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   463
to  recreate this  more  complicated value,  rather  than just  return
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   464
$Empty$. This will require what I call \emph{rectification functions}.
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   465
They  need  to  be  calculated  whenever  a  regular  expression  gets
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   466
simplified.
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   467
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   468
Rectification functions take a value as argument and return a
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   469
(rectified) value. In the example above the argument would be $Empty$
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   470
and the output $Right(Right(Empty))$.  Before we define these
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   471
rectification functions in general, let us first take a look again at
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   472
our simplification rules:
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   473
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   474
\begin{center}
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   475
\begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   476
$r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ 
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   477
$\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ 
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   478
$r \cdot \ONE$ & $\mapsto$ & $r$\\ 
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   479
$\ONE \cdot r$ & $\mapsto$ & $r$\\ 
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   480
$r + \ZERO$ & $\mapsto$ & $r$\\ 
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   481
$\ZERO + r$ & $\mapsto$ & $r$\\ 
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   482
$r + r$ & $\mapsto$ & $r$\\ 
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   483
\end{tabular}
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   484
\end{center}
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   485
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   486
\noindent Applying them to $r_4$ in \eqref{regexfour} will require several nested
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   487
simplifications in order end up with just $\ONE$. However,
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   488
it is possible to apply them in a depth-first, or inside-out,
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   489
manner in order to calculate this simplified regular
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   490
expression.
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   491
356
d9c784c71305 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 352
diff changeset
   492
The rectification we can implement by letting simp return not
d9c784c71305 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 352
diff changeset
   493
just a (simplified) regular expression, but also a
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   494
rectification function. Let us consider the alternative case,
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   495
$r_1 + r_2$, first. By going depth-first, we first simplify
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   496
the component regular expressions $r_1$ and $r_2.$ This will
356
d9c784c71305 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 352
diff changeset
   497
return simplified versions, say $r_{1s}$ and $r_{2s}$, of the
d9c784c71305 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 352
diff changeset
   498
component regular expressions (if they can be simplified) but
d9c784c71305 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 352
diff changeset
   499
also two rectification functions $f_{1s}$ and $f_{2s}$. We
d9c784c71305 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 352
diff changeset
   500
need to assemble them in order to obtain a rectified value for
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   501
$r_1 + r_2$. In case $r_{1s}$ simplified to $\ZERO$, we
356
d9c784c71305 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 352
diff changeset
   502
continue the derivative calculation with $r_{2s}$. The
d9c784c71305 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 352
diff changeset
   503
Sulzmann \& Lu algorithm would return a corresponding value,
d9c784c71305 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 352
diff changeset
   504
say $v_{2s}$. But now this value needs to be ``rectified'' to
d9c784c71305 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 352
diff changeset
   505
the value 
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   506
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   507
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   508
$Right(v_{2s})$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   509
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   510
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   511
\noindent The reason is that we look for the value that tells
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   512
us how $r_1 + r_2$ could have matched the string, not just
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   513
$r_2$ or $r_{2s}$. Unfortunately, this is still not the right
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   514
value in general because there might be some simplifications
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   515
that happened inside $r_2$ and for which the simplification
520
2849c305b12d updated
cu
parents: 492
diff changeset
   516
function returned also a rectification function $f_{2s}$. So in
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   517
fact we need to apply this one too which gives
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   518
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   519
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   520
$Right(f_{2s}(v_{2s}))$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   521
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   522
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   523
\noindent This is now the correct, or rectified, value. Since
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   524
the simplification will be done in the first phase of the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   525
algorithm, but the rectification needs to be done to the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   526
values in the second phase, it is advantageous to calculate
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   527
the rectification as a function, remember this function and
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   528
then apply the value to this function during the second phase.
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   529
So if we want to implement the rectification as function, we 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   530
would need to return
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   531
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   532
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   533
$\lambda v.\,Right(f_{2s}(v))$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   534
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   535
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   536
\noindent which is the lambda-calculus notation for
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   537
a function that expects a value $v$ and returns everything
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   538
after the dot where $v$ is replaced by whatever value is 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   539
given.
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   540
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   541
Let us package this idea with rectification functions
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   542
into a single function (still only considering the alternative
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   543
case):
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   544
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   545
\begin{center}
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   546
\begin{tabular}{ll}
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   547
$_1$ & $simp(r)$:\\
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   548
$_2$ & \quad case $r = r_1 + r_2$\\
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   549
$_3$ & \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   550
$_4$ & \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   551
$_5$ & \qquad case $r_{1s} = \ZERO$: 
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   552
       return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   553
$_6$ & \qquad case $r_{2s} = \ZERO$: 
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   554
       return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   555
$_7$ & \qquad case $r_{1s} = r_{2s}$:
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   556
       return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   557
$_8$ & \qquad otherwise: 
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   558
       return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   559
\end{tabular}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   560
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   561
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   562
\noindent We first recursively call the simplification with
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   563
$r_1$ and $r_2$ (Lines 3 and 4). This gives simplified regular expressions,
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   564
$r_{1s}$ and $r_{2s}$, as well as two rectification functions
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   565
$f_{1s}$ and $f_{2s}$. We next need to test whether the
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   566
simplified regular expressions are $\ZERO$ so as to make
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   567
further simplifications. In case $r_{1s}$ is $\ZERO$ (Line 5),
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   568
then we can return $r_{2s}$ (the other alternative). However
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   569
for this we need to build a corresponding rectification 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   570
function, which as said above is
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   571
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   572
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   573
$\lambda v.\,Right(f_{2s}(v))$
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   574
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   575
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   576
\noindent The case where $r_{2s} = \ZERO$ is similar:
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   577
We return $r_{1s}$ and rectify with $\Left(\_)$ and the
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   578
other calculated rectification function $f_{1s}$. This gives
284
0afe43616b6a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 283
diff changeset
   579
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   580
\begin{center}
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   581
$\lambda v.\,\Left(f_{1s}(v))$
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   582
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   583
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   584
\noindent The next case where $r_{1s} = r_{2s}$ can be treated
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   585
like the one where $r_{2s} = \ZERO$. We return $r_{1s}$
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   586
and rectify with $\Left(\_)$ and so on.
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   587
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   588
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   589
The otherwise-case is slightly more complicated. In this case
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   590
neither $r_{1s}$ nor $r_{2s}$ are $\ZERO$ and also
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   591
$r_{1s} \not= r_{2s}$, which means no further simplification
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   592
can be applied. Accordingly, we return $r_{1s} + r_{2s}$ as
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   593
the simplified regular expression. In principle we also do not
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   594
have to do any rectification, because no simplification was
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   595
done in this case. But this is actually not true: There might
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   596
have been simplifications inside $r_{1s}$ and $r_{2s}$. We
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   597
therefore need to take into account the calculated
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   598
rectification functions $f_{1s}$ and $f_{2s}$. We can do this
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   599
by defining a rectification function $f_{alt}$ which takes two
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   600
rectification functions as arguments and applies them
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   601
according to whether the value is of the form $\Left(\_)$ or
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   602
$Right(\_)$:
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   603
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   604
\begin{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   605
\begin{tabular}{l@{\hspace{1mm}}l}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   606
$f_{alt}(f_1, f_2) \dn$\\
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   607
\qquad $\lambda v.\,$ case $v = \Left(v')$: 
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   608
      & return $\Left(f_1(v'))$\\
285
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   609
\qquad \phantom{$\lambda v.\,$} case $v = Right(v')$: 
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   610
      & return $Right(f_2(v'))$\\      
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   611
\end{tabular}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   612
\end{center}
8a222559278f updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 284
diff changeset
   613
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   614
The other interesting case with simplification is the sequence
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   615
case. In this case the main simplification function is as
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   616
follows
286
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   617
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   618
\begin{center}
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   619
\begin{tabular}{l}
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   620
$simp(r)$:\hspace{5cm} (continued)\\
286
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   621
\quad case $r = r_1 \cdot r_2$\\
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   622
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   623
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   624
\qquad case $r_{1s} = \ZERO$: 
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   625
       return $(\ZERO, f_{error})$\\
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   626
\qquad case $r_{2s} = \ZERO$: 
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   627
       return $(\ZERO, f_{error})$\\
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   628
\qquad case $r_{1s} = \ONE$: 
286
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   629
return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   630
\qquad case $r_{2s} = \ONE$: 
286
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   631
return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   632
\qquad otherwise: 
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   633
       return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   634
\end{tabular}
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   635
\end{center}
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   636
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   637
\noindent whereby in the last line $f_{seq}$ is again pushing
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   638
the two rectification functions into the two components of the
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   639
Seq-value:
286
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   640
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   641
\begin{center}
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   642
\begin{tabular}{l@{\hspace{1mm}}l}
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   643
$f_{seq}(f_1, f_2) \dn$\\
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   644
\qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$: 
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   645
      & return $Seq(f_1(v_1), f_2(v_2))$\\
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   646
\end{tabular}
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   647
\end{center}
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   648
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   649
\noindent Note that in the case of $r_{1s} = \ZERO$
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   650
(similarly $r_{2s}$) we use the function $f_{error}$ for
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   651
rectification. If you think carefully, then you will realise
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   652
that this function will actually never been called. This is
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   653
because a sequence with $\ZERO$ will never recognise any
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   654
string and therefore the second phase of the algorithm would
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   655
never been called. The simplification function still expects
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   656
us to give a function. So in my own implementation I just
326
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   657
returned a function that raises an error. In the case
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   658
where $r_{1s} = \ONE$ (similarly $r_{2s}$) we have
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   659
to create a sequence where the first component is a rectified
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   660
version of $Empty$. Therefore we call $f_{1s}$ with $Empty$.
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   661
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   662
Since we only simplify regular expressions of the form $r_1 +
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   663
r_2$ and $r_1 \cdot r_2$ we do not have to do anything else
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   664
in the remaining cases. The rectification function will
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   665
be just the identity, which in lambda-calculus terms is
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   666
just
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   667
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   668
\begin{center}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   669
$\lambda v.\,v$
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   670
\end{center}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   671
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   672
\noindent This completes the high-level version of the simplification
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   673
function, which is shown again in Figure~\ref{simp}. The Scala code
444
3056a4c071b0 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 422
diff changeset
   674
for the simplification function is in Figure~\ref{simprect}.
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   675
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   676
\begin{figure}[t]
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   677
\begin{center}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   678
\begin{tabular}{l}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   679
$simp(r)$:\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   680
\quad case $r = r_1 + r_2$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   681
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   682
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   683
\qquad case $r_{1s} = \ZERO$: 
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   684
       return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   685
\qquad case $r_{2s} = \ZERO$: 
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   686
       return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   687
\qquad case $r_{1s} = r_{2s}$:
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   688
       return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   689
\qquad otherwise: 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   690
       return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   691
       \medskip\\
286
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   692
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   693
\quad case $r = r_1 \cdot r_2$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   694
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   695
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   696
\qquad case $r_{1s} = \ZERO$: 
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   697
       return $(\ZERO, f_{error})$\\
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   698
\qquad case $r_{2s} = \ZERO$: 
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   699
       return $(\ZERO, f_{error})$\\
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   700
\qquad case $r_{1s} = \ONE$: 
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   701
return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   702
\qquad case $r_{2s} = \ONE$: 
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   703
return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   704
\qquad otherwise: 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   705
       return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   706
       \medskip\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   707
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   708
\quad otherwise:\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   709
\qquad return $(r, \lambda v.\,v)$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   710
\end{tabular}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   711
\end{center}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   712
\caption{The simplification function that returns a simplified 
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   713
regular expression and a rectification function.\label{simp}}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   714
\end{figure}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   715
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   716
\begin{figure}[p]
492
39b7ff2cf1bc updated
Christian Urban <urbanc@in.tum.de>
parents: 490
diff changeset
   717
\small  
490
4fee50f38305 updated
Christian Urban <urbanc@in.tum.de>
parents: 444
diff changeset
   718
\lstinputlisting[numbers=left,linebackgroundcolor=
4fee50f38305 updated
Christian Urban <urbanc@in.tum.de>
parents: 444
diff changeset
   719
   {\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app61.scala}
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   720
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   721
\caption{The Scala code for the simplification function. The
520
2849c305b12d updated
cu
parents: 492
diff changeset
   722
first part defines some auxiliary functions for the rectification.
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   723
The second part give the simplification function.
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   724
\label{simprect}}
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   725
\end{figure}
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   726
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   727
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   728
We are now in the position to define a \emph{lexing function} as follows:
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   729
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   730
\begin{center}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   731
\begin{tabular}{lcl}
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   732
$lex\,r\,[]$ & $\dn$ & if $\nullable(r)$ then $\textit{mkeps}(r)$\\
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   733
             &       & else $error$\\
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   734
$lex\,r\,c\!::\!s$ & $\dn$ & let 
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   735
   $(r_{simp}, f_{rect}) = simp(\der(c, r))$\\
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   736
& & $\inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
287
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   737
\end{tabular}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   738
\end{center}
2c50b8b5886c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 286
diff changeset
   739
326
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   740
\noindent This corresponds to the $matches$ function we have
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   741
seen in earlier lectures. In the first clause we are given an
520
2849c305b12d updated
cu
parents: 492
diff changeset
   742
empty string, $[]$, and need to test whether the regular
326
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   743
expression is $nullable$. If yes, we can proceed normally and
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   744
just return the value calculated by $\textit{mkeps}$. The second clause
326
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   745
is for strings where the first character is $c$, say, and the
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   746
rest of the string is $s$. We first build the derivative of
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   747
$r$ with respect to $c$; simplify the resulting regular
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   748
expression. We continue lexing with the simplified regular
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   749
expression and the string $s$. Whatever will be returned as
874
ffe02fd574a5 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 792
diff changeset
   750
value, we still need to rectify using the $f_{rect}$ from the
326
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   751
simplification and finally inject $c$ back into the
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   752
(rectified) value.
286
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   753
19020b75d75e updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 285
diff changeset
   754
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   755
\subsubsection*{Records}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   756
326
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   757
Remember we wanted to tokenize input strings, that means
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   758
splitting strings into their ``word'' components. Furthermore
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   759
we want to classify each token as being a keyword or
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   760
identifier and so on. For this one more feature will be
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   761
required, which I call a \emph{record} regular expression.
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   762
While values encode how a regular expression matches a string,
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   763
records can be used to ``focus'' on some particular parts of
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   764
the regular expression and ``forget'' about others. 
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   765
326
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   766
Let us look at an example. Suppose you have the regular
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   767
expression $a\cdot b + a\cdot c$. Clearly this regular expression can only
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   768
recognise two strings. But suppose you are not interested
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   769
whether it can recognise $ab$ or $ac$, but rather if it
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   770
matched, then what was the last character of the matched
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   771
string\ldots either $b$ or $c$. You can do this by annotating
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   772
the regular expression with a record, written in general
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   773
$(x:r)$, where $x$ is just an identifier (in my implementation
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   774
a plain string) and $r$ is a regular expression. A record will
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   775
be regarded as a regular expression. The extended definition
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   776
in Scala therefore looks as follows:
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   777
490
4fee50f38305 updated
Christian Urban <urbanc@in.tum.de>
parents: 444
diff changeset
   778
{\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
4fee50f38305 updated
Christian Urban <urbanc@in.tum.de>
parents: 444
diff changeset
   779
                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   780
{../progs/app03.scala}}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   781
326
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   782
\noindent Since we regard records as regular expressions we
669
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   783
need to extend the functions $\nullable$ and $\der$. Similarly
2f5a4d76756d updated
Christian Urban <urbanc@in.tum.de>
parents: 643
diff changeset
   784
$\textit{mkeps}$ and $\inj$ need to be extended. This means we also need
326
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   785
to extend the definition of values, which in Scala looks as
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   786
follows:
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   787
490
4fee50f38305 updated
Christian Urban <urbanc@in.tum.de>
parents: 444
diff changeset
   788
{\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
4fee50f38305 updated
Christian Urban <urbanc@in.tum.de>
parents: 444
diff changeset
   789
                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   790
{../progs/app04.scala}}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   791
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   792
\noindent Let us now look at the purpose of records more
326
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   793
closely and let us return to our question whether the string
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   794
terminated in a $b$ or $c$. We can do this as follows: we
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   795
annotate the regular expression $ab + ac$ with a record
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   796
as follows
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   797
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   798
\begin{center}
326
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   799
$a\cdot (x:b) + a\cdot (x:c)$
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   800
\end{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   801
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   802
\noindent This regular expression can still only recognise
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   803
the strings $ab$ and $ac$, but we can now use a function
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   804
that takes a value and returns all records. I call this
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   805
function \emph{env} for environment\ldots it builds a list
326
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   806
of identifiers associated with a string. This function
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   807
can be defined as follows:
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   808
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   809
\begin{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   810
\begin{tabular}{lcl}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   811
  $env(Empty)$     & $\dn$ & $[]$\\
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   812
  $env(Char(c))$   & $\dn$ & $[]$\\
400
e4afe3f46c29 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 394
diff changeset
   813
  $env(\Left(v))$   & $\dn$ & $env(v)$\\
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   814
  $env(Right(v))$  & $\dn$ & $env(v)$\\
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   815
  $env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\
422
5deefcc8cffa updated programs
Christian Urban <urbanc@in.tum.de>
parents: 417
diff changeset
   816
  $env(Stars\,[v_1,\ldots ,v_n])$ & $\dn$ & 
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   817
     $env(v_1) \,@\ldots @\, env(v_n)$\\
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   818
  $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   819
\end{tabular}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   820
\end{center}
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   821
326
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   822
\noindent where in the last clause we use the flatten function
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   823
defined earlier. As can be seen, the function $env$ ``picks''
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   824
out all underlying strings where a record is given. Since
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   825
there can be more than one, the environment will potentially
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   826
contain many ``records''. If we now postprocess the value
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   827
calculated by $lex$ extracting all records using $env$, we can
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   828
answer the question whether the last element in the string was
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   829
an $b$ or a $c$. Lets see this in action: if we use $a\cdot b
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   830
+ a\cdot c$ and $ac$ the calculated value will be
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   831
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   832
\begin{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   833
$Right(Seq(Char(a), Char(c)))$
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   834
\end{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   835
326
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   836
\noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ and
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   837
use the $env$ function to extract the recording for 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   838
$x$ we obtain
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   839
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   840
\begin{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   841
$[(x:c)]$
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   842
\end{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   843
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   844
\noindent If we had given the string $ab$ instead, then the
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   845
record would have been $[(x:b)]$. The fun starts if we 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   846
iterate this. Consider the regular expression 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   847
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   848
\begin{center}
326
94700593a2d5 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 319
diff changeset
   849
$(a\cdot (x:b) + a\cdot (y:c))^*$
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   850
\end{center}
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   851
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   852
\noindent and the string $ababacabacab$. This string is 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   853
clearly matched by the regular expression, but we are only
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   854
interested in the sequence of $b$s and $c$s. Using $env$
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   855
we obtain
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   856
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   857
\begin{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   858
$[(x:b), (x:b), (y:c), (x:b), (y:c), (x:b)]$
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   859
\end{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   860
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   861
\noindent While this feature might look silly, it is in fact
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   862
quite useful. For example if we want to match the name of
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   863
an email we might use the regular expression
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   864
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   865
\[
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   866
(name: [a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot 
792
34132a854d03 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 778
diff changeset
   867
(domain: [a\mbox{-}z0\mbox{-}9\,-]^+)\cdot .\cdot 
288
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   868
(top\_level: [a\mbox{-}z\,.]^{\{2,6\}})
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   869
\]
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   870
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   871
\noindent Then if we match the email address
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   872
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   873
\[
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   874
\texttt{christian.urban@kcl.ac.uk}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   875
\]
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   876
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   877
\noindent we can use the $env$ function and find out
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   878
what the name, domain and top-level part of the email
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   879
address are:
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   880
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   881
\begin{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   882
$[(name:\texttt{christian.urban}), 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   883
  (domain:\texttt{kcl}), 
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   884
  (top\_level:\texttt{ac.uk})]$
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   885
\end{center}
39aeca14af8c updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 287
diff changeset
   886
721
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   887
Recall that we want to lex a little programming language, called the
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   888
\emph{While}-language. A simple program in this language is shown in
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   889
Figure~\ref{while}. The main keywords in this language are
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   890
\pcode{while}, \pcode{if}, \pcode{then} and
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   891
\pcode{else}.\footnote{Contrast this with the COBOL programming
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   892
  language, which was developed around 1960 and thought to be dead for
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   893
  many decades---even your friendly lecturer is not old enough to have
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   894
  been taught this language. Anyway, this language had over 600
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   895
  keywords (or what they called \emph{reserved words}). Interestingly
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   896
  though this language is still used in 2020: during the height of
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   897
  Corona crisis the State of New Jewrsey in the US was looking for
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   898
  COBOL programers who could fix the state's national insurance
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   899
  webpage. You were probably paid in gold and diamonds, if you were
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   900
  able to program in COBOL. If you fixed their webpage, surely you
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   901
  were allowed to marry the governer's son/daughter. } As usual we
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   902
have syntactic categories for identifiers, operators, numbers and so
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   903
on. For this we would need to design the corresponding regular
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   904
expressions to recognise these syntactic categories. I let you do this
e3c64f22dd31 added slides from Rochester
Christian Urban <christian.urban@kcl.ac.uk>
parents: 720
diff changeset
   905
design task. Having these regular expressions at our disposal, we can
357
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   906
form the regular expression
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   907
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   908
\begin{figure}[t]
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   909
\begin{center}
778
3e5f5d19f514 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 721
diff changeset
   910
\mbox{\lstinputlisting[language=while]{../progs/while-tests/fib.while}}
357
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   911
\end{center}
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   912
\caption{The Fibonacci program in the While-language.\label{while}}
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   913
\end{figure}
356
d9c784c71305 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 352
diff changeset
   914
d9c784c71305 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 352
diff changeset
   915
\begin{center}
357
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   916
$\textit{WhileRegs} \;\dn\; 
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   917
\left(\begin{array}{l}
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   918
       (k, KeyWords)\; +\\
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   919
       (i, Ids)\;+\\
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   920
       (o, Ops)\;+ \\
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   921
       (n, Nums)\;+ \\
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   922
       (s, Semis)\;+ \\
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   923
       (p, (LParens + RParens))\;+\\ 
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   924
       (b, (Begin + End))\;+ \\
551
bd551ede2be6 updated
Christian Urban <urbanc@in.tum.de>
parents: 520
diff changeset
   925
       (w, WhiteSpaces)
394
2f9fe225ecc8 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 357
diff changeset
   926
      \end{array}\right)^{\mbox{\LARGE{}*}}$
357
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   927
\end{center}
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   928
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   929
\noindent and ask the algorithm by Sulzmann \& Lu to lex, say
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   930
the following string
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   931
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   932
\begin{center}\large
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   933
\code{"if true then then 42 else +"}
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   934
\end{center}
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   935
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   936
\noindent
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   937
By using the records and extracting the environment, the 
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   938
result is the following list:
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   939
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   940
\begin{center}\tt
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   941
\begin{tabular}{l}
643
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   942
(k, if),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   943
(w, " "),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   944
(i, true),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   945
(w, " "),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   946
(k, then),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   947
(w, " "),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   948
(k, then),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   949
(w, " "),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   950
(n, 42),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   951
(w, " "),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   952
(k, else),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   953
(w, " "),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   954
(o, +)
356
d9c784c71305 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 352
diff changeset
   955
\end{tabular}
d9c784c71305 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 352
diff changeset
   956
\end{center}
d9c784c71305 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 352
diff changeset
   957
357
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   958
\noindent Typically we are not interested in the whitespaces 
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   959
and comments and would filter them out: this gives
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   960
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   961
\begin{center}\tt
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   962
\begin{tabular}{l}
643
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   963
  (k, if),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   964
  (i, true),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   965
  (k, then),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   966
  (k, then),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   967
  (n, 42),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   968
  (k, else),\\ 
08375ca3874e updated
Christian Urban <urbanc@in.tum.de>
parents: 596
diff changeset
   969
  (o, +)
357
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   970
\end{tabular}
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   971
\end{center}
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   972
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   973
\noindent
603e171a7b48 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 356
diff changeset
   974
which will be the input for the next phase of our compiler.
356
d9c784c71305 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 352
diff changeset
   975
251
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   976
\end{document}
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   977
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   978
%%% Local Variables: 
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   979
%%% mode: latex
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   980
%%% TeX-master: t
5b5a68df6d16 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   981
%%% End: