| author | Christian Urban <christian.urban@kcl.ac.uk> | 
| Tue, 03 Oct 2023 23:23:57 +0100 | |
| changeset 937 | b3a237a5f4ad | 
| parent 935 | 3fb9b05465dd | 
| child 940 | 1c1fbf45a03c | 
| permissions | -rw-r--r-- | 
| 669 | 1  | 
% !TEX program = xelatex  | 
| 
251
 
5b5a68df6d16
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
2  | 
\documentclass{article}
 | 
| 
 
5b5a68df6d16
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
3  | 
\usepackage{../style}
 | 
| 
 
5b5a68df6d16
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
4  | 
\usepackage{../langs}
 | 
| 
282
 
3e3b927a85cf
added ho04
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
251 
diff
changeset
 | 
5  | 
\usepackage{../graphics}
 | 
| 671 | 6  | 
\usepackage{skull}
 | 
| 
251
 
5b5a68df6d16
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
7  | 
|
| 
 
5b5a68df6d16
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
8  | 
\begin{document}
 | 
| 669 | 9  | 
\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2019}
 | 
| 
251
 
5b5a68df6d16
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
10  | 
|
| 
282
 
3e3b927a85cf
added ho04
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
251 
diff
changeset
 | 
11  | 
\section*{Handout 4 (Sulzmann \& Lu Algorithm)}
 | 
| 
 
3e3b927a85cf
added ho04
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
251 
diff
changeset
 | 
12  | 
|
| 716 | 13  | 
So far our algorithm based on derivatives is only able to say  | 
14  | 
yes or no depending on whether a string is matched by a regular  | 
|
| 
282
 
3e3b927a85cf
added ho04
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
251 
diff
changeset
 | 
15  | 
expression or not. Often a more interesting question is to  | 
| 
 
3e3b927a85cf
added ho04
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
251 
diff
changeset
 | 
16  | 
find out \emph{how} a regular expression matched a string?
 | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
17  | 
Answering this question will also help us with the problem we  | 
| 
319
 
e7b110f93697
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
296 
diff
changeset
 | 
18  | 
are after, namely tokenising an input string.  | 
| 
282
 
3e3b927a85cf
added ho04
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
251 
diff
changeset
 | 
19  | 
|
| 935 | 20  | 
The algorithm we will be looking at in this lecture was designed by  | 
21  | 
Sulzmann \& Lu in a rather recent research paper (from 2014). A link  | 
|
22  | 
to it is provided on KEATS, in case you are interested.\footnote{In my
 | 
|
23  | 
humble opinion this is an interesting instance of the research  | 
|
24  | 
literature: it contains very clever ideas, but its presentation is  | 
|
25  | 
rather sloppy. In earlier versions of this paper, students and I  | 
|
26  | 
found several rather annoying typos in the examples and definitions;  | 
|
27  | 
we even found downright errors in their work.} Together with my former PhD  | 
|
28  | 
students Fahad Ausaf and Chengsong Tan we wrote several papers where  | 
|
29  | 
we build on their result: we provided a mathematical proof that their  | 
|
30  | 
algorithm is really correct---the proof Sulzmann \& Lu had originally  | 
|
31  | 
given contained major flaws. Such correctness proofs are important:  | 
|
32  | 
Kuklewicz maintains a unit-test library for the kind of algorithms we  | 
|
33  | 
are interested in here and he showed that many implementations in the  | 
|
34  | 
``wild'' are buggy, that is not satisfy his unit tests:  | 
|
| 422 | 35  | 
|
36  | 
\begin{center}
 | 
|
37  | 
\url{http://www.haskell.org/haskellwiki/Regex_Posix}
 | 
|
38  | 
\end{center}
 | 
|
39  | 
||
40  | 
||
| 669 | 41  | 
Coming back to the algorithm by Sulzmann \& Lu, their idea is to  | 
42  | 
introduce \emph{values} for producing an answer for \emph{how} a regular
 | 
|
43  | 
expression matches a string. So rather than a boolean like in the  | 
|
44  | 
Brzozowski algorithm, a value will be the output of the Sulzman \& Lu  | 
|
45  | 
algorithm whenever the regular expression matches the string. If the  | 
|
46  | 
string does not match the string, an error will be raised. The  | 
|
47  | 
definition for values is given below. They are shown together with the  | 
|
48  | 
regular expressions $r$ to which they correspond:  | 
|
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
49  | 
|
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
50  | 
\begin{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
51  | 
\begin{tabular}{cc}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
52  | 
\begin{tabular}{@{}rrl@{}}
 | 
| 
319
 
e7b110f93697
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
296 
diff
changeset
 | 
53  | 
\multicolumn{3}{c}{regular expressions}\medskip\\
 | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
54  | 
$r$ & $::=$ & $\ZERO$\\  | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
55  | 
& $\mid$ & $\ONE$ \\  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
56  | 
& $\mid$ & $c$ \\  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
57  | 
& $\mid$ & $r_1 \cdot r_2$\\  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
58  | 
& $\mid$ & $r_1 + r_2$ \\  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
59  | 
\\  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
60  | 
& $\mid$ & $r^*$ \\  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
61  | 
\end{tabular}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
62  | 
&  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
63  | 
\begin{tabular}{@{\hspace{0mm}}rrl@{}}
 | 
| 
319
 
e7b110f93697
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
296 
diff
changeset
 | 
64  | 
\multicolumn{3}{c}{values}\medskip\\
 | 
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
65  | 
$v$ & $::=$ & \\  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
66  | 
& & $Empty$ \\  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
67  | 
& $\mid$ & $Char(c)$ \\  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
68  | 
& $\mid$ & $Seq(v_1,v_2)$\\  | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
69  | 
& $\mid$ & $\Left(v)$ \\  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
70  | 
& $\mid$ & $Right(v)$ \\  | 
| 422 | 71  | 
& $\mid$ & $Stars\,[v_1,\ldots\,v_n]$ \\  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
72  | 
\end{tabular}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
73  | 
\end{tabular}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
74  | 
\end{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
75  | 
|
| 417 | 76  | 
\noindent There is no value for the  | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
77  | 
$\ZERO$ regular expression, since it does not match any  | 
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
78  | 
string. Otherwise there is exactly one value corresponding to  | 
| 
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
79  | 
each regular expression with the exception of $r_1 + r_2$  | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
80  | 
where there are two values, namely $\Left(v)$ and $Right(v)$  | 
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
81  | 
corresponding to the two alternatives. Note that $r^*$ is  | 
| 
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
82  | 
associated with a list of values, one for each copy of $r$  | 
| 
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
83  | 
that was needed to match the string. This means we might also  | 
| 669 | 84  | 
return the empty list $Stars\,[]$, if no copy was needed  | 
| 520 | 85  | 
for $r^*$. For sequence, there is exactly one value, composed  | 
| 
350
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
86  | 
of two component values ($v_1$ and $v_2$).  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
87  | 
|
| 422 | 88  | 
My implementation of regular expressions and values in Scala is  | 
| 716 | 89  | 
shown below. I use the convention that  | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
90  | 
regular expressions are written entirely with upper-case  | 
| 716 | 91  | 
letters, whereas values start with a single upper-case  | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
92  | 
character and the rest are lower-case letters.  | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
93  | 
|
| 490 | 94  | 
{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=
 | 
95  | 
                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 | 
|
| 
352
 
1e1b0fe66107
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
350 
diff
changeset
 | 
96  | 
{../progs/app01.scala}}
 | 
| 
350
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
97  | 
|
| 
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
98  | 
|
| 490 | 99  | 
{\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=
 | 
100  | 
                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 | 
|
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
101  | 
{../progs/app02.scala}}
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
102  | 
|
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
103  | 
|
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
104  | 
Graphically the algorithm by Sulzmann \& Lu can be illustrated  | 
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
105  | 
by the picture in Figure~\ref{Sulz} where the path from the
 | 
| 669 | 106  | 
left to the right involving $\der/\nullable$ is the first phase  | 
107  | 
of the algorithm and $\textit{mkeps}/\inj$, the path from right to left,
 | 
|
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
108  | 
the second phase. This picture shows the steps required when a  | 
| 
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
109  | 
regular expression, say $r_1$, matches the string $abc$. We  | 
| 
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
110  | 
first build the three derivatives (according to $a$, $b$ and  | 
| 669 | 111  | 
$c$). We then use $\nullable$ to find out whether the resulting  | 
| 
296
 
796b9b81ac8d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
288 
diff
changeset
 | 
112  | 
regular expression can match the empty string. If yes, we call  | 
| 422 | 113  | 
the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular
 | 
114  | 
expression has matched the empty string. Its definition  | 
|
115  | 
is as follows:  | 
|
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
116  | 
|
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
117  | 
\begin{figure}[t]
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
118  | 
\begin{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
119  | 
\begin{tikzpicture}[scale=2,node distance=1.2cm,
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
120  | 
                    every node/.style={minimum size=7mm}]
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
121  | 
\node (r1)  {$r_1$};
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
122  | 
\node (r2) [right=of r1]{$r_2$};
 | 
| 669 | 123  | 
\draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\der\,a$};
 | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
124  | 
\node (r3) [right=of r2]{$r_3$};
 | 
| 669 | 125  | 
\draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\der\,b$};
 | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
126  | 
\node (r4) [right=of r3]{$r_4$};
 | 
| 669 | 127  | 
\draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\der\,c$};
 | 
128  | 
\draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\nullable$}};
 | 
|
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
129  | 
\node (v4) [below=of r4]{$v_4$};
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
130  | 
\draw[->,line width=1mm](r4) -- (v4);  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
131  | 
\node (v3) [left=of v4] {$v_3$};
 | 
| 669 | 132  | 
\draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\inj\,c$};
 | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
133  | 
\node (v2) [left=of v3]{$v_2$};
 | 
| 669 | 134  | 
\draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\inj\,b$};
 | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
135  | 
\node (v1) [left=of v2] {$v_1$};
 | 
| 669 | 136  | 
\draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\inj\,a$};
 | 
| 422 | 137  | 
\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}};
 | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
138  | 
\end{tikzpicture}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
139  | 
\end{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
140  | 
\caption{The two phases of the algorithm by Sulzmann \& Lu.
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
141  | 
\label{Sulz}}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
142  | 
\end{figure}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
143  | 
|
| 422 | 144  | 
|
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
145  | 
|
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
146  | 
\begin{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
147  | 
\begin{tabular}{lcl}
 | 
| 422 | 148  | 
  $\textit{mkeps}(\ONE)$       & $\dn$ & $Empty$\\
 | 
| 669 | 149  | 
  $\textit{mkeps}(r_1 + r_2)$      & $\dn$ & if $\nullable(r_1)$  \\
 | 
| 422 | 150  | 
                          &       & then $\Left(\textit{mkeps}(r_1))$\\
 | 
151  | 
                          &       & else $Right(\textit{mkeps}(r_2))$\\
 | 
|
152  | 
  $\textit{mkeps}(r_1 \cdot r_2)$  & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\
 | 
|
153  | 
  $\textit{mkeps}(r^*)$            & $\dn$ & $Stars\,[]$  \\
 | 
|
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
154  | 
\end{tabular}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
155  | 
\end{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
156  | 
|
| 596 | 157  | 
\noindent There are no cases for $\ZERO$ and $c$, since  | 
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
158  | 
these regular expression cannot match the empty string. Note  | 
| 
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
159  | 
also that in case of alternatives we give preference to the  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
160  | 
regular expression on the left-hand side. This will become  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
161  | 
important later on.  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
162  | 
|
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
163  | 
The second phase of the algorithm is organised so that it will  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
164  | 
calculate a value for how the derivative regular expression  | 
| 
350
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
165  | 
has matched a string. For this we need a function that  | 
| 
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
166  | 
reverses this ``chopping off'' for values which we did in the  | 
| 
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
167  | 
first phase for derivatives. The corresponding function is  | 
| 669 | 168  | 
called $\inj$ (short for injection). This function takes three  | 
| 
350
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
169  | 
arguments: the first one is a regular expression for which we  | 
| 
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
170  | 
want to calculate the value, the second is the character we  | 
| 
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
171  | 
want to inject and the third argument is the value where we  | 
| 
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
172  | 
will inject the character into. The result of this function is a  | 
| 669 | 173  | 
new value. The definition of $\inj$ is as follows:  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
174  | 
|
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
175  | 
\begin{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
176  | 
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | 
| 669 | 177  | 
$\inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\  | 
178  | 
$\inj\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\inj\,r_1\,c\,v)$\\  | 
|
179  | 
$\inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\inj\,r_2\,c\,v)$\\  | 
|
180  | 
$\inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\  | 
|
181  | 
$\inj\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\  | 
|
182  | 
  $\inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\inj\,r_2\,c\,v)$\\
 | 
|
183  | 
$\inj\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\inj\,r\,c\,v\,::\,vs)$\\  | 
|
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
184  | 
\end{tabular}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
185  | 
\end{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
186  | 
|
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
187  | 
\noindent This definition is by recursion on the regular  | 
| 
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
188  | 
expression and by analysing the shape of the values. Therefore  | 
| 422 | 189  | 
there are three cases for sequence regular  | 
190  | 
expressions (for all possible shapes of the value we can encounter). The last  | 
|
| 
350
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
191  | 
clause for the star regular expression returns a list where  | 
| 669 | 192  | 
the first element is $\inj\,r\,c\,v$ and the other elements are  | 
| 
350
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
193  | 
$vs$. That means $\_\,::\,\_$ should be read as list cons.  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
194  | 
|
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
195  | 
To understand what is going on, it might be best to do some  | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
196  | 
example calculations and compare them with Figure~\ref{Sulz}.
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
197  | 
For this note that we have not yet dealt with the need of  | 
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
198  | 
simplifying regular expressions (this will be a topic on its  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
199  | 
own later). Suppose the regular expression is $a \cdot (b  | 
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
200  | 
\cdot c)$ and the input string is $abc$. The derivatives from  | 
| 
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
201  | 
the first phase are as follows:  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
202  | 
|
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
203  | 
\begin{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
204  | 
\begin{tabular}{ll}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
205  | 
$r_1$: & $a \cdot (b \cdot c)$\\  | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
206  | 
$r_2$: & $\ONE \cdot (b \cdot c)$\\  | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
207  | 
$r_3$: & $(\ZERO \cdot (b \cdot c)) + (\ONE \cdot c)$\\  | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
208  | 
$r_4$: & $(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)$\\  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
209  | 
\end{tabular}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
210  | 
\end{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
211  | 
|
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
212  | 
\noindent According to the simple algorithm, we would test  | 
| 
350
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
213  | 
whether $r_4$ is nullable, which in this case it indeed is.  | 
| 422 | 214  | 
This means we can use the function $\textit{mkeps}$ to calculate a
 | 
| 
350
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
215  | 
value for how $r_4$ was able to match the empty string.  | 
| 
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
216  | 
Remember that this function gives preference for alternatives  | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
217  | 
on the left-hand side. However there is only $\ONE$ on the  | 
| 422 | 218  | 
very right-hand side of $r_4$ (underlined)  | 
219  | 
||
220  | 
\begin{center}
 | 
|
221  | 
$r_4$:\;$(\ZERO \cdot (b \cdot c)) +  | 
|
222  | 
         ((\ZERO \cdot c) + \underline{\ONE})$\\
 | 
|
223  | 
\end{center}
 | 
|
224  | 
||
225  | 
\noindent  | 
|
226  | 
that matches the empty string.  | 
|
227  | 
Therefore $\textit{mkeps}$ returns the value
 | 
|
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
228  | 
|
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
229  | 
\begin{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
230  | 
$v_4:\;Right(Right(Empty))$  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
231  | 
\end{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
232  | 
|
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
233  | 
\noindent If there had been a $\ONE$ on the left, then  | 
| 422 | 234  | 
$\textit{mkeps}$ would have returned something of the form
 | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
235  | 
$\Left(\ldots)$. The point is that from this value we can  | 
| 
350
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
236  | 
directly read off which part of $r_4$ matched the empty  | 
| 
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
237  | 
string: take the right-alternative first, and then the  | 
| 669 | 238  | 
right-alternative again, then you got to the $\ONE$.  | 
| 
326
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
239  | 
|
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
240  | 
Next we have to ``inject'' the last character, that is $c$ in  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
241  | 
the running example, into this value $v_4$ in order to  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
242  | 
calculate how $r_3$ could have matched the string $c$.  | 
| 669 | 243  | 
For this we call injection with $\inj(r_3, c, v_4)$.  | 
244  | 
According to the definition of $\inj$ we obtain  | 
|
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
245  | 
|
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
246  | 
\begin{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
247  | 
$v_3:\;Right(Seq(Empty, Char(c)))$  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
248  | 
\end{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
249  | 
|
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
250  | 
\noindent This is the correct result, because $r_3$ needs  | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
251  | 
to use the right-hand alternative, and then $\ONE$ needs  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
252  | 
to match the empty string and $c$ needs to match $c$.  | 
| 596 | 253  | 
Next we need to inject back the letter $b$ into $v_3$.  | 
| 669 | 254  | 
For this we call injection with $\inj(r_2, b, v_3)$.  | 
| 596 | 255  | 
This gives  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
256  | 
|
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
257  | 
\begin{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
258  | 
$v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
259  | 
\end{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
260  | 
|
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
261  | 
\noindent which is again the correct result for how $r_2$  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
262  | 
matched the string $bc$. Finally we need to inject back the  | 
| 596 | 263  | 
letter $a$ into $v_2$ giving the final result.  | 
| 669 | 264  | 
For this we call injection with $\inj(r_1, a, v_2)$  | 
| 596 | 265  | 
and obtain  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
266  | 
|
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
267  | 
\begin{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
268  | 
$v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
269  | 
\end{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
270  | 
|
| 596 | 271  | 
\noindent This value corresponds to how the regular expression $r_1$,  | 
272  | 
namely $a \cdot (b \cdot c)$, matched the string $abc$.  | 
|
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
273  | 
|
| 669 | 274  | 
|
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
275  | 
There are a few auxiliary functions that are of interest  | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
276  | 
when analysing this algorithm. One is called \emph{flatten},
 | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
277  | 
written $|\_|$, which extracts the string ``underlying'' a  | 
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
278  | 
value. It is defined recursively as  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
279  | 
|
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
280  | 
\begin{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
281  | 
\begin{tabular}{lcl}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
282  | 
$|Empty|$ & $\dn$ & $[]$\\  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
283  | 
$|Char(c)|$ & $\dn$ & $[c]$\\  | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
284  | 
$|\Left(v)|$ & $\dn$ & $|v|$\\  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
285  | 
$|Right(v)|$ & $\dn$ & $|v|$\\  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
286  | 
$|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\  | 
| 422 | 287  | 
$|Stars\,[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
288  | 
\end{tabular}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
289  | 
\end{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
290  | 
|
| 
350
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
291  | 
\noindent Using flatten we can see what are the strings behind  | 
| 669 | 292  | 
the values calculated by $\textit{mkeps}$ and $\inj$. In our running 
 | 
| 
350
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
293  | 
example:  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
294  | 
|
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
295  | 
\begin{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
296  | 
\begin{tabular}{ll}
 | 
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
297  | 
$|v_4|$: & $[]$\\  | 
| 
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
298  | 
$|v_3|$: & $c$\\  | 
| 
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
299  | 
$|v_2|$: & $bc$\\  | 
| 
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
300  | 
$|v_1|$: & $abc$  | 
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
301  | 
\end{tabular}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
302  | 
\end{center}
 | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
303  | 
|
| 669 | 304  | 
\noindent This indicates that $\inj$ indeed is injecting, or  | 
| 422 | 305  | 
adding, back a character into the value.  | 
| 
350
 
c4e7caa06c74
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
326 
diff
changeset
 | 
306  | 
|
| 669 | 307  | 
The definition of $\inj$ might still be very puzzling and each clause  | 
308  | 
might look arbitrary, but there is in fact a relatively simple idea  | 
|
309  | 
behind it. Ultimately the $\inj$-functions is determined by the  | 
|
310  | 
derivative functions. For this consider one of the ``squares'' from  | 
|
311  | 
Figure~\ref{Sulz}:
 | 
|
312  | 
||
| 720 | 313  | 
\begin{center}
 | 
314  | 
\begin{tikzpicture}[scale=2,node distance=1.2cm,
 | 
|
315  | 
                    every node/.style={minimum size=7mm}]
 | 
|
316  | 
\node (r)  {$r$};
 | 
|
317  | 
\node (rd) [right=of r]{$r_{der}$};
 | 
|
318  | 
\draw[->,line width=1mm](r)--(rd) node[above,midway] {$\der\,c$};
 | 
|
319  | 
\node (vd) [below=of r2]{$v_{der}$};
 | 
|
320  | 
\draw[->,line width=1mm](rd) -- (vd);  | 
|
321  | 
\node (v) [left=of vd] {$v$};
 | 
|
322  | 
\draw[->,line width=1mm](vd)--(v) node[below,midway] {$\inj\,c$};
 | 
|
323  | 
\draw[->,line width=0.5mm,dotted](r) -- (v) node[left,midway,red] {\bf ?};
 | 
|
324  | 
\end{tikzpicture}
 | 
|
325  | 
\end{center}
 | 
|
| 669 | 326  | 
|
327  | 
\noindent  | 
|
328  | 
The input to the $\inj$-function is $r$ (on the top left), $c$ (the  | 
|
329  | 
character to be injected) and $v_{der}$ (bottom right). The output is
 | 
|
330  | 
the value $v$ for how the regular expression $r$ matched the  | 
|
331  | 
corresponding string. How does $\inj$ calculate this value? Well, it has  | 
|
332  | 
to analyse the value $v_{der}$ and transform it into the value $v$ for
 | 
|
333  | 
the regular expression $r$. The value $v_{der}$ corresponds to the
 | 
|
334  | 
$r_{der}$-regular expression which is the derivative of $r$. Remember
 | 
|
335  | 
that $v_{der}$ is the value for how $r_{der}$ matches the corresponding string
 | 
|
336  | 
where $c$ has been chopped off.  | 
|
337  | 
||
| 720 | 338  | 
For a concrete example, let $r$ be $r_1 + r_2$. Then $r_{der}$
 | 
| 669 | 339  | 
is of the form $(\der\,c\,r_1) + (\der\,c\,r_2)$. What are the possible  | 
340  | 
values corresponding to $r_{der}$? Well, they can be only of the form
 | 
|
341  | 
$\Left(\ldots)$ and $\Right(\ldots)$. Therefore you have two  | 
|
342  | 
cases in the $\inj$ function -- one for $\Left$ and one for $\Right$.  | 
|
343  | 
||
344  | 
\begin{center}
 | 
|
345  | 
\begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l}
 | 
|
346  | 
$\inj\,(r_1 + r_2)\,c\,\,\Left(v)$ & $\dn$ & $\ldots$\\  | 
|
347  | 
$\inj\,(r_1 + r_2)\,c\,\,\Right(v)$ & $\dn$ & $\ldots$\\  | 
|
348  | 
\end{tabular}
 | 
|
349  | 
\end{center}
 | 
|
350  | 
||
351  | 
\noindent  | 
|
352  | 
Let's look next at the sequence case where $r = r_1 \cdot r_2$. What does the  | 
|
353  | 
derivative of $r$ look like? According to the definition it is:  | 
|
354  | 
||
355  | 
\begin{center}
 | 
|
356  | 
  \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | 
|
357  | 
$\der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $\nullable (r_1)$\\  | 
|
358  | 
& & then $(\der\,c\,r_1) \cdot r_2 + \der\, c\, r_2$\\  | 
|
359  | 
& & else $(\der\, c\, r_1) \cdot r_2$\\  | 
|
360  | 
  \end{tabular}  
 | 
|
361  | 
\end{center}  
 | 
|
362  | 
||
363  | 
\noindent As you can see there is a derivative in the if-branch and another  | 
|
364  | 
in the else-branch. In the if-branch we have an alternative of the form  | 
|
365  | 
$\_ + \_$. We already know what the possible values are for such a regular  | 
|
366  | 
expression, namely $\Left$ or $\Right$. Therefore we have in $\inj$ the  | 
|
367  | 
two cases:  | 
|
368  | 
||
369  | 
\begin{center}
 | 
|
370  | 
  \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l}
 | 
|
371  | 
$\inj\,(r_1 \cdot r_2)\,c\,\,\Left(Seq(v_1,v_2))$ & $\dn$ & $\ldots$\\  | 
|
372  | 
$\inj\,(r_1 \cdot r_2)\,c\,\,Right(v)$ & $\dn$ & $\ldots$\\  | 
|
373  | 
\end{tabular}
 | 
|
374  | 
\end{center}
 | 
|
375  | 
||
376  | 
\noindent  | 
|
377  | 
In the first case we even know that it is not just a $\Left$-value, but  | 
|
378  | 
$\Left(Seq(\ldots))$, because the corresponding regular expression  | 
|
379  | 
in the derivation is \mbox{$(\der\,c\,r_1) \cdot r_2$}. Hence we know 
 | 
|
380  | 
it must be a $Seq$-value enclosed inside $\Left$.  | 
|
381  | 
The third clause for $r_1\cdot r_2$ in the $\inj$-function  | 
|
382  | 
||
383  | 
\begin{center}
 | 
|
384  | 
  \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | 
|
385  | 
$\inj\,(r_1 \cdot r_2)\,c\,\,Seq(v_1,v_2)$ & $\dn$ & $\ldots$\\  | 
|
386  | 
  \end{tabular}
 | 
|
387  | 
  \end{center}
 | 
|
388  | 
||
389  | 
\noindent corresponds to the else-branch in the derivative. In this  | 
|
390  | 
case we know the derivative is of the form $(\der\,c\,r_1) \cdot r_2$ and  | 
|
391  | 
therefore the value must be of the form $Seq(\ldots)$.  | 
|
392  | 
||
393  | 
Hopefully the $inj$-function makes now more sense. I let you explain  | 
|
394  | 
the $r^*$ case. What do the derivative of $r^*$ and  | 
|
395  | 
the corresponding value look like? Does this explain the shape of  | 
|
396  | 
the clause?  | 
|
397  | 
||
398  | 
\begin{center}
 | 
|
399  | 
  \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | 
|
400  | 
$\inj\,(r^*)\,c\,\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\inj\,r\,c\,v\,::\,vs)$\\  | 
|
401  | 
  \end{tabular}
 | 
|
402  | 
\end{center}
 | 
|
403  | 
||
404  | 
\noindent If yes, you made sense of the left-hand sides of  | 
|
405  | 
the $\inj$-definition.  | 
|
406  | 
||
407  | 
How about the right-hand sides? Well, in the  | 
|
408  | 
$r^*$ case for example we have according to the square in the picture  | 
|
409  | 
above a value $v_{der}$ which says how the derivative $r_{der}$
 | 
|
410  | 
matched the string. Since the derivative is of the form  | 
|
411  | 
$(\der\,c\,r) \cdot (r^*)$ the corresponding value is of the  | 
|
412  | 
form $Seq(v, Stars\,vs)$. But for $r^*$ we are looking for a value  | 
|
413  | 
for the original (top left) regular expression --- so it cannot  | 
|
414  | 
be a value of the shape $Seq(\ldots, Stars\ldots)$ because that is the  | 
|
415  | 
shape for sequence regular expressions. What we need is a value  | 
|
416  | 
of the form $Stars \ldots$ (remember the correspondence between  | 
|
417  | 
values and regular expressions). This suggests to define the right hand  | 
|
418  | 
side as  | 
|
419  | 
||
420  | 
\begin{center}
 | 
|
421  | 
$\ldots \dn Stars(\inj\,r\,c\,v\,::\,vs)$  | 
|
422  | 
\end{center}  
 | 
|
423  | 
||
424  | 
\noindent It is a value of the right shape, namely $Stars$. It injected  | 
|
| 874 | 425  | 
$c$ into the first-value, which is in fact the value where we need in order to  | 
| 669 | 426  | 
undo the derivative. Remember again the shape of the derivative of $r^*$.  | 
427  | 
In place where we chopped off the $c$, we now need to do the $\inj$ of $c$.  | 
|
428  | 
Therefore $\inj\,r\,c\,v$ in the definition above. That is the same with  | 
|
429  | 
the other clauses in $\inj$.  | 
|
430  | 
||
431  | 
||
| 671 | 432  | 
Phew\ldots{}Sweat\ldots!\#@$\skull$\%\ldots Unfortunately, there is
 | 
433  | 
a gigantic problem with the described algorithm so far: it is very  | 
|
| 720 | 434  | 
slow. To make it faster, we have to include in all this the simplification  | 
435  | 
from Lecture 2\ldots{}and what rotten luck: simplification messes things 
 | 
|
| 935 | 436  | 
up and we need to \emph{rectify} the mess. This is what we shall do next.
 | 
| 
282
 
3e3b927a85cf
added ho04
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
251 
diff
changeset
 | 
437  | 
|
| 
 
3e3b927a85cf
added ho04
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
251 
diff
changeset
 | 
438  | 
|
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
439  | 
\subsubsection*{Simplification}
 | 
| 
282
 
3e3b927a85cf
added ho04
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
251 
diff
changeset
 | 
440  | 
|
| 
283
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
441  | 
Generally the matching algorithms based on derivatives do  | 
| 
 
c14e5ebf0c3b
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
282 
diff
changeset
 | 
442  | 
poorly unless the regular expressions are simplified after  | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
443  | 
each derivative step. But this is a bit more involved in the  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
444  | 
algorithm of Sulzmann \& Lu. So what follows might require you  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
445  | 
to read several times before it makes sense and also might  | 
| 
326
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
446  | 
require that you do some example calculations yourself. As a  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
447  | 
first example consider the last derivation step in our earlier  | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
448  | 
example:  | 
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
449  | 
|
| 422 | 450  | 
\begin{equation}\label{regexfour}
 | 
| 669 | 451  | 
r_4 = \der\,c\,r_3 =  | 
| 422 | 452  | 
(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)  | 
453  | 
\end{equation}
 | 
|
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
454  | 
|
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
455  | 
\noindent Simplifying this regular expression would just give  | 
| 422 | 456  | 
us $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ as
 | 
457  | 
input, however, would give us with the value $Empty$ instead of  | 
|
458  | 
||
459  | 
\[Right(Right(Empty))\]  | 
|
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
460  | 
|
| 422 | 461  | 
\noindent  | 
462  | 
that was obtained without the simplification. The problem is we need  | 
|
463  | 
to recreate this more complicated value, rather than just return  | 
|
464  | 
$Empty$. This will require what I call \emph{rectification functions}.
 | 
|
465  | 
They need to be calculated whenever a regular expression gets  | 
|
466  | 
simplified.  | 
|
467  | 
||
468  | 
Rectification functions take a value as argument and return a  | 
|
469  | 
(rectified) value. In the example above the argument would be $Empty$  | 
|
470  | 
and the output $Right(Right(Empty))$. Before we define these  | 
|
471  | 
rectification functions in general, let us first take a look again at  | 
|
472  | 
our simplification rules:  | 
|
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
473  | 
|
| 
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
474  | 
\begin{center}
 | 
| 
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
475  | 
\begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
 | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
476  | 
$r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\  | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
477  | 
$\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\  | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
478  | 
$r \cdot \ONE$ & $\mapsto$ & $r$\\  | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
479  | 
$\ONE \cdot r$ & $\mapsto$ & $r$\\  | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
480  | 
$r + \ZERO$ & $\mapsto$ & $r$\\  | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
481  | 
$\ZERO + r$ & $\mapsto$ & $r$\\  | 
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
482  | 
$r + r$ & $\mapsto$ & $r$\\  | 
| 
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
483  | 
\end{tabular}
 | 
| 
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
484  | 
\end{center}
 | 
| 
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
485  | 
|
| 422 | 486  | 
\noindent Applying them to $r_4$ in \eqref{regexfour} will require several nested
 | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
487  | 
simplifications in order end up with just $\ONE$. However,  | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
488  | 
it is possible to apply them in a depth-first, or inside-out,  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
489  | 
manner in order to calculate this simplified regular  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
490  | 
expression.  | 
| 
285
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
491  | 
|
| 
356
 
d9c784c71305
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
352 
diff
changeset
 | 
492  | 
The rectification we can implement by letting simp return not  | 
| 
 
d9c784c71305
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
352 
diff
changeset
 | 
493  | 
just a (simplified) regular expression, but also a  | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
494  | 
rectification function. Let us consider the alternative case,  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
495  | 
$r_1 + r_2$, first. By going depth-first, we first simplify  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
496  | 
the component regular expressions $r_1$ and $r_2.$ This will  | 
| 
356
 
d9c784c71305
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
352 
diff
changeset
 | 
497  | 
return simplified versions, say $r_{1s}$ and $r_{2s}$, of the
 | 
| 
 
d9c784c71305
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
352 
diff
changeset
 | 
498  | 
component regular expressions (if they can be simplified) but  | 
| 
 
d9c784c71305
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
352 
diff
changeset
 | 
499  | 
also two rectification functions $f_{1s}$ and $f_{2s}$. We
 | 
| 
 
d9c784c71305
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
352 
diff
changeset
 | 
500  | 
need to assemble them in order to obtain a rectified value for  | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
501  | 
$r_1 + r_2$. In case $r_{1s}$ simplified to $\ZERO$, we
 | 
| 
356
 
d9c784c71305
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
352 
diff
changeset
 | 
502  | 
continue the derivative calculation with $r_{2s}$. The
 | 
| 
 
d9c784c71305
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
352 
diff
changeset
 | 
503  | 
Sulzmann \& Lu algorithm would return a corresponding value,  | 
| 
 
d9c784c71305
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
352 
diff
changeset
 | 
504  | 
say $v_{2s}$. But now this value needs to be ``rectified'' to
 | 
| 
 
d9c784c71305
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
352 
diff
changeset
 | 
505  | 
the value  | 
| 
285
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
506  | 
|
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
507  | 
\begin{center}
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
508  | 
$Right(v_{2s})$
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
509  | 
\end{center}
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
510  | 
|
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
511  | 
\noindent The reason is that we look for the value that tells  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
512  | 
us how $r_1 + r_2$ could have matched the string, not just  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
513  | 
$r_2$ or $r_{2s}$. Unfortunately, this is still not the right
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
514  | 
value in general because there might be some simplifications  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
515  | 
that happened inside $r_2$ and for which the simplification  | 
| 520 | 516  | 
function returned also a rectification function $f_{2s}$. So in
 | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
517  | 
fact we need to apply this one too which gives  | 
| 
285
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
518  | 
|
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
519  | 
\begin{center}
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
520  | 
$Right(f_{2s}(v_{2s}))$
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
521  | 
\end{center}
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
522  | 
|
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
523  | 
\noindent This is now the correct, or rectified, value. Since  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
524  | 
the simplification will be done in the first phase of the  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
525  | 
algorithm, but the rectification needs to be done to the  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
526  | 
values in the second phase, it is advantageous to calculate  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
527  | 
the rectification as a function, remember this function and  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
528  | 
then apply the value to this function during the second phase.  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
529  | 
So if we want to implement the rectification as function, we  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
530  | 
would need to return  | 
| 
285
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
531  | 
|
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
532  | 
\begin{center}
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
533  | 
$\lambda v.\,Right(f_{2s}(v))$
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
534  | 
\end{center}
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
535  | 
|
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
536  | 
\noindent which is the lambda-calculus notation for  | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
537  | 
a function that expects a value $v$ and returns everything  | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
538  | 
after the dot where $v$ is replaced by whatever value is  | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
539  | 
given.  | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
540  | 
|
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
541  | 
Let us package this idea with rectification functions  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
542  | 
into a single function (still only considering the alternative  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
543  | 
case):  | 
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
544  | 
|
| 
285
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
545  | 
\begin{center}
 | 
| 422 | 546  | 
\begin{tabular}{ll}
 | 
547  | 
$_1$ & $simp(r)$:\\  | 
|
548  | 
$_2$ & \quad case $r = r_1 + r_2$\\  | 
|
549  | 
$_3$ & \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 | 
|
550  | 
$_4$ & \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 | 
|
551  | 
$_5$ & \qquad case $r_{1s} = \ZERO$: 
 | 
|
| 
285
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
552  | 
       return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
 | 
| 422 | 553  | 
$_6$ & \qquad case $r_{2s} = \ZERO$: 
 | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
554  | 
       return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
 | 
| 422 | 555  | 
$_7$ & \qquad case $r_{1s} = r_{2s}$:
 | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
556  | 
       return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
 | 
| 422 | 557  | 
$_8$ & \qquad otherwise:  | 
| 
285
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
558  | 
       return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
559  | 
\end{tabular}
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
560  | 
\end{center}
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
561  | 
|
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
562  | 
\noindent We first recursively call the simplification with  | 
| 422 | 563  | 
$r_1$ and $r_2$ (Lines 3 and 4). This gives simplified regular expressions,  | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
564  | 
$r_{1s}$ and $r_{2s}$, as well as two rectification functions
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
565  | 
$f_{1s}$ and $f_{2s}$. We next need to test whether the
 | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
566  | 
simplified regular expressions are $\ZERO$ so as to make  | 
| 422 | 567  | 
further simplifications. In case $r_{1s}$ is $\ZERO$ (Line 5),
 | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
568  | 
then we can return $r_{2s}$ (the other alternative). However
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
569  | 
for this we need to build a corresponding rectification  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
570  | 
function, which as said above is  | 
| 
285
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
571  | 
|
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
572  | 
\begin{center}
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
573  | 
$\lambda v.\,Right(f_{2s}(v))$
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
574  | 
\end{center}
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
575  | 
|
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
576  | 
\noindent The case where $r_{2s} = \ZERO$ is similar:
 | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
577  | 
We return $r_{1s}$ and rectify with $\Left(\_)$ and the
 | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
578  | 
other calculated rectification function $f_{1s}$. This gives
 | 
| 
284
 
0afe43616b6a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
283 
diff
changeset
 | 
579  | 
|
| 
285
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
580  | 
\begin{center}
 | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
581  | 
$\lambda v.\,\Left(f_{1s}(v))$
 | 
| 
285
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
582  | 
\end{center}
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
583  | 
|
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
584  | 
\noindent The next case where $r_{1s} = r_{2s}$ can be treated
 | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
585  | 
like the one where $r_{2s} = \ZERO$. We return $r_{1s}$
 | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
586  | 
and rectify with $\Left(\_)$ and so on.  | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
587  | 
|
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
588  | 
|
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
589  | 
The otherwise-case is slightly more complicated. In this case  | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
590  | 
neither $r_{1s}$ nor $r_{2s}$ are $\ZERO$ and also
 | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
591  | 
$r_{1s} \not= r_{2s}$, which means no further simplification
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
592  | 
can be applied. Accordingly, we return $r_{1s} + r_{2s}$ as
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
593  | 
the simplified regular expression. In principle we also do not  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
594  | 
have to do any rectification, because no simplification was  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
595  | 
done in this case. But this is actually not true: There might  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
596  | 
have been simplifications inside $r_{1s}$ and $r_{2s}$. We
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
597  | 
therefore need to take into account the calculated  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
598  | 
rectification functions $f_{1s}$ and $f_{2s}$. We can do this
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
599  | 
by defining a rectification function $f_{alt}$ which takes two
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
600  | 
rectification functions as arguments and applies them  | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
601  | 
according to whether the value is of the form $\Left(\_)$ or  | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
602  | 
$Right(\_)$:  | 
| 
285
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
603  | 
|
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
604  | 
\begin{center}
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
605  | 
\begin{tabular}{l@{\hspace{1mm}}l}
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
606  | 
$f_{alt}(f_1, f_2) \dn$\\
 | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
607  | 
\qquad $\lambda v.\,$ case $v = \Left(v')$:  | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
608  | 
& return $\Left(f_1(v'))$\\  | 
| 
285
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
609  | 
\qquad \phantom{$\lambda v.\,$} case $v = Right(v')$: 
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
610  | 
& return $Right(f_2(v'))$\\  | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
611  | 
\end{tabular}
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
612  | 
\end{center}
 | 
| 
 
8a222559278f
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
284 
diff
changeset
 | 
613  | 
|
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
614  | 
The other interesting case with simplification is the sequence  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
615  | 
case. In this case the main simplification function is as  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
616  | 
follows  | 
| 
286
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
617  | 
|
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
618  | 
\begin{center}
 | 
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
619  | 
\begin{tabular}{l}
 | 
| 422 | 620  | 
$simp(r)$:\hspace{5cm} (continued)\\
 | 
| 
286
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
621  | 
\quad case $r = r_1 \cdot r_2$\\  | 
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
622  | 
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 | 
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
623  | 
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
624  | 
\qquad case $r_{1s} = \ZERO$: 
 | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
625  | 
       return $(\ZERO, f_{error})$\\
 | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
626  | 
\qquad case $r_{2s} = \ZERO$: 
 | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
627  | 
       return $(\ZERO, f_{error})$\\
 | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
628  | 
\qquad case $r_{1s} = \ONE$: 
 | 
| 
286
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
629  | 
return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
 | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
630  | 
\qquad case $r_{2s} = \ONE$: 
 | 
| 
286
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
631  | 
return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
 | 
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
632  | 
\qquad otherwise:  | 
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
633  | 
       return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\
 | 
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
634  | 
\end{tabular}
 | 
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
635  | 
\end{center}
 | 
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
636  | 
|
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
637  | 
\noindent whereby in the last line $f_{seq}$ is again pushing
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
638  | 
the two rectification functions into the two components of the  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
639  | 
Seq-value:  | 
| 
286
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
640  | 
|
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
641  | 
\begin{center}
 | 
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
642  | 
\begin{tabular}{l@{\hspace{1mm}}l}
 | 
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
643  | 
$f_{seq}(f_1, f_2) \dn$\\
 | 
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
644  | 
\qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$:  | 
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
645  | 
& return $Seq(f_1(v_1), f_2(v_2))$\\  | 
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
646  | 
\end{tabular}
 | 
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
647  | 
\end{center}
 | 
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
648  | 
|
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
649  | 
\noindent Note that in the case of $r_{1s} = \ZERO$
 | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
650  | 
(similarly $r_{2s}$) we use the function $f_{error}$ for
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
651  | 
rectification. If you think carefully, then you will realise  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
652  | 
that this function will actually never been called. This is  | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
653  | 
because a sequence with $\ZERO$ will never recognise any  | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
654  | 
string and therefore the second phase of the algorithm would  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
655  | 
never been called. The simplification function still expects  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
656  | 
us to give a function. So in my own implementation I just  | 
| 
326
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
657  | 
returned a function that raises an error. In the case  | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
658  | 
where $r_{1s} = \ONE$ (similarly $r_{2s}$) we have
 | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
659  | 
to create a sequence where the first component is a rectified  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
660  | 
version of $Empty$. Therefore we call $f_{1s}$ with $Empty$.
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
661  | 
|
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
662  | 
Since we only simplify regular expressions of the form $r_1 +  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
663  | 
r_2$ and $r_1 \cdot r_2$ we do not have to do anything else  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
664  | 
in the remaining cases. The rectification function will  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
665  | 
be just the identity, which in lambda-calculus terms is  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
666  | 
just  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
667  | 
|
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
668  | 
\begin{center}
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
669  | 
$\lambda v.\,v$  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
670  | 
\end{center}
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
671  | 
|
| 422 | 672  | 
\noindent This completes the high-level version of the simplification  | 
673  | 
function, which is shown again in Figure~\ref{simp}. The Scala code
 | 
|
| 
444
 
3056a4c071b0
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
422 
diff
changeset
 | 
674  | 
for the simplification function is in Figure~\ref{simprect}.
 | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
675  | 
|
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
676  | 
\begin{figure}[t]
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
677  | 
\begin{center}
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
678  | 
\begin{tabular}{l}
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
679  | 
$simp(r)$:\\  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
680  | 
\quad case $r = r_1 + r_2$\\  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
681  | 
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
682  | 
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
683  | 
\qquad case $r_{1s} = \ZERO$: 
 | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
684  | 
       return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
 | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
685  | 
\qquad case $r_{2s} = \ZERO$: 
 | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
686  | 
       return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
 | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
687  | 
\qquad case $r_{1s} = r_{2s}$:
 | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
688  | 
       return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
 | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
689  | 
\qquad otherwise:  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
690  | 
       return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
691  | 
\medskip\\  | 
| 
286
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
692  | 
|
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
693  | 
\quad case $r = r_1 \cdot r_2$\\  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
694  | 
\qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
695  | 
\qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
696  | 
\qquad case $r_{1s} = \ZERO$: 
 | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
697  | 
       return $(\ZERO, f_{error})$\\
 | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
698  | 
\qquad case $r_{2s} = \ZERO$: 
 | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
699  | 
       return $(\ZERO, f_{error})$\\
 | 
| 
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
700  | 
\qquad case $r_{1s} = \ONE$: 
 | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
701  | 
return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
 | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
702  | 
\qquad case $r_{2s} = \ONE$: 
 | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
703  | 
return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
704  | 
\qquad otherwise:  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
705  | 
       return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
706  | 
\medskip\\  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
707  | 
|
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
708  | 
\quad otherwise:\\  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
709  | 
\qquad return $(r, \lambda v.\,v)$\\  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
710  | 
\end{tabular}
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
711  | 
\end{center}
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
712  | 
\caption{The simplification function that returns a simplified 
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
713  | 
regular expression and a rectification function.\label{simp}}
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
714  | 
\end{figure}
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
715  | 
|
| 422 | 716  | 
\begin{figure}[p]
 | 
| 492 | 717  | 
\small  | 
| 490 | 718  | 
\lstinputlisting[numbers=left,linebackgroundcolor=  | 
719  | 
   {\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app61.scala}
 | 
|
| 422 | 720  | 
|
721  | 
\caption{The Scala code for the simplification function. The
 | 
|
| 520 | 722  | 
first part defines some auxiliary functions for the rectification.  | 
| 422 | 723  | 
The second part give the simplification function.  | 
724  | 
\label{simprect}}
 | 
|
725  | 
\end{figure}
 | 
|
726  | 
||
727  | 
||
728  | 
We are now in the position to define a \emph{lexing function} as follows:
 | 
|
729  | 
||
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
730  | 
\begin{center}
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
731  | 
\begin{tabular}{lcl}
 | 
| 669 | 732  | 
$lex\,r\,[]$ & $\dn$ & if $\nullable(r)$ then $\textit{mkeps}(r)$\\
 | 
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
733  | 
& & else $error$\\  | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
734  | 
$lex\,r\,c\!::\!s$ & $\dn$ & let  | 
| 669 | 735  | 
   $(r_{simp}, f_{rect}) = simp(\der(c, r))$\\
 | 
736  | 
& & $\inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
 | 
|
| 
287
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
737  | 
\end{tabular}
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
738  | 
\end{center}
 | 
| 
 
2c50b8b5886c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
286 
diff
changeset
 | 
739  | 
|
| 
326
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
740  | 
\noindent This corresponds to the $matches$ function we have  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
741  | 
seen in earlier lectures. In the first clause we are given an  | 
| 520 | 742  | 
empty string, $[]$, and need to test whether the regular  | 
| 
326
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
743  | 
expression is $nullable$. If yes, we can proceed normally and  | 
| 422 | 744  | 
just return the value calculated by $\textit{mkeps}$. The second clause
 | 
| 
326
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
745  | 
is for strings where the first character is $c$, say, and the  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
746  | 
rest of the string is $s$. We first build the derivative of  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
747  | 
$r$ with respect to $c$; simplify the resulting regular  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
748  | 
expression. We continue lexing with the simplified regular  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
749  | 
expression and the string $s$. Whatever will be returned as  | 
| 874 | 750  | 
value, we still need to rectify using the $f_{rect}$ from the
 | 
| 
326
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
751  | 
simplification and finally inject $c$ back into the  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
752  | 
(rectified) value.  | 
| 
286
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
753  | 
|
| 
 
19020b75d75e
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
285 
diff
changeset
 | 
754  | 
|
| 
288
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
755  | 
\subsubsection*{Records}
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
756  | 
|
| 
326
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
757  | 
Remember we wanted to tokenize input strings, that means  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
758  | 
splitting strings into their ``word'' components. Furthermore  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
759  | 
we want to classify each token as being a keyword or  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
760  | 
identifier and so on. For this one more feature will be  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
761  | 
required, which I call a \emph{record} regular expression.
 | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
762  | 
While values encode how a regular expression matches a string,  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
763  | 
records can be used to ``focus'' on some particular parts of  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
764  | 
the regular expression and ``forget'' about others.  | 
| 
288
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
765  | 
|
| 
326
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
766  | 
Let us look at an example. Suppose you have the regular  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
767  | 
expression $a\cdot b + a\cdot c$. Clearly this regular expression can only  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
768  | 
recognise two strings. But suppose you are not interested  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
769  | 
whether it can recognise $ab$ or $ac$, but rather if it  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
770  | 
matched, then what was the last character of the matched  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
771  | 
string\ldots either $b$ or $c$. You can do this by annotating  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
772  | 
the regular expression with a record, written in general  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
773  | 
$(x:r)$, where $x$ is just an identifier (in my implementation  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
774  | 
a plain string) and $r$ is a regular expression. A record will  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
775  | 
be regarded as a regular expression. The extended definition  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
776  | 
in Scala therefore looks as follows:  | 
| 
288
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
777  | 
|
| 490 | 778  | 
{\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
 | 
779  | 
                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 | 
|
| 
288
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
780  | 
{../progs/app03.scala}}
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
781  | 
|
| 
326
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
782  | 
\noindent Since we regard records as regular expressions we  | 
| 669 | 783  | 
need to extend the functions $\nullable$ and $\der$. Similarly  | 
784  | 
$\textit{mkeps}$ and $\inj$ need to be extended. This means we also need
 | 
|
| 
326
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
785  | 
to extend the definition of values, which in Scala looks as  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
786  | 
follows:  | 
| 
288
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
787  | 
|
| 490 | 788  | 
{\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
 | 
789  | 
                  {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 | 
|
| 
288
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
790  | 
{../progs/app04.scala}}
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
791  | 
|
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
792  | 
\noindent Let us now look at the purpose of records more  | 
| 
326
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
793  | 
closely and let us return to our question whether the string  | 
| 
288
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
794  | 
terminated in a $b$ or $c$. We can do this as follows: we  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
795  | 
annotate the regular expression $ab + ac$ with a record  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
796  | 
as follows  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
797  | 
|
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
798  | 
\begin{center}
 | 
| 
326
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
799  | 
$a\cdot (x:b) + a\cdot (x:c)$  | 
| 
288
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
800  | 
\end{center}
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
801  | 
|
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
802  | 
\noindent This regular expression can still only recognise  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
803  | 
the strings $ab$ and $ac$, but we can now use a function  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
804  | 
that takes a value and returns all records. I call this  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
805  | 
function \emph{env} for environment\ldots it builds a list
 | 
| 
326
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
806  | 
of identifiers associated with a string. This function  | 
| 
288
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
807  | 
can be defined as follows:  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
808  | 
|
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
809  | 
\begin{center}
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
810  | 
\begin{tabular}{lcl}
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
811  | 
$env(Empty)$ & $\dn$ & $[]$\\  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
812  | 
$env(Char(c))$ & $\dn$ & $[]$\\  | 
| 
400
 
e4afe3f46c29
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
394 
diff
changeset
 | 
813  | 
$env(\Left(v))$ & $\dn$ & $env(v)$\\  | 
| 
288
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
814  | 
$env(Right(v))$ & $\dn$ & $env(v)$\\  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
815  | 
$env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\  | 
| 422 | 816  | 
$env(Stars\,[v_1,\ldots ,v_n])$ & $\dn$ &  | 
| 
288
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
817  | 
$env(v_1) \,@\ldots @\, env(v_n)$\\  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
818  | 
$env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
819  | 
\end{tabular}
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
820  | 
\end{center}
 | 
| 
251
 
5b5a68df6d16
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
821  | 
|
| 
326
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
822  | 
\noindent where in the last clause we use the flatten function  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
823  | 
defined earlier. As can be seen, the function $env$ ``picks''  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
824  | 
out all underlying strings where a record is given. Since  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
825  | 
there can be more than one, the environment will potentially  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
826  | 
contain many ``records''. If we now postprocess the value  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
827  | 
calculated by $lex$ extracting all records using $env$, we can  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
828  | 
answer the question whether the last element in the string was  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
829  | 
an $b$ or a $c$. Lets see this in action: if we use $a\cdot b  | 
| 
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
830  | 
+ a\cdot c$ and $ac$ the calculated value will be  | 
| 
288
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
831  | 
|
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
832  | 
\begin{center}
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
833  | 
$Right(Seq(Char(a), Char(c)))$  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
834  | 
\end{center}
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
835  | 
|
| 
326
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
836  | 
\noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ and  | 
| 
288
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
837  | 
use the $env$ function to extract the recording for  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
838  | 
$x$ we obtain  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
839  | 
|
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
840  | 
\begin{center}
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
841  | 
$[(x:c)]$  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
842  | 
\end{center}
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
843  | 
|
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
844  | 
\noindent If we had given the string $ab$ instead, then the  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
845  | 
record would have been $[(x:b)]$. The fun starts if we  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
846  | 
iterate this. Consider the regular expression  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
847  | 
|
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
848  | 
\begin{center}
 | 
| 
326
 
94700593a2d5
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
319 
diff
changeset
 | 
849  | 
$(a\cdot (x:b) + a\cdot (y:c))^*$  | 
| 
288
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
850  | 
\end{center}
 | 
| 
251
 
5b5a68df6d16
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
851  | 
|
| 
288
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
852  | 
\noindent and the string $ababacabacab$. This string is  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
853  | 
clearly matched by the regular expression, but we are only  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
854  | 
interested in the sequence of $b$s and $c$s. Using $env$  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
855  | 
we obtain  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
856  | 
|
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
857  | 
\begin{center}
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
858  | 
$[(x:b), (x:b), (y:c), (x:b), (y:c), (x:b)]$  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
859  | 
\end{center}
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
860  | 
|
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
861  | 
\noindent While this feature might look silly, it is in fact  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
862  | 
quite useful. For example if we want to match the name of  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
863  | 
an email we might use the regular expression  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
864  | 
|
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
865  | 
\[  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
866  | 
(name: [a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot 
 | 
| 792 | 867  | 
(domain: [a\mbox{-}z0\mbox{-}9\,-]^+)\cdot .\cdot 
 | 
| 
288
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
868  | 
(top\_level: [a\mbox{-}z\,.]^{\{2,6\}})
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
869  | 
\]  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
870  | 
|
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
871  | 
\noindent Then if we match the email address  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
872  | 
|
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
873  | 
\[  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
874  | 
\texttt{christian.urban@kcl.ac.uk}
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
875  | 
\]  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
876  | 
|
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
877  | 
\noindent we can use the $env$ function and find out  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
878  | 
what the name, domain and top-level part of the email  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
879  | 
address are:  | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
880  | 
|
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
881  | 
\begin{center}
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
882  | 
$[(name:\texttt{christian.urban}), 
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
883  | 
  (domain:\texttt{kcl}), 
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
884  | 
  (top\_level:\texttt{ac.uk})]$
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
885  | 
\end{center}
 | 
| 
 
39aeca14af8c
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
287 
diff
changeset
 | 
886  | 
|
| 
721
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
887  | 
Recall that we want to lex a little programming language, called the  | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
888  | 
\emph{While}-language. A simple program in this language is shown in
 | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
889  | 
Figure~\ref{while}. The main keywords in this language are
 | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
890  | 
\pcode{while}, \pcode{if}, \pcode{then} and
 | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
891  | 
\pcode{else}.\footnote{Contrast this with the COBOL programming
 | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
892  | 
language, which was developed around 1960 and thought to be dead for  | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
893  | 
many decades---even your friendly lecturer is not old enough to have  | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
894  | 
been taught this language. Anyway, this language had over 600  | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
895  | 
  keywords (or what they called \emph{reserved words}). Interestingly
 | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
896  | 
though this language is still used in 2020: during the height of  | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
897  | 
Corona crisis the State of New Jewrsey in the US was looking for  | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
898  | 
COBOL programers who could fix the state's national insurance  | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
899  | 
webpage. You were probably paid in gold and diamonds, if you were  | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
900  | 
able to program in COBOL. If you fixed their webpage, surely you  | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
901  | 
were allowed to marry the governer's son/daughter. } As usual we  | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
902  | 
have syntactic categories for identifiers, operators, numbers and so  | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
903  | 
on. For this we would need to design the corresponding regular  | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
904  | 
expressions to recognise these syntactic categories. I let you do this  | 
| 
 
e712943cff71
added slides from Rochester
 
Christian Urban <christian.urban@kcl.ac.uk> 
parents: 
720 
diff
changeset
 | 
905  | 
design task. Having these regular expressions at our disposal, we can  | 
| 
357
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
906  | 
form the regular expression  | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
907  | 
|
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
908  | 
\begin{figure}[t]
 | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
909  | 
\begin{center}
 | 
| 778 | 910  | 
\mbox{\lstinputlisting[language=while]{../progs/while-tests/fib.while}}
 | 
| 
357
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
911  | 
\end{center}
 | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
912  | 
\caption{The Fibonacci program in the While-language.\label{while}}
 | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
913  | 
\end{figure}
 | 
| 
356
 
d9c784c71305
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
352 
diff
changeset
 | 
914  | 
|
| 
 
d9c784c71305
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
352 
diff
changeset
 | 
915  | 
\begin{center}
 | 
| 
357
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
916  | 
$\textit{WhileRegs} \;\dn\; 
 | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
917  | 
\left(\begin{array}{l}
 | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
918  | 
(k, KeyWords)\; +\\  | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
919  | 
(i, Ids)\;+\\  | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
920  | 
(o, Ops)\;+ \\  | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
921  | 
(n, Nums)\;+ \\  | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
922  | 
(s, Semis)\;+ \\  | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
923  | 
(p, (LParens + RParens))\;+\\  | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
924  | 
(b, (Begin + End))\;+ \\  | 
| 551 | 925  | 
(w, WhiteSpaces)  | 
| 
394
 
2f9fe225ecc8
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
357 
diff
changeset
 | 
926  | 
      \end{array}\right)^{\mbox{\LARGE{}*}}$
 | 
| 
357
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
927  | 
\end{center}
 | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
928  | 
|
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
929  | 
\noindent and ask the algorithm by Sulzmann \& Lu to lex, say  | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
930  | 
the following string  | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
931  | 
|
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
932  | 
\begin{center}\large
 | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
933  | 
\code{"if true then then 42 else +"}
 | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
934  | 
\end{center}
 | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
935  | 
|
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
936  | 
\noindent  | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
937  | 
By using the records and extracting the environment, the  | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
938  | 
result is the following list:  | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
939  | 
|
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
940  | 
\begin{center}\tt
 | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
941  | 
\begin{tabular}{l}
 | 
| 643 | 942  | 
(k, if),\\  | 
943  | 
(w, " "),\\  | 
|
944  | 
(i, true),\\  | 
|
945  | 
(w, " "),\\  | 
|
946  | 
(k, then),\\  | 
|
947  | 
(w, " "),\\  | 
|
948  | 
(k, then),\\  | 
|
949  | 
(w, " "),\\  | 
|
950  | 
(n, 42),\\  | 
|
951  | 
(w, " "),\\  | 
|
952  | 
(k, else),\\  | 
|
953  | 
(w, " "),\\  | 
|
954  | 
(o, +)  | 
|
| 
356
 
d9c784c71305
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
352 
diff
changeset
 | 
955  | 
\end{tabular}
 | 
| 
 
d9c784c71305
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
352 
diff
changeset
 | 
956  | 
\end{center}
 | 
| 
 
d9c784c71305
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
352 
diff
changeset
 | 
957  | 
|
| 
357
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
958  | 
\noindent Typically we are not interested in the whitespaces  | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
959  | 
and comments and would filter them out: this gives  | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
960  | 
|
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
961  | 
\begin{center}\tt
 | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
962  | 
\begin{tabular}{l}
 | 
| 643 | 963  | 
(k, if),\\  | 
964  | 
(i, true),\\  | 
|
965  | 
(k, then),\\  | 
|
966  | 
(k, then),\\  | 
|
967  | 
(n, 42),\\  | 
|
968  | 
(k, else),\\  | 
|
969  | 
(o, +)  | 
|
| 
357
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
970  | 
\end{tabular}
 | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
971  | 
\end{center}
 | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
972  | 
|
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
973  | 
\noindent  | 
| 
 
603e171a7b48
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
356 
diff
changeset
 | 
974  | 
which will be the input for the next phase of our compiler.  | 
| 
356
 
d9c784c71305
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
352 
diff
changeset
 | 
975  | 
|
| 
251
 
5b5a68df6d16
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
976  | 
\end{document}
 | 
| 
 
5b5a68df6d16
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
977  | 
|
| 
 
5b5a68df6d16
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
978  | 
%%% Local Variables:  | 
| 
 
5b5a68df6d16
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
979  | 
%%% mode: latex  | 
| 
 
5b5a68df6d16
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
980  | 
%%% TeX-master: t  | 
| 
 
5b5a68df6d16
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
981  | 
%%% End:  |