| author | Christian Urban <christian dot urban at kcl dot ac dot uk> | 
| Tue, 26 Sep 2017 14:07:29 +0100 | |
| changeset 503 | 3b9496db3fb9 | 
| parent 444 | 3056a4c071b0 | 
| child 490 | 8a07f7256f2a | 
| permissions | -rw-r--r-- | 
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 1 | \documentclass{article}
 | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 2 | \usepackage{../style}
 | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 3 | \usepackage{../langs}
 | 
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 4 | \usepackage{../graphics}
 | 
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 5 | |
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 6 | |
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 7 | \begin{document}
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 8 | \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016}
 | 
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 9 | |
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 10 | \section*{Handout 4 (Sulzmann \& Lu Algorithm)}
 | 
| 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 11 | |
| 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 12 | So far our algorithm based on derivatives was only able to say | 
| 422 | 13 | yes or no depending on whether a string was matched by a regular | 
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 14 | expression or not. Often a more interesting question is to | 
| 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 15 | find out \emph{how} a regular expression matched a string?
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 16 | Answering this question will also help us with the problem we | 
| 319 
e7b110f93697
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
296diff
changeset | 17 | are after, namely tokenising an input string. | 
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 18 | |
| 422 | 19 | The algorithm we will be looking at in this lecture was designed by Sulzmann | 
| 20 | \& Lu in a rather recent research paper (from 2014). A link to it is | |
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 21 | provided on KEATS, in case you are interested.\footnote{In my
 | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 22 | humble opinion this is an interesting instance of the research | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 23 | literature: it contains a very neat idea, but its presentation | 
| 422 | 24 | is rather sloppy. In earlier versions of this paper, a King's | 
| 25 | undergraduate student and I found several rather annoying typos in the | |
| 26 | examples and definitions.} My PhD student Fahad Ausaf and I even more recently | |
| 27 | wrote a paper where we build on their result: we provided a | |
| 28 | mathematical proof that their algorithm is really correct---the proof | |
| 29 | Sulzmann \& Lu had originally given contained major flaws. Such correctness | |
| 30 | proofs are important: Kuklewicz maintains a unit-test library | |
| 31 | for the kind of algorithma we are interested in here and he showed | |
| 32 | that many implementations in the ``wild'' are buggy, that is not | |
| 33 | satisfy his unit tests: | |
| 34 | ||
| 35 | \begin{center}
 | |
| 36 | \url{http://www.haskell.org/haskellwiki/Regex_Posix}
 | |
| 37 | \end{center}
 | |
| 38 | ||
| 39 | ||
| 40 | In order to give an answer for | |
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 41 | \emph{how} a regular expression matches a string, Sulzmann and
 | 
| 417 | 42 | Lu use \emph{values}. A value will be the output of the
 | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 43 | algorithm whenever the regular expression matches the string. | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 44 | If the string does not match the string, an error will be | 
| 417 | 45 | raised. | 
| 46 | The definition for values is given below. They are shown | |
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 47 | together with the regular expressions $r$ to which | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 48 | they correspond: | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 49 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 50 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 51 | \begin{tabular}{cc}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 52 | \begin{tabular}{@{}rrl@{}}
 | 
| 319 
e7b110f93697
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
296diff
changeset | 53 | \multicolumn{3}{c}{regular expressions}\medskip\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 54 | $r$ & $::=$ & $\ZERO$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 55 | & $\mid$ & $\ONE$ \\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 56 | & $\mid$ & $c$ \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 57 | & $\mid$ & $r_1 \cdot r_2$\\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 58 | & $\mid$ & $r_1 + r_2$ \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 59 | \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 60 | & $\mid$ & $r^*$ \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 61 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 62 | & | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 63 | \begin{tabular}{@{\hspace{0mm}}rrl@{}}
 | 
| 319 
e7b110f93697
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
296diff
changeset | 64 | \multicolumn{3}{c}{values}\medskip\\
 | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 65 | $v$ & $::=$ & \\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 66 | & & $Empty$ \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 67 | & $\mid$ & $Char(c)$ \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 68 | & $\mid$ & $Seq(v_1,v_2)$\\ | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 69 | & $\mid$ & $\Left(v)$ \\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 70 | & $\mid$ & $Right(v)$ \\ | 
| 422 | 71 | & $\mid$ & $Stars\,[v_1,\ldots\,v_n]$ \\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 72 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 73 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 74 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 75 | |
| 417 | 76 | \noindent There is no value for the | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 77 | $\ZERO$ regular expression, since it does not match any | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 78 | string. Otherwise there is exactly one value corresponding to | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 79 | each regular expression with the exception of $r_1 + r_2$ | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 80 | where there are two values, namely $\Left(v)$ and $Right(v)$ | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 81 | corresponding to the two alternatives. Note that $r^*$ is | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 82 | associated with a list of values, one for each copy of $r$ | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 83 | that was needed to match the string. This means we might also | 
| 417 | 84 | return the empty list $Stars []$, if no copy was needed in case | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 85 | of $r^*$. For sequence, there is exactly one value, composed | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 86 | of two component values ($v_1$ and $v_2$). | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 87 | |
| 422 | 88 | My implementation of regular expressions and values in Scala is | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 89 | shown below. I have in my implementation the convention that | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 90 | regular expressions are written entirely with upper-case | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 91 | letters, while values just start with a single upper-case | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 92 | character and the rest are lower-case letters. | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 93 | |
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 94 | {\small\lstinputlisting[language=Scala,numbers=none]
 | 
| 352 
1e1b0fe66107
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
350diff
changeset | 95 | {../progs/app01.scala}}
 | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 96 | |
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 97 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 98 | {\small\lstinputlisting[language=Scala,numbers=none]
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 99 | {../progs/app02.scala}}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 100 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 101 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 102 | Graphically the algorithm by Sulzmann \& Lu can be illustrated | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 103 | by the picture in Figure~\ref{Sulz} where the path from the
 | 
| 422 | 104 | left to the right involving $\textit{der}/\textit{nullable}$ is the first phase
 | 
| 105 | of the algorithm and $\textit{mkeps}/\textit{inj}$, the path from right to left,
 | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 106 | the second phase. This picture shows the steps required when a | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 107 | regular expression, say $r_1$, matches the string $abc$. We | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 108 | first build the three derivatives (according to $a$, $b$ and | 
| 422 | 109 | $c$). We then use $\textit{nullable}$ to find out whether the resulting
 | 
| 296 
796b9b81ac8d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
288diff
changeset | 110 | regular expression can match the empty string. If yes, we call | 
| 422 | 111 | the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular
 | 
| 112 | expression has matched the empty string. Its definition | |
| 113 | is as follows: | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 114 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 115 | \begin{figure}[t]
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 116 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 117 | \begin{tikzpicture}[scale=2,node distance=1.2cm,
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 118 |                     every node/.style={minimum size=7mm}]
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 119 | \node (r1)  {$r_1$};
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 120 | \node (r2) [right=of r1]{$r_2$};
 | 
| 422 | 121 | \draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\textit{der}\,a$};
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 122 | \node (r3) [right=of r2]{$r_3$};
 | 
| 422 | 123 | \draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\textit{der}\,b$};
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 124 | \node (r4) [right=of r3]{$r_4$};
 | 
| 422 | 125 | \draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\textit{der}\,c$};
 | 
| 126 | \draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\textit{nullable}$}};
 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 127 | \node (v4) [below=of r4]{$v_4$};
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 128 | \draw[->,line width=1mm](r4) -- (v4); | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 129 | \node (v3) [left=of v4] {$v_3$};
 | 
| 422 | 130 | \draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\textit{inj}\,c$};
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 131 | \node (v2) [left=of v3]{$v_2$};
 | 
| 422 | 132 | \draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\textit{inj}\,b$};
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 133 | \node (v1) [left=of v2] {$v_1$};
 | 
| 422 | 134 | \draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\textit{inj}\,a$};
 | 
| 135 | \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}};
 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 136 | \end{tikzpicture}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 137 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 138 | \caption{The two phases of the algorithm by Sulzmann \& Lu.
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 139 | \label{Sulz}}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 140 | \end{figure}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 141 | |
| 422 | 142 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 143 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 144 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 145 | \begin{tabular}{lcl}
 | 
| 422 | 146 |   $\textit{mkeps}(\ONE)$       & $\dn$ & $Empty$\\
 | 
| 147 |   $\textit{mkeps}(r_1 + r_2)$      & $\dn$ & if $\textit{nullable}(r_1)$  \\
 | |
| 148 |                           &       & then $\Left(\textit{mkeps}(r_1))$\\
 | |
| 149 |                           &       & else $Right(\textit{mkeps}(r_2))$\\
 | |
| 150 |   $\textit{mkeps}(r_1 \cdot r_2)$  & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\
 | |
| 151 |   $\textit{mkeps}(r^*)$            & $\dn$ & $Stars\,[]$  \\
 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 152 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 153 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 154 | |
| 422 | 155 | \noindent There is are no cases for $\ZERO$ and $c$, since | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 156 | these regular expression cannot match the empty string. Note | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 157 | also that in case of alternatives we give preference to the | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 158 | regular expression on the left-hand side. This will become | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 159 | important later on. | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 160 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 161 | The second phase of the algorithm is organised so that it will | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 162 | calculate a value for how the derivative regular expression | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 163 | has matched a string. For this we need a function that | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 164 | reverses this ``chopping off'' for values which we did in the | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 165 | first phase for derivatives. The corresponding function is | 
| 422 | 166 | called $\textit{inj}$ (short for injection). This function takes three
 | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 167 | arguments: the first one is a regular expression for which we | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 168 | want to calculate the value, the second is the character we | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 169 | want to inject and the third argument is the value where we | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 170 | will inject the character into. The result of this function is a | 
| 422 | 171 | new value. The definition of $\textit{inj}$ is as follows: 
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 172 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 173 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 174 | \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | 
| 422 | 175 |   $\textit{inj}\,(c)\,c\,Empty$            & $\dn$ & $Char\,c$\\
 | 
| 176 |   $\textit{inj}\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\textit{inj}\,r_1\,c\,v)$\\
 | |
| 177 |   $\textit{inj}\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\textit{inj}\,r_2\,c\,v)$\\
 | |
| 178 |   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
 | |
| 179 |   $\textit{inj}\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$  & $Seq(\textit{inj}\,r_1\,c\,v_1,v_2)$\\
 | |
| 180 |   $\textit{inj}\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\textit{inj}\,r_2\,c\,v)$\\
 | |
| 181 |   $\textit{inj}\,(r^*)\,c\,Seq(v,Stars\,vs)$         & $\dn$  & $Stars(\textit{inj}\,r\,c\,v\,::\,vs)$\\
 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 182 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 183 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 184 | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 185 | \noindent This definition is by recursion on the regular | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 186 | expression and by analysing the shape of the values. Therefore | 
| 422 | 187 | there are three cases for sequence regular | 
| 188 | expressions (for all possible shapes of the value we can encounter). The last | |
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 189 | clause for the star regular expression returns a list where | 
| 422 | 190 | the first element is $\textit{inj}\,r\,c\,v$ and the other elements are
 | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 191 | $vs$. That means $\_\,::\,\_$ should be read as list cons. | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 192 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 193 | To understand what is going on, it might be best to do some | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 194 | example calculations and compare them with Figure~\ref{Sulz}.
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 195 | For this note that we have not yet dealt with the need of | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 196 | simplifying regular expressions (this will be a topic on its | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 197 | own later). Suppose the regular expression is $a \cdot (b | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 198 | \cdot c)$ and the input string is $abc$. The derivatives from | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 199 | the first phase are as follows: | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 200 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 201 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 202 | \begin{tabular}{ll}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 203 | $r_1$: & $a \cdot (b \cdot c)$\\ | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 204 | $r_2$: & $\ONE \cdot (b \cdot c)$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 205 | $r_3$: & $(\ZERO \cdot (b \cdot c)) + (\ONE \cdot c)$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 206 | $r_4$: & $(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)$\\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 207 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 208 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 209 | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 210 | \noindent According to the simple algorithm, we would test | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 211 | whether $r_4$ is nullable, which in this case it indeed is. | 
| 422 | 212 | This means we can use the function $\textit{mkeps}$ to calculate a
 | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 213 | value for how $r_4$ was able to match the empty string. | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 214 | Remember that this function gives preference for alternatives | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 215 | on the left-hand side. However there is only $\ONE$ on the | 
| 422 | 216 | very right-hand side of $r_4$ (underlined) | 
| 217 | ||
| 218 | \begin{center}
 | |
| 219 | $r_4$:\;$(\ZERO \cdot (b \cdot c)) + | |
| 220 |          ((\ZERO \cdot c) + \underline{\ONE})$\\
 | |
| 221 | \end{center}
 | |
| 222 | ||
| 223 | \noindent | |
| 224 | that matches the empty string. | |
| 225 | Therefore $\textit{mkeps}$ returns the value
 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 226 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 227 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 228 | $v_4:\;Right(Right(Empty))$ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 229 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 230 | |
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 231 | \noindent If there had been a $\ONE$ on the left, then | 
| 422 | 232 | $\textit{mkeps}$ would have returned something of the form
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 233 | $\Left(\ldots)$. The point is that from this value we can | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 234 | directly read off which part of $r_4$ matched the empty | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 235 | string: take the right-alternative first, and then the | 
| 422 | 236 | right-alternative again. | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 237 | |
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 238 | Next we have to ``inject'' the last character, that is $c$ in | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 239 | the running example, into this value $v_4$ in order to | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 240 | calculate how $r_3$ could have matched the string $c$. | 
| 422 | 241 | According to the definition of $\textit{inj}$ we obtain
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 242 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 243 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 244 | $v_3:\;Right(Seq(Empty, Char(c)))$ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 245 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 246 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 247 | \noindent This is the correct result, because $r_3$ needs | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 248 | to use the right-hand alternative, and then $\ONE$ needs | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 249 | to match the empty string and $c$ needs to match $c$. | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 250 | Next we need to inject back the letter $b$ into $v_3$. This | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 251 | gives | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 252 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 253 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 254 | $v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 255 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 256 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 257 | \noindent which is again the correct result for how $r_2$ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 258 | matched the string $bc$. Finally we need to inject back the | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 259 | letter $a$ into $v_2$ giving the final result | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 260 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 261 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 262 | $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 263 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 264 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 265 | \noindent This now corresponds to how the regular | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 266 | expression $a \cdot (b \cdot c)$ matched the string $abc$. | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 267 | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 268 | There are a few auxiliary functions that are of interest | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 269 | when analysing this algorithm. One is called \emph{flatten},
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 270 | written $|\_|$, which extracts the string ``underlying'' a | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 271 | value. It is defined recursively as | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 272 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 273 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 274 | \begin{tabular}{lcl}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 275 | $|Empty|$ & $\dn$ & $[]$\\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 276 | $|Char(c)|$ & $\dn$ & $[c]$\\ | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 277 | $|\Left(v)|$ & $\dn$ & $|v|$\\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 278 | $|Right(v)|$ & $\dn$ & $|v|$\\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 279 | $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\ | 
| 422 | 280 | $|Stars\,[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 281 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 282 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 283 | |
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 284 | \noindent Using flatten we can see what are the strings behind | 
| 422 | 285 | the values calculated by $\textit{mkeps}$ and $\textit{inj}$. In our running 
 | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 286 | example: | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 287 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 288 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 289 | \begin{tabular}{ll}
 | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 290 | $|v_4|$: & $[]$\\ | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 291 | $|v_3|$: & $c$\\ | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 292 | $|v_2|$: & $bc$\\ | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 293 | $|v_1|$: & $abc$ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 294 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 295 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 296 | |
| 422 | 297 | \noindent This indicates that $\textit{inj}$ indeed is injecting, or
 | 
| 298 | adding, back a character into the value. | |
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 299 | |
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 300 | There is a problem, however, with the described algorithm | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 301 | so far: it is very slow. We need to include the simplification | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 302 | from Lecture 2. This is what we shall do next. | 
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 303 | |
| 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 304 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 305 | \subsubsection*{Simplification}
 | 
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 306 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 307 | Generally the matching algorithms based on derivatives do | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 308 | poorly unless the regular expressions are simplified after | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 309 | each derivative step. But this is a bit more involved in the | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 310 | algorithm of Sulzmann \& Lu. So what follows might require you | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 311 | to read several times before it makes sense and also might | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 312 | require that you do some example calculations yourself. As a | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 313 | first example consider the last derivation step in our earlier | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 314 | example: | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 315 | |
| 422 | 316 | \begin{equation}\label{regexfour}
 | 
| 317 | r_4 = \textit{der}\,c\,r_3 = 
 | |
| 318 | (\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE) | |
| 319 | \end{equation}
 | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 320 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 321 | \noindent Simplifying this regular expression would just give | 
| 422 | 322 | us $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ as
 | 
| 323 | input, however, would give us with the value $Empty$ instead of | |
| 324 | ||
| 325 | \[Right(Right(Empty))\] | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 326 | |
| 422 | 327 | \noindent | 
| 328 | that was obtained without the simplification. The problem is we need | |
| 329 | to recreate this more complicated value, rather than just return | |
| 330 | $Empty$. This will require what I call \emph{rectification functions}.
 | |
| 331 | They need to be calculated whenever a regular expression gets | |
| 332 | simplified. | |
| 333 | ||
| 334 | Rectification functions take a value as argument and return a | |
| 335 | (rectified) value. In the example above the argument would be $Empty$ | |
| 336 | and the output $Right(Right(Empty))$. Before we define these | |
| 337 | rectification functions in general, let us first take a look again at | |
| 338 | our simplification rules: | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 339 | |
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 340 | \begin{center}
 | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 341 | \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 342 | $r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 343 | $\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 344 | $r \cdot \ONE$ & $\mapsto$ & $r$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 345 | $\ONE \cdot r$ & $\mapsto$ & $r$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 346 | $r + \ZERO$ & $\mapsto$ & $r$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 347 | $\ZERO + r$ & $\mapsto$ & $r$\\ | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 348 | $r + r$ & $\mapsto$ & $r$\\ | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 349 | \end{tabular}
 | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 350 | \end{center}
 | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 351 | |
| 422 | 352 | \noindent Applying them to $r_4$ in \eqref{regexfour} will require several nested
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 353 | simplifications in order end up with just $\ONE$. However, | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 354 | it is possible to apply them in a depth-first, or inside-out, | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 355 | manner in order to calculate this simplified regular | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 356 | expression. | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 357 | |
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 358 | The rectification we can implement by letting simp return not | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 359 | just a (simplified) regular expression, but also a | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 360 | rectification function. Let us consider the alternative case, | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 361 | $r_1 + r_2$, first. By going depth-first, we first simplify | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 362 | the component regular expressions $r_1$ and $r_2.$ This will | 
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 363 | return simplified versions, say $r_{1s}$ and $r_{2s}$, of the
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 364 | component regular expressions (if they can be simplified) but | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 365 | also two rectification functions $f_{1s}$ and $f_{2s}$. We
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 366 | need to assemble them in order to obtain a rectified value for | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 367 | $r_1 + r_2$. In case $r_{1s}$ simplified to $\ZERO$, we
 | 
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 368 | continue the derivative calculation with $r_{2s}$. The
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 369 | Sulzmann \& Lu algorithm would return a corresponding value, | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 370 | say $v_{2s}$. But now this value needs to be ``rectified'' to
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 371 | the value | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 372 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 373 | \begin{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 374 | $Right(v_{2s})$
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 375 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 376 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 377 | \noindent The reason is that we look for the value that tells | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 378 | us how $r_1 + r_2$ could have matched the string, not just | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 379 | $r_2$ or $r_{2s}$. Unfortunately, this is still not the right
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 380 | value in general because there might be some simplifications | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 381 | that happened inside $r_2$ and for which the simplification | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 382 | function retuned also a rectification function $f_{2s}$. So in
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 383 | fact we need to apply this one too which gives | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 384 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 385 | \begin{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 386 | $Right(f_{2s}(v_{2s}))$
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 387 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 388 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 389 | \noindent This is now the correct, or rectified, value. Since | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 390 | the simplification will be done in the first phase of the | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 391 | algorithm, but the rectification needs to be done to the | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 392 | values in the second phase, it is advantageous to calculate | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 393 | the rectification as a function, remember this function and | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 394 | then apply the value to this function during the second phase. | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 395 | So if we want to implement the rectification as function, we | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 396 | would need to return | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 397 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 398 | \begin{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 399 | $\lambda v.\,Right(f_{2s}(v))$
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 400 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 401 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 402 | \noindent which is the lambda-calculus notation for | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 403 | a function that expects a value $v$ and returns everything | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 404 | after the dot where $v$ is replaced by whatever value is | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 405 | given. | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 406 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 407 | Let us package this idea with rectification functions | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 408 | into a single function (still only considering the alternative | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 409 | case): | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 410 | |
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 411 | \begin{center}
 | 
| 422 | 412 | \begin{tabular}{ll}
 | 
| 413 | $_1$ & $simp(r)$:\\ | |
| 414 | $_2$ & \quad case $r = r_1 + r_2$\\ | |
| 415 | $_3$ & \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 | |
| 416 | $_4$ & \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 | |
| 417 | $_5$ & \qquad case $r_{1s} = \ZERO$: 
 | |
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 418 |        return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
 | 
| 422 | 419 | $_6$ & \qquad case $r_{2s} = \ZERO$: 
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 420 |        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
 | 
| 422 | 421 | $_7$ & \qquad case $r_{1s} = r_{2s}$:
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 422 |        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
 | 
| 422 | 423 | $_8$ & \qquad otherwise: | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 424 |        return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 425 | \end{tabular}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 426 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 427 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 428 | \noindent We first recursively call the simplification with | 
| 422 | 429 | $r_1$ and $r_2$ (Lines 3 and 4). This gives simplified regular expressions, | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 430 | $r_{1s}$ and $r_{2s}$, as well as two rectification functions
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 431 | $f_{1s}$ and $f_{2s}$. We next need to test whether the
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 432 | simplified regular expressions are $\ZERO$ so as to make | 
| 422 | 433 | further simplifications. In case $r_{1s}$ is $\ZERO$ (Line 5),
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 434 | then we can return $r_{2s}$ (the other alternative). However
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 435 | for this we need to build a corresponding rectification | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 436 | function, which as said above is | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 437 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 438 | \begin{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 439 | $\lambda v.\,Right(f_{2s}(v))$
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 440 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 441 | |
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 442 | \noindent The case where $r_{2s} = \ZERO$ is similar:
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 443 | We return $r_{1s}$ and rectify with $\Left(\_)$ and the
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 444 | other calculated rectification function $f_{1s}$. This gives
 | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 445 | |
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 446 | \begin{center}
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 447 | $\lambda v.\,\Left(f_{1s}(v))$
 | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 448 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 449 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 450 | \noindent The next case where $r_{1s} = r_{2s}$ can be treated
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 451 | like the one where $r_{2s} = \ZERO$. We return $r_{1s}$
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 452 | and rectify with $\Left(\_)$ and so on. | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 453 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 454 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 455 | The otherwise-case is slightly more complicated. In this case | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 456 | neither $r_{1s}$ nor $r_{2s}$ are $\ZERO$ and also
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 457 | $r_{1s} \not= r_{2s}$, which means no further simplification
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 458 | can be applied. Accordingly, we return $r_{1s} + r_{2s}$ as
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 459 | the simplified regular expression. In principle we also do not | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 460 | have to do any rectification, because no simplification was | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 461 | done in this case. But this is actually not true: There might | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 462 | have been simplifications inside $r_{1s}$ and $r_{2s}$. We
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 463 | therefore need to take into account the calculated | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 464 | rectification functions $f_{1s}$ and $f_{2s}$. We can do this
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 465 | by defining a rectification function $f_{alt}$ which takes two
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 466 | rectification functions as arguments and applies them | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 467 | according to whether the value is of the form $\Left(\_)$ or | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 468 | $Right(\_)$: | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 469 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 470 | \begin{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 471 | \begin{tabular}{l@{\hspace{1mm}}l}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 472 | $f_{alt}(f_1, f_2) \dn$\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 473 | \qquad $\lambda v.\,$ case $v = \Left(v')$: | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 474 | & return $\Left(f_1(v'))$\\ | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 475 | \qquad \phantom{$\lambda v.\,$} case $v = Right(v')$: 
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 476 | & return $Right(f_2(v'))$\\ | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 477 | \end{tabular}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 478 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 479 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 480 | The other interesting case with simplification is the sequence | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 481 | case. In this case the main simplification function is as | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 482 | follows | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 483 | |
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 484 | \begin{center}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 485 | \begin{tabular}{l}
 | 
| 422 | 486 | $simp(r)$:\hspace{5cm} (continued)\\
 | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 487 | \quad case $r = r_1 \cdot r_2$\\ | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 488 | \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 489 | \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 490 | \qquad case $r_{1s} = \ZERO$: 
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 491 |        return $(\ZERO, f_{error})$\\
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 492 | \qquad case $r_{2s} = \ZERO$: 
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 493 |        return $(\ZERO, f_{error})$\\
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 494 | \qquad case $r_{1s} = \ONE$: 
 | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 495 | return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 496 | \qquad case $r_{2s} = \ONE$: 
 | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 497 | return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 498 | \qquad otherwise: | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 499 |        return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 500 | \end{tabular}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 501 | \end{center}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 502 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 503 | \noindent whereby in the last line $f_{seq}$ is again pushing
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 504 | the two rectification functions into the two components of the | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 505 | Seq-value: | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 506 | |
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 507 | \begin{center}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 508 | \begin{tabular}{l@{\hspace{1mm}}l}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 509 | $f_{seq}(f_1, f_2) \dn$\\
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 510 | \qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$: | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 511 | & return $Seq(f_1(v_1), f_2(v_2))$\\ | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 512 | \end{tabular}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 513 | \end{center}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 514 | |
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 515 | \noindent Note that in the case of $r_{1s} = \ZERO$
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 516 | (similarly $r_{2s}$) we use the function $f_{error}$ for
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 517 | rectification. If you think carefully, then you will realise | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 518 | that this function will actually never been called. This is | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 519 | because a sequence with $\ZERO$ will never recognise any | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 520 | string and therefore the second phase of the algorithm would | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 521 | never been called. The simplification function still expects | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 522 | us to give a function. So in my own implementation I just | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 523 | returned a function that raises an error. In the case | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 524 | where $r_{1s} = \ONE$ (similarly $r_{2s}$) we have
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 525 | to create a sequence where the first component is a rectified | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 526 | version of $Empty$. Therefore we call $f_{1s}$ with $Empty$.
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 527 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 528 | Since we only simplify regular expressions of the form $r_1 + | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 529 | r_2$ and $r_1 \cdot r_2$ we do not have to do anything else | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 530 | in the remaining cases. The rectification function will | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 531 | be just the identity, which in lambda-calculus terms is | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 532 | just | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 533 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 534 | \begin{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 535 | $\lambda v.\,v$ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 536 | \end{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 537 | |
| 422 | 538 | \noindent This completes the high-level version of the simplification | 
| 539 | function, which is shown again in Figure~\ref{simp}. The Scala code
 | |
| 444 
3056a4c071b0
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
422diff
changeset | 540 | for the simplification function is in Figure~\ref{simprect}.
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 541 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 542 | \begin{figure}[t]
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 543 | \begin{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 544 | \begin{tabular}{l}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 545 | $simp(r)$:\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 546 | \quad case $r = r_1 + r_2$\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 547 | \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 548 | \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 549 | \qquad case $r_{1s} = \ZERO$: 
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 550 |        return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 551 | \qquad case $r_{2s} = \ZERO$: 
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 552 |        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 553 | \qquad case $r_{1s} = r_{2s}$:
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 554 |        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 555 | \qquad otherwise: | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 556 |        return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 557 | \medskip\\ | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 558 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 559 | \quad case $r = r_1 \cdot r_2$\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 560 | \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 561 | \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 562 | \qquad case $r_{1s} = \ZERO$: 
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 563 |        return $(\ZERO, f_{error})$\\
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 564 | \qquad case $r_{2s} = \ZERO$: 
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 565 |        return $(\ZERO, f_{error})$\\
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 566 | \qquad case $r_{1s} = \ONE$: 
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 567 | return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 568 | \qquad case $r_{2s} = \ONE$: 
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 569 | return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 570 | \qquad otherwise: | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 571 |        return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 572 | \medskip\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 573 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 574 | \quad otherwise:\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 575 | \qquad return $(r, \lambda v.\,v)$\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 576 | \end{tabular}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 577 | \end{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 578 | \caption{The simplification function that returns a simplified 
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 579 | regular expression and a rectification function.\label{simp}}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 580 | \end{figure}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 581 | |
| 422 | 582 | \begin{figure}[p]
 | 
| 583 | \lstinputlisting{../progs/app61.scala}
 | |
| 584 | ||
| 585 | \caption{The Scala code for the simplification function. The
 | |
| 586 | first part defines some auxillary functions for the rectification. | |
| 587 | The second part give the simplification function. | |
| 588 | \label{simprect}}
 | |
| 589 | \end{figure}
 | |
| 590 | ||
| 591 | ||
| 592 | We are now in the position to define a \emph{lexing function} as follows:
 | |
| 593 | ||
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 594 | \begin{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 595 | \begin{tabular}{lcl}
 | 
| 422 | 596 | $lex\,r\,[]$ & $\dn$ & if $\textit{nullable}(r)$ then $\textit{mkeps}(r)$\\
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 597 | & & else $error$\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 598 | $lex\,r\,c\!::\!s$ & $\dn$ & let | 
| 422 | 599 |    $(r_{simp}, f_{rect}) = simp(\textit{der}(c, r))$\\
 | 
| 600 | & & $\textit{inj}\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 601 | \end{tabular}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 602 | \end{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 603 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 604 | \noindent This corresponds to the $matches$ function we have | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 605 | seen in earlier lectures. In the first clause we are given an | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 606 | empty string, $[]$, and need to test wether the regular | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 607 | expression is $nullable$. If yes, we can proceed normally and | 
| 422 | 608 | just return the value calculated by $\textit{mkeps}$. The second clause
 | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 609 | is for strings where the first character is $c$, say, and the | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 610 | rest of the string is $s$. We first build the derivative of | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 611 | $r$ with respect to $c$; simplify the resulting regular | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 612 | expression. We continue lexing with the simplified regular | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 613 | expression and the string $s$. Whatever will be returned as | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 614 | value, we sill need to rectify using the $f_{rect}$ from the
 | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 615 | simplification and finally inject $c$ back into the | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 616 | (rectified) value. | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 617 | |
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 618 | |
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 619 | \subsubsection*{Records}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 620 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 621 | Remember we wanted to tokenize input strings, that means | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 622 | splitting strings into their ``word'' components. Furthermore | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 623 | we want to classify each token as being a keyword or | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 624 | identifier and so on. For this one more feature will be | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 625 | required, which I call a \emph{record} regular expression.
 | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 626 | While values encode how a regular expression matches a string, | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 627 | records can be used to ``focus'' on some particular parts of | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 628 | the regular expression and ``forget'' about others. | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 629 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 630 | Let us look at an example. Suppose you have the regular | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 631 | expression $a\cdot b + a\cdot c$. Clearly this regular expression can only | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 632 | recognise two strings. But suppose you are not interested | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 633 | whether it can recognise $ab$ or $ac$, but rather if it | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 634 | matched, then what was the last character of the matched | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 635 | string\ldots either $b$ or $c$. You can do this by annotating | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 636 | the regular expression with a record, written in general | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 637 | $(x:r)$, where $x$ is just an identifier (in my implementation | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 638 | a plain string) and $r$ is a regular expression. A record will | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 639 | be regarded as a regular expression. The extended definition | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 640 | in Scala therefore looks as follows: | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 641 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 642 | {\small\lstinputlisting[language=Scala]
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 643 | {../progs/app03.scala}}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 644 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 645 | \noindent Since we regard records as regular expressions we | 
| 422 | 646 | need to extend the functions $\textit{nullable}$ and $\textit{der}$. Similarly
 | 
| 647 | $\textit{mkeps}$ and $\textit{inj}$ need to be extended. This means we also need
 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 648 | to extend the definition of values, which in Scala looks as | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 649 | follows: | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 650 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 651 | {\small\lstinputlisting[language=Scala]
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 652 | {../progs/app04.scala}}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 653 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 654 | \noindent Let us now look at the purpose of records more | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 655 | closely and let us return to our question whether the string | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 656 | terminated in a $b$ or $c$. We can do this as follows: we | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 657 | annotate the regular expression $ab + ac$ with a record | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 658 | as follows | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 659 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 660 | \begin{center}
 | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 661 | $a\cdot (x:b) + a\cdot (x:c)$ | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 662 | \end{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 663 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 664 | \noindent This regular expression can still only recognise | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 665 | the strings $ab$ and $ac$, but we can now use a function | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 666 | that takes a value and returns all records. I call this | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 667 | function \emph{env} for environment\ldots it builds a list
 | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 668 | of identifiers associated with a string. This function | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 669 | can be defined as follows: | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 670 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 671 | \begin{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 672 | \begin{tabular}{lcl}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 673 | $env(Empty)$ & $\dn$ & $[]$\\ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 674 | $env(Char(c))$ & $\dn$ & $[]$\\ | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 675 | $env(\Left(v))$ & $\dn$ & $env(v)$\\ | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 676 | $env(Right(v))$ & $\dn$ & $env(v)$\\ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 677 | $env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\ | 
| 422 | 678 | $env(Stars\,[v_1,\ldots ,v_n])$ & $\dn$ & | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 679 | $env(v_1) \,@\ldots @\, env(v_n)$\\ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 680 | $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 681 | \end{tabular}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 682 | \end{center}
 | 
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 683 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 684 | \noindent where in the last clause we use the flatten function | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 685 | defined earlier. As can be seen, the function $env$ ``picks'' | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 686 | out all underlying strings where a record is given. Since | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 687 | there can be more than one, the environment will potentially | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 688 | contain many ``records''. If we now postprocess the value | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 689 | calculated by $lex$ extracting all records using $env$, we can | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 690 | answer the question whether the last element in the string was | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 691 | an $b$ or a $c$. Lets see this in action: if we use $a\cdot b | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 692 | + a\cdot c$ and $ac$ the calculated value will be | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 693 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 694 | \begin{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 695 | $Right(Seq(Char(a), Char(c)))$ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 696 | \end{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 697 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 698 | \noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ and | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 699 | use the $env$ function to extract the recording for | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 700 | $x$ we obtain | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 701 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 702 | \begin{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 703 | $[(x:c)]$ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 704 | \end{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 705 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 706 | \noindent If we had given the string $ab$ instead, then the | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 707 | record would have been $[(x:b)]$. The fun starts if we | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 708 | iterate this. Consider the regular expression | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 709 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 710 | \begin{center}
 | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 711 | $(a\cdot (x:b) + a\cdot (y:c))^*$ | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 712 | \end{center}
 | 
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 713 | |
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 714 | \noindent and the string $ababacabacab$. This string is | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 715 | clearly matched by the regular expression, but we are only | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 716 | interested in the sequence of $b$s and $c$s. Using $env$ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 717 | we obtain | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 718 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 719 | \begin{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 720 | $[(x:b), (x:b), (y:c), (x:b), (y:c), (x:b)]$ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 721 | \end{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 722 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 723 | \noindent While this feature might look silly, it is in fact | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 724 | quite useful. For example if we want to match the name of | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 725 | an email we might use the regular expression | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 726 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 727 | \[ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 728 | (name: [a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot 
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 729 | (domain: [a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot 
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 730 | (top\_level: [a\mbox{-}z\,.]^{\{2,6\}})
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 731 | \] | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 732 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 733 | \noindent Then if we match the email address | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 734 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 735 | \[ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 736 | \texttt{christian.urban@kcl.ac.uk}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 737 | \] | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 738 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 739 | \noindent we can use the $env$ function and find out | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 740 | what the name, domain and top-level part of the email | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 741 | address are: | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 742 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 743 | \begin{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 744 | $[(name:\texttt{christian.urban}), 
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 745 |   (domain:\texttt{kcl}), 
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 746 |   (top\_level:\texttt{ac.uk})]$
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 747 | \end{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 748 | |
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 749 | Recall that we want to lex a little programming language, | 
| 357 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 750 | called the \emph{While}-language. A simple program in this
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 751 | language is shown in Figure~\ref{while}. The main keywords in
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 752 | this language are \pcode{while}, \pcode{if}, \pcode{then} and
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 753 | \pcode{else}. As usual we have syntactic categories for
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 754 | identifiers, operators, numbers and so on. For this we would | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 755 | need to design the corresponding regular expressions to | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 756 | recognise these syntactic categories. I let you do this design | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 757 | task. Having these regular expressions at our disposal, we can | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 758 | form the regular expression | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 759 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 760 | \begin{figure}[t]
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 761 | \begin{center}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 762 | \mbox{\lstinputlisting[language=while]{../progs/fib.while}}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 763 | \end{center}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 764 | \caption{The Fibonacci program in the While-language.\label{while}}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 765 | \end{figure}
 | 
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 766 | |
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 767 | \begin{center}
 | 
| 357 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 768 | $\textit{WhileRegs} \;\dn\; 
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 769 | \left(\begin{array}{l}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 770 | (k, KeyWords)\; +\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 771 | (i, Ids)\;+\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 772 | (o, Ops)\;+ \\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 773 | (n, Nums)\;+ \\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 774 | (s, Semis)\;+ \\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 775 | (p, (LParens + RParens))\;+\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 776 | (b, (Begin + End))\;+ \\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 777 | (w, WhiteSpacess) | 
| 394 
2f9fe225ecc8
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
357diff
changeset | 778 |       \end{array}\right)^{\mbox{\LARGE{}*}}$
 | 
| 357 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 779 | \end{center}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 780 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 781 | \noindent and ask the algorithm by Sulzmann \& Lu to lex, say | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 782 | the following string | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 783 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 784 | \begin{center}\large
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 785 | \code{"if true then then 42 else +"}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 786 | \end{center}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 787 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 788 | \noindent | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 789 | By using the records and extracting the environment, the | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 790 | result is the following list: | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 791 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 792 | \begin{center}\tt
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 793 | \begin{tabular}{l}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 794 | KEYWORD(if),\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 795 | WHITESPACE,\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 796 | IDENT(true),\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 797 | WHITESPACE,\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 798 | KEYWORD(then),\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 799 | WHITESPACE,\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 800 | KEYWORD(then),\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 801 | WHITESPACE,\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 802 | NUM(42),\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 803 | WHITESPACE,\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 804 | KEYWORD(else),\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 805 | WHITESPACE,\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 806 | OP(+) | 
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 807 | \end{tabular}
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 808 | \end{center}
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 809 | |
| 357 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 810 | \noindent Typically we are not interested in the whitespaces | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 811 | and comments and would filter them out: this gives | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 812 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 813 | \begin{center}\tt
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 814 | \begin{tabular}{l}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 815 | KEYWORD(if),\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 816 | IDENT(true),\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 817 | KEYWORD(then),\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 818 | KEYWORD(then),\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 819 | NUM(42),\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 820 | KEYWORD(else),\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 821 | OP(+) | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 822 | \end{tabular}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 823 | \end{center}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 824 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 825 | \noindent | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 826 | which will be the input for the next phase of our compiler. | 
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 827 | |
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 828 | \end{document}
 | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 829 | |
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 830 | %%% Local Variables: | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 831 | %%% mode: latex | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 832 | %%% TeX-master: t | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 833 | %%% End: |