| author | Christian Urban <urbanc@in.tum.de> | 
| Fri, 10 Apr 2020 12:11:01 +0100 | |
| changeset 716 | 6a2edc950109 | 
| parent 671 | 7457eabb8c76 | 
| child 720 | 6d5e88493bf3 | 
| permissions | -rw-r--r-- | 
| 669 | 1 | % !TEX program = xelatex | 
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 2 | \documentclass{article}
 | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 3 | \usepackage{../style}
 | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 4 | \usepackage{../langs}
 | 
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 5 | \usepackage{../graphics}
 | 
| 671 | 6 | \usepackage{skull}
 | 
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 7 | |
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 8 | \begin{document}
 | 
| 669 | 9 | \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2019}
 | 
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 10 | |
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 11 | \section*{Handout 4 (Sulzmann \& Lu Algorithm)}
 | 
| 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 12 | |
| 716 | 13 | So far our algorithm based on derivatives is only able to say | 
| 14 | yes or no depending on whether a string is matched by a regular | |
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 15 | expression or not. Often a more interesting question is to | 
| 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 16 | find out \emph{how} a regular expression matched a string?
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 17 | Answering this question will also help us with the problem we | 
| 319 
e7b110f93697
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
296diff
changeset | 18 | are after, namely tokenising an input string. | 
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 19 | |
| 422 | 20 | The algorithm we will be looking at in this lecture was designed by Sulzmann | 
| 21 | \& Lu in a rather recent research paper (from 2014). A link to it is | |
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 22 | provided on KEATS, in case you are interested.\footnote{In my
 | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 23 | humble opinion this is an interesting instance of the research | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 24 | literature: it contains a very neat idea, but its presentation | 
| 422 | 25 | is rather sloppy. In earlier versions of this paper, a King's | 
| 26 | undergraduate student and I found several rather annoying typos in the | |
| 669 | 27 | examples and definitions.} My former PhD student Fahad Ausaf and I even more recently | 
| 422 | 28 | wrote a paper where we build on their result: we provided a | 
| 29 | mathematical proof that their algorithm is really correct---the proof | |
| 30 | Sulzmann \& Lu had originally given contained major flaws. Such correctness | |
| 31 | proofs are important: Kuklewicz maintains a unit-test library | |
| 520 | 32 | for the kind of algorithms we are interested in here and he showed | 
| 422 | 33 | that many implementations in the ``wild'' are buggy, that is not | 
| 34 | satisfy his unit tests: | |
| 35 | ||
| 36 | \begin{center}
 | |
| 37 | \url{http://www.haskell.org/haskellwiki/Regex_Posix}
 | |
| 38 | \end{center}
 | |
| 39 | ||
| 40 | ||
| 669 | 41 | Coming back to the algorithm by Sulzmann \& Lu, their idea is to | 
| 42 | introduce \emph{values} for producing an answer for \emph{how} a regular
 | |
| 43 | expression matches a string. So rather than a boolean like in the | |
| 44 | Brzozowski algorithm, a value will be the output of the Sulzman \& Lu | |
| 45 | algorithm whenever the regular expression matches the string. If the | |
| 46 | string does not match the string, an error will be raised. The | |
| 47 | definition for values is given below. They are shown together with the | |
| 48 | regular expressions $r$ to which they correspond: | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 49 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 50 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 51 | \begin{tabular}{cc}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 52 | \begin{tabular}{@{}rrl@{}}
 | 
| 319 
e7b110f93697
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
296diff
changeset | 53 | \multicolumn{3}{c}{regular expressions}\medskip\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 54 | $r$ & $::=$ & $\ZERO$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 55 | & $\mid$ & $\ONE$ \\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 56 | & $\mid$ & $c$ \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 57 | & $\mid$ & $r_1 \cdot r_2$\\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 58 | & $\mid$ & $r_1 + r_2$ \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 59 | \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 60 | & $\mid$ & $r^*$ \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 61 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 62 | & | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 63 | \begin{tabular}{@{\hspace{0mm}}rrl@{}}
 | 
| 319 
e7b110f93697
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
296diff
changeset | 64 | \multicolumn{3}{c}{values}\medskip\\
 | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 65 | $v$ & $::=$ & \\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 66 | & & $Empty$ \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 67 | & $\mid$ & $Char(c)$ \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 68 | & $\mid$ & $Seq(v_1,v_2)$\\ | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 69 | & $\mid$ & $\Left(v)$ \\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 70 | & $\mid$ & $Right(v)$ \\ | 
| 422 | 71 | & $\mid$ & $Stars\,[v_1,\ldots\,v_n]$ \\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 72 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 73 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 74 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 75 | |
| 417 | 76 | \noindent There is no value for the | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 77 | $\ZERO$ regular expression, since it does not match any | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 78 | string. Otherwise there is exactly one value corresponding to | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 79 | each regular expression with the exception of $r_1 + r_2$ | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 80 | where there are two values, namely $\Left(v)$ and $Right(v)$ | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 81 | corresponding to the two alternatives. Note that $r^*$ is | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 82 | associated with a list of values, one for each copy of $r$ | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 83 | that was needed to match the string. This means we might also | 
| 669 | 84 | return the empty list $Stars\,[]$, if no copy was needed | 
| 520 | 85 | for $r^*$. For sequence, there is exactly one value, composed | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 86 | of two component values ($v_1$ and $v_2$). | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 87 | |
| 422 | 88 | My implementation of regular expressions and values in Scala is | 
| 716 | 89 | shown below. I use the convention that | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 90 | regular expressions are written entirely with upper-case | 
| 716 | 91 | letters, whereas values start with a single upper-case | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 92 | character and the rest are lower-case letters. | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 93 | |
| 490 | 94 | {\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=
 | 
| 95 |                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 | |
| 352 
1e1b0fe66107
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
350diff
changeset | 96 | {../progs/app01.scala}}
 | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 97 | |
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 98 | |
| 490 | 99 | {\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=
 | 
| 100 |                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 101 | {../progs/app02.scala}}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 102 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 103 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 104 | Graphically the algorithm by Sulzmann \& Lu can be illustrated | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 105 | by the picture in Figure~\ref{Sulz} where the path from the
 | 
| 669 | 106 | left to the right involving $\der/\nullable$ is the first phase | 
| 107 | of the algorithm and $\textit{mkeps}/\inj$, the path from right to left,
 | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 108 | the second phase. This picture shows the steps required when a | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 109 | regular expression, say $r_1$, matches the string $abc$. We | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 110 | first build the three derivatives (according to $a$, $b$ and | 
| 669 | 111 | $c$). We then use $\nullable$ to find out whether the resulting | 
| 296 
796b9b81ac8d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
288diff
changeset | 112 | regular expression can match the empty string. If yes, we call | 
| 422 | 113 | the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular
 | 
| 114 | expression has matched the empty string. Its definition | |
| 115 | is as follows: | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 116 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 117 | \begin{figure}[t]
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 118 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 119 | \begin{tikzpicture}[scale=2,node distance=1.2cm,
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 120 |                     every node/.style={minimum size=7mm}]
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 121 | \node (r1)  {$r_1$};
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 122 | \node (r2) [right=of r1]{$r_2$};
 | 
| 669 | 123 | \draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\der\,a$};
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 124 | \node (r3) [right=of r2]{$r_3$};
 | 
| 669 | 125 | \draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\der\,b$};
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 126 | \node (r4) [right=of r3]{$r_4$};
 | 
| 669 | 127 | \draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\der\,c$};
 | 
| 128 | \draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\nullable$}};
 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 129 | \node (v4) [below=of r4]{$v_4$};
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 130 | \draw[->,line width=1mm](r4) -- (v4); | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 131 | \node (v3) [left=of v4] {$v_3$};
 | 
| 669 | 132 | \draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\inj\,c$};
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 133 | \node (v2) [left=of v3]{$v_2$};
 | 
| 669 | 134 | \draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\inj\,b$};
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 135 | \node (v1) [left=of v2] {$v_1$};
 | 
| 669 | 136 | \draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\inj\,a$};
 | 
| 422 | 137 | \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}};
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 138 | \end{tikzpicture}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 139 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 140 | \caption{The two phases of the algorithm by Sulzmann \& Lu.
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 141 | \label{Sulz}}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 142 | \end{figure}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 143 | |
| 422 | 144 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 145 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 146 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 147 | \begin{tabular}{lcl}
 | 
| 422 | 148 |   $\textit{mkeps}(\ONE)$       & $\dn$ & $Empty$\\
 | 
| 669 | 149 |   $\textit{mkeps}(r_1 + r_2)$      & $\dn$ & if $\nullable(r_1)$  \\
 | 
| 422 | 150 |                           &       & then $\Left(\textit{mkeps}(r_1))$\\
 | 
| 151 |                           &       & else $Right(\textit{mkeps}(r_2))$\\
 | |
| 152 |   $\textit{mkeps}(r_1 \cdot r_2)$  & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\
 | |
| 153 |   $\textit{mkeps}(r^*)$            & $\dn$ & $Stars\,[]$  \\
 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 154 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 155 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 156 | |
| 596 | 157 | \noindent There are no cases for $\ZERO$ and $c$, since | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 158 | these regular expression cannot match the empty string. Note | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 159 | also that in case of alternatives we give preference to the | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 160 | regular expression on the left-hand side. This will become | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 161 | important later on. | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 162 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 163 | The second phase of the algorithm is organised so that it will | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 164 | calculate a value for how the derivative regular expression | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 165 | has matched a string. For this we need a function that | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 166 | reverses this ``chopping off'' for values which we did in the | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 167 | first phase for derivatives. The corresponding function is | 
| 669 | 168 | called $\inj$ (short for injection). This function takes three | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 169 | arguments: the first one is a regular expression for which we | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 170 | want to calculate the value, the second is the character we | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 171 | want to inject and the third argument is the value where we | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 172 | will inject the character into. The result of this function is a | 
| 669 | 173 | new value. The definition of $\inj$ is as follows: | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 174 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 175 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 176 | \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | 
| 669 | 177 | $\inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ | 
| 178 | $\inj\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\inj\,r_1\,c\,v)$\\ | |
| 179 | $\inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\inj\,r_2\,c\,v)$\\ | |
| 180 | $\inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\ | |
| 181 | $\inj\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\ | |
| 182 |   $\inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\inj\,r_2\,c\,v)$\\
 | |
| 183 | $\inj\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\inj\,r\,c\,v\,::\,vs)$\\ | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 184 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 185 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 186 | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 187 | \noindent This definition is by recursion on the regular | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 188 | expression and by analysing the shape of the values. Therefore | 
| 422 | 189 | there are three cases for sequence regular | 
| 190 | expressions (for all possible shapes of the value we can encounter). The last | |
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 191 | clause for the star regular expression returns a list where | 
| 669 | 192 | the first element is $\inj\,r\,c\,v$ and the other elements are | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 193 | $vs$. That means $\_\,::\,\_$ should be read as list cons. | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 194 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 195 | To understand what is going on, it might be best to do some | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 196 | example calculations and compare them with Figure~\ref{Sulz}.
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 197 | For this note that we have not yet dealt with the need of | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 198 | simplifying regular expressions (this will be a topic on its | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 199 | own later). Suppose the regular expression is $a \cdot (b | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 200 | \cdot c)$ and the input string is $abc$. The derivatives from | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 201 | the first phase are as follows: | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 202 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 203 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 204 | \begin{tabular}{ll}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 205 | $r_1$: & $a \cdot (b \cdot c)$\\ | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 206 | $r_2$: & $\ONE \cdot (b \cdot c)$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 207 | $r_3$: & $(\ZERO \cdot (b \cdot c)) + (\ONE \cdot c)$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 208 | $r_4$: & $(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)$\\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 209 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 210 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 211 | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 212 | \noindent According to the simple algorithm, we would test | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 213 | whether $r_4$ is nullable, which in this case it indeed is. | 
| 422 | 214 | This means we can use the function $\textit{mkeps}$ to calculate a
 | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 215 | value for how $r_4$ was able to match the empty string. | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 216 | Remember that this function gives preference for alternatives | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 217 | on the left-hand side. However there is only $\ONE$ on the | 
| 422 | 218 | very right-hand side of $r_4$ (underlined) | 
| 219 | ||
| 220 | \begin{center}
 | |
| 221 | $r_4$:\;$(\ZERO \cdot (b \cdot c)) + | |
| 222 |          ((\ZERO \cdot c) + \underline{\ONE})$\\
 | |
| 223 | \end{center}
 | |
| 224 | ||
| 225 | \noindent | |
| 226 | that matches the empty string. | |
| 227 | Therefore $\textit{mkeps}$ returns the value
 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 228 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 229 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 230 | $v_4:\;Right(Right(Empty))$ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 231 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 232 | |
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 233 | \noindent If there had been a $\ONE$ on the left, then | 
| 422 | 234 | $\textit{mkeps}$ would have returned something of the form
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 235 | $\Left(\ldots)$. The point is that from this value we can | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 236 | directly read off which part of $r_4$ matched the empty | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 237 | string: take the right-alternative first, and then the | 
| 669 | 238 | right-alternative again, then you got to the $\ONE$. | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 239 | |
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 240 | Next we have to ``inject'' the last character, that is $c$ in | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 241 | the running example, into this value $v_4$ in order to | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 242 | calculate how $r_3$ could have matched the string $c$. | 
| 669 | 243 | For this we call injection with $\inj(r_3, c, v_4)$. | 
| 244 | According to the definition of $\inj$ we obtain | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 245 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 246 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 247 | $v_3:\;Right(Seq(Empty, Char(c)))$ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 248 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 249 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 250 | \noindent This is the correct result, because $r_3$ needs | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 251 | to use the right-hand alternative, and then $\ONE$ needs | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 252 | to match the empty string and $c$ needs to match $c$. | 
| 596 | 253 | Next we need to inject back the letter $b$ into $v_3$. | 
| 669 | 254 | For this we call injection with $\inj(r_2, b, v_3)$. | 
| 596 | 255 | This gives | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 256 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 257 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 258 | $v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 259 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 260 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 261 | \noindent which is again the correct result for how $r_2$ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 262 | matched the string $bc$. Finally we need to inject back the | 
| 596 | 263 | letter $a$ into $v_2$ giving the final result. | 
| 669 | 264 | For this we call injection with $\inj(r_1, a, v_2)$ | 
| 596 | 265 | and obtain | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 266 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 267 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 268 | $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 269 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 270 | |
| 596 | 271 | \noindent This value corresponds to how the regular expression $r_1$, | 
| 272 | namely $a \cdot (b \cdot c)$, matched the string $abc$. | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 273 | |
| 669 | 274 | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 275 | There are a few auxiliary functions that are of interest | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 276 | when analysing this algorithm. One is called \emph{flatten},
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 277 | written $|\_|$, which extracts the string ``underlying'' a | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 278 | value. It is defined recursively as | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 279 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 280 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 281 | \begin{tabular}{lcl}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 282 | $|Empty|$ & $\dn$ & $[]$\\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 283 | $|Char(c)|$ & $\dn$ & $[c]$\\ | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 284 | $|\Left(v)|$ & $\dn$ & $|v|$\\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 285 | $|Right(v)|$ & $\dn$ & $|v|$\\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 286 | $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\ | 
| 422 | 287 | $|Stars\,[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 288 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 289 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 290 | |
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 291 | \noindent Using flatten we can see what are the strings behind | 
| 669 | 292 | the values calculated by $\textit{mkeps}$ and $\inj$. In our running 
 | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 293 | example: | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 294 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 295 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 296 | \begin{tabular}{ll}
 | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 297 | $|v_4|$: & $[]$\\ | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 298 | $|v_3|$: & $c$\\ | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 299 | $|v_2|$: & $bc$\\ | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 300 | $|v_1|$: & $abc$ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 301 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 302 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 303 | |
| 669 | 304 | \noindent This indicates that $\inj$ indeed is injecting, or | 
| 422 | 305 | adding, back a character into the value. | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 306 | |
| 669 | 307 | The definition of $\inj$ might still be very puzzling and each clause | 
| 308 | might look arbitrary, but there is in fact a relatively simple idea | |
| 309 | behind it. Ultimately the $\inj$-functions is determined by the | |
| 310 | derivative functions. For this consider one of the ``squares'' from | |
| 311 | Figure~\ref{Sulz}:
 | |
| 312 | ||
| 313 | ||
| 314 |   \begin{center}
 | |
| 315 |   \begin{tikzpicture}[scale=2,node distance=1.2cm,
 | |
| 316 |                       every node/.style={minimum size=7mm}]
 | |
| 317 |   \node (r)  {$r$};
 | |
| 318 |   \node (rd) [right=of r]{$r_{der}$};
 | |
| 319 |   \draw[->,line width=1mm](r)--(rd) node[above,midway] {$\der\,c$};
 | |
| 320 |   \node (vd) [below=of r2]{$v_{der}$};
 | |
| 321 | \draw[->,line width=1mm](rd) -- (vd); | |
| 322 |   \node (v) [left=of vd] {$v$};
 | |
| 323 |   \draw[->,line width=1mm](vd)--(v) node[below,midway] {$\inj\,c$};
 | |
| 324 |   \draw[->,line width=0.5mm,dotted](r) -- (v) node[left,midway,red] {\bf ?};
 | |
| 325 |   \end{tikzpicture}
 | |
| 326 |   \end{center}
 | |
| 327 | ||
| 328 | \noindent | |
| 329 | The input to the $\inj$-function is $r$ (on the top left), $c$ (the | |
| 330 | character to be injected) and $v_{der}$ (bottom right). The output is
 | |
| 331 | the value $v$ for how the regular expression $r$ matched the | |
| 332 | corresponding string. How does $\inj$ calculate this value? Well, it has | |
| 333 | to analyse the value $v_{der}$ and transform it into the value $v$ for
 | |
| 334 | the regular expression $r$. The value $v_{der}$ corresponds to the
 | |
| 335 | $r_{der}$-regular expression which is the derivative of $r$. Remember
 | |
| 336 | that $v_{der}$ is the value for how $r_{der}$ matches the corresponding string
 | |
| 337 | where $c$ has been chopped off. | |
| 338 | ||
| 339 | Let $r$ be $r_1 + r_2$. Then $r_{der}$
 | |
| 340 | is of the form $(\der\,c\,r_1) + (\der\,c\,r_2)$. What are the possible | |
| 341 | values corresponding to $r_{der}$? Well, they can be only of the form
 | |
| 342 | $\Left(\ldots)$ and $\Right(\ldots)$. Therefore you have two | |
| 343 | cases in the $\inj$ function -- one for $\Left$ and one for $\Right$. | |
| 344 | ||
| 345 | \begin{center}
 | |
| 346 | \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l}
 | |
| 347 | $\inj\,(r_1 + r_2)\,c\,\,\Left(v)$ & $\dn$ & $\ldots$\\ | |
| 348 | $\inj\,(r_1 + r_2)\,c\,\,\Right(v)$ & $\dn$ & $\ldots$\\ | |
| 349 | \end{tabular}
 | |
| 350 | \end{center}
 | |
| 351 | ||
| 352 | \noindent | |
| 353 | Let's look next at the sequence case where $r = r_1 \cdot r_2$. What does the | |
| 354 | derivative of $r$ look like? According to the definition it is: | |
| 355 | ||
| 356 | \begin{center}
 | |
| 357 |   \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | |
| 358 | $\der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $\nullable (r_1)$\\ | |
| 359 | & & then $(\der\,c\,r_1) \cdot r_2 + \der\, c\, r_2$\\ | |
| 360 | & & else $(\der\, c\, r_1) \cdot r_2$\\ | |
| 361 |   \end{tabular}  
 | |
| 362 | \end{center}  
 | |
| 363 | ||
| 364 | \noindent As you can see there is a derivative in the if-branch and another | |
| 365 | in the else-branch. In the if-branch we have an alternative of the form | |
| 366 | $\_ + \_$. We already know what the possible values are for such a regular | |
| 367 | expression, namely $\Left$ or $\Right$. Therefore we have in $\inj$ the | |
| 368 | two cases: | |
| 369 | ||
| 370 | \begin{center}
 | |
| 371 |   \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l}
 | |
| 372 | $\inj\,(r_1 \cdot r_2)\,c\,\,\Left(Seq(v_1,v_2))$ & $\dn$ & $\ldots$\\ | |
| 373 | $\inj\,(r_1 \cdot r_2)\,c\,\,Right(v)$ & $\dn$ & $\ldots$\\ | |
| 374 | \end{tabular}
 | |
| 375 | \end{center}
 | |
| 376 | ||
| 377 | \noindent | |
| 378 | In the first case we even know that it is not just a $\Left$-value, but | |
| 379 | $\Left(Seq(\ldots))$, because the corresponding regular expression | |
| 380 | in the derivation is \mbox{$(\der\,c\,r_1) \cdot r_2$}. Hence we know 
 | |
| 381 | it must be a $Seq$-value enclosed inside $\Left$. | |
| 382 | The third clause for $r_1\cdot r_2$ in the $\inj$-function | |
| 383 | ||
| 384 | \begin{center}
 | |
| 385 |   \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | |
| 386 | $\inj\,(r_1 \cdot r_2)\,c\,\,Seq(v_1,v_2)$ & $\dn$ & $\ldots$\\ | |
| 387 |   \end{tabular}
 | |
| 388 |   \end{center}
 | |
| 389 | ||
| 390 | \noindent corresponds to the else-branch in the derivative. In this | |
| 391 | case we know the derivative is of the form $(\der\,c\,r_1) \cdot r_2$ and | |
| 392 | therefore the value must be of the form $Seq(\ldots)$. | |
| 393 | ||
| 394 | Hopefully the $inj$-function makes now more sense. I let you explain | |
| 395 | the $r^*$ case. What do the derivative of $r^*$ and | |
| 396 | the corresponding value look like? Does this explain the shape of | |
| 397 | the clause? | |
| 398 | ||
| 399 | \begin{center}
 | |
| 400 |   \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | |
| 401 | $\inj\,(r^*)\,c\,\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\inj\,r\,c\,v\,::\,vs)$\\ | |
| 402 |   \end{tabular}
 | |
| 403 | \end{center}
 | |
| 404 | ||
| 405 | \noindent If yes, you made sense of the left-hand sides of | |
| 406 | the $\inj$-definition. | |
| 407 | ||
| 408 | How about the right-hand sides? Well, in the | |
| 409 | $r^*$ case for example we have according to the square in the picture | |
| 410 | above a value $v_{der}$ which says how the derivative $r_{der}$
 | |
| 411 | matched the string. Since the derivative is of the form | |
| 412 | $(\der\,c\,r) \cdot (r^*)$ the corresponding value is of the | |
| 413 | form $Seq(v, Stars\,vs)$. But for $r^*$ we are looking for a value | |
| 414 | for the original (top left) regular expression --- so it cannot | |
| 415 | be a value of the shape $Seq(\ldots, Stars\ldots)$ because that is the | |
| 416 | shape for sequence regular expressions. What we need is a value | |
| 417 | of the form $Stars \ldots$ (remember the correspondence between | |
| 418 | values and regular expressions). This suggests to define the right hand | |
| 419 | side as | |
| 420 | ||
| 421 | \begin{center}
 | |
| 422 | $\ldots \dn Stars(\inj\,r\,c\,v\,::\,vs)$ | |
| 423 | \end{center}  
 | |
| 424 | ||
| 425 | \noindent It is a value of the right shape, namely $Stars$. It injected | |
| 426 | $c$ into the first-value, which is in fact the value where we need to | |
| 427 | undo the derivative. Remember again the shape of the derivative of $r^*$. | |
| 428 | In place where we chopped off the $c$, we now need to do the $\inj$ of $c$. | |
| 429 | Therefore $\inj\,r\,c\,v$ in the definition above. That is the same with | |
| 430 | the other clauses in $\inj$. | |
| 431 | ||
| 432 | ||
| 671 | 433 | Phew\ldots{}Sweat\ldots!\#@$\skull$\%\ldots Unfortunately, there is
 | 
| 434 | a gigantic problem with the described algorithm so far: it is very | |
| 435 | slow. We need to include in all this the simplification from Lecture | |
| 436 | 2. And what rotten luck: simplification messes things up and we need | |
| 437 | to rectify the mess. This is what we shall do next. | |
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 438 | |
| 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 439 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 440 | \subsubsection*{Simplification}
 | 
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 441 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 442 | Generally the matching algorithms based on derivatives do | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 443 | poorly unless the regular expressions are simplified after | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 444 | each derivative step. But this is a bit more involved in the | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 445 | algorithm of Sulzmann \& Lu. So what follows might require you | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 446 | to read several times before it makes sense and also might | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 447 | require that you do some example calculations yourself. As a | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 448 | first example consider the last derivation step in our earlier | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 449 | example: | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 450 | |
| 422 | 451 | \begin{equation}\label{regexfour}
 | 
| 669 | 452 | r_4 = \der\,c\,r_3 = | 
| 422 | 453 | (\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE) | 
| 454 | \end{equation}
 | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 455 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 456 | \noindent Simplifying this regular expression would just give | 
| 422 | 457 | us $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ as
 | 
| 458 | input, however, would give us with the value $Empty$ instead of | |
| 459 | ||
| 460 | \[Right(Right(Empty))\] | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 461 | |
| 422 | 462 | \noindent | 
| 463 | that was obtained without the simplification. The problem is we need | |
| 464 | to recreate this more complicated value, rather than just return | |
| 465 | $Empty$. This will require what I call \emph{rectification functions}.
 | |
| 466 | They need to be calculated whenever a regular expression gets | |
| 467 | simplified. | |
| 468 | ||
| 469 | Rectification functions take a value as argument and return a | |
| 470 | (rectified) value. In the example above the argument would be $Empty$ | |
| 471 | and the output $Right(Right(Empty))$. Before we define these | |
| 472 | rectification functions in general, let us first take a look again at | |
| 473 | our simplification rules: | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 474 | |
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 475 | \begin{center}
 | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 476 | \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 477 | $r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 478 | $\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 479 | $r \cdot \ONE$ & $\mapsto$ & $r$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 480 | $\ONE \cdot r$ & $\mapsto$ & $r$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 481 | $r + \ZERO$ & $\mapsto$ & $r$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 482 | $\ZERO + r$ & $\mapsto$ & $r$\\ | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 483 | $r + r$ & $\mapsto$ & $r$\\ | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 484 | \end{tabular}
 | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 485 | \end{center}
 | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 486 | |
| 422 | 487 | \noindent Applying them to $r_4$ in \eqref{regexfour} will require several nested
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 488 | simplifications in order end up with just $\ONE$. However, | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 489 | it is possible to apply them in a depth-first, or inside-out, | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 490 | manner in order to calculate this simplified regular | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 491 | expression. | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 492 | |
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 493 | The rectification we can implement by letting simp return not | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 494 | just a (simplified) regular expression, but also a | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 495 | rectification function. Let us consider the alternative case, | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 496 | $r_1 + r_2$, first. By going depth-first, we first simplify | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 497 | the component regular expressions $r_1$ and $r_2.$ This will | 
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 498 | return simplified versions, say $r_{1s}$ and $r_{2s}$, of the
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 499 | component regular expressions (if they can be simplified) but | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 500 | also two rectification functions $f_{1s}$ and $f_{2s}$. We
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 501 | need to assemble them in order to obtain a rectified value for | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 502 | $r_1 + r_2$. In case $r_{1s}$ simplified to $\ZERO$, we
 | 
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 503 | continue the derivative calculation with $r_{2s}$. The
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 504 | Sulzmann \& Lu algorithm would return a corresponding value, | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 505 | say $v_{2s}$. But now this value needs to be ``rectified'' to
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 506 | the value | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 507 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 508 | \begin{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 509 | $Right(v_{2s})$
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 510 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 511 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 512 | \noindent The reason is that we look for the value that tells | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 513 | us how $r_1 + r_2$ could have matched the string, not just | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 514 | $r_2$ or $r_{2s}$. Unfortunately, this is still not the right
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 515 | value in general because there might be some simplifications | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 516 | that happened inside $r_2$ and for which the simplification | 
| 520 | 517 | function returned also a rectification function $f_{2s}$. So in
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 518 | fact we need to apply this one too which gives | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 519 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 520 | \begin{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 521 | $Right(f_{2s}(v_{2s}))$
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 522 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 523 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 524 | \noindent This is now the correct, or rectified, value. Since | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 525 | the simplification will be done in the first phase of the | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 526 | algorithm, but the rectification needs to be done to the | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 527 | values in the second phase, it is advantageous to calculate | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 528 | the rectification as a function, remember this function and | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 529 | then apply the value to this function during the second phase. | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 530 | So if we want to implement the rectification as function, we | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 531 | would need to return | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 532 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 533 | \begin{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 534 | $\lambda v.\,Right(f_{2s}(v))$
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 535 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 536 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 537 | \noindent which is the lambda-calculus notation for | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 538 | a function that expects a value $v$ and returns everything | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 539 | after the dot where $v$ is replaced by whatever value is | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 540 | given. | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 541 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 542 | Let us package this idea with rectification functions | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 543 | into a single function (still only considering the alternative | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 544 | case): | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 545 | |
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 546 | \begin{center}
 | 
| 422 | 547 | \begin{tabular}{ll}
 | 
| 548 | $_1$ & $simp(r)$:\\ | |
| 549 | $_2$ & \quad case $r = r_1 + r_2$\\ | |
| 550 | $_3$ & \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 | |
| 551 | $_4$ & \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 | |
| 552 | $_5$ & \qquad case $r_{1s} = \ZERO$: 
 | |
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 553 |        return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
 | 
| 422 | 554 | $_6$ & \qquad case $r_{2s} = \ZERO$: 
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 555 |        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
 | 
| 422 | 556 | $_7$ & \qquad case $r_{1s} = r_{2s}$:
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 557 |        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
 | 
| 422 | 558 | $_8$ & \qquad otherwise: | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 559 |        return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 560 | \end{tabular}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 561 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 562 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 563 | \noindent We first recursively call the simplification with | 
| 422 | 564 | $r_1$ and $r_2$ (Lines 3 and 4). This gives simplified regular expressions, | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 565 | $r_{1s}$ and $r_{2s}$, as well as two rectification functions
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 566 | $f_{1s}$ and $f_{2s}$. We next need to test whether the
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 567 | simplified regular expressions are $\ZERO$ so as to make | 
| 422 | 568 | further simplifications. In case $r_{1s}$ is $\ZERO$ (Line 5),
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 569 | then we can return $r_{2s}$ (the other alternative). However
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 570 | for this we need to build a corresponding rectification | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 571 | function, which as said above is | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 572 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 573 | \begin{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 574 | $\lambda v.\,Right(f_{2s}(v))$
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 575 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 576 | |
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 577 | \noindent The case where $r_{2s} = \ZERO$ is similar:
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 578 | We return $r_{1s}$ and rectify with $\Left(\_)$ and the
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 579 | other calculated rectification function $f_{1s}$. This gives
 | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 580 | |
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 581 | \begin{center}
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 582 | $\lambda v.\,\Left(f_{1s}(v))$
 | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 583 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 584 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 585 | \noindent The next case where $r_{1s} = r_{2s}$ can be treated
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 586 | like the one where $r_{2s} = \ZERO$. We return $r_{1s}$
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 587 | and rectify with $\Left(\_)$ and so on. | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 588 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 589 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 590 | The otherwise-case is slightly more complicated. In this case | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 591 | neither $r_{1s}$ nor $r_{2s}$ are $\ZERO$ and also
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 592 | $r_{1s} \not= r_{2s}$, which means no further simplification
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 593 | can be applied. Accordingly, we return $r_{1s} + r_{2s}$ as
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 594 | the simplified regular expression. In principle we also do not | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 595 | have to do any rectification, because no simplification was | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 596 | done in this case. But this is actually not true: There might | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 597 | have been simplifications inside $r_{1s}$ and $r_{2s}$. We
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 598 | therefore need to take into account the calculated | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 599 | rectification functions $f_{1s}$ and $f_{2s}$. We can do this
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 600 | by defining a rectification function $f_{alt}$ which takes two
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 601 | rectification functions as arguments and applies them | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 602 | according to whether the value is of the form $\Left(\_)$ or | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 603 | $Right(\_)$: | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 604 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 605 | \begin{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 606 | \begin{tabular}{l@{\hspace{1mm}}l}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 607 | $f_{alt}(f_1, f_2) \dn$\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 608 | \qquad $\lambda v.\,$ case $v = \Left(v')$: | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 609 | & return $\Left(f_1(v'))$\\ | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 610 | \qquad \phantom{$\lambda v.\,$} case $v = Right(v')$: 
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 611 | & return $Right(f_2(v'))$\\ | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 612 | \end{tabular}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 613 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 614 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 615 | The other interesting case with simplification is the sequence | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 616 | case. In this case the main simplification function is as | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 617 | follows | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 618 | |
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 619 | \begin{center}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 620 | \begin{tabular}{l}
 | 
| 422 | 621 | $simp(r)$:\hspace{5cm} (continued)\\
 | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 622 | \quad case $r = r_1 \cdot r_2$\\ | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 623 | \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 624 | \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 625 | \qquad case $r_{1s} = \ZERO$: 
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 626 |        return $(\ZERO, f_{error})$\\
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 627 | \qquad case $r_{2s} = \ZERO$: 
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 628 |        return $(\ZERO, f_{error})$\\
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 629 | \qquad case $r_{1s} = \ONE$: 
 | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 630 | return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 631 | \qquad case $r_{2s} = \ONE$: 
 | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 632 | return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 633 | \qquad otherwise: | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 634 |        return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 635 | \end{tabular}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 636 | \end{center}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 637 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 638 | \noindent whereby in the last line $f_{seq}$ is again pushing
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 639 | the two rectification functions into the two components of the | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 640 | Seq-value: | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 641 | |
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 642 | \begin{center}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 643 | \begin{tabular}{l@{\hspace{1mm}}l}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 644 | $f_{seq}(f_1, f_2) \dn$\\
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 645 | \qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$: | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 646 | & return $Seq(f_1(v_1), f_2(v_2))$\\ | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 647 | \end{tabular}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 648 | \end{center}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 649 | |
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 650 | \noindent Note that in the case of $r_{1s} = \ZERO$
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 651 | (similarly $r_{2s}$) we use the function $f_{error}$ for
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 652 | rectification. If you think carefully, then you will realise | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 653 | that this function will actually never been called. This is | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 654 | because a sequence with $\ZERO$ will never recognise any | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 655 | string and therefore the second phase of the algorithm would | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 656 | never been called. The simplification function still expects | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 657 | us to give a function. So in my own implementation I just | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 658 | returned a function that raises an error. In the case | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 659 | where $r_{1s} = \ONE$ (similarly $r_{2s}$) we have
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 660 | to create a sequence where the first component is a rectified | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 661 | version of $Empty$. Therefore we call $f_{1s}$ with $Empty$.
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 662 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 663 | Since we only simplify regular expressions of the form $r_1 + | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 664 | r_2$ and $r_1 \cdot r_2$ we do not have to do anything else | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 665 | in the remaining cases. The rectification function will | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 666 | be just the identity, which in lambda-calculus terms is | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 667 | just | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 668 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 669 | \begin{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 670 | $\lambda v.\,v$ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 671 | \end{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 672 | |
| 422 | 673 | \noindent This completes the high-level version of the simplification | 
| 674 | function, which is shown again in Figure~\ref{simp}. The Scala code
 | |
| 444 
3056a4c071b0
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
422diff
changeset | 675 | for the simplification function is in Figure~\ref{simprect}.
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 676 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 677 | \begin{figure}[t]
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 678 | \begin{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 679 | \begin{tabular}{l}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 680 | $simp(r)$:\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 681 | \quad case $r = r_1 + r_2$\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 682 | \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 683 | \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 684 | \qquad case $r_{1s} = \ZERO$: 
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 685 |        return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 686 | \qquad case $r_{2s} = \ZERO$: 
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 687 |        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 688 | \qquad case $r_{1s} = r_{2s}$:
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 689 |        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 690 | \qquad otherwise: | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 691 |        return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 692 | \medskip\\ | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 693 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 694 | \quad case $r = r_1 \cdot r_2$\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 695 | \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 696 | \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 697 | \qquad case $r_{1s} = \ZERO$: 
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 698 |        return $(\ZERO, f_{error})$\\
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 699 | \qquad case $r_{2s} = \ZERO$: 
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 700 |        return $(\ZERO, f_{error})$\\
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 701 | \qquad case $r_{1s} = \ONE$: 
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 702 | return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 703 | \qquad case $r_{2s} = \ONE$: 
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 704 | return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 705 | \qquad otherwise: | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 706 |        return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 707 | \medskip\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 708 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 709 | \quad otherwise:\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 710 | \qquad return $(r, \lambda v.\,v)$\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 711 | \end{tabular}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 712 | \end{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 713 | \caption{The simplification function that returns a simplified 
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 714 | regular expression and a rectification function.\label{simp}}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 715 | \end{figure}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 716 | |
| 422 | 717 | \begin{figure}[p]
 | 
| 492 | 718 | \small | 
| 490 | 719 | \lstinputlisting[numbers=left,linebackgroundcolor= | 
| 720 |    {\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app61.scala}
 | |
| 422 | 721 | |
| 722 | \caption{The Scala code for the simplification function. The
 | |
| 520 | 723 | first part defines some auxiliary functions for the rectification. | 
| 422 | 724 | The second part give the simplification function. | 
| 725 | \label{simprect}}
 | |
| 726 | \end{figure}
 | |
| 727 | ||
| 728 | ||
| 729 | We are now in the position to define a \emph{lexing function} as follows:
 | |
| 730 | ||
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 731 | \begin{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 732 | \begin{tabular}{lcl}
 | 
| 669 | 733 | $lex\,r\,[]$ & $\dn$ & if $\nullable(r)$ then $\textit{mkeps}(r)$\\
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 734 | & & else $error$\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 735 | $lex\,r\,c\!::\!s$ & $\dn$ & let | 
| 669 | 736 |    $(r_{simp}, f_{rect}) = simp(\der(c, r))$\\
 | 
| 737 | & & $\inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 738 | \end{tabular}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 739 | \end{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 740 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 741 | \noindent This corresponds to the $matches$ function we have | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 742 | seen in earlier lectures. In the first clause we are given an | 
| 520 | 743 | empty string, $[]$, and need to test whether the regular | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 744 | expression is $nullable$. If yes, we can proceed normally and | 
| 422 | 745 | just return the value calculated by $\textit{mkeps}$. The second clause
 | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 746 | is for strings where the first character is $c$, say, and the | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 747 | rest of the string is $s$. We first build the derivative of | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 748 | $r$ with respect to $c$; simplify the resulting regular | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 749 | expression. We continue lexing with the simplified regular | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 750 | expression and the string $s$. Whatever will be returned as | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 751 | value, we sill need to rectify using the $f_{rect}$ from the
 | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 752 | simplification and finally inject $c$ back into the | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 753 | (rectified) value. | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 754 | |
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 755 | |
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 756 | \subsubsection*{Records}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 757 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 758 | Remember we wanted to tokenize input strings, that means | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 759 | splitting strings into their ``word'' components. Furthermore | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 760 | we want to classify each token as being a keyword or | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 761 | identifier and so on. For this one more feature will be | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 762 | required, which I call a \emph{record} regular expression.
 | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 763 | While values encode how a regular expression matches a string, | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 764 | records can be used to ``focus'' on some particular parts of | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 765 | the regular expression and ``forget'' about others. | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 766 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 767 | Let us look at an example. Suppose you have the regular | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 768 | expression $a\cdot b + a\cdot c$. Clearly this regular expression can only | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 769 | recognise two strings. But suppose you are not interested | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 770 | whether it can recognise $ab$ or $ac$, but rather if it | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 771 | matched, then what was the last character of the matched | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 772 | string\ldots either $b$ or $c$. You can do this by annotating | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 773 | the regular expression with a record, written in general | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 774 | $(x:r)$, where $x$ is just an identifier (in my implementation | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 775 | a plain string) and $r$ is a regular expression. A record will | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 776 | be regarded as a regular expression. The extended definition | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 777 | in Scala therefore looks as follows: | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 778 | |
| 490 | 779 | {\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
 | 
| 780 |                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 | |
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 781 | {../progs/app03.scala}}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 782 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 783 | \noindent Since we regard records as regular expressions we | 
| 669 | 784 | need to extend the functions $\nullable$ and $\der$. Similarly | 
| 785 | $\textit{mkeps}$ and $\inj$ need to be extended. This means we also need
 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 786 | to extend the definition of values, which in Scala looks as | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 787 | follows: | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 788 | |
| 490 | 789 | {\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
 | 
| 790 |                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 | |
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 791 | {../progs/app04.scala}}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 792 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 793 | \noindent Let us now look at the purpose of records more | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 794 | closely and let us return to our question whether the string | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 795 | terminated in a $b$ or $c$. We can do this as follows: we | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 796 | annotate the regular expression $ab + ac$ with a record | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 797 | as follows | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 798 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 799 | \begin{center}
 | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 800 | $a\cdot (x:b) + a\cdot (x:c)$ | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 801 | \end{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 802 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 803 | \noindent This regular expression can still only recognise | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 804 | the strings $ab$ and $ac$, but we can now use a function | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 805 | that takes a value and returns all records. I call this | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 806 | function \emph{env} for environment\ldots it builds a list
 | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 807 | of identifiers associated with a string. This function | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 808 | can be defined as follows: | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 809 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 810 | \begin{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 811 | \begin{tabular}{lcl}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 812 | $env(Empty)$ & $\dn$ & $[]$\\ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 813 | $env(Char(c))$ & $\dn$ & $[]$\\ | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 814 | $env(\Left(v))$ & $\dn$ & $env(v)$\\ | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 815 | $env(Right(v))$ & $\dn$ & $env(v)$\\ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 816 | $env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\ | 
| 422 | 817 | $env(Stars\,[v_1,\ldots ,v_n])$ & $\dn$ & | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 818 | $env(v_1) \,@\ldots @\, env(v_n)$\\ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 819 | $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 820 | \end{tabular}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 821 | \end{center}
 | 
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 822 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 823 | \noindent where in the last clause we use the flatten function | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 824 | defined earlier. As can be seen, the function $env$ ``picks'' | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 825 | out all underlying strings where a record is given. Since | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 826 | there can be more than one, the environment will potentially | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 827 | contain many ``records''. If we now postprocess the value | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 828 | calculated by $lex$ extracting all records using $env$, we can | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 829 | answer the question whether the last element in the string was | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 830 | an $b$ or a $c$. Lets see this in action: if we use $a\cdot b | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 831 | + a\cdot c$ and $ac$ the calculated value will be | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 832 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 833 | \begin{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 834 | $Right(Seq(Char(a), Char(c)))$ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 835 | \end{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 836 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 837 | \noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ and | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 838 | use the $env$ function to extract the recording for | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 839 | $x$ we obtain | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 840 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 841 | \begin{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 842 | $[(x:c)]$ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 843 | \end{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 844 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 845 | \noindent If we had given the string $ab$ instead, then the | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 846 | record would have been $[(x:b)]$. The fun starts if we | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 847 | iterate this. Consider the regular expression | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 848 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 849 | \begin{center}
 | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 850 | $(a\cdot (x:b) + a\cdot (y:c))^*$ | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 851 | \end{center}
 | 
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 852 | |
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 853 | \noindent and the string $ababacabacab$. This string is | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 854 | clearly matched by the regular expression, but we are only | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 855 | interested in the sequence of $b$s and $c$s. Using $env$ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 856 | we obtain | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 857 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 858 | \begin{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 859 | $[(x:b), (x:b), (y:c), (x:b), (y:c), (x:b)]$ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 860 | \end{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 861 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 862 | \noindent While this feature might look silly, it is in fact | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 863 | quite useful. For example if we want to match the name of | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 864 | an email we might use the regular expression | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 865 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 866 | \[ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 867 | (name: [a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot 
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 868 | (domain: [a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot 
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 869 | (top\_level: [a\mbox{-}z\,.]^{\{2,6\}})
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 870 | \] | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 871 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 872 | \noindent Then if we match the email address | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 873 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 874 | \[ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 875 | \texttt{christian.urban@kcl.ac.uk}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 876 | \] | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 877 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 878 | \noindent we can use the $env$ function and find out | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 879 | what the name, domain and top-level part of the email | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 880 | address are: | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 881 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 882 | \begin{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 883 | $[(name:\texttt{christian.urban}), 
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 884 |   (domain:\texttt{kcl}), 
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 885 |   (top\_level:\texttt{ac.uk})]$
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 886 | \end{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 887 | |
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 888 | Recall that we want to lex a little programming language, | 
| 357 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 889 | called the \emph{While}-language. A simple program in this
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 890 | language is shown in Figure~\ref{while}. The main keywords in
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 891 | this language are \pcode{while}, \pcode{if}, \pcode{then} and
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 892 | \pcode{else}. As usual we have syntactic categories for
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 893 | identifiers, operators, numbers and so on. For this we would | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 894 | need to design the corresponding regular expressions to | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 895 | recognise these syntactic categories. I let you do this design | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 896 | task. Having these regular expressions at our disposal, we can | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 897 | form the regular expression | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 898 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 899 | \begin{figure}[t]
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 900 | \begin{center}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 901 | \mbox{\lstinputlisting[language=while]{../progs/fib.while}}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 902 | \end{center}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 903 | \caption{The Fibonacci program in the While-language.\label{while}}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 904 | \end{figure}
 | 
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 905 | |
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 906 | \begin{center}
 | 
| 357 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 907 | $\textit{WhileRegs} \;\dn\; 
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 908 | \left(\begin{array}{l}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 909 | (k, KeyWords)\; +\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 910 | (i, Ids)\;+\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 911 | (o, Ops)\;+ \\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 912 | (n, Nums)\;+ \\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 913 | (s, Semis)\;+ \\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 914 | (p, (LParens + RParens))\;+\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 915 | (b, (Begin + End))\;+ \\ | 
| 551 | 916 | (w, WhiteSpaces) | 
| 394 
2f9fe225ecc8
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
357diff
changeset | 917 |       \end{array}\right)^{\mbox{\LARGE{}*}}$
 | 
| 357 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 918 | \end{center}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 919 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 920 | \noindent and ask the algorithm by Sulzmann \& Lu to lex, say | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 921 | the following string | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 922 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 923 | \begin{center}\large
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 924 | \code{"if true then then 42 else +"}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 925 | \end{center}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 926 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 927 | \noindent | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 928 | By using the records and extracting the environment, the | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 929 | result is the following list: | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 930 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 931 | \begin{center}\tt
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 932 | \begin{tabular}{l}
 | 
| 643 | 933 | (k, if),\\ | 
| 934 | (w, " "),\\ | |
| 935 | (i, true),\\ | |
| 936 | (w, " "),\\ | |
| 937 | (k, then),\\ | |
| 938 | (w, " "),\\ | |
| 939 | (k, then),\\ | |
| 940 | (w, " "),\\ | |
| 941 | (n, 42),\\ | |
| 942 | (w, " "),\\ | |
| 943 | (k, else),\\ | |
| 944 | (w, " "),\\ | |
| 945 | (o, +) | |
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 946 | \end{tabular}
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 947 | \end{center}
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 948 | |
| 357 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 949 | \noindent Typically we are not interested in the whitespaces | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 950 | and comments and would filter them out: this gives | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 951 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 952 | \begin{center}\tt
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 953 | \begin{tabular}{l}
 | 
| 643 | 954 | (k, if),\\ | 
| 955 | (i, true),\\ | |
| 956 | (k, then),\\ | |
| 957 | (k, then),\\ | |
| 958 | (n, 42),\\ | |
| 959 | (k, else),\\ | |
| 960 | (o, +) | |
| 357 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 961 | \end{tabular}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 962 | \end{center}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 963 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 964 | \noindent | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 965 | which will be the input for the next phase of our compiler. | 
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 966 | |
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 967 | \end{document}
 | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 968 | |
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 969 | %%% Local Variables: | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 970 | %%% mode: latex | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 971 | %%% TeX-master: t | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 972 | %%% End: |