| author | Christian Urban <christian.urban@kcl.ac.uk> | 
| Sun, 11 Oct 2020 09:10:08 +0100 | |
| changeset 778 | 3e5f5d19f514 | 
| parent 721 | e3c64f22dd31 | 
| child 792 | 34132a854d03 | 
| permissions | -rw-r--r-- | 
| 669 | 1 | % !TEX program = xelatex | 
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 2 | \documentclass{article}
 | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 3 | \usepackage{../style}
 | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 4 | \usepackage{../langs}
 | 
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 5 | \usepackage{../graphics}
 | 
| 671 | 6 | \usepackage{skull}
 | 
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 7 | |
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 8 | \begin{document}
 | 
| 669 | 9 | \fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2019}
 | 
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 10 | |
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 11 | \section*{Handout 4 (Sulzmann \& Lu Algorithm)}
 | 
| 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 12 | |
| 716 | 13 | So far our algorithm based on derivatives is only able to say | 
| 14 | yes or no depending on whether a string is matched by a regular | |
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 15 | expression or not. Often a more interesting question is to | 
| 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 16 | find out \emph{how} a regular expression matched a string?
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 17 | Answering this question will also help us with the problem we | 
| 319 
e7b110f93697
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
296diff
changeset | 18 | are after, namely tokenising an input string. | 
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 19 | |
| 422 | 20 | The algorithm we will be looking at in this lecture was designed by Sulzmann | 
| 21 | \& Lu in a rather recent research paper (from 2014). A link to it is | |
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 22 | provided on KEATS, in case you are interested.\footnote{In my
 | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 23 | humble opinion this is an interesting instance of the research | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 24 | literature: it contains a very neat idea, but its presentation | 
| 422 | 25 | is rather sloppy. In earlier versions of this paper, a King's | 
| 26 | undergraduate student and I found several rather annoying typos in the | |
| 669 | 27 | examples and definitions.} My former PhD student Fahad Ausaf and I even more recently | 
| 422 | 28 | wrote a paper where we build on their result: we provided a | 
| 29 | mathematical proof that their algorithm is really correct---the proof | |
| 30 | Sulzmann \& Lu had originally given contained major flaws. Such correctness | |
| 31 | proofs are important: Kuklewicz maintains a unit-test library | |
| 520 | 32 | for the kind of algorithms we are interested in here and he showed | 
| 422 | 33 | that many implementations in the ``wild'' are buggy, that is not | 
| 34 | satisfy his unit tests: | |
| 35 | ||
| 36 | \begin{center}
 | |
| 37 | \url{http://www.haskell.org/haskellwiki/Regex_Posix}
 | |
| 38 | \end{center}
 | |
| 39 | ||
| 40 | ||
| 669 | 41 | Coming back to the algorithm by Sulzmann \& Lu, their idea is to | 
| 42 | introduce \emph{values} for producing an answer for \emph{how} a regular
 | |
| 43 | expression matches a string. So rather than a boolean like in the | |
| 44 | Brzozowski algorithm, a value will be the output of the Sulzman \& Lu | |
| 45 | algorithm whenever the regular expression matches the string. If the | |
| 46 | string does not match the string, an error will be raised. The | |
| 47 | definition for values is given below. They are shown together with the | |
| 48 | regular expressions $r$ to which they correspond: | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 49 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 50 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 51 | \begin{tabular}{cc}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 52 | \begin{tabular}{@{}rrl@{}}
 | 
| 319 
e7b110f93697
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
296diff
changeset | 53 | \multicolumn{3}{c}{regular expressions}\medskip\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 54 | $r$ & $::=$ & $\ZERO$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 55 | & $\mid$ & $\ONE$ \\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 56 | & $\mid$ & $c$ \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 57 | & $\mid$ & $r_1 \cdot r_2$\\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 58 | & $\mid$ & $r_1 + r_2$ \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 59 | \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 60 | & $\mid$ & $r^*$ \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 61 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 62 | & | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 63 | \begin{tabular}{@{\hspace{0mm}}rrl@{}}
 | 
| 319 
e7b110f93697
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
296diff
changeset | 64 | \multicolumn{3}{c}{values}\medskip\\
 | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 65 | $v$ & $::=$ & \\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 66 | & & $Empty$ \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 67 | & $\mid$ & $Char(c)$ \\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 68 | & $\mid$ & $Seq(v_1,v_2)$\\ | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 69 | & $\mid$ & $\Left(v)$ \\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 70 | & $\mid$ & $Right(v)$ \\ | 
| 422 | 71 | & $\mid$ & $Stars\,[v_1,\ldots\,v_n]$ \\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 72 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 73 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 74 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 75 | |
| 417 | 76 | \noindent There is no value for the | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 77 | $\ZERO$ regular expression, since it does not match any | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 78 | string. Otherwise there is exactly one value corresponding to | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 79 | each regular expression with the exception of $r_1 + r_2$ | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 80 | where there are two values, namely $\Left(v)$ and $Right(v)$ | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 81 | corresponding to the two alternatives. Note that $r^*$ is | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 82 | associated with a list of values, one for each copy of $r$ | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 83 | that was needed to match the string. This means we might also | 
| 669 | 84 | return the empty list $Stars\,[]$, if no copy was needed | 
| 520 | 85 | for $r^*$. For sequence, there is exactly one value, composed | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 86 | of two component values ($v_1$ and $v_2$). | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 87 | |
| 422 | 88 | My implementation of regular expressions and values in Scala is | 
| 716 | 89 | shown below. I use the convention that | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 90 | regular expressions are written entirely with upper-case | 
| 716 | 91 | letters, whereas values start with a single upper-case | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 92 | character and the rest are lower-case letters. | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 93 | |
| 490 | 94 | {\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=
 | 
| 95 |                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 | |
| 352 
1e1b0fe66107
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
350diff
changeset | 96 | {../progs/app01.scala}}
 | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 97 | |
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 98 | |
| 490 | 99 | {\small\lstinputlisting[language=Scala,numbers=none,linebackgroundcolor=
 | 
| 100 |                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 101 | {../progs/app02.scala}}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 102 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 103 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 104 | Graphically the algorithm by Sulzmann \& Lu can be illustrated | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 105 | by the picture in Figure~\ref{Sulz} where the path from the
 | 
| 669 | 106 | left to the right involving $\der/\nullable$ is the first phase | 
| 107 | of the algorithm and $\textit{mkeps}/\inj$, the path from right to left,
 | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 108 | the second phase. This picture shows the steps required when a | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 109 | regular expression, say $r_1$, matches the string $abc$. We | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 110 | first build the three derivatives (according to $a$, $b$ and | 
| 669 | 111 | $c$). We then use $\nullable$ to find out whether the resulting | 
| 296 
796b9b81ac8d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
288diff
changeset | 112 | regular expression can match the empty string. If yes, we call | 
| 422 | 113 | the function $\textit{mkeps}$. The $\textit{mkeps}$ function calculates a value for how a regular
 | 
| 114 | expression has matched the empty string. Its definition | |
| 115 | is as follows: | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 116 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 117 | \begin{figure}[t]
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 118 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 119 | \begin{tikzpicture}[scale=2,node distance=1.2cm,
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 120 |                     every node/.style={minimum size=7mm}]
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 121 | \node (r1)  {$r_1$};
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 122 | \node (r2) [right=of r1]{$r_2$};
 | 
| 669 | 123 | \draw[->,line width=1mm](r1)--(r2) node[above,midway] {$\der\,a$};
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 124 | \node (r3) [right=of r2]{$r_3$};
 | 
| 669 | 125 | \draw[->,line width=1mm](r2)--(r3) node[above,midway] {$\der\,b$};
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 126 | \node (r4) [right=of r3]{$r_4$};
 | 
| 669 | 127 | \draw[->,line width=1mm](r3)--(r4) node[above,midway] {$\der\,c$};
 | 
| 128 | \draw (r4) node[anchor=west] {\;\raisebox{3mm}{$\nullable$}};
 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 129 | \node (v4) [below=of r4]{$v_4$};
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 130 | \draw[->,line width=1mm](r4) -- (v4); | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 131 | \node (v3) [left=of v4] {$v_3$};
 | 
| 669 | 132 | \draw[->,line width=1mm](v4)--(v3) node[below,midway] {$\inj\,c$};
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 133 | \node (v2) [left=of v3]{$v_2$};
 | 
| 669 | 134 | \draw[->,line width=1mm](v3)--(v2) node[below,midway] {$\inj\,b$};
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 135 | \node (v1) [left=of v2] {$v_1$};
 | 
| 669 | 136 | \draw[->,line width=1mm](v2)--(v1) node[below,midway] {$\inj\,a$};
 | 
| 422 | 137 | \draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{$\textit{mkeps}$}};
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 138 | \end{tikzpicture}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 139 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 140 | \caption{The two phases of the algorithm by Sulzmann \& Lu.
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 141 | \label{Sulz}}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 142 | \end{figure}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 143 | |
| 422 | 144 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 145 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 146 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 147 | \begin{tabular}{lcl}
 | 
| 422 | 148 |   $\textit{mkeps}(\ONE)$       & $\dn$ & $Empty$\\
 | 
| 669 | 149 |   $\textit{mkeps}(r_1 + r_2)$      & $\dn$ & if $\nullable(r_1)$  \\
 | 
| 422 | 150 |                           &       & then $\Left(\textit{mkeps}(r_1))$\\
 | 
| 151 |                           &       & else $Right(\textit{mkeps}(r_2))$\\
 | |
| 152 |   $\textit{mkeps}(r_1 \cdot r_2)$  & $\dn$ & $Seq(\textit{mkeps}(r_1),\textit{mkeps}(r_2))$\\
 | |
| 153 |   $\textit{mkeps}(r^*)$            & $\dn$ & $Stars\,[]$  \\
 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 154 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 155 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 156 | |
| 596 | 157 | \noindent There are no cases for $\ZERO$ and $c$, since | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 158 | these regular expression cannot match the empty string. Note | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 159 | also that in case of alternatives we give preference to the | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 160 | regular expression on the left-hand side. This will become | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 161 | important later on. | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 162 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 163 | The second phase of the algorithm is organised so that it will | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 164 | calculate a value for how the derivative regular expression | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 165 | has matched a string. For this we need a function that | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 166 | reverses this ``chopping off'' for values which we did in the | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 167 | first phase for derivatives. The corresponding function is | 
| 669 | 168 | called $\inj$ (short for injection). This function takes three | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 169 | arguments: the first one is a regular expression for which we | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 170 | want to calculate the value, the second is the character we | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 171 | want to inject and the third argument is the value where we | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 172 | will inject the character into. The result of this function is a | 
| 669 | 173 | new value. The definition of $\inj$ is as follows: | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 174 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 175 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 176 | \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | 
| 669 | 177 | $\inj\,(c)\,c\,Empty$ & $\dn$ & $Char\,c$\\ | 
| 178 | $\inj\,(r_1 + r_2)\,c\,\Left(v)$ & $\dn$ & $\Left(\inj\,r_1\,c\,v)$\\ | |
| 179 | $\inj\,(r_1 + r_2)\,c\,Right(v)$ & $\dn$ & $Right(\inj\,r_2\,c\,v)$\\ | |
| 180 | $\inj\,(r_1 \cdot r_2)\,c\,Seq(v_1,v_2)$ & $\dn$ & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\ | |
| 181 | $\inj\,(r_1 \cdot r_2)\,c\,\Left(Seq(v_1,v_2))$ & $\dn$ & $Seq(\inj\,r_1\,c\,v_1,v_2)$\\ | |
| 182 |   $\inj\,(r_1 \cdot r_2)\,c\,Right(v)$ & $\dn$  & $Seq(\textit{mkeps}(r_1),\inj\,r_2\,c\,v)$\\
 | |
| 183 | $\inj\,(r^*)\,c\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\inj\,r\,c\,v\,::\,vs)$\\ | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 184 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 185 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 186 | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 187 | \noindent This definition is by recursion on the regular | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 188 | expression and by analysing the shape of the values. Therefore | 
| 422 | 189 | there are three cases for sequence regular | 
| 190 | expressions (for all possible shapes of the value we can encounter). The last | |
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 191 | clause for the star regular expression returns a list where | 
| 669 | 192 | the first element is $\inj\,r\,c\,v$ and the other elements are | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 193 | $vs$. That means $\_\,::\,\_$ should be read as list cons. | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 194 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 195 | To understand what is going on, it might be best to do some | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 196 | example calculations and compare them with Figure~\ref{Sulz}.
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 197 | For this note that we have not yet dealt with the need of | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 198 | simplifying regular expressions (this will be a topic on its | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 199 | own later). Suppose the regular expression is $a \cdot (b | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 200 | \cdot c)$ and the input string is $abc$. The derivatives from | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 201 | the first phase are as follows: | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 202 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 203 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 204 | \begin{tabular}{ll}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 205 | $r_1$: & $a \cdot (b \cdot c)$\\ | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 206 | $r_2$: & $\ONE \cdot (b \cdot c)$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 207 | $r_3$: & $(\ZERO \cdot (b \cdot c)) + (\ONE \cdot c)$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 208 | $r_4$: & $(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)$\\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 209 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 210 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 211 | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 212 | \noindent According to the simple algorithm, we would test | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 213 | whether $r_4$ is nullable, which in this case it indeed is. | 
| 422 | 214 | This means we can use the function $\textit{mkeps}$ to calculate a
 | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 215 | value for how $r_4$ was able to match the empty string. | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 216 | Remember that this function gives preference for alternatives | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 217 | on the left-hand side. However there is only $\ONE$ on the | 
| 422 | 218 | very right-hand side of $r_4$ (underlined) | 
| 219 | ||
| 220 | \begin{center}
 | |
| 221 | $r_4$:\;$(\ZERO \cdot (b \cdot c)) + | |
| 222 |          ((\ZERO \cdot c) + \underline{\ONE})$\\
 | |
| 223 | \end{center}
 | |
| 224 | ||
| 225 | \noindent | |
| 226 | that matches the empty string. | |
| 227 | Therefore $\textit{mkeps}$ returns the value
 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 228 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 229 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 230 | $v_4:\;Right(Right(Empty))$ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 231 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 232 | |
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 233 | \noindent If there had been a $\ONE$ on the left, then | 
| 422 | 234 | $\textit{mkeps}$ would have returned something of the form
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 235 | $\Left(\ldots)$. The point is that from this value we can | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 236 | directly read off which part of $r_4$ matched the empty | 
| 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 237 | string: take the right-alternative first, and then the | 
| 669 | 238 | right-alternative again, then you got to the $\ONE$. | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 239 | |
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 240 | Next we have to ``inject'' the last character, that is $c$ in | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 241 | the running example, into this value $v_4$ in order to | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 242 | calculate how $r_3$ could have matched the string $c$. | 
| 669 | 243 | For this we call injection with $\inj(r_3, c, v_4)$. | 
| 244 | According to the definition of $\inj$ we obtain | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 245 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 246 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 247 | $v_3:\;Right(Seq(Empty, Char(c)))$ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 248 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 249 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 250 | \noindent This is the correct result, because $r_3$ needs | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 251 | to use the right-hand alternative, and then $\ONE$ needs | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 252 | to match the empty string and $c$ needs to match $c$. | 
| 596 | 253 | Next we need to inject back the letter $b$ into $v_3$. | 
| 669 | 254 | For this we call injection with $\inj(r_2, b, v_3)$. | 
| 596 | 255 | This gives | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 256 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 257 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 258 | $v_2:\;Seq(Empty, Seq(Char(b), Char(c)))$ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 259 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 260 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 261 | \noindent which is again the correct result for how $r_2$ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 262 | matched the string $bc$. Finally we need to inject back the | 
| 596 | 263 | letter $a$ into $v_2$ giving the final result. | 
| 669 | 264 | For this we call injection with $\inj(r_1, a, v_2)$ | 
| 596 | 265 | and obtain | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 266 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 267 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 268 | $v_1:\;Seq(Char(a), Seq(Char(b), Char(c)))$ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 269 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 270 | |
| 596 | 271 | \noindent This value corresponds to how the regular expression $r_1$, | 
| 272 | namely $a \cdot (b \cdot c)$, matched the string $abc$. | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 273 | |
| 669 | 274 | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 275 | There are a few auxiliary functions that are of interest | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 276 | when analysing this algorithm. One is called \emph{flatten},
 | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 277 | written $|\_|$, which extracts the string ``underlying'' a | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 278 | value. It is defined recursively as | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 279 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 280 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 281 | \begin{tabular}{lcl}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 282 | $|Empty|$ & $\dn$ & $[]$\\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 283 | $|Char(c)|$ & $\dn$ & $[c]$\\ | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 284 | $|\Left(v)|$ & $\dn$ & $|v|$\\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 285 | $|Right(v)|$ & $\dn$ & $|v|$\\ | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 286 | $|Seq(v_1,v_2)|$& $\dn$ & $|v_1| \,@\, |v_2|$\\ | 
| 422 | 287 | $|Stars\,[v_1,\ldots ,v_n]|$ & $\dn$ & $|v_1| \,@\ldots @\, |v_n|$\\ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 288 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 289 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 290 | |
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 291 | \noindent Using flatten we can see what are the strings behind | 
| 669 | 292 | the values calculated by $\textit{mkeps}$ and $\inj$. In our running 
 | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 293 | example: | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 294 | |
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 295 | \begin{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 296 | \begin{tabular}{ll}
 | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 297 | $|v_4|$: & $[]$\\ | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 298 | $|v_3|$: & $c$\\ | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 299 | $|v_2|$: & $bc$\\ | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 300 | $|v_1|$: & $abc$ | 
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 301 | \end{tabular}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 302 | \end{center}
 | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 303 | |
| 669 | 304 | \noindent This indicates that $\inj$ indeed is injecting, or | 
| 422 | 305 | adding, back a character into the value. | 
| 350 
c4e7caa06c74
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
326diff
changeset | 306 | |
| 669 | 307 | The definition of $\inj$ might still be very puzzling and each clause | 
| 308 | might look arbitrary, but there is in fact a relatively simple idea | |
| 309 | behind it. Ultimately the $\inj$-functions is determined by the | |
| 310 | derivative functions. For this consider one of the ``squares'' from | |
| 311 | Figure~\ref{Sulz}:
 | |
| 312 | ||
| 720 | 313 | \begin{center}
 | 
| 314 | \begin{tikzpicture}[scale=2,node distance=1.2cm,
 | |
| 315 |                     every node/.style={minimum size=7mm}]
 | |
| 316 | \node (r)  {$r$};
 | |
| 317 | \node (rd) [right=of r]{$r_{der}$};
 | |
| 318 | \draw[->,line width=1mm](r)--(rd) node[above,midway] {$\der\,c$};
 | |
| 319 | \node (vd) [below=of r2]{$v_{der}$};
 | |
| 320 | \draw[->,line width=1mm](rd) -- (vd); | |
| 321 | \node (v) [left=of vd] {$v$};
 | |
| 322 | \draw[->,line width=1mm](vd)--(v) node[below,midway] {$\inj\,c$};
 | |
| 323 | \draw[->,line width=0.5mm,dotted](r) -- (v) node[left,midway,red] {\bf ?};
 | |
| 324 | \end{tikzpicture}
 | |
| 325 | \end{center}
 | |
| 669 | 326 | |
| 327 | \noindent | |
| 328 | The input to the $\inj$-function is $r$ (on the top left), $c$ (the | |
| 329 | character to be injected) and $v_{der}$ (bottom right). The output is
 | |
| 330 | the value $v$ for how the regular expression $r$ matched the | |
| 331 | corresponding string. How does $\inj$ calculate this value? Well, it has | |
| 332 | to analyse the value $v_{der}$ and transform it into the value $v$ for
 | |
| 333 | the regular expression $r$. The value $v_{der}$ corresponds to the
 | |
| 334 | $r_{der}$-regular expression which is the derivative of $r$. Remember
 | |
| 335 | that $v_{der}$ is the value for how $r_{der}$ matches the corresponding string
 | |
| 336 | where $c$ has been chopped off. | |
| 337 | ||
| 720 | 338 | For a concrete example, let $r$ be $r_1 + r_2$. Then $r_{der}$
 | 
| 669 | 339 | is of the form $(\der\,c\,r_1) + (\der\,c\,r_2)$. What are the possible | 
| 340 | values corresponding to $r_{der}$? Well, they can be only of the form
 | |
| 341 | $\Left(\ldots)$ and $\Right(\ldots)$. Therefore you have two | |
| 342 | cases in the $\inj$ function -- one for $\Left$ and one for $\Right$. | |
| 343 | ||
| 344 | \begin{center}
 | |
| 345 | \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l}
 | |
| 346 | $\inj\,(r_1 + r_2)\,c\,\,\Left(v)$ & $\dn$ & $\ldots$\\ | |
| 347 | $\inj\,(r_1 + r_2)\,c\,\,\Right(v)$ & $\dn$ & $\ldots$\\ | |
| 348 | \end{tabular}
 | |
| 349 | \end{center}
 | |
| 350 | ||
| 351 | \noindent | |
| 352 | Let's look next at the sequence case where $r = r_1 \cdot r_2$. What does the | |
| 353 | derivative of $r$ look like? According to the definition it is: | |
| 354 | ||
| 355 | \begin{center}
 | |
| 356 |   \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | |
| 357 | $\der\, c\, (r_1 \cdot r_2)$ & $\dn$ & if $\nullable (r_1)$\\ | |
| 358 | & & then $(\der\,c\,r_1) \cdot r_2 + \der\, c\, r_2$\\ | |
| 359 | & & else $(\der\, c\, r_1) \cdot r_2$\\ | |
| 360 |   \end{tabular}  
 | |
| 361 | \end{center}  
 | |
| 362 | ||
| 363 | \noindent As you can see there is a derivative in the if-branch and another | |
| 364 | in the else-branch. In the if-branch we have an alternative of the form | |
| 365 | $\_ + \_$. We already know what the possible values are for such a regular | |
| 366 | expression, namely $\Left$ or $\Right$. Therefore we have in $\inj$ the | |
| 367 | two cases: | |
| 368 | ||
| 369 | \begin{center}
 | |
| 370 |   \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{2mm}}l}
 | |
| 371 | $\inj\,(r_1 \cdot r_2)\,c\,\,\Left(Seq(v_1,v_2))$ & $\dn$ & $\ldots$\\ | |
| 372 | $\inj\,(r_1 \cdot r_2)\,c\,\,Right(v)$ & $\dn$ & $\ldots$\\ | |
| 373 | \end{tabular}
 | |
| 374 | \end{center}
 | |
| 375 | ||
| 376 | \noindent | |
| 377 | In the first case we even know that it is not just a $\Left$-value, but | |
| 378 | $\Left(Seq(\ldots))$, because the corresponding regular expression | |
| 379 | in the derivation is \mbox{$(\der\,c\,r_1) \cdot r_2$}. Hence we know 
 | |
| 380 | it must be a $Seq$-value enclosed inside $\Left$. | |
| 381 | The third clause for $r_1\cdot r_2$ in the $\inj$-function | |
| 382 | ||
| 383 | \begin{center}
 | |
| 384 |   \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | |
| 385 | $\inj\,(r_1 \cdot r_2)\,c\,\,Seq(v_1,v_2)$ & $\dn$ & $\ldots$\\ | |
| 386 |   \end{tabular}
 | |
| 387 |   \end{center}
 | |
| 388 | ||
| 389 | \noindent corresponds to the else-branch in the derivative. In this | |
| 390 | case we know the derivative is of the form $(\der\,c\,r_1) \cdot r_2$ and | |
| 391 | therefore the value must be of the form $Seq(\ldots)$. | |
| 392 | ||
| 393 | Hopefully the $inj$-function makes now more sense. I let you explain | |
| 394 | the $r^*$ case. What do the derivative of $r^*$ and | |
| 395 | the corresponding value look like? Does this explain the shape of | |
| 396 | the clause? | |
| 397 | ||
| 398 | \begin{center}
 | |
| 399 |   \begin{tabular}{l@{\hspace{1mm}}c@{\hspace{1mm}}l}
 | |
| 400 | $\inj\,(r^*)\,c\,\,Seq(v,Stars\,vs)$ & $\dn$ & $Stars(\inj\,r\,c\,v\,::\,vs)$\\ | |
| 401 |   \end{tabular}
 | |
| 402 | \end{center}
 | |
| 403 | ||
| 404 | \noindent If yes, you made sense of the left-hand sides of | |
| 405 | the $\inj$-definition. | |
| 406 | ||
| 407 | How about the right-hand sides? Well, in the | |
| 408 | $r^*$ case for example we have according to the square in the picture | |
| 409 | above a value $v_{der}$ which says how the derivative $r_{der}$
 | |
| 410 | matched the string. Since the derivative is of the form | |
| 411 | $(\der\,c\,r) \cdot (r^*)$ the corresponding value is of the | |
| 412 | form $Seq(v, Stars\,vs)$. But for $r^*$ we are looking for a value | |
| 413 | for the original (top left) regular expression --- so it cannot | |
| 414 | be a value of the shape $Seq(\ldots, Stars\ldots)$ because that is the | |
| 415 | shape for sequence regular expressions. What we need is a value | |
| 416 | of the form $Stars \ldots$ (remember the correspondence between | |
| 417 | values and regular expressions). This suggests to define the right hand | |
| 418 | side as | |
| 419 | ||
| 420 | \begin{center}
 | |
| 421 | $\ldots \dn Stars(\inj\,r\,c\,v\,::\,vs)$ | |
| 422 | \end{center}  
 | |
| 423 | ||
| 424 | \noindent It is a value of the right shape, namely $Stars$. It injected | |
| 425 | $c$ into the first-value, which is in fact the value where we need to | |
| 426 | undo the derivative. Remember again the shape of the derivative of $r^*$. | |
| 427 | In place where we chopped off the $c$, we now need to do the $\inj$ of $c$. | |
| 428 | Therefore $\inj\,r\,c\,v$ in the definition above. That is the same with | |
| 429 | the other clauses in $\inj$. | |
| 430 | ||
| 431 | ||
| 671 | 432 | Phew\ldots{}Sweat\ldots!\#@$\skull$\%\ldots Unfortunately, there is
 | 
| 433 | a gigantic problem with the described algorithm so far: it is very | |
| 720 | 434 | slow. To make it faster, we have to include in all this the simplification | 
| 435 | from Lecture 2\ldots{}and what rotten luck: simplification messes things 
 | |
| 436 | up and we need to rectify the mess. This is what we shall do next. | |
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 437 | |
| 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 438 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 439 | \subsubsection*{Simplification}
 | 
| 282 
3e3b927a85cf
added ho04
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
251diff
changeset | 440 | |
| 283 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 441 | Generally the matching algorithms based on derivatives do | 
| 
c14e5ebf0c3b
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
282diff
changeset | 442 | poorly unless the regular expressions are simplified after | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 443 | each derivative step. But this is a bit more involved in the | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 444 | algorithm of Sulzmann \& Lu. So what follows might require you | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 445 | to read several times before it makes sense and also might | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 446 | require that you do some example calculations yourself. As a | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 447 | first example consider the last derivation step in our earlier | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 448 | example: | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 449 | |
| 422 | 450 | \begin{equation}\label{regexfour}
 | 
| 669 | 451 | r_4 = \der\,c\,r_3 = | 
| 422 | 452 | (\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE) | 
| 453 | \end{equation}
 | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 454 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 455 | \noindent Simplifying this regular expression would just give | 
| 422 | 456 | us $\ONE$. Running $\textit{mkeps}$ with this $\ONE$ as
 | 
| 457 | input, however, would give us with the value $Empty$ instead of | |
| 458 | ||
| 459 | \[Right(Right(Empty))\] | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 460 | |
| 422 | 461 | \noindent | 
| 462 | that was obtained without the simplification. The problem is we need | |
| 463 | to recreate this more complicated value, rather than just return | |
| 464 | $Empty$. This will require what I call \emph{rectification functions}.
 | |
| 465 | They need to be calculated whenever a regular expression gets | |
| 466 | simplified. | |
| 467 | ||
| 468 | Rectification functions take a value as argument and return a | |
| 469 | (rectified) value. In the example above the argument would be $Empty$ | |
| 470 | and the output $Right(Right(Empty))$. Before we define these | |
| 471 | rectification functions in general, let us first take a look again at | |
| 472 | our simplification rules: | |
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 473 | |
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 474 | \begin{center}
 | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 475 | \begin{tabular}{l@{\hspace{2mm}}c@{\hspace{2mm}}l}
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 476 | $r \cdot \ZERO$ & $\mapsto$ & $\ZERO$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 477 | $\ZERO \cdot r$ & $\mapsto$ & $\ZERO$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 478 | $r \cdot \ONE$ & $\mapsto$ & $r$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 479 | $\ONE \cdot r$ & $\mapsto$ & $r$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 480 | $r + \ZERO$ & $\mapsto$ & $r$\\ | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 481 | $\ZERO + r$ & $\mapsto$ & $r$\\ | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 482 | $r + r$ & $\mapsto$ & $r$\\ | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 483 | \end{tabular}
 | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 484 | \end{center}
 | 
| 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 485 | |
| 422 | 486 | \noindent Applying them to $r_4$ in \eqref{regexfour} will require several nested
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 487 | simplifications in order end up with just $\ONE$. However, | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 488 | it is possible to apply them in a depth-first, or inside-out, | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 489 | manner in order to calculate this simplified regular | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 490 | expression. | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 491 | |
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 492 | The rectification we can implement by letting simp return not | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 493 | just a (simplified) regular expression, but also a | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 494 | rectification function. Let us consider the alternative case, | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 495 | $r_1 + r_2$, first. By going depth-first, we first simplify | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 496 | the component regular expressions $r_1$ and $r_2.$ This will | 
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 497 | return simplified versions, say $r_{1s}$ and $r_{2s}$, of the
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 498 | component regular expressions (if they can be simplified) but | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 499 | also two rectification functions $f_{1s}$ and $f_{2s}$. We
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 500 | need to assemble them in order to obtain a rectified value for | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 501 | $r_1 + r_2$. In case $r_{1s}$ simplified to $\ZERO$, we
 | 
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 502 | continue the derivative calculation with $r_{2s}$. The
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 503 | Sulzmann \& Lu algorithm would return a corresponding value, | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 504 | say $v_{2s}$. But now this value needs to be ``rectified'' to
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 505 | the value | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 506 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 507 | \begin{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 508 | $Right(v_{2s})$
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 509 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 510 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 511 | \noindent The reason is that we look for the value that tells | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 512 | us how $r_1 + r_2$ could have matched the string, not just | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 513 | $r_2$ or $r_{2s}$. Unfortunately, this is still not the right
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 514 | value in general because there might be some simplifications | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 515 | that happened inside $r_2$ and for which the simplification | 
| 520 | 516 | function returned also a rectification function $f_{2s}$. So in
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 517 | fact we need to apply this one too which gives | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 518 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 519 | \begin{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 520 | $Right(f_{2s}(v_{2s}))$
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 521 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 522 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 523 | \noindent This is now the correct, or rectified, value. Since | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 524 | the simplification will be done in the first phase of the | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 525 | algorithm, but the rectification needs to be done to the | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 526 | values in the second phase, it is advantageous to calculate | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 527 | the rectification as a function, remember this function and | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 528 | then apply the value to this function during the second phase. | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 529 | So if we want to implement the rectification as function, we | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 530 | would need to return | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 531 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 532 | \begin{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 533 | $\lambda v.\,Right(f_{2s}(v))$
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 534 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 535 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 536 | \noindent which is the lambda-calculus notation for | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 537 | a function that expects a value $v$ and returns everything | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 538 | after the dot where $v$ is replaced by whatever value is | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 539 | given. | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 540 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 541 | Let us package this idea with rectification functions | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 542 | into a single function (still only considering the alternative | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 543 | case): | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 544 | |
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 545 | \begin{center}
 | 
| 422 | 546 | \begin{tabular}{ll}
 | 
| 547 | $_1$ & $simp(r)$:\\ | |
| 548 | $_2$ & \quad case $r = r_1 + r_2$\\ | |
| 549 | $_3$ & \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 | |
| 550 | $_4$ & \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 | |
| 551 | $_5$ & \qquad case $r_{1s} = \ZERO$: 
 | |
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 552 |        return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
 | 
| 422 | 553 | $_6$ & \qquad case $r_{2s} = \ZERO$: 
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 554 |        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
 | 
| 422 | 555 | $_7$ & \qquad case $r_{1s} = r_{2s}$:
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 556 |        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
 | 
| 422 | 557 | $_8$ & \qquad otherwise: | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 558 |        return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$\\
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 559 | \end{tabular}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 560 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 561 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 562 | \noindent We first recursively call the simplification with | 
| 422 | 563 | $r_1$ and $r_2$ (Lines 3 and 4). This gives simplified regular expressions, | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 564 | $r_{1s}$ and $r_{2s}$, as well as two rectification functions
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 565 | $f_{1s}$ and $f_{2s}$. We next need to test whether the
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 566 | simplified regular expressions are $\ZERO$ so as to make | 
| 422 | 567 | further simplifications. In case $r_{1s}$ is $\ZERO$ (Line 5),
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 568 | then we can return $r_{2s}$ (the other alternative). However
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 569 | for this we need to build a corresponding rectification | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 570 | function, which as said above is | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 571 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 572 | \begin{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 573 | $\lambda v.\,Right(f_{2s}(v))$
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 574 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 575 | |
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 576 | \noindent The case where $r_{2s} = \ZERO$ is similar:
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 577 | We return $r_{1s}$ and rectify with $\Left(\_)$ and the
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 578 | other calculated rectification function $f_{1s}$. This gives
 | 
| 284 
0afe43616b6a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
283diff
changeset | 579 | |
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 580 | \begin{center}
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 581 | $\lambda v.\,\Left(f_{1s}(v))$
 | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 582 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 583 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 584 | \noindent The next case where $r_{1s} = r_{2s}$ can be treated
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 585 | like the one where $r_{2s} = \ZERO$. We return $r_{1s}$
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 586 | and rectify with $\Left(\_)$ and so on. | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 587 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 588 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 589 | The otherwise-case is slightly more complicated. In this case | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 590 | neither $r_{1s}$ nor $r_{2s}$ are $\ZERO$ and also
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 591 | $r_{1s} \not= r_{2s}$, which means no further simplification
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 592 | can be applied. Accordingly, we return $r_{1s} + r_{2s}$ as
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 593 | the simplified regular expression. In principle we also do not | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 594 | have to do any rectification, because no simplification was | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 595 | done in this case. But this is actually not true: There might | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 596 | have been simplifications inside $r_{1s}$ and $r_{2s}$. We
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 597 | therefore need to take into account the calculated | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 598 | rectification functions $f_{1s}$ and $f_{2s}$. We can do this
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 599 | by defining a rectification function $f_{alt}$ which takes two
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 600 | rectification functions as arguments and applies them | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 601 | according to whether the value is of the form $\Left(\_)$ or | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 602 | $Right(\_)$: | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 603 | |
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 604 | \begin{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 605 | \begin{tabular}{l@{\hspace{1mm}}l}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 606 | $f_{alt}(f_1, f_2) \dn$\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 607 | \qquad $\lambda v.\,$ case $v = \Left(v')$: | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 608 | & return $\Left(f_1(v'))$\\ | 
| 285 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 609 | \qquad \phantom{$\lambda v.\,$} case $v = Right(v')$: 
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 610 | & return $Right(f_2(v'))$\\ | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 611 | \end{tabular}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 612 | \end{center}
 | 
| 
8a222559278f
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
284diff
changeset | 613 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 614 | The other interesting case with simplification is the sequence | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 615 | case. In this case the main simplification function is as | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 616 | follows | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 617 | |
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 618 | \begin{center}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 619 | \begin{tabular}{l}
 | 
| 422 | 620 | $simp(r)$:\hspace{5cm} (continued)\\
 | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 621 | \quad case $r = r_1 \cdot r_2$\\ | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 622 | \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 623 | \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 624 | \qquad case $r_{1s} = \ZERO$: 
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 625 |        return $(\ZERO, f_{error})$\\
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 626 | \qquad case $r_{2s} = \ZERO$: 
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 627 |        return $(\ZERO, f_{error})$\\
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 628 | \qquad case $r_{1s} = \ONE$: 
 | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 629 | return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 630 | \qquad case $r_{2s} = \ONE$: 
 | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 631 | return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 632 | \qquad otherwise: | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 633 |        return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$\\
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 634 | \end{tabular}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 635 | \end{center}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 636 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 637 | \noindent whereby in the last line $f_{seq}$ is again pushing
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 638 | the two rectification functions into the two components of the | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 639 | Seq-value: | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 640 | |
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 641 | \begin{center}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 642 | \begin{tabular}{l@{\hspace{1mm}}l}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 643 | $f_{seq}(f_1, f_2) \dn$\\
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 644 | \qquad $\lambda v.\,$ case $v = Seq(v_1, v_2)$: | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 645 | & return $Seq(f_1(v_1), f_2(v_2))$\\ | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 646 | \end{tabular}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 647 | \end{center}
 | 
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 648 | |
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 649 | \noindent Note that in the case of $r_{1s} = \ZERO$
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 650 | (similarly $r_{2s}$) we use the function $f_{error}$ for
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 651 | rectification. If you think carefully, then you will realise | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 652 | that this function will actually never been called. This is | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 653 | because a sequence with $\ZERO$ will never recognise any | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 654 | string and therefore the second phase of the algorithm would | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 655 | never been called. The simplification function still expects | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 656 | us to give a function. So in my own implementation I just | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 657 | returned a function that raises an error. In the case | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 658 | where $r_{1s} = \ONE$ (similarly $r_{2s}$) we have
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 659 | to create a sequence where the first component is a rectified | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 660 | version of $Empty$. Therefore we call $f_{1s}$ with $Empty$.
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 661 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 662 | Since we only simplify regular expressions of the form $r_1 + | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 663 | r_2$ and $r_1 \cdot r_2$ we do not have to do anything else | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 664 | in the remaining cases. The rectification function will | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 665 | be just the identity, which in lambda-calculus terms is | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 666 | just | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 667 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 668 | \begin{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 669 | $\lambda v.\,v$ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 670 | \end{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 671 | |
| 422 | 672 | \noindent This completes the high-level version of the simplification | 
| 673 | function, which is shown again in Figure~\ref{simp}. The Scala code
 | |
| 444 
3056a4c071b0
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
422diff
changeset | 674 | for the simplification function is in Figure~\ref{simprect}.
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 675 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 676 | \begin{figure}[t]
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 677 | \begin{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 678 | \begin{tabular}{l}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 679 | $simp(r)$:\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 680 | \quad case $r = r_1 + r_2$\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 681 | \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 682 | \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 683 | \qquad case $r_{1s} = \ZERO$: 
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 684 |        return $(r_{2s}, \lambda v. \,Right(f_{2s}(v)))$\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 685 | \qquad case $r_{2s} = \ZERO$: 
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 686 |        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 687 | \qquad case $r_{1s} = r_{2s}$:
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 688 |        return $(r_{1s}, \lambda v. \,\Left(f_{1s}(v)))$\\
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 689 | \qquad otherwise: | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 690 |        return $(r_{1s} + r_{2s}, f_{alt}(f_{1s}, f_{2s}))$
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 691 | \medskip\\ | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 692 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 693 | \quad case $r = r_1 \cdot r_2$\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 694 | \qquad let $(r_{1s}, f_{1s}) = simp(r_1)$\\
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 695 | \qquad \phantom{let} $(r_{2s}, f_{2s}) = simp(r_2)$\smallskip\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 696 | \qquad case $r_{1s} = \ZERO$: 
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 697 |        return $(\ZERO, f_{error})$\\
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 698 | \qquad case $r_{2s} = \ZERO$: 
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 699 |        return $(\ZERO, f_{error})$\\
 | 
| 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 700 | \qquad case $r_{1s} = \ONE$: 
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 701 | return $(r_{2s}, \lambda v. \,Seq(f_{1s}(Empty), f_{2s}(v)))$\\
 | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 702 | \qquad case $r_{2s} = \ONE$: 
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 703 | return $(r_{1s}, \lambda v. \,Seq(f_{1s}(v), f_{2s}(Empty)))$\\
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 704 | \qquad otherwise: | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 705 |        return $(r_{1s} \cdot r_{2s}, f_{seq}(f_{1s}, f_{2s}))$
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 706 | \medskip\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 707 | |
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 708 | \quad otherwise:\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 709 | \qquad return $(r, \lambda v.\,v)$\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 710 | \end{tabular}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 711 | \end{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 712 | \caption{The simplification function that returns a simplified 
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 713 | regular expression and a rectification function.\label{simp}}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 714 | \end{figure}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 715 | |
| 422 | 716 | \begin{figure}[p]
 | 
| 492 | 717 | \small | 
| 490 | 718 | \lstinputlisting[numbers=left,linebackgroundcolor= | 
| 719 |    {\ifodd\value{lstnumber}\color{capri!3}\fi}]{../progs/app61.scala}
 | |
| 422 | 720 | |
| 721 | \caption{The Scala code for the simplification function. The
 | |
| 520 | 722 | first part defines some auxiliary functions for the rectification. | 
| 422 | 723 | The second part give the simplification function. | 
| 724 | \label{simprect}}
 | |
| 725 | \end{figure}
 | |
| 726 | ||
| 727 | ||
| 728 | We are now in the position to define a \emph{lexing function} as follows:
 | |
| 729 | ||
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 730 | \begin{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 731 | \begin{tabular}{lcl}
 | 
| 669 | 732 | $lex\,r\,[]$ & $\dn$ & if $\nullable(r)$ then $\textit{mkeps}(r)$\\
 | 
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 733 | & & else $error$\\ | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 734 | $lex\,r\,c\!::\!s$ & $\dn$ & let | 
| 669 | 735 |    $(r_{simp}, f_{rect}) = simp(\der(c, r))$\\
 | 
| 736 | & & $\inj\,r\,c\,f_{rect}(lex\,r_{simp}\,s)$              
 | |
| 287 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 737 | \end{tabular}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 738 | \end{center}
 | 
| 
2c50b8b5886c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
286diff
changeset | 739 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 740 | \noindent This corresponds to the $matches$ function we have | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 741 | seen in earlier lectures. In the first clause we are given an | 
| 520 | 742 | empty string, $[]$, and need to test whether the regular | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 743 | expression is $nullable$. If yes, we can proceed normally and | 
| 422 | 744 | just return the value calculated by $\textit{mkeps}$. The second clause
 | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 745 | is for strings where the first character is $c$, say, and the | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 746 | rest of the string is $s$. We first build the derivative of | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 747 | $r$ with respect to $c$; simplify the resulting regular | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 748 | expression. We continue lexing with the simplified regular | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 749 | expression and the string $s$. Whatever will be returned as | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 750 | value, we sill need to rectify using the $f_{rect}$ from the
 | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 751 | simplification and finally inject $c$ back into the | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 752 | (rectified) value. | 
| 286 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 753 | |
| 
19020b75d75e
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
285diff
changeset | 754 | |
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 755 | \subsubsection*{Records}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 756 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 757 | Remember we wanted to tokenize input strings, that means | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 758 | splitting strings into their ``word'' components. Furthermore | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 759 | we want to classify each token as being a keyword or | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 760 | identifier and so on. For this one more feature will be | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 761 | required, which I call a \emph{record} regular expression.
 | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 762 | While values encode how a regular expression matches a string, | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 763 | records can be used to ``focus'' on some particular parts of | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 764 | the regular expression and ``forget'' about others. | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 765 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 766 | Let us look at an example. Suppose you have the regular | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 767 | expression $a\cdot b + a\cdot c$. Clearly this regular expression can only | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 768 | recognise two strings. But suppose you are not interested | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 769 | whether it can recognise $ab$ or $ac$, but rather if it | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 770 | matched, then what was the last character of the matched | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 771 | string\ldots either $b$ or $c$. You can do this by annotating | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 772 | the regular expression with a record, written in general | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 773 | $(x:r)$, where $x$ is just an identifier (in my implementation | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 774 | a plain string) and $r$ is a regular expression. A record will | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 775 | be regarded as a regular expression. The extended definition | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 776 | in Scala therefore looks as follows: | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 777 | |
| 490 | 778 | {\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
 | 
| 779 |                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 | |
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 780 | {../progs/app03.scala}}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 781 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 782 | \noindent Since we regard records as regular expressions we | 
| 669 | 783 | need to extend the functions $\nullable$ and $\der$. Similarly | 
| 784 | $\textit{mkeps}$ and $\inj$ need to be extended. This means we also need
 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 785 | to extend the definition of values, which in Scala looks as | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 786 | follows: | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 787 | |
| 490 | 788 | {\small\lstinputlisting[language=Scala, numbers=none,linebackgroundcolor=
 | 
| 789 |                   {\ifodd\value{lstnumber}\color{capri!3}\fi}]
 | |
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 790 | {../progs/app04.scala}}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 791 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 792 | \noindent Let us now look at the purpose of records more | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 793 | closely and let us return to our question whether the string | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 794 | terminated in a $b$ or $c$. We can do this as follows: we | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 795 | annotate the regular expression $ab + ac$ with a record | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 796 | as follows | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 797 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 798 | \begin{center}
 | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 799 | $a\cdot (x:b) + a\cdot (x:c)$ | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 800 | \end{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 801 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 802 | \noindent This regular expression can still only recognise | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 803 | the strings $ab$ and $ac$, but we can now use a function | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 804 | that takes a value and returns all records. I call this | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 805 | function \emph{env} for environment\ldots it builds a list
 | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 806 | of identifiers associated with a string. This function | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 807 | can be defined as follows: | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 808 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 809 | \begin{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 810 | \begin{tabular}{lcl}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 811 | $env(Empty)$ & $\dn$ & $[]$\\ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 812 | $env(Char(c))$ & $\dn$ & $[]$\\ | 
| 400 
e4afe3f46c29
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
394diff
changeset | 813 | $env(\Left(v))$ & $\dn$ & $env(v)$\\ | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 814 | $env(Right(v))$ & $\dn$ & $env(v)$\\ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 815 | $env(Seq(v_1,v_2))$& $\dn$ & $env(v_1) \,@\, env(v_2)$\\ | 
| 422 | 816 | $env(Stars\,[v_1,\ldots ,v_n])$ & $\dn$ & | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 817 | $env(v_1) \,@\ldots @\, env(v_n)$\\ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 818 | $env(Rec(x:v))$ & $\dn$ & $(x:|v|) :: env(v)$\\ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 819 | \end{tabular}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 820 | \end{center}
 | 
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 821 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 822 | \noindent where in the last clause we use the flatten function | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 823 | defined earlier. As can be seen, the function $env$ ``picks'' | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 824 | out all underlying strings where a record is given. Since | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 825 | there can be more than one, the environment will potentially | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 826 | contain many ``records''. If we now postprocess the value | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 827 | calculated by $lex$ extracting all records using $env$, we can | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 828 | answer the question whether the last element in the string was | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 829 | an $b$ or a $c$. Lets see this in action: if we use $a\cdot b | 
| 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 830 | + a\cdot c$ and $ac$ the calculated value will be | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 831 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 832 | \begin{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 833 | $Right(Seq(Char(a), Char(c)))$ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 834 | \end{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 835 | |
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 836 | \noindent If we use instead $a\cdot (x:b) + a\cdot (x:c)$ and | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 837 | use the $env$ function to extract the recording for | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 838 | $x$ we obtain | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 839 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 840 | \begin{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 841 | $[(x:c)]$ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 842 | \end{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 843 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 844 | \noindent If we had given the string $ab$ instead, then the | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 845 | record would have been $[(x:b)]$. The fun starts if we | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 846 | iterate this. Consider the regular expression | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 847 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 848 | \begin{center}
 | 
| 326 
94700593a2d5
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
319diff
changeset | 849 | $(a\cdot (x:b) + a\cdot (y:c))^*$ | 
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 850 | \end{center}
 | 
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 851 | |
| 288 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 852 | \noindent and the string $ababacabacab$. This string is | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 853 | clearly matched by the regular expression, but we are only | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 854 | interested in the sequence of $b$s and $c$s. Using $env$ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 855 | we obtain | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 856 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 857 | \begin{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 858 | $[(x:b), (x:b), (y:c), (x:b), (y:c), (x:b)]$ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 859 | \end{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 860 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 861 | \noindent While this feature might look silly, it is in fact | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 862 | quite useful. For example if we want to match the name of | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 863 | an email we might use the regular expression | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 864 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 865 | \[ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 866 | (name: [a\mbox{-}z0\mbox{-}9\_\!\_\,.-]^+)\cdot @\cdot 
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 867 | (domain: [a\mbox{-}z0\mbox{-}9\,.-]^+)\cdot .\cdot 
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 868 | (top\_level: [a\mbox{-}z\,.]^{\{2,6\}})
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 869 | \] | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 870 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 871 | \noindent Then if we match the email address | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 872 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 873 | \[ | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 874 | \texttt{christian.urban@kcl.ac.uk}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 875 | \] | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 876 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 877 | \noindent we can use the $env$ function and find out | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 878 | what the name, domain and top-level part of the email | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 879 | address are: | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 880 | |
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 881 | \begin{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 882 | $[(name:\texttt{christian.urban}), 
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 883 |   (domain:\texttt{kcl}), 
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 884 |   (top\_level:\texttt{ac.uk})]$
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 885 | \end{center}
 | 
| 
39aeca14af8c
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
287diff
changeset | 886 | |
| 721 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 887 | Recall that we want to lex a little programming language, called the | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 888 | \emph{While}-language. A simple program in this language is shown in
 | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 889 | Figure~\ref{while}. The main keywords in this language are
 | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 890 | \pcode{while}, \pcode{if}, \pcode{then} and
 | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 891 | \pcode{else}.\footnote{Contrast this with the COBOL programming
 | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 892 | language, which was developed around 1960 and thought to be dead for | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 893 | many decades---even your friendly lecturer is not old enough to have | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 894 | been taught this language. Anyway, this language had over 600 | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 895 |   keywords (or what they called \emph{reserved words}). Interestingly
 | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 896 | though this language is still used in 2020: during the height of | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 897 | Corona crisis the State of New Jewrsey in the US was looking for | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 898 | COBOL programers who could fix the state's national insurance | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 899 | webpage. You were probably paid in gold and diamonds, if you were | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 900 | able to program in COBOL. If you fixed their webpage, surely you | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 901 | were allowed to marry the governer's son/daughter. } As usual we | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 902 | have syntactic categories for identifiers, operators, numbers and so | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 903 | on. For this we would need to design the corresponding regular | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 904 | expressions to recognise these syntactic categories. I let you do this | 
| 
e3c64f22dd31
added slides from Rochester
 Christian Urban <christian.urban@kcl.ac.uk> parents: 
720diff
changeset | 905 | design task. Having these regular expressions at our disposal, we can | 
| 357 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 906 | form the regular expression | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 907 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 908 | \begin{figure}[t]
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 909 | \begin{center}
 | 
| 778 | 910 | \mbox{\lstinputlisting[language=while]{../progs/while-tests/fib.while}}
 | 
| 357 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 911 | \end{center}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 912 | \caption{The Fibonacci program in the While-language.\label{while}}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 913 | \end{figure}
 | 
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 914 | |
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 915 | \begin{center}
 | 
| 357 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 916 | $\textit{WhileRegs} \;\dn\; 
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 917 | \left(\begin{array}{l}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 918 | (k, KeyWords)\; +\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 919 | (i, Ids)\;+\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 920 | (o, Ops)\;+ \\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 921 | (n, Nums)\;+ \\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 922 | (s, Semis)\;+ \\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 923 | (p, (LParens + RParens))\;+\\ | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 924 | (b, (Begin + End))\;+ \\ | 
| 551 | 925 | (w, WhiteSpaces) | 
| 394 
2f9fe225ecc8
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
357diff
changeset | 926 |       \end{array}\right)^{\mbox{\LARGE{}*}}$
 | 
| 357 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 927 | \end{center}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 928 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 929 | \noindent and ask the algorithm by Sulzmann \& Lu to lex, say | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 930 | the following string | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 931 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 932 | \begin{center}\large
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 933 | \code{"if true then then 42 else +"}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 934 | \end{center}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 935 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 936 | \noindent | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 937 | By using the records and extracting the environment, the | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 938 | result is the following list: | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 939 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 940 | \begin{center}\tt
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 941 | \begin{tabular}{l}
 | 
| 643 | 942 | (k, if),\\ | 
| 943 | (w, " "),\\ | |
| 944 | (i, true),\\ | |
| 945 | (w, " "),\\ | |
| 946 | (k, then),\\ | |
| 947 | (w, " "),\\ | |
| 948 | (k, then),\\ | |
| 949 | (w, " "),\\ | |
| 950 | (n, 42),\\ | |
| 951 | (w, " "),\\ | |
| 952 | (k, else),\\ | |
| 953 | (w, " "),\\ | |
| 954 | (o, +) | |
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 955 | \end{tabular}
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 956 | \end{center}
 | 
| 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 957 | |
| 357 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 958 | \noindent Typically we are not interested in the whitespaces | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 959 | and comments and would filter them out: this gives | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 960 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 961 | \begin{center}\tt
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 962 | \begin{tabular}{l}
 | 
| 643 | 963 | (k, if),\\ | 
| 964 | (i, true),\\ | |
| 965 | (k, then),\\ | |
| 966 | (k, then),\\ | |
| 967 | (n, 42),\\ | |
| 968 | (k, else),\\ | |
| 969 | (o, +) | |
| 357 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 970 | \end{tabular}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 971 | \end{center}
 | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 972 | |
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 973 | \noindent | 
| 
603e171a7b48
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
356diff
changeset | 974 | which will be the input for the next phase of our compiler. | 
| 356 
d9c784c71305
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
352diff
changeset | 975 | |
| 251 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 976 | \end{document}
 | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 977 | |
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 978 | %%% Local Variables: | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 979 | %%% mode: latex | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 980 | %%% TeX-master: t | 
| 
5b5a68df6d16
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 981 | %%% End: |