151
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 1
\documentclass{article}
292
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 2
\usepackage{../style}
217
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 3
\usepackage{../langs}
292
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 4
\usepackage{../graphics}
151
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 5
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 6
\begin{document}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 7
292
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 8
\section*{Handout 5 (Lexing)}
151
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 9
292
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 10
Whenever you want to design a new programming language or
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 11
implement a compiler for an existing language, the first task
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 12
is to fix the basic ``words'' of the language. For example
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 13
what are the keywords, or reserved words, of the language,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 14
what are permitted identifiers, numbers, expressions and so
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 15
on. One convenient way to do this is, of course, by using
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 16
regular expressions.
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 17
292
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 18
In this course we want to take a closer look at the WHILE
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 19
programming language. This is a simple imperative programming
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 20
language consisting of arithmetic expressions, assignments,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 21
if-statements and loops. For example the Fibonacci program can
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 22
be written in this language as follows:
151
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 23
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 24
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 25
\mbox{\lstinputlisting[language=while]{../progs/fib.while}}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 26
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 27
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 28
\noindent
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 29
The keywords in this language will be
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 30
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 31
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 32
\texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 33
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 34
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 35
\noindent
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 36
In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 37
and strings (which we however ignore for the moment). We also need to specify what the ``whitespace''
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 38
is in our programming language and what comments should look like.
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 39
As a first try, we might specify the regular expressions for our language roughly as follows
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 40
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 41
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 42
\begin{tabular}{lcl}
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 43
$\textit{LETTER}$ & $:=$ & $\texttt{a} + \texttt{A} + \texttt{b} + \texttt{B} + \ldots$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 44
$\textit{DIGIT}$ & $:=$ & $\texttt{0} + \texttt{1} + \texttt{2} + \ldots$\\
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 45
$\textit{KEYWORD}$ & $:=$ & $\texttt{while} + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 46
$\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 47
$\textit{OP}$ & $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 48
$\textit{NUM}$ & $:=$ & $\textit{DIGIT}^+$\\
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 49
$\textit{WHITESPACE}$ & $:=$ & $("\hspace{2mm}" + \backslash\texttt{n})^+$
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 50
\end{tabular}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 51
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 52
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 53
Having the regular expressions in place, the problem we have to solve is:
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 54
given a string of our programming language, which regular expression
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 55
matches which part of the string. By solving this problem, we can split up a string
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 56
of our language into components.
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 57
For example given the input string
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 58
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 59
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 60
\texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 61
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 62
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 63
\noindent
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 64
we expect it is split up as follows
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 65
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 66
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 67
\tt
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 68
\Grid{if}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 69
\Grid{\VS}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 70
\Grid{true}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 71
\Grid{\VS}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 72
\Grid{then}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 73
\Grid{\VS}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 74
\Grid{x}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 75
\Grid{+}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 76
\Grid{2}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 77
\Grid{\VS}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 78
\Grid{else}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 79
\Grid{\VS}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 80
\Grid{x}\,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 81
\Grid{+}\,
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 82
\Grid{3}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 83
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 84
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 85
\noindent
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 86
This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}.
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 87
It is usually the first phase of a compiler. Note that the separation into words cannot, in general,
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 88
be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 89
in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 90
not separated by any whitespace. Another reason for recognising whitespaces explicitly is
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 91
that in some languages, for example Python, whitespaces matters, that is carry meaning. However in
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 92
our small language we will eventually just filter out all whitespaces and also all comments.
153
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 93
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 94
Lexing not just separates a string into its components, but also classifies the components, meaning it explicitly records that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on.
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 95
For the moment, though, we will only focus on the simpler problem of just splitting up
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 96
a string into components.
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 97
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 98
There are a few subtleties we need to consider first. For example, say the string is
154
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 99
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 100
\begin{center}
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 101
\texttt{\Grid{iffoo\VS}}\;\ldots
154
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 102
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 103
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 104
\noindent
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 105
then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed
154
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 106
by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 107
single identifier. The choice that is often made in lexers is to look for the longest possible match.
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 108
This leaves \texttt{iffoo} as the only match in this case (since it is longer than \texttt{if}).
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 109
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 110
Unfortunately, the convention about the longest match does not yet make the process
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 111
of lexing completely deterministic. Consider the string
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 112
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 113
\begin{center}
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 114
\texttt{\Grid{then\VS}}\;\dots
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 115
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 116
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 117
\noindent
159
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 118
Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our
156
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 119
regular expressions. In our running example we just use the ranking
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 120
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 121
\[
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 122
\textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 123
\]
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 124
156
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 125
\noindent
160
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 126
So even if both regular expressions match in the example above,
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 127
we give preference to the regular expression for keywords.
155
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 128
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 129
Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 130
and $der$, it will use the function \emph{zeroable} defined as follows:
157
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 131
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 132
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 133
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 134
$zeroable(\varnothing)$ & $\dn$ & $true$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 135
$zeroable(\epsilon)$ & $\dn$ & $f\!alse$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 136
$zeroable (c)$ & $\dn$ & $f\!alse$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 137
$zeroable (r_1 + r_2)$ & $\dn$ & $zeroable(r_1) \wedge zeroable(r_2)$ \\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 138
$zeroable (r_1 \cdot r_2)$ & $\dn$ & $zeroable(r_1) \vee zeroable(r_2)$ \\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 139
$zeroable (r^*)$ & $\dn$ & $f\!alse$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 140
\end{tabular}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 141
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 142
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 143
\noindent
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 144
Recall that the function $nullable(r)$ tests whether a regular expression
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 145
can match the empty string. The function $zeroable$, on the other hand, tests whether a regular
158
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 146
expression cannot match anything at all. The mathematical way of stating this
160
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 147
property is
157
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 148
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 149
\begin{center}
160
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 150
$zeroable(r)$ if and only if $L(r) = \varnothing$
157
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 151
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 152
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 153
\noindent
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 154
For what follows let us fix a set of regular expressions $rs$ as being
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 155
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 156
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 157
\textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 158
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 159
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 160
\noindent
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 161
specifying the ``words''
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 162
of our programming language. The algorithm takes as input the $rs$ and a string, say
158
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 163
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 164
\begin{center}
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 165
\texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 166
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 167
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 168
\noindent
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 169
and tries to chop off one word from the beginning of the string. If none of the
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 170
regular expression in $rs$ matches, we will just return
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 171
the empty string.
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 172
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 173
The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 174
to the first character $c_1$. Then we take the results and continue with
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 175
building the derivatives with respect to $c_2$ until we have either exhausted our
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 176
input string or all of the regular expressions are ``zeroable''. Suppose the input string is
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 177
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 178
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 179
\texttt{\Grid{if2\VS}}\;\dots
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 180
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 181
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 182
\noindent
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 183
then building the derivatives with respect to \texttt{i} gives
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 184
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 185
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 186
\begin{tabular}{l|c}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 187
& $zeroable$\\\hline
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 188
$der\;\texttt{i}\;(\textit{KEYWORD})$ & no\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 189
$der\;\texttt{i}\;(\textit{IDENT})$ & no\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 190
$der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 191
\end{tabular}
158
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 192
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 193
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 194
\noindent
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 195
We can eliminate \textit{WHITESPACE} as a potential candidate, because
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 196
no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 197
two regular expressions as potential candidate and we have to consider the
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 198
next character, \texttt{f}, from the input string
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 199
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 200
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 201
\begin{tabular}{l|c}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 202
& $zeroable$\\\hline
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 203
$der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ & no\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 204
$der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$ & no\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 205
\end{tabular}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 206
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 207
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 208
\noindent
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 209
Since both are `no', we have to continue with \texttt{2} from the input string
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 210
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 211
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 212
\begin{tabular}{l|c}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 213
& $zeroable$\\\hline
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 214
$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$ & yes\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 215
$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ & no\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 216
\end{tabular}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 217
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 218
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 219
\noindent
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 220
Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 221
know how much of the input string should be considered as an \textit{IDENT}. So we
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 222
still have to continue and consider the next derivative.
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 223
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 224
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 225
\begin{tabular}{l|c}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 226
& $zeroable$\\\hline
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 227
$der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$ & yes\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 228
\end{tabular}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 229
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 230
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 231
\noindent
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 232
Since the answer is now `yes' also in this case, we can stop: once all derivatives are
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 233
zeroable, we know the regular expressions cannot match any more letters from
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 234
the input. In this case we only have to go back to the derivative that is nullable. In this case it
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 235
is
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 236
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 237
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 238
$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 239
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 240
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 241
\noindent
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 242
which means we recognised an identifier. In case where there is a choice of more than one
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 243
regular expressions that are nullable, then we choose the one with the highest precedence.
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 244
You can try out such a case with the input string
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 245
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 246
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 247
\texttt{\Grid{then\VS}}\;\dots
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 248
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 249
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 250
\noindent
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 251
which can both be recognised as a keyword, but also an identifier.
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 252
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 253
While in the example above the last nullable derivative is the one directly before
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 254
the derivative turns zeroable, this is not always the case. Imagine, identifiers can be
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 255
letters, as permuted by the regular expression \textit{IDENT}, but must end with an
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 256
undercore.
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 257
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 258
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 259
\begin{tabular}{lcl}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 260
$\textit{NEWIDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^* \cdot \_$\\
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 261
\end{tabular}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 262
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 263
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 264
\noindent
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 265
If we use \textit{NEWIDENT} with the input string
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 266
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 267
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 268
\texttt{\Grid{iffoo\VS}}\;\ldots
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 269
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 270
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 271
\noindent
163
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 272
then it will only become $zeroable$ after the $\VS$ has been analysed. In this case we have to go back to the
162
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 273
first \texttt{f} because only
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 274
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 275
\begin{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 276
$der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 277
\end{center}
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 278
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 279
\noindent
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 280
is nullable. As a result we recognise successfully the keyword \texttt{if} and the remaining
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 281
string needs to be consumed by other regular expressions or lead to a lexing error.
161
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 282
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 283
157
Christian Urban <christian dot urban at kcl dot ac dot uk>
diff
changeset
+ − 284
151
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 285
\end{document}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 286
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 287
%%% Local Variables:
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 288
%%% mode: latex
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 289
%%% TeX-master: t
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
+ − 290
%%% End: