\documentclass{article}+ −
\usepackage{../style}+ −
\usepackage{../langs}+ −
\usepackage{../graphics}+ −
+ −
\begin{document}+ −
+ −
\section*{Handout 5 (Lexing)}+ −
+ −
Whenever you want to design a new programming language or+ −
implement a compiler for an existing language, the first task+ −
is to fix the basic ``words'' of the language. For example+ −
what are the keywords, or reserved words, of the language,+ −
what are permitted identifiers, numbers, expressions and so+ −
on. One convenient way to do this is, of course, by using+ −
regular expressions. + −
+ −
In this course we want to take a closer look at the WHILE+ −
programming language. This is a simple imperative programming+ −
language consisting of arithmetic expressions, assignments,+ −
if-statements and loops. For example the Fibonacci program can+ −
be written in this language as follows:+ −
+ −
\begin{center}+ −
\mbox{\lstinputlisting[language=while]{../progs/fib.while}}+ −
\end{center}+ −
+ −
\noindent+ −
The keywords in this language will be+ −
+ −
\begin{center}+ −
\texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}+ −
\end{center}+ −
+ −
\noindent+ −
In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers+ −
and strings (which we however ignore for the moment). We also need to specify what the ``whitespace''+ −
is in our programming language and what comments should look like.+ −
As a first try, we might specify the regular expressions for our language roughly as follows+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{LETTER}$ & $:=$ & $\texttt{a} + \texttt{A} + \texttt{b} + \texttt{B} + \ldots$\\+ −
$\textit{DIGIT}$ & $:=$ & $\texttt{0} + \texttt{1} + \texttt{2} + \ldots$\\+ −
$\textit{KEYWORD}$ & $:=$ & $\texttt{while} + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\+ −
$\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ + −
$\textit{OP}$ & $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\+ −
$\textit{NUM}$ & $:=$ & $\textit{DIGIT}^+$\\+ −
$\textit{WHITESPACE}$ & $:=$ & $("\hspace{2mm}" + \backslash\texttt{n})^+$+ −
\end{tabular}+ −
\end{center}+ −
+ −
Having the regular expressions in place, the problem we have to solve is: + −
given a string of our programming language, which regular expression + −
matches which part of the string. By solving this problem, we can split up a string + −
of our language into components. + −
For example given the input string+ −
+ −
\begin{center}+ −
\texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}+ −
\end{center}+ −
+ −
\noindent+ −
we expect it is split up as follows+ −
+ −
\begin{center}+ −
\tt+ −
\Grid{if}\,+ −
\Grid{\VS}\, + −
\Grid{true}\, + −
\Grid{\VS}\, + −
\Grid{then}\, + −
\Grid{\VS}\, + −
\Grid{x}\, + −
\Grid{+}\, + −
\Grid{2}\,+ −
\Grid{\VS}\, + −
\Grid{else}\,+ −
\Grid{\VS}\,+ −
\Grid{x}\,+ −
\Grid{+}\,+ −
\Grid{3}+ −
\end{center}+ −
+ −
\noindent+ −
This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}.+ −
It is usually the first phase of a compiler. Note that the separation into words cannot, in general, + −
be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace+ −
in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are + −
not separated by any whitespace. Another reason for recognising whitespaces explicitly is+ −
that in some languages, for example Python, whitespaces matters, that is carry meaning. However in + −
our small language we will eventually just filter out all whitespaces and also all comments.+ −
+ −
Lexing not just separates a string into its components, but also classifies the components, meaning it explicitly records that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on.+ −
For the moment, though, we will only focus on the simpler problem of just splitting up + −
a string into components.+ −
+ −
There are a few subtleties we need to consider first. For example, say the string is+ −
+ −
\begin{center}+ −
\texttt{\Grid{iffoo\VS}}\;\ldots+ −
\end{center}+ −
+ −
\noindent+ −
then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed+ −
by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a + −
single identifier. The choice that is often made in lexers is to look for the longest possible match.+ −
This leaves \texttt{iffoo} as the only match in this case (since it is longer than \texttt{if}).+ −
+ −
Unfortunately, the convention about the longest match does not yet make the process + −
of lexing completely deterministic. Consider the string+ −
+ −
\begin{center}+ −
\texttt{\Grid{then\VS}}\;\dots+ −
\end{center}+ −
+ −
\noindent+ −
Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our + −
regular expressions. In our running example we just use the ranking+ −
+ −
\[+ −
\textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots+ −
\]+ −
+ −
\noindent+ −
So even if both regular expressions match in the example above,+ −
we give preference to the regular expression for keywords.+ −
+ −
Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$+ −
and $der$, it will use the function \emph{zeroable} defined as follows:+ −
+ −
\begin{center}+ −
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}+ −
$zeroable(\varnothing)$ & $\dn$ & $true$\\+ −
$zeroable(\epsilon)$ & $\dn$ & $f\!alse$\\+ −
$zeroable (c)$ & $\dn$ & $f\!alse$\\+ −
$zeroable (r_1 + r_2)$ & $\dn$ & $zeroable(r_1) \wedge zeroable(r_2)$ \\ + −
$zeroable (r_1 \cdot r_2)$ & $\dn$ & $zeroable(r_1) \vee zeroable(r_2)$ \\+ −
$zeroable (r^*)$ & $\dn$ & $f\!alse$\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
Recall that the function $nullable(r)$ tests whether a regular expression+ −
can match the empty string. The function $zeroable$, on the other hand, tests whether a regular+ −
expression cannot match anything at all. The mathematical way of stating this+ −
property is+ −
+ −
\begin{center}+ −
$zeroable(r)$ if and only if $L(r) = \varnothing$+ −
\end{center}+ −
+ −
\noindent+ −
For what follows let us fix a set of regular expressions $rs$ as being + −
+ −
\begin{center}+ −
\textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE}+ −
\end{center}+ −
+ −
\noindent+ −
specifying the ``words'' + −
of our programming language. The algorithm takes as input the $rs$ and a string, say+ −
+ −
\begin{center}+ −
\texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots+ −
\end{center}+ −
+ −
\noindent+ −
and tries to chop off one word from the beginning of the string. If none of the+ −
regular expression in $rs$ matches, we will just return+ −
the empty string.+ −
+ −
The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect+ −
to the first character $c_1$. Then we take the results and continue with + −
building the derivatives with respect to $c_2$ until we have either exhausted our + −
input string or all of the regular expressions are ``zeroable''. Suppose the input string is + −
+ −
\begin{center}+ −
\texttt{\Grid{if2\VS}}\;\dots+ −
\end{center}+ −
+ −
\noindent+ −
then building the derivatives with respect to \texttt{i} gives+ −
+ −
\begin{center}+ −
\begin{tabular}{l|c}+ −
& $zeroable$\\\hline+ −
$der\;\texttt{i}\;(\textit{KEYWORD})$ & no\\+ −
$der\;\texttt{i}\;(\textit{IDENT})$ & no\\+ −
$der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
We can eliminate \textit{WHITESPACE} as a potential candidate, because+ −
no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other+ −
two regular expressions as potential candidate and we have to consider the+ −
next character, \texttt{f}, from the input string+ −
+ −
\begin{center}+ −
\begin{tabular}{l|c}+ −
& $zeroable$\\\hline+ −
$der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ & no\\+ −
$der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$ & no\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
Since both are `no', we have to continue with \texttt{2} from the input string+ −
+ −
\begin{center}+ −
\begin{tabular}{l|c}+ −
& $zeroable$\\\hline+ −
$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$ & yes\\+ −
$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ & no\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet+ −
know how much of the input string should be considered as an \textit{IDENT}. So we+ −
still have to continue and consider the next derivative.+ −
+ −
\begin{center}+ −
\begin{tabular}{l|c}+ −
& $zeroable$\\\hline+ −
$der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$ & yes\\+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
Since the answer is now `yes' also in this case, we can stop: once all derivatives are+ −
zeroable, we know the regular expressions cannot match any more letters from+ −
the input. In this case we only have to go back to the derivative that is nullable. In this case it+ −
is+ −
+ −
\begin{center}+ −
$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ + −
\end{center} + −
+ −
\noindent+ −
which means we recognised an identifier. In case where there is a choice of more than one+ −
regular expressions that are nullable, then we choose the one with the highest precedence. + −
You can try out such a case with the input string+ −
+ −
\begin{center}+ −
\texttt{\Grid{then\VS}}\;\dots+ −
\end{center}+ −
+ −
\noindent+ −
which can both be recognised as a keyword, but also an identifier. + −
+ −
While in the example above the last nullable derivative is the one directly before+ −
the derivative turns zeroable, this is not always the case. Imagine, identifiers can be + −
letters, as permuted by the regular expression \textit{IDENT}, but must end with an + −
undercore.+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{NEWIDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^* \cdot \_$\\ + −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
If we use \textit{NEWIDENT} with the input string+ −
+ −
\begin{center}+ −
\texttt{\Grid{iffoo\VS}}\;\ldots+ −
\end{center}+ −
+ −
\noindent+ −
then it will only become $zeroable$ after the $\VS$ has been analysed. In this case we have to go back to the+ −
first \texttt{f} because only + −
+ −
\begin{center}+ −
$der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ + −
\end{center}+ −
+ −
\noindent+ −
is nullable. As a result we recognise successfully the keyword \texttt{if} and the remaining+ −
string needs to be consumed by other regular expressions or lead to a lexing error.+ −
+ −
+ −
+ −
\end{document}+ −
+ −
%%% Local Variables: + −
%%% mode: latex+ −
%%% TeX-master: t+ −
%%% End: + −