\documentclass{article}\usepackage{../style}\usepackage{../langs}\usepackage{../graphics}\begin{document}\section*{Handout 5 (Lexing)}Whenever you want to design a new programming language orimplement a compiler for an existing language, the first taskis to fix the basic ``words'' of the language. For examplewhat are the keywords, or reserved words, of the language,what are permitted identifiers, numbers, expressions and soon. One convenient way to do this is, of course, by usingregular expressions. In this course we want to take a closer look at the WHILEprogramming language. This is a simple imperative programminglanguage consisting of arithmetic expressions, assignments,if-statements and loops. For example the Fibonacci program canbe written in this language as follows:\begin{center}\mbox{\lstinputlisting[language=while]{../progs/fib.while}}\end{center}\noindentThe keywords in this language will be\begin{center}\texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}\end{center}\noindentIn addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbersand strings (which we however ignore for the moment). We also need to specify what the ``whitespace''is in our programming language and what comments should look like.As a first try, we might specify the regular expressions for our language roughly as follows\begin{center}\begin{tabular}{lcl}$\textit{LETTER}$ & $:=$ & $\texttt{a} + \texttt{A} + \texttt{b} + \texttt{B} + \ldots$\\$\textit{DIGIT}$ & $:=$ & $\texttt{0} + \texttt{1} + \texttt{2} + \ldots$\\$\textit{KEYWORD}$ & $:=$ & $\texttt{while} + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\$\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ $\textit{OP}$ & $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\$\textit{NUM}$ & $:=$ & $\textit{DIGIT}^+$\\$\textit{WHITESPACE}$ & $:=$ & $("\hspace{2mm}" + \backslash\texttt{n})^+$\end{tabular}\end{center}Having the regular expressions in place, the problem we have to solve is: given a string of our programming language, which regular expression matches which part of the string. By solving this problem, we can split up a string of our language into components. For example given the input string\begin{center}\texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}\end{center}\noindentwe expect it is split up as follows\begin{center}\tt\Grid{if}\,\Grid{\VS}\, \Grid{true}\, \Grid{\VS}\, \Grid{then}\, \Grid{\VS}\, \Grid{x}\, \Grid{+}\, \Grid{2}\,\Grid{\VS}\, \Grid{else}\,\Grid{\VS}\,\Grid{x}\,\Grid{+}\,\Grid{3}\end{center}\noindentThis process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}.It is usually the first phase of a compiler. Note that the separation into words cannot, in general, be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespacein the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are not separated by any whitespace. Another reason for recognising whitespaces explicitly isthat in some languages, for example Python, whitespaces matters, that is carry meaning. However in our small language we will eventually just filter out all whitespaces and also all comments.Lexing not just separates a string into its components, but also classifies the components, meaning it explicitly records that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on.For the moment, though, we will only focus on the simpler problem of just splitting up a string into components.There are a few subtleties we need to consider first. For example, say the string is\begin{center}\texttt{\Grid{iffoo\VS}}\;\ldots\end{center}\noindentthen there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followedby the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a single identifier. The choice that is often made in lexers is to look for the longest possible match.This leaves \texttt{iffoo} as the only match in this case (since it is longer than \texttt{if}).Unfortunately, the convention about the longest match does not yet make the process of lexing completely deterministic. Consider the string\begin{center}\texttt{\Grid{then\VS}}\;\dots\end{center}\noindentClearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our regular expressions. In our running example we just use the ranking\[\textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots\]\noindentSo even if both regular expressions match in the example above,we give preference to the regular expression for keywords.Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$and $der$, it will use the function \emph{zeroable} defined as follows:\begin{center}\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}$zeroable(\varnothing)$ & $\dn$ & $true$\\$zeroable(\epsilon)$ & $\dn$ & $f\!alse$\\$zeroable (c)$ & $\dn$ & $f\!alse$\\$zeroable (r_1 + r_2)$ & $\dn$ & $zeroable(r_1) \wedge zeroable(r_2)$ \\ $zeroable (r_1 \cdot r_2)$ & $\dn$ & $zeroable(r_1) \vee zeroable(r_2)$ \\$zeroable (r^*)$ & $\dn$ & $f\!alse$\\\end{tabular}\end{center}\noindentRecall that the function $nullable(r)$ tests whether a regular expressioncan match the empty string. The function $zeroable$, on the other hand, tests whether a regularexpression cannot match anything at all. The mathematical way of stating thisproperty is\begin{center}$zeroable(r)$ if and only if $L(r) = \varnothing$\end{center}\noindentFor what follows let us fix a set of regular expressions $rs$ as being \begin{center}\textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE}\end{center}\noindentspecifying the ``words'' of our programming language. The algorithm takes as input the $rs$ and a string, say\begin{center}\texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots\end{center}\noindentand tries to chop off one word from the beginning of the string. If none of theregular expression in $rs$ matches, we will just returnthe empty string.The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respectto the first character $c_1$. Then we take the results and continue with building the derivatives with respect to $c_2$ until we have either exhausted our input string or all of the regular expressions are ``zeroable''. Suppose the input string is \begin{center}\texttt{\Grid{if2\VS}}\;\dots\end{center}\noindentthen building the derivatives with respect to \texttt{i} gives\begin{center}\begin{tabular}{l|c} & $zeroable$\\\hline $der\;\texttt{i}\;(\textit{KEYWORD})$ & no\\ $der\;\texttt{i}\;(\textit{IDENT})$ & no\\ $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\\end{tabular}\end{center}\noindentWe can eliminate \textit{WHITESPACE} as a potential candidate, becauseno derivative can go from $zeroable = \text{yes}$ to no. That leaves the othertwo regular expressions as potential candidate and we have to consider thenext character, \texttt{f}, from the input string\begin{center}\begin{tabular}{l|c} & $zeroable$\\\hline $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ & no\\ $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$ & no\\\end{tabular}\end{center}\noindentSince both are `no', we have to continue with \texttt{2} from the input string\begin{center}\begin{tabular}{l|c} & $zeroable$\\\hline $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$ & yes\\ $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ & no\\\end{tabular}\end{center}\noindentAlthough we now know that the beginning is definitely an \textit{IDENT}, we do not yetknow how much of the input string should be considered as an \textit{IDENT}. So westill have to continue and consider the next derivative.\begin{center}\begin{tabular}{l|c} & $zeroable$\\\hline $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$ & yes\\\end{tabular}\end{center}\noindentSince the answer is now `yes' also in this case, we can stop: once all derivatives arezeroable, we know the regular expressions cannot match any more letters fromthe input. In this case we only have to go back to the derivative that is nullable. In this case itis\begin{center}$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ \end{center} \noindentwhich means we recognised an identifier. In case where there is a choice of more than oneregular expressions that are nullable, then we choose the one with the highest precedence. You can try out such a case with the input string\begin{center}\texttt{\Grid{then\VS}}\;\dots\end{center}\noindentwhich can both be recognised as a keyword, but also an identifier. While in the example above the last nullable derivative is the one directly beforethe derivative turns zeroable, this is not always the case. Imagine, identifiers can be letters, as permuted by the regular expression \textit{IDENT}, but must end with an undercore.\begin{center}\begin{tabular}{lcl}$\textit{NEWIDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^* \cdot \_$\\ \end{tabular}\end{center}\noindentIf we use \textit{NEWIDENT} with the input string\begin{center}\texttt{\Grid{iffoo\VS}}\;\ldots\end{center}\noindentthen it will only become $zeroable$ after the $\VS$ has been analysed. In this case we have to go back to thefirst \texttt{f} because only \begin{center} $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ \end{center}\noindentis nullable. As a result we recognise successfully the keyword \texttt{if} and the remainingstring needs to be consumed by other regular expressions or lead to a lexing error.\end{document}%%% Local Variables: %%% mode: latex%%% TeX-master: t%%% End: