diff -r 603e171a7b48 -r b3129cff41e9 handouts/ho05-bak.tex --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/handouts/ho05-bak.tex Mon Oct 19 23:49:25 2015 +0100 @@ -0,0 +1,290 @@ +\documentclass{article} +\usepackage{../style} +\usepackage{../langs} +\usepackage{../graphics} + +\begin{document} + +\section*{Handout 5 (Lexing)} + +Whenever you want to design a new programming language or +implement a compiler for an existing language, the first task +is to fix the basic ``words'' of the language. For example +what are the keywords, or reserved words, of the language, +what are permitted identifiers, numbers, expressions and so +on. One convenient way to do this is, of course, by using +regular expressions. + +In this course we want to take a closer look at the WHILE +programming language. This is a simple imperative programming +language consisting of arithmetic expressions, assignments, +if-statements and loops. For example the Fibonacci program can +be written in this language as follows: + +\begin{center} +\mbox{\lstinputlisting[language=while]{../progs/fib.while}} +\end{center} + +\noindent +The keywords in this language will be + +\begin{center} +\texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read} +\end{center} + +\noindent +In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers +and strings (which we however ignore for the moment). We also need to specify what the ``whitespace'' +is in our programming language and what comments should look like. +As a first try, we might specify the regular expressions for our language roughly as follows + +\begin{center} +\begin{tabular}{lcl} +$\textit{LETTER}$ & $:=$ & $\texttt{a} + \texttt{A} + \texttt{b} + \texttt{B} + \ldots$\\ +$\textit{DIGIT}$ & $:=$ & $\texttt{0} + \texttt{1} + \texttt{2} + \ldots$\\ +$\textit{KEYWORD}$ & $:=$ & $\texttt{while} + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\ +$\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ +$\textit{OP}$ & $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\ +$\textit{NUM}$ & $:=$ & $\textit{DIGIT}^+$\\ +$\textit{WHITESPACE}$ & $:=$ & $("\hspace{2mm}" + \backslash\texttt{n})^+$ +\end{tabular} +\end{center} + +Having the regular expressions in place, the problem we have to solve is: +given a string of our programming language, which regular expression +matches which part of the string. By solving this problem, we can split up a string +of our language into components. +For example given the input string + +\begin{center} +\texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}} +\end{center} + +\noindent +we expect it is split up as follows + +\begin{center} +\tt +\Grid{if}\, +\Grid{\VS}\, +\Grid{true}\, +\Grid{\VS}\, +\Grid{then}\, +\Grid{\VS}\, +\Grid{x}\, +\Grid{+}\, +\Grid{2}\, +\Grid{\VS}\, +\Grid{else}\, +\Grid{\VS}\, +\Grid{x}\, +\Grid{+}\, +\Grid{3} +\end{center} + +\noindent +This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}. +It is usually the first phase of a compiler. Note that the separation into words cannot, in general, +be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace +in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are +not separated by any whitespace. Another reason for recognising whitespaces explicitly is +that in some languages, for example Python, whitespaces matters, that is carry meaning. However in +our small language we will eventually just filter out all whitespaces and also all comments. + +Lexing not just separates a string into its components, but also classifies the components, meaning it explicitly records that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on. +For the moment, though, we will only focus on the simpler problem of just splitting up +a string into components. + +There are a few subtleties we need to consider first. For example, say the string is + +\begin{center} +\texttt{\Grid{iffoo\VS}}\;\ldots +\end{center} + +\noindent +then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed +by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a +single identifier. The choice that is often made in lexers is to look for the longest possible match. +This leaves \texttt{iffoo} as the only match in this case (since it is longer than \texttt{if}). + +Unfortunately, the convention about the longest match does not yet make the process +of lexing completely deterministic. Consider the string + +\begin{center} +\texttt{\Grid{then\VS}}\;\dots +\end{center} + +\noindent +Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our +regular expressions. In our running example we just use the ranking + +\[ +\textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots +\] + +\noindent +So even if both regular expressions match in the example above, +we give preference to the regular expression for keywords. + +Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$ +and $der$, it will use the function \emph{zeroable} defined as follows: + +\begin{center} +\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} +$zeroable(\varnothing)$ & $\dn$ & $true$\\ +$zeroable(\epsilon)$ & $\dn$ & $f\!alse$\\ +$zeroable (c)$ & $\dn$ & $f\!alse$\\ +$zeroable (r_1 + r_2)$ & $\dn$ & $zeroable(r_1) \wedge zeroable(r_2)$ \\ +$zeroable (r_1 \cdot r_2)$ & $\dn$ & $zeroable(r_1) \vee zeroable(r_2)$ \\ +$zeroable (r^*)$ & $\dn$ & $f\!alse$\\ +\end{tabular} +\end{center} + +\noindent +Recall that the function $nullable(r)$ tests whether a regular expression +can match the empty string. The function $zeroable$, on the other hand, tests whether a regular +expression cannot match anything at all. The mathematical way of stating this +property is + +\begin{center} +$zeroable(r)$ if and only if $L(r) = \varnothing$ +\end{center} + +\noindent +For what follows let us fix a set of regular expressions $rs$ as being + +\begin{center} +\textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE} +\end{center} + +\noindent +specifying the ``words'' +of our programming language. The algorithm takes as input the $rs$ and a string, say + +\begin{center} +\texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots +\end{center} + +\noindent +and tries to chop off one word from the beginning of the string. If none of the +regular expression in $rs$ matches, we will just return +the empty string. + +The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect +to the first character $c_1$. Then we take the results and continue with +building the derivatives with respect to $c_2$ until we have either exhausted our +input string or all of the regular expressions are ``zeroable''. Suppose the input string is + +\begin{center} +\texttt{\Grid{if2\VS}}\;\dots +\end{center} + +\noindent +then building the derivatives with respect to \texttt{i} gives + +\begin{center} +\begin{tabular}{l|c} + & $zeroable$\\\hline + $der\;\texttt{i}\;(\textit{KEYWORD})$ & no\\ + $der\;\texttt{i}\;(\textit{IDENT})$ & no\\ + $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\ +\end{tabular} +\end{center} + +\noindent +We can eliminate \textit{WHITESPACE} as a potential candidate, because +no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other +two regular expressions as potential candidate and we have to consider the +next character, \texttt{f}, from the input string + +\begin{center} +\begin{tabular}{l|c} + & $zeroable$\\\hline + $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ & no\\ + $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$ & no\\ +\end{tabular} +\end{center} + +\noindent +Since both are `no', we have to continue with \texttt{2} from the input string + +\begin{center} +\begin{tabular}{l|c} + & $zeroable$\\\hline + $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$ & yes\\ + $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ & no\\ +\end{tabular} +\end{center} + +\noindent +Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet +know how much of the input string should be considered as an \textit{IDENT}. So we +still have to continue and consider the next derivative. + +\begin{center} +\begin{tabular}{l|c} + & $zeroable$\\\hline + $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$ & yes\\ +\end{tabular} +\end{center} + +\noindent +Since the answer is now `yes' also in this case, we can stop: once all derivatives are +zeroable, we know the regular expressions cannot match any more letters from +the input. In this case we only have to go back to the derivative that is nullable. In this case it +is + +\begin{center} +$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ +\end{center} + +\noindent +which means we recognised an identifier. In case where there is a choice of more than one +regular expressions that are nullable, then we choose the one with the highest precedence. +You can try out such a case with the input string + +\begin{center} +\texttt{\Grid{then\VS}}\;\dots +\end{center} + +\noindent +which can both be recognised as a keyword, but also an identifier. + +While in the example above the last nullable derivative is the one directly before +the derivative turns zeroable, this is not always the case. Imagine, identifiers can be +letters, as permuted by the regular expression \textit{IDENT}, but must end with an +undercore. + +\begin{center} +\begin{tabular}{lcl} +$\textit{NEWIDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^* \cdot \_$\\ +\end{tabular} +\end{center} + +\noindent +If we use \textit{NEWIDENT} with the input string + +\begin{center} +\texttt{\Grid{iffoo\VS}}\;\ldots +\end{center} + +\noindent +then it will only become $zeroable$ after the $\VS$ has been analysed. In this case we have to go back to the +first \texttt{f} because only + +\begin{center} + $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ +\end{center} + +\noindent +is nullable. As a result we recognise successfully the keyword \texttt{if} and the remaining +string needs to be consumed by other regular expressions or lead to a lexing error. + + + +\end{document} + +%%% Local Variables: +%%% mode: latex +%%% TeX-master: t +%%% End: