handouts/ho05-bak.tex
changeset 358 b3129cff41e9
parent 292 7ed2a25dd115
child 871 94b84d880c2b
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/handouts/ho05-bak.tex	Mon Oct 19 23:49:25 2015 +0100
@@ -0,0 +1,290 @@
+\documentclass{article}
+\usepackage{../style}
+\usepackage{../langs}
+\usepackage{../graphics}
+
+\begin{document}
+
+\section*{Handout 5 (Lexing)}
+
+Whenever you want to design a new programming language or
+implement a compiler for an existing language, the first task
+is to fix the basic ``words'' of the language. For example
+what are the keywords, or reserved words, of the language,
+what are permitted identifiers, numbers, expressions and so
+on. One convenient way to do this is, of course, by using
+regular expressions. 
+
+In this course we want to take a closer look at the WHILE
+programming language. This is a simple imperative programming
+language consisting of arithmetic expressions, assignments,
+if-statements and loops. For example the Fibonacci program can
+be written in this language as follows:
+
+\begin{center}
+\mbox{\lstinputlisting[language=while]{../progs/fib.while}}
+\end{center}
+
+\noindent
+The keywords in this language will be
+
+\begin{center}
+\texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}
+\end{center}
+
+\noindent
+In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers
+and strings (which we however ignore for the moment). We also need to specify what the ``whitespace''
+is in our programming language and what comments should look like.
+As a first try, we might specify the regular expressions for our language roughly as follows
+
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{LETTER}$ & $:=$ & $\texttt{a} + \texttt{A} + \texttt{b} + \texttt{B} + \ldots$\\
+$\textit{DIGIT}$ & $:=$ & $\texttt{0} + \texttt{1} + \texttt{2} + \ldots$\\
+$\textit{KEYWORD}$ & $:=$ & $\texttt{while}  + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\
+$\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ 
+$\textit{OP}$      &  $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\
+$\textit{NUM}$   &  $:=$ & $\textit{DIGIT}^+$\\
+$\textit{WHITESPACE}$ & $:=$ & $("\hspace{2mm}" + \backslash\texttt{n})^+$
+\end{tabular}
+\end{center}
+
+Having the regular expressions in place, the problem we have to solve is: 
+given a string of our programming language, which regular expression 
+matches which part of the string. By solving this problem, we can split up a string 
+of our language into components. 
+For example given the input string
+
+\begin{center}
+\texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
+\end{center}
+
+\noindent
+we expect it is split up as follows
+
+\begin{center}
+\tt
+\Grid{if}\,
+\Grid{\VS}\, 
+\Grid{true}\, 
+\Grid{\VS}\, 
+\Grid{then}\, 
+\Grid{\VS}\, 
+\Grid{x}\, 
+\Grid{+}\, 
+\Grid{2}\,
+\Grid{\VS}\, 
+\Grid{else}\,
+\Grid{\VS}\,
+\Grid{x}\,
+\Grid{+}\,
+\Grid{3}
+\end{center}
+
+\noindent
+This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}.
+It is usually the first phase of a compiler. Note that the separation into words cannot, in general, 
+be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace
+in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are 
+not separated by any whitespace. Another reason for recognising whitespaces explicitly is
+that in some languages, for example Python, whitespaces matters, that is carry meaning. However in 
+our small language we will eventually just filter out all whitespaces and also all comments.
+
+Lexing not just separates a string into its components, but also classifies the components, meaning it explicitly records that \texttt{if} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
+For the moment, though, we will only focus on the simpler problem of just splitting up 
+a string into components.
+
+There are a few subtleties  we need to consider first. For example, say the string is
+
+\begin{center}
+\texttt{\Grid{iffoo\VS}}\;\ldots
+\end{center}
+
+\noindent
+then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed
+by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a 
+single identifier. The choice that is often made in lexers is to look for the longest possible match.
+This leaves  \texttt{iffoo} as the only match in this case (since it is longer than \texttt{if}).
+
+Unfortunately, the convention about the longest match does not yet make the process 
+of lexing completely deterministic. Consider the string
+
+\begin{center}
+\texttt{\Grid{then\VS}}\;\dots
+\end{center}
+
+\noindent
+Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our 
+regular expressions. In our running example we just use the ranking
+
+\[
+\textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots
+\]
+
+\noindent
+So even if both regular expressions match in the example above,
+we give preference to the regular expression for keywords.
+
+Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$
+and $der$, it will  use the function \emph{zeroable} defined as follows:
+
+\begin{center}
+\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
+$zeroable(\varnothing)$      & $\dn$ & $true$\\
+$zeroable(\epsilon)$           & $\dn$ &  $f\!alse$\\
+$zeroable (c)$                    & $\dn$ &  $f\!alse$\\
+$zeroable (r_1 + r_2)$        & $\dn$ &  $zeroable(r_1) \wedge zeroable(r_2)$ \\ 
+$zeroable (r_1 \cdot r_2)$  & $\dn$ &  $zeroable(r_1) \vee zeroable(r_2)$ \\
+$zeroable (r^*)$                  & $\dn$ & $f\!alse$\\
+\end{tabular}
+\end{center}
+
+\noindent
+Recall that the function $nullable(r)$ tests whether a regular expression
+can match the empty string. The function $zeroable$, on the other hand, tests whether a regular
+expression cannot match anything at all. The mathematical way of stating this
+property is
+
+\begin{center}
+$zeroable(r)$ if and only if $L(r) = \varnothing$
+\end{center}
+
+\noindent
+For what follows let us fix a set of regular expressions $rs$ as being 
+
+\begin{center}
+\textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE}
+\end{center}
+
+\noindent
+specifying the ``words'' 
+of our programming language. The algorithm takes as input the $rs$ and a string, say
+
+\begin{center}
+\texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots
+\end{center}
+
+\noindent
+and tries to chop off one word from the beginning of the string. If none of the
+regular expression in $rs$ matches, we will just return
+the empty string.
+
+The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect
+to the first character $c_1$. Then we take the results and continue with 
+building the derivatives with respect to $c_2$ until we have either exhausted our 
+input string or all of the regular expressions are ``zeroable''.  Suppose the input string is 
+
+\begin{center}
+\texttt{\Grid{if2\VS}}\;\dots
+\end{center}
+
+\noindent
+then building the derivatives with respect to \texttt{i} gives
+
+\begin{center}
+\begin{tabular}{l|c}
+ & $zeroable$\\\hline
+ $der\;\texttt{i}\;(\textit{KEYWORD})$      & no\\
+ $der\;\texttt{i}\;(\textit{IDENT})$              & no\\
+ $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\
+\end{tabular}
+\end{center}
+
+\noindent
+We can eliminate \textit{WHITESPACE} as a potential candidate, because
+no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other
+two regular expressions as potential candidate and we have to consider the
+next character, \texttt{f}, from the input string
+
+\begin{center}
+\begin{tabular}{l|c}
+ & $zeroable$\\\hline
+ $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$      & no\\
+ $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$              & no\\
+\end{tabular}
+\end{center}
+
+\noindent
+Since both are `no', we have to continue with \texttt{2} from the input string
+
+\begin{center}
+\begin{tabular}{l|c}
+ & $zeroable$\\\hline
+ $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$      & yes\\
+ $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$              & no\\
+\end{tabular}
+\end{center}
+
+\noindent
+Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet
+know how much of the input string should be considered as an \textit{IDENT}. So we
+still have to continue and consider the next derivative.
+
+\begin{center}
+\begin{tabular}{l|c}
+ & $zeroable$\\\hline
+ $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$              & yes\\
+\end{tabular}
+\end{center}
+
+\noindent
+Since the answer is now `yes' also in this case, we can stop: once all derivatives are
+zeroable, we know the regular expressions cannot match any more letters from
+the input. In this case we only have to go back to the derivative that is nullable.  In this case it
+is
+
+\begin{center}
+$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ 
+\end{center} 
+
+\noindent
+which means we recognised an identifier. In case where there is a choice of more than one
+regular expressions that are nullable, then we choose the one with the highest precedence. 
+You can try out such a case with the input string
+
+\begin{center}
+\texttt{\Grid{then\VS}}\;\dots
+\end{center}
+
+\noindent
+which can both be recognised as a keyword, but also an identifier. 
+
+While in the example above the last nullable derivative is the one directly before
+the derivative turns zeroable, this is not always the case. Imagine, identifiers can be 
+letters, as permuted by the regular expression \textit{IDENT}, but must end with an 
+undercore.
+
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{NEWIDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^* \cdot \_$\\ 
+\end{tabular}
+\end{center}
+
+\noindent
+If we use \textit{NEWIDENT} with the input string
+
+\begin{center}
+\texttt{\Grid{iffoo\VS}}\;\ldots
+\end{center}
+
+\noindent
+then it will only become $zeroable$ after the $\VS$ has been analysed. In this case we have to go back to the
+first \texttt{f} because only 
+
+\begin{center}
+ $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$  
+\end{center}
+
+\noindent
+is nullable. As a result we recognise successfully the keyword \texttt{if} and the remaining
+string needs to be consumed by other regular expressions or lead to a lexing error.
+
+
+
+\end{document}
+
+%%% Local Variables: 
+%%% mode: latex
+%%% TeX-master: t
+%%% End: