diff -r 8134c3b981e0 -r bfcf838dda4d handouts/ho05.tex --- a/handouts/ho05.tex Sat Oct 26 10:06:58 2013 +0100 +++ b/handouts/ho05.tex Sun Oct 27 00:08:57 2013 +0100 @@ -146,7 +146,7 @@ \end{center} \noindent -we expect it will be split up as follows +we expect it is split up as follows \begin{center} \tt @@ -168,21 +168,22 @@ \end{center} \noindent -This process of splitting an input string into components is often called \emph{lexing} or \emph{scanning}. +This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}. It is usually the first phase of a compiler. Note that the separation into words cannot, in general, -be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace, -this is not always the case. As can be seen the three components in \texttt{x+2} are not separated by any -whitespace. Another reason for recognising whitespaces explicitly is -that in some languages, for example Python, whitespace matters. However in our small language we will eventually just filter out all whitespaces and also all comments. +be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace +in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are +not separated by any whitespace. Another reason for recognising whitespaces explicitly is +that in some languages, for example Python, whitespaces matters, that is carry meaning. However in +our small language we will eventually just filter out all whitespaces and also all comments. -Lexing will not just separate a string into its components, but also classify the components, that -is explicitly record that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on. -For the moment, though, we will only focus on the simpler problem of just splitting a string into components. +Lexing not just separates a string into its components, but also classify the components, meaning explicitly record that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on. +For the moment, though, we will only focus on the simpler problem of just splitting up +a string into components. There are a few subtleties we need to consider first. For example, say the string is \begin{center} -\texttt{\Grid{iffoo\VS\ldots}} +\texttt{\Grid{iffoo\VS}}\;\ldots \end{center} \noindent @@ -191,11 +192,11 @@ single identifier. The choice that is often made in lexers is to look for the longest possible match. This leaves \texttt{iffoo} as the only match (since it is longer than \texttt{if}). -Unfortunately, the convention about the longest match does not yet make the whole process +Unfortunately, the convention about the longest match does not yet make the process of lexing completely deterministic. Consider the string \begin{center} -\texttt{\Grid{then\VS\dots}} +\texttt{\Grid{then\VS}}\;\dots \end{center} \noindent @@ -210,10 +211,8 @@ So even if both regular expressions match in the example above, we give preference to the regular expression for keywords. -Let us see how our algorithm for lexing works in detail. The regular -expressions and their ranking are shown above. For our algorithm it will -be helpful to have a look at the function \emph{zeroable} -defined as follows: +Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$ +and $der$, it will use the function \emph{zeroable} defined as follows: \begin{center} \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} @@ -227,8 +226,8 @@ \end{center} \noindent -In contrast to the function $nullable(r)$, which test whether a regular expression -can match the empty string, the $zeroable$ function identifies whether a regular +Recall that the function $nullable(r)$ tests whether a regular expression +can match the empty string. The function $zeroable$, on the other hand, tests whether a regular expression cannot match anything at all. The mathematical way of stating this property is @@ -237,19 +236,99 @@ \end{center} \noindent -Let us fix a set of regular expressions $rs$. -The crucial idea of the algorithm working with $rs$ and the string, say +For what follows let us fix a set of regular expressions $rs$ as being + +\begin{center} +\textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE} +\end{center} + +\noindent +specifying the ``words'' +of our programming language. The algorithm takes as input the $rs$ and a string, say \begin{center} -\texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}\grid{\ldots}} +\texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots +\end{center} + +\noindent +and tries to chop off one word from the beginning of the string. If none of the +regular expression in $rs$ matches, we will just return +the empty string. + +The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect +to the first character $c_1$. Then we take the results and continue with +building the derivatives with respect to $c_2$ until we have either exhausted our +input string or all of the regular expressions are ``zeroable''. If the input string is + +\begin{center} +\texttt{\Grid{if2\VS}}\;\dots +\end{center} + +\noindent +then building the derivatives with respect to \texttt{i} gives + +\begin{center} +\begin{tabular}{l|c} + & $zeroable$\\\hline + $der\;\texttt{i}\;(\textit{KEYWORD})$ & no\\ + $der\;\texttt{i}\;(\textit{IDENT})$ & no\\ + $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\ +\end{tabular} \end{center} \noindent -is to build the derivative of all regular expressions in $rs$ with respect -to the first character $c_1$. Then we take the results and continue with -building the derivatives with respect to $c_2$ -until we have either exhausted our input string or all of the regular expressions -are ``zeroable''. +We can eliminate then \textit{WHITESPACE} as a potential candidate, because +no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other +two regular expressions as potential candidate and we have to consider the +next character from the input string + +\begin{center} +\begin{tabular}{l|c} + & $zeroable$\\\hline + $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ & no\\ + $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$ & no\\ +\end{tabular} +\end{center} + +\noindent +Since both are `no', we have to go on with \texttt{2} from the input string + +\begin{center} +\begin{tabular}{l|c} + & $zeroable$\\\hline + $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$ & yes\\ + $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ & no\\ +\end{tabular} +\end{center} + +\noindent +Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet +know how much of the input string should be considered as an \textit{IDENT}. So we +also have to go on and consider the next derivative. + +\begin{center} +\begin{tabular}{l|c} + & $zeroable$\\\hline + $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$ & yes\\ +\end{tabular} +\end{center} + +\noindent +Since the answer is now `yes' we can stop: once all derivatives are +zeroable, we know the regular expressions cannot match any more letters from +the input. In this case we only have to go back to the derivative that is nullable. In this case it +is + +\begin{center} +$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ +\end{center} + +\noindent +which means we recognised an identifier. In case there is a choice of more than one +regular expressions that are nullable, then we choose the one with the highest precedence. +You can try out this case + + \end{document}