added
authorChristian Urban <christian dot urban at kcl dot ac dot uk>
Sun, 27 Oct 2013 00:08:57 +0100
changeset 161 bfcf838dda4d
parent 160 8134c3b981e0
child 162 edcd84c7b491
added
handouts/ho05.pdf
handouts/ho05.tex
Binary file handouts/ho05.pdf has changed
--- a/handouts/ho05.tex	Sat Oct 26 10:06:58 2013 +0100
+++ b/handouts/ho05.tex	Sun Oct 27 00:08:57 2013 +0100
@@ -146,7 +146,7 @@
 \end{center}
 
 \noindent
-we expect it will be split up as follows
+we expect it is split up as follows
 
 \begin{center}
 \tt
@@ -168,21 +168,22 @@
 \end{center}
 
 \noindent
-This process of splitting an input string into components is often called \emph{lexing} or \emph{scanning}.
+This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}.
 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, 
-be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,
-this is not always the case. As can be seen the three components in \texttt{x+2} are not separated by any
-whitespace. Another reason for recognising whitespaces explicitly is
-that in some languages, for example Python, whitespace matters. However in our small language we will eventually just filter out all whitespaces and also all comments.
+be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace
+in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are 
+not separated by any whitespace. Another reason for recognising whitespaces explicitly is
+that in some languages, for example Python, whitespaces matters, that is carry meaning. However in 
+our small language we will eventually just filter out all whitespaces and also all comments.
 
-Lexing will not just separate a string into its components, but also classify the components, that
-is explicitly record that \texttt{if} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
-For the moment, though, we will only focus on the simpler problem of just splitting a string into components.
+Lexing not just separates a string into its components, but also classify the components, meaning explicitly record that \texttt{if} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
+For the moment, though, we will only focus on the simpler problem of just splitting up 
+a string into components.
 
 There are a few subtleties  we need to consider first. For example, say the string is
 
 \begin{center}
-\texttt{\Grid{iffoo\VS\ldots}}
+\texttt{\Grid{iffoo\VS}}\;\ldots
 \end{center}
 
 \noindent
@@ -191,11 +192,11 @@
 single identifier. The choice that is often made in lexers is to look for the longest possible match.
 This leaves  \texttt{iffoo} as the only match (since it is longer than \texttt{if}).
 
-Unfortunately, the convention about the longest match does not yet make the whole process 
+Unfortunately, the convention about the longest match does not yet make the process 
 of lexing completely deterministic. Consider the string
 
 \begin{center}
-\texttt{\Grid{then\VS\dots}}
+\texttt{\Grid{then\VS}}\;\dots
 \end{center}
 
 \noindent
@@ -210,10 +211,8 @@
 So even if both regular expressions match in the example above,
 we give preference to the regular expression for keywords.
 
-Let us see how our algorithm for lexing works in detail. The regular 
-expressions and their ranking are shown above. For our algorithm it will  
-be helpful to have a look at the function \emph{zeroable}
-defined as follows:
+Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$
+and $der$, it will  use the function \emph{zeroable} defined as follows:
 
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
@@ -227,8 +226,8 @@
 \end{center}
 
 \noindent
-In contrast to the function $nullable(r)$, which test whether a regular expression
-can match the empty string, the $zeroable$ function identifies whether a regular
+Recall that the function $nullable(r)$ tests whether a regular expression
+can match the empty string. The function $zeroable$, on the other hand, tests whether a regular
 expression cannot match anything at all. The mathematical way of stating this
 property is
 
@@ -237,19 +236,99 @@
 \end{center}
 
 \noindent
-Let us fix a set of regular expressions $rs$.
-The crucial idea of the algorithm working with $rs$ and the string, say
+For what follows let us fix a set of regular expressions $rs$ as being 
+
+\begin{center}
+\textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE}
+\end{center}
+
+\noindent
+specifying the ``words'' 
+of our programming language. The algorithm takes as input the $rs$ and a string, say
 
 \begin{center}
-\texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}\grid{\ldots}}
+\texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots
+\end{center}
+
+\noindent
+and tries to chop off one word from the beginning of the string. If none of the
+regular expression in $rs$ matches, we will just return
+the empty string.
+
+The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect
+to the first character $c_1$. Then we take the results and continue with 
+building the derivatives with respect to $c_2$ until we have either exhausted our 
+input string or all of the regular expressions are ``zeroable''.  If the input string is 
+
+\begin{center}
+\texttt{\Grid{if2\VS}}\;\dots
+\end{center}
+
+\noindent
+then building the derivatives with respect to \texttt{i} gives
+
+\begin{center}
+\begin{tabular}{l|c}
+ & $zeroable$\\\hline
+ $der\;\texttt{i}\;(\textit{KEYWORD})$      & no\\
+ $der\;\texttt{i}\;(\textit{IDENT})$              & no\\
+ $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\
+\end{tabular}
 \end{center}
 
 \noindent
-is to build the derivative of all regular expressions in $rs$ with respect
-to the first character $c_1$. Then we take the results and continue with 
-building the derivatives with respect to $c_2$
-until we have either exhausted our input string or all of the regular expressions
-are ``zeroable''.
+We can eliminate then \textit{WHITESPACE} as a potential candidate, because
+no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other
+two regular expressions as potential candidate and we have to consider the
+next character from the input string
+
+\begin{center}
+\begin{tabular}{l|c}
+ & $zeroable$\\\hline
+ $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$      & no\\
+ $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$              & no\\
+\end{tabular}
+\end{center}
+
+\noindent
+Since both are `no', we have to go on with \texttt{2} from the input string
+
+\begin{center}
+\begin{tabular}{l|c}
+ & $zeroable$\\\hline
+ $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$      & yes\\
+ $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$              & no\\
+\end{tabular}
+\end{center}
+
+\noindent
+Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet
+know how much of the input string should be considered as an \textit{IDENT}. So we
+also have to go on and consider the next derivative.
+
+\begin{center}
+\begin{tabular}{l|c}
+ & $zeroable$\\\hline
+ $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$              & yes\\
+\end{tabular}
+\end{center}
+
+\noindent
+Since the answer is now `yes' we can stop: once all derivatives are
+zeroable, we know the regular expressions cannot match any more letters from
+the input. In this case we only have to go back to the derivative that is nullable.  In this case it
+is
+
+\begin{center}
+$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ 
+\end{center} 
+
+\noindent
+which means we recognised an identifier. In case there is a choice of more than one
+regular expressions that are nullable, then we choose the one with the highest precedence. 
+You can try out this case
+
+
 
 \end{document}