afl-material: comparison handouts/ho05.tex

equal deleted inserted replaced

-:8134c3b981e0
+:bfcf838dda4d
 \begin{center}
 \texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
 \end{center}
 \noindent
-we expect it will be split up as follows
+we expect it is split up as follows
 \begin{center}
 \tt
 \Grid{if}\,
 \Grid{\VS}\,
 \Grid{+}\,
 \Grid{3}
 \end{center}
 \noindent
-This process of splitting an input string into components is often called \emph{lexing} or \emph{scanning}.
+This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}.
 It is usually the first phase of a compiler. Note that the separation into words cannot, in general,
-be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,
+be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace
-this is not always the case. As can be seen the three components in \texttt{x+2} are not separated by any
+in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are
-whitespace. Another reason for recognising whitespaces explicitly is
+not separated by any whitespace. Another reason for recognising whitespaces explicitly is
-that in some languages, for example Python, whitespace matters. However in our small language we will eventually just filter out all whitespaces and also all comments.
+that in some languages, for example Python, whitespaces matters, that is carry meaning. However in
+our small language we will eventually just filter out all whitespaces and also all comments.
-Lexing will not just separate a string into its components, but also classify the components, that
-is explicitly record that \texttt{if} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
+Lexing not just separates a string into its components, but also classify the components, meaning explicitly record that \texttt{if} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
-For the moment, though, we will only focus on the simpler problem of just splitting a string into components.
+For the moment, though, we will only focus on the simpler problem of just splitting up
+a string into components.
 There are a few subtleties  we need to consider first. For example, say the string is
 \begin{center}
-\texttt{\Grid{iffoo\VS\ldots}}
+\texttt{\Grid{iffoo\VS}}\;\ldots
 \end{center}
 \noindent
 then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed
 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a
 single identifier. The choice that is often made in lexers is to look for the longest possible match.
 This leaves  \texttt{iffoo} as the only match (since it is longer than \texttt{if}).
-Unfortunately, the convention about the longest match does not yet make the whole process
+Unfortunately, the convention about the longest match does not yet make the process
 of lexing completely deterministic. Consider the string
 \begin{center}
-\texttt{\Grid{then\VS\dots}}
+\texttt{\Grid{then\VS}}\;\dots
 \end{center}
 \noindent
 Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our
 regular expressions. In our running example we just use the ranking
 \noindent
 So even if both regular expressions match in the example above,
 we give preference to the regular expression for keywords.
-Let us see how our algorithm for lexing works in detail. The regular
+Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$
-expressions and their ranking are shown above. For our algorithm it will
+and $der$, it will  use the function \emph{zeroable} defined as follows:
-be helpful to have a look at the function \emph{zeroable}
-defined as follows:
 \begin{center}
 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
 $zeroable(\varnothing)$      & $\dn$ & $true$\\
 $zeroable(\epsilon)$           & $\dn$ &  $f\!alse$\\
 $zeroable (r^*)$                  & $\dn$ & $f\!alse$\\
 \end{tabular}
 \end{center}
 \noindent
-In contrast to the function $nullable(r)$, which test whether a regular expression
+Recall that the function $nullable(r)$ tests whether a regular expression
-can match the empty string, the $zeroable$ function identifies whether a regular
+can match the empty string. The function $zeroable$, on the other hand, tests whether a regular
 expression cannot match anything at all. The mathematical way of stating this
 property is
 \begin{center}
 $zeroable(r)$ if and only if $L(r) = \varnothing$
 \end{center}
 \noindent
-Let us fix a set of regular expressions $rs$.
+For what follows let us fix a set of regular expressions $rs$ as being
-The crucial idea of the algorithm working with $rs$ and the string, say
+\begin{center}
-\begin{center}
+\textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE}
-\texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}\grid{\ldots}}
+\end{center}
-\end{center}
+\noindent
-\noindent
+specifying the ``words''
-is to build the derivative of all regular expressions in $rs$ with respect
+of our programming language. The algorithm takes as input the $rs$ and a string, say
+\begin{center}
+\texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots
+\end{center}
+\noindent
+and tries to chop off one word from the beginning of the string. If none of the
+regular expression in $rs$ matches, we will just return
+the empty string.
+The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect
 to the first character $c_1$. Then we take the results and continue with
-building the derivatives with respect to $c_2$
+building the derivatives with respect to $c_2$ until we have either exhausted our
-until we have either exhausted our input string or all of the regular expressions
+input string or all of the regular expressions are ``zeroable''.  If the input string is
-are ``zeroable''.
+\begin{center}
+\texttt{\Grid{if2\VS}}\;\dots
+\end{center}
+\noindent
+then building the derivatives with respect to \texttt{i} gives
+\begin{center}
+\begin{tabular}{l|c}
+& $zeroable$\\\hline
+$der\;\texttt{i}\;(\textit{KEYWORD})$      & no\\
+$der\;\texttt{i}\;(\textit{IDENT})$              & no\\
+$der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\
+\end{tabular}
+\end{center}
+\noindent
+We can eliminate then \textit{WHITESPACE} as a potential candidate, because
+no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other
+two regular expressions as potential candidate and we have to consider the
+next character from the input string
+\begin{center}
+\begin{tabular}{l|c}
+& $zeroable$\\\hline
+$der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$      & no\\
+$der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$              & no\\
+\end{tabular}
+\end{center}
+\noindent
+Since both are `no', we have to go on with \texttt{2} from the input string
+\begin{center}
+\begin{tabular}{l|c}
+& $zeroable$\\\hline
+$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$      & yes\\
+$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$              & no\\
+\end{tabular}
+\end{center}
+\noindent
+Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet
+know how much of the input string should be considered as an \textit{IDENT}. So we
+also have to go on and consider the next derivative.
+\begin{center}
+\begin{tabular}{l|c}
+& $zeroable$\\\hline
+$der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$              & yes\\
+\end{tabular}
+\end{center}
+\noindent
+Since the answer is now `yes' we can stop: once all derivatives are
+zeroable, we know the regular expressions cannot match any more letters from
+the input. In this case we only have to go back to the derivative that is nullable.  In this case it
+is
+\begin{center}
+$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$
+\end{center}
+\noindent
+which means we recognised an identifier. In case there is a choice of more than one
+regular expressions that are nullable, then we choose the one with the highest precedence.
+You can try out this case
 \end{document}
 %%% Local Variables:
 %%% mode: latex

changeset 161	bfcf838dda4d
parent 160	8134c3b981e0
child 162	edcd84c7b491