afl-material: comparison handouts/ho05.tex

equal deleted inserted replaced

-:bfcf838dda4d
+:edcd84c7b491
 in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are
 not separated by any whitespace. Another reason for recognising whitespaces explicitly is
 that in some languages, for example Python, whitespaces matters, that is carry meaning. However in
 our small language we will eventually just filter out all whitespaces and also all comments.
-Lexing not just separates a string into its components, but also classify the components, meaning explicitly record that \texttt{if} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
+Lexing not just separates a string into its components, but also classifies the components, meaning it explicitly records that \texttt{if} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
 For the moment, though, we will only focus on the simpler problem of just splitting up
 a string into components.
 There are a few subtleties  we need to consider first. For example, say the string is
 \noindent
 then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed
 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a
 single identifier. The choice that is often made in lexers is to look for the longest possible match.
-This leaves  \texttt{iffoo} as the only match (since it is longer than \texttt{if}).
+This leaves  \texttt{iffoo} as the only match in this case (since it is longer than \texttt{if}).
 Unfortunately, the convention about the longest match does not yet make the process
 of lexing completely deterministic. Consider the string
 \begin{center}
 the empty string.
 The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect
 to the first character $c_1$. Then we take the results and continue with
 building the derivatives with respect to $c_2$ until we have either exhausted our
-input string or all of the regular expressions are ``zeroable''.  If the input string is
+input string or all of the regular expressions are ``zeroable''.  Suppose the input string is
 \begin{center}
 \texttt{\Grid{if2\VS}}\;\dots
 \end{center}
 $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\
 \end{tabular}
 \end{center}
 \noindent
-We can eliminate then \textit{WHITESPACE} as a potential candidate, because
+We can eliminate \textit{WHITESPACE} as a potential candidate, because
 no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other
 two regular expressions as potential candidate and we have to consider the
-next character from the input string
+next character, \texttt{f}, from the input string
 \begin{center}
 \begin{tabular}{l|c}
 & $zeroable$\\\hline
 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$      & no\\
 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$              & no\\
 \end{tabular}
 \end{center}
 \noindent
-Since both are `no', we have to go on with \texttt{2} from the input string
+Since both are `no', we have to continue with \texttt{2} from the input string
 \begin{center}
 \begin{tabular}{l|c}
 & $zeroable$\\\hline
 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$      & yes\\
 \end{center}
 \noindent
 Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet
 know how much of the input string should be considered as an \textit{IDENT}. So we
-also have to go on and consider the next derivative.
+still have to continue and consider the next derivative.
 \begin{center}
 \begin{tabular}{l|c}
 & $zeroable$\\\hline
 $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$              & yes\\
 \end{tabular}
 \end{center}
 \noindent
-Since the answer is now `yes' we can stop: once all derivatives are
+Since the answer is now `yes' also in this case, we can stop: once all derivatives are
 zeroable, we know the regular expressions cannot match any more letters from
 the input. In this case we only have to go back to the derivative that is nullable.  In this case it
 is
 \begin{center}
 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$
 \end{center}
 \noindent
-which means we recognised an identifier. In case there is a choice of more than one
+which means we recognised an identifier. In case where there is a choice of more than one
 regular expressions that are nullable, then we choose the one with the highest precedence.
-You can try out this case
+You can try out such a case with the input string
+\begin{center}
+\texttt{\Grid{then\VS}}\;\dots
+\end{center}
+\noindent
+which can both be recognised as a keyword, but also an identifier.
+While in the example above the last nullable derivative is the one directly before
+the derivative turns zeroable, this is not always the case. Imagine, identifiers can be
+letters, as permuted by the regular expression \textit{IDENT}, but must end with an
+undercore.
+\begin{center}
+\begin{tabular}{lcl}
+$\textit{NEWIDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^* \cdot \_$\\
+\end{tabular}
+\end{center}
+\noindent
+If we use \textit{NEWIDENT} with the input string
+\begin{center}
+\texttt{\Grid{iffoo\VS}}\;\ldots
+\end{center}
+\noindent
+then it will only become $zeroable$ after the $\VS$ has been analysed. Then we have to go back to the
+first \texttt{f} because only
+\begin{center}
+$der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$
+\end{center}
+\noindent
+is nullable. As a result we recognise successfully the keyword \texttt{if} and the remaining
+string needs to be consumed by other regular expressions or lead to a lexing error.
 \end{document}

changeset 162	edcd84c7b491
parent 161	bfcf838dda4d
child 163	89d6d89d9844