--- a/handouts/ho05.tex Sat Oct 26 01:22:21 2013 +0100
+++ b/handouts/ho05.tex Sat Oct 26 10:02:29 2013 +0100
@@ -97,12 +97,12 @@
Whenever you want to design a new programming language or implement a compiler for an
existing language, the first task is to fix the basic ``words'' of the language. For example what are the
-keywords, or reserved words, of the language, what are permitted identifiers, numbers and so
+keywords, or reserved words, of the language, what are permitted identifiers, numbers, expressions and so
on. One convenient way to do this is, of
course, by using regular expressions.
In this course we want to take a closer look at the
-WHILE-language. This is a simple imperative programming language consisting of arithmetic
+WHILE programming language. This is a simple imperative programming language consisting of arithmetic
expressions, assignments, if-statements and loops. For example the Fibonacci program can be
written in this language as follows:
@@ -119,34 +119,34 @@
\noindent
In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers
-and strings (which we however ignore for the moment). We also need to specify what the ``white space''
+and strings (which we however ignore for the moment). We also need to specify what the ``whitespace''
is in our programming language and what comments should look like.
-As a first try, we specify the regular expressions for our language roughly as follows
+As a first try, we might specify the regular expressions for our language roughly as follows
\begin{center}
\begin{tabular}{lcl}
+$\textit{LETTER}$ & $:=$ & $\texttt{a} + \texttt{A} + \texttt{b} + \texttt{B} + \ldots$\\
+$\textit{DIGIT}$ & $:=$ & $\texttt{0} + \texttt{1} + \texttt{2} + \ldots$\\
$\textit{KEYWORD}$ & $:=$ & $\texttt{while} + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\
$\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\
$\textit{OP}$ & $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\
$\textit{NUM}$ & $:=$ & $\textit{DIGIT}^+$\\
-$\textit{WHITESPACE}$ & $:=$ & $"\hspace{2mm}" + \backslash\texttt{n}$
+$\textit{WHITESPACE}$ & $:=$ & $("\hspace{2mm}" + \backslash\texttt{n})^+$
\end{tabular}
\end{center}
-\noindent
-with the usual meanings for the regular expressions $\textit{LETTER}$ and $\textit{DIGIT}$.
-
Having the regular expressions in place, the problem we have to solve is:
given a string of our programming language, which regular expression
-matches which part of the string. In this way we can split up a string into components.
-For example we expect that the input string
+matches which part of the string. By solving this problem, we can split up a string
+of our language into components.
+For example given the input string
\begin{center}
\texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
\end{center}
\noindent
-is split up as follows
+we expect it will be split up as follows
\begin{center}
\tt
@@ -170,35 +170,36 @@
\noindent
This process of splitting an input string into components is often called \emph{lexing} or \emph{scanning}.
It is usually the first phase of a compiler. Note that the separation into words cannot, in general,
-be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,
-the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is
-that in some languages, for example Python, whitespace matters. However in our small language we will eventually just filter out all whitespaces and also comments.
+be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,
+this is not always the case. As can be seen the three components in \texttt{x+2} are not separated by any
+whitespace. Another reason for recognising whitespaces explicitly is
+that in some languages, for example Python, whitespace matters. However in our small language we will eventually just filter out all whitespaces and also all comments.
-Lexing will not just separate the input into its components, but also classify the components, that
-is explicitly record that \texttt{it} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on.
-For the moment, though, we will only focus on the simpler problem of splitting a string into components.
+Lexing will not just separate a string into its components, but also classify the components, that
+is explicitly record that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on.
+For the moment, though, we will only focus on the simpler problem of just splitting a string into components.
-There are a few subtleties we need to consider first. For example, say the input string is
+There are a few subtleties we need to consider first. For example, say the string is
\begin{center}
\texttt{\Grid{iffoo\VS\ldots}}
\end{center}
\noindent
-then there are two possibilities how it can be split up: either we regard the input as the keyword \texttt{if} followed
+then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed
by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a
single identifier. The choice that is often made in lexers is to look for the longest possible match.
This leaves \texttt{iffoo} as the only match (since it is longer than \texttt{if}).
-Unfortuantely, the convention of the longs match does not yet make the whole process
-completely deterministic. Consider the input string
+Unfortunately, the convention about the longest match does not yet make the whole process
+of lexing completely deterministic. Consider the string
\begin{center}
\texttt{\Grid{then\VS\dots}}
\end{center}
\noindent
-Clearly, this string should clearly be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers would also match this string. To overcome this ambiguity we need to rank our
+Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our
regular expressions. In our running example we just use the ranking
\[
@@ -245,8 +246,9 @@
\noindent
and build the derivative of all regular expressions in $rs$ with respect
-to the first character. Then we take the result and continue with $c_2$
-until we have either exhausted our input string or all of the regula
+to the first character $c_1$. Then we take the result and continue with $c_2$
+until we have either exhausted our input string or all of the regular expressions
+are ``zeroable''.
\end{document}