diff -r 51d6b8b828c4 -r 9b2d128765e1 handouts/ho05.tex --- a/handouts/ho05.tex Fri Oct 25 17:55:35 2013 +0100 +++ b/handouts/ho05.tex Sat Oct 26 00:39:27 2013 +0100 @@ -95,14 +95,16 @@ \section*{Handout 5} -Whenever you want to design a programming language or implement a compiler for an -existing language, the first task is to fix the basic ``words'' of the language, like what are the -keywords or reserved words of the language, what are permitted identifiers, numbers and so +Whenever you want to design a new programming language or implement a compiler for an +existing language, the first task is to fix the basic ``words'' of the language. For example what are the +keywords, or reserved words, of the language, what are permitted identifiers, numbers and so on. One convenient way to do this is, of -course, to use regular expressions. In this course we want to take a closer look at the -WHILE-language. This is a simple imperative language consisting of arithmetic -expressions, assignments and loops only. For example the Fibonacci program can be -written in this language as follows +course, by using regular expressions. + +In this course we want to take a closer look at the +WHILE-language. This is a simple imperative programming language consisting of arithmetic +expressions, assignments, if-statements and loops. For example the Fibonacci program can be +written in this language as follows: \begin{center} \mbox{\lstinputlisting[language=while]{../progs/fib.while}} @@ -116,10 +118,10 @@ \end{center} \noindent -In addition we will have some typical operators, like \texttt{<}, \texttt{>}, \texttt{:=} and so on; numbers +In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers and strings (which we however ignore for the moment). We also need to specify what the ``white space'' -is in our programming language as well as what comments should look like. -We might specify the regular expressions for our language roughly as follows +is in our programming language and what comments should look like. +As a first try, we specify the regular expressions for our language roughly as follows \begin{center} \begin{tabular}{lcl} @@ -134,56 +136,71 @@ \noindent with the usual meanings for the regular expressions $\textit{LETTER}$ and $\textit{DIGIT}$. -The problem we have to solve is given a string of our programming language, which regular expression -matches which part of the string. For example the input string +Having the regular expressions in place, the problem we have to solve is: +given a string of our programming language, which regular expression +matches which part of the string. In this way we can split up a string into components. +For example we expect that the input string \begin{center} \texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}} \end{center} \noindent -needs to be recognised as +is split up as follows \begin{center} \tt -\Grid{if} -\Grid{\VS} -\Grid{true} -\Grid{\VS} -\Grid{then} -\Grid{\VS} -\Grid{x} -\Grid{+} -\Grid{2} -\Grid{\VS} -\Grid{else} -\Grid{\VS} -\Grid{x} -\Grid{+} +\Grid{if}\, +\Grid{\VS}\, +\Grid{true}\, +\Grid{\VS}\, +\Grid{then}\, +\Grid{\VS}\, +\Grid{x}\, +\Grid{+}\, +\Grid{2}\, +\Grid{\VS}\, +\Grid{else}\, +\Grid{\VS}\, +\Grid{x}\, +\Grid{+}\, \Grid{3} \end{center} \noindent -Since \texttt{if} matches the \textit{KEYWORD} regular expression, \VS{} is a whitespace and so on. This process of separating an input string into components is often called \emph{lexing} or \emph{scanning}. +This process of splitting an input string into components is often called \emph{lexing} or \emph{scanning}. It is usually the first phase of a compiler. Note that the separation into words cannot, in general, be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace, the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is -that in some languages, for example Python, whitespace matters. However in our small language we will eventually filter out all whitespaces and also comments. +that in some languages, for example Python, whitespace matters. However in our small language we will eventually just filter out all whitespaces and also comments. Lexing will not just separate the input into its components, but also classify the components, that is explicitly record that \texttt{it} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on. -But for the moment we will only focus on the simpler problem of separating a string into components. -There are a few subtleties we need to consider first. For example if the input string is +For the moment, though, we will only focus on the simpler problem of splitting a string into components. + +There are a few subtleties we need to consider first. For example, say the input string is \begin{center} \texttt{\Grid{iffoo\VS\ldots}} \end{center} \noindent -then there are two possibilities: either we regard the input as the keyword \texttt{if} followed +then there are two possibilities how it can be split up: either we regard the input as the keyword \texttt{if} followed by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a -single identifier. The choice that is often made in lexers to look for the longest possible match, -that is regard the input as a single identifier \texttt{iffoo} (since it is longer than \texttt{if}). +single identifier. The choice that is often made in lexers is to look for the longest possible match. +This leaves \texttt{iffoo} as the only match (since it is longer than \texttt{if}). + +Unfortuantely, the convention of the longs match does not yet make the whole process +completely deterministic. Consider the input string + +\begin{center} +\texttt{\Grid{then\VS\dots}} +\end{center} + +\noindent +Clearly, this string should clearly be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers would also match this string. To solve this ambiguity we need to rank our regular expressions. + + \end{document}