95 |
95 |
96 \section*{Handout 5} |
96 \section*{Handout 5} |
97 |
97 |
98 Whenever you want to design a new programming language or implement a compiler for an |
98 Whenever you want to design a new programming language or implement a compiler for an |
99 existing language, the first task is to fix the basic ``words'' of the language. For example what are the |
99 existing language, the first task is to fix the basic ``words'' of the language. For example what are the |
100 keywords, or reserved words, of the language, what are permitted identifiers, numbers and so |
100 keywords, or reserved words, of the language, what are permitted identifiers, numbers, expressions and so |
101 on. One convenient way to do this is, of |
101 on. One convenient way to do this is, of |
102 course, by using regular expressions. |
102 course, by using regular expressions. |
103 |
103 |
104 In this course we want to take a closer look at the |
104 In this course we want to take a closer look at the |
105 WHILE-language. This is a simple imperative programming language consisting of arithmetic |
105 WHILE programming language. This is a simple imperative programming language consisting of arithmetic |
106 expressions, assignments, if-statements and loops. For example the Fibonacci program can be |
106 expressions, assignments, if-statements and loops. For example the Fibonacci program can be |
107 written in this language as follows: |
107 written in this language as follows: |
108 |
108 |
109 \begin{center} |
109 \begin{center} |
110 \mbox{\lstinputlisting[language=while]{../progs/fib.while}} |
110 \mbox{\lstinputlisting[language=while]{../progs/fib.while}} |
117 \texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read} |
117 \texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read} |
118 \end{center} |
118 \end{center} |
119 |
119 |
120 \noindent |
120 \noindent |
121 In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers |
121 In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers |
122 and strings (which we however ignore for the moment). We also need to specify what the ``white space'' |
122 and strings (which we however ignore for the moment). We also need to specify what the ``whitespace'' |
123 is in our programming language and what comments should look like. |
123 is in our programming language and what comments should look like. |
124 As a first try, we specify the regular expressions for our language roughly as follows |
124 As a first try, we might specify the regular expressions for our language roughly as follows |
125 |
125 |
126 \begin{center} |
126 \begin{center} |
127 \begin{tabular}{lcl} |
127 \begin{tabular}{lcl} |
|
128 $\textit{LETTER}$ & $:=$ & $\texttt{a} + \texttt{A} + \texttt{b} + \texttt{B} + \ldots$\\ |
|
129 $\textit{DIGIT}$ & $:=$ & $\texttt{0} + \texttt{1} + \texttt{2} + \ldots$\\ |
128 $\textit{KEYWORD}$ & $:=$ & $\texttt{while} + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\ |
130 $\textit{KEYWORD}$ & $:=$ & $\texttt{while} + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\ |
129 $\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ |
131 $\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ |
130 $\textit{OP}$ & $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\ |
132 $\textit{OP}$ & $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\ |
131 $\textit{NUM}$ & $:=$ & $\textit{DIGIT}^+$\\ |
133 $\textit{NUM}$ & $:=$ & $\textit{DIGIT}^+$\\ |
132 $\textit{WHITESPACE}$ & $:=$ & $"\hspace{2mm}" + \backslash\texttt{n}$ |
134 $\textit{WHITESPACE}$ & $:=$ & $("\hspace{2mm}" + \backslash\texttt{n})^+$ |
133 \end{tabular} |
135 \end{tabular} |
134 \end{center} |
136 \end{center} |
135 |
|
136 \noindent |
|
137 with the usual meanings for the regular expressions $\textit{LETTER}$ and $\textit{DIGIT}$. |
|
138 |
137 |
139 Having the regular expressions in place, the problem we have to solve is: |
138 Having the regular expressions in place, the problem we have to solve is: |
140 given a string of our programming language, which regular expression |
139 given a string of our programming language, which regular expression |
141 matches which part of the string. In this way we can split up a string into components. |
140 matches which part of the string. By solving this problem, we can split up a string |
142 For example we expect that the input string |
141 of our language into components. |
|
142 For example given the input string |
143 |
143 |
144 \begin{center} |
144 \begin{center} |
145 \texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}} |
145 \texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}} |
146 \end{center} |
146 \end{center} |
147 |
147 |
148 \noindent |
148 \noindent |
149 is split up as follows |
149 we expect it will be split up as follows |
150 |
150 |
151 \begin{center} |
151 \begin{center} |
152 \tt |
152 \tt |
153 \Grid{if}\, |
153 \Grid{if}\, |
154 \Grid{\VS}\, |
154 \Grid{\VS}\, |
168 \end{center} |
168 \end{center} |
169 |
169 |
170 \noindent |
170 \noindent |
171 This process of splitting an input string into components is often called \emph{lexing} or \emph{scanning}. |
171 This process of splitting an input string into components is often called \emph{lexing} or \emph{scanning}. |
172 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, |
172 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, |
173 be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace, |
173 be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace, |
174 the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is |
174 this is not always the case. As can be seen the three components in \texttt{x+2} are not separated by any |
175 that in some languages, for example Python, whitespace matters. However in our small language we will eventually just filter out all whitespaces and also comments. |
175 whitespace. Another reason for recognising whitespaces explicitly is |
176 |
176 that in some languages, for example Python, whitespace matters. However in our small language we will eventually just filter out all whitespaces and also all comments. |
177 Lexing will not just separate the input into its components, but also classify the components, that |
177 |
178 is explicitly record that \texttt{it} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on. |
178 Lexing will not just separate a string into its components, but also classify the components, that |
179 For the moment, though, we will only focus on the simpler problem of splitting a string into components. |
179 is explicitly record that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on. |
180 |
180 For the moment, though, we will only focus on the simpler problem of just splitting a string into components. |
181 There are a few subtleties we need to consider first. For example, say the input string is |
181 |
|
182 There are a few subtleties we need to consider first. For example, say the string is |
182 |
183 |
183 \begin{center} |
184 \begin{center} |
184 \texttt{\Grid{iffoo\VS\ldots}} |
185 \texttt{\Grid{iffoo\VS\ldots}} |
185 \end{center} |
186 \end{center} |
186 |
187 |
187 \noindent |
188 \noindent |
188 then there are two possibilities how it can be split up: either we regard the input as the keyword \texttt{if} followed |
189 then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed |
189 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a |
190 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a |
190 single identifier. The choice that is often made in lexers is to look for the longest possible match. |
191 single identifier. The choice that is often made in lexers is to look for the longest possible match. |
191 This leaves \texttt{iffoo} as the only match (since it is longer than \texttt{if}). |
192 This leaves \texttt{iffoo} as the only match (since it is longer than \texttt{if}). |
192 |
193 |
193 Unfortuantely, the convention of the longs match does not yet make the whole process |
194 Unfortunately, the convention about the longest match does not yet make the whole process |
194 completely deterministic. Consider the input string |
195 of lexing completely deterministic. Consider the string |
195 |
196 |
196 \begin{center} |
197 \begin{center} |
197 \texttt{\Grid{then\VS\dots}} |
198 \texttt{\Grid{then\VS\dots}} |
198 \end{center} |
199 \end{center} |
199 |
200 |
200 \noindent |
201 \noindent |
201 Clearly, this string should clearly be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers would also match this string. To overcome this ambiguity we need to rank our |
202 Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our |
202 regular expressions. In our running example we just use the ranking |
203 regular expressions. In our running example we just use the ranking |
203 |
204 |
204 \[ |
205 \[ |
205 \textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots |
206 \textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots |
206 \] |
207 \] |