93 |
93 |
94 \begin{document} |
94 \begin{document} |
95 |
95 |
96 \section*{Handout 5} |
96 \section*{Handout 5} |
97 |
97 |
98 Whenever you want to design a programming language or implement a compiler for an |
98 Whenever you want to design a new programming language or implement a compiler for an |
99 existing language, the first task is to fix the basic ``words'' of the language, like what are the |
99 existing language, the first task is to fix the basic ``words'' of the language. For example what are the |
100 keywords or reserved words of the language, what are permitted identifiers, numbers and so |
100 keywords, or reserved words, of the language, what are permitted identifiers, numbers and so |
101 on. One convenient way to do this is, of |
101 on. One convenient way to do this is, of |
102 course, to use regular expressions. In this course we want to take a closer look at the |
102 course, by using regular expressions. |
103 WHILE-language. This is a simple imperative language consisting of arithmetic |
103 |
104 expressions, assignments and loops only. For example the Fibonacci program can be |
104 In this course we want to take a closer look at the |
105 written in this language as follows |
105 WHILE-language. This is a simple imperative programming language consisting of arithmetic |
|
106 expressions, assignments, if-statements and loops. For example the Fibonacci program can be |
|
107 written in this language as follows: |
106 |
108 |
107 \begin{center} |
109 \begin{center} |
108 \mbox{\lstinputlisting[language=while]{../progs/fib.while}} |
110 \mbox{\lstinputlisting[language=while]{../progs/fib.while}} |
109 \end{center} |
111 \end{center} |
110 |
112 |
114 \begin{center} |
116 \begin{center} |
115 \texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read} |
117 \texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read} |
116 \end{center} |
118 \end{center} |
117 |
119 |
118 \noindent |
120 \noindent |
119 In addition we will have some typical operators, like \texttt{<}, \texttt{>}, \texttt{:=} and so on; numbers |
121 In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers |
120 and strings (which we however ignore for the moment). We also need to specify what the ``white space'' |
122 and strings (which we however ignore for the moment). We also need to specify what the ``white space'' |
121 is in our programming language as well as what comments should look like. |
123 is in our programming language and what comments should look like. |
122 We might specify the regular expressions for our language roughly as follows |
124 As a first try, we specify the regular expressions for our language roughly as follows |
123 |
125 |
124 \begin{center} |
126 \begin{center} |
125 \begin{tabular}{lcl} |
127 \begin{tabular}{lcl} |
126 $\textit{KEYWORD}$ & $:=$ & $\texttt{while} + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\ |
128 $\textit{KEYWORD}$ & $:=$ & $\texttt{while} + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\ |
127 $\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ |
129 $\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ |
132 \end{center} |
134 \end{center} |
133 |
135 |
134 \noindent |
136 \noindent |
135 with the usual meanings for the regular expressions $\textit{LETTER}$ and $\textit{DIGIT}$. |
137 with the usual meanings for the regular expressions $\textit{LETTER}$ and $\textit{DIGIT}$. |
136 |
138 |
137 The problem we have to solve is given a string of our programming language, which regular expression |
139 Having the regular expressions in place, the problem we have to solve is: |
138 matches which part of the string. For example the input string |
140 given a string of our programming language, which regular expression |
|
141 matches which part of the string. In this way we can split up a string into components. |
|
142 For example we expect that the input string |
139 |
143 |
140 \begin{center} |
144 \begin{center} |
141 \texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}} |
145 \texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}} |
142 \end{center} |
146 \end{center} |
143 |
147 |
144 \noindent |
148 \noindent |
145 needs to be recognised as |
149 is split up as follows |
146 |
150 |
147 \begin{center} |
151 \begin{center} |
148 \tt |
152 \tt |
149 \Grid{if} |
153 \Grid{if}\, |
150 \Grid{\VS} |
154 \Grid{\VS}\, |
151 \Grid{true} |
155 \Grid{true}\, |
152 \Grid{\VS} |
156 \Grid{\VS}\, |
153 \Grid{then} |
157 \Grid{then}\, |
154 \Grid{\VS} |
158 \Grid{\VS}\, |
155 \Grid{x} |
159 \Grid{x}\, |
156 \Grid{+} |
160 \Grid{+}\, |
157 \Grid{2} |
161 \Grid{2}\, |
158 \Grid{\VS} |
162 \Grid{\VS}\, |
159 \Grid{else} |
163 \Grid{else}\, |
160 \Grid{\VS} |
164 \Grid{\VS}\, |
161 \Grid{x} |
165 \Grid{x}\, |
162 \Grid{+} |
166 \Grid{+}\, |
163 \Grid{3} |
167 \Grid{3} |
164 \end{center} |
168 \end{center} |
165 |
169 |
166 \noindent |
170 \noindent |
167 Since \texttt{if} matches the \textit{KEYWORD} regular expression, \VS{} is a whitespace and so on. This process of separating an input string into components is often called \emph{lexing} or \emph{scanning}. |
171 This process of splitting an input string into components is often called \emph{lexing} or \emph{scanning}. |
168 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, |
172 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, |
169 be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace, |
173 be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace, |
170 the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is |
174 the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is |
171 that in some languages, for example Python, whitespace matters. However in our small language we will eventually filter out all whitespaces and also comments. |
175 that in some languages, for example Python, whitespace matters. However in our small language we will eventually just filter out all whitespaces and also comments. |
172 |
176 |
173 Lexing will not just separate the input into its components, but also classify the components, that |
177 Lexing will not just separate the input into its components, but also classify the components, that |
174 is explicitly record that \texttt{it} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on. |
178 is explicitly record that \texttt{it} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on. |
175 But for the moment we will only focus on the simpler problem of separating a string into components. |
179 For the moment, though, we will only focus on the simpler problem of splitting a string into components. |
176 There are a few subtleties we need to consider first. For example if the input string is |
180 |
|
181 There are a few subtleties we need to consider first. For example, say the input string is |
177 |
182 |
178 \begin{center} |
183 \begin{center} |
179 \texttt{\Grid{iffoo\VS\ldots}} |
184 \texttt{\Grid{iffoo\VS\ldots}} |
180 \end{center} |
185 \end{center} |
181 |
186 |
182 \noindent |
187 \noindent |
183 then there are two possibilities: either we regard the input as the keyword \texttt{if} followed |
188 then there are two possibilities how it can be split up: either we regard the input as the keyword \texttt{if} followed |
184 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a |
189 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a |
185 single identifier. The choice that is often made in lexers to look for the longest possible match, |
190 single identifier. The choice that is often made in lexers is to look for the longest possible match. |
186 that is regard the input as a single identifier \texttt{iffoo} (since it is longer than \texttt{if}). |
191 This leaves \texttt{iffoo} as the only match (since it is longer than \texttt{if}). |
|
192 |
|
193 Unfortuantely, the convention of the longs match does not yet make the whole process |
|
194 completely deterministic. Consider the input string |
|
195 |
|
196 \begin{center} |
|
197 \texttt{\Grid{then\VS\dots}} |
|
198 \end{center} |
|
199 |
|
200 \noindent |
|
201 Clearly, this string should clearly be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers would also match this string. To solve this ambiguity we need to rank our regular expressions. |
|
202 |
|
203 |
187 |
204 |
188 \end{document} |
205 \end{document} |
189 |
206 |
190 %%% Local Variables: |
207 %%% Local Variables: |
191 %%% mode: latex |
208 %%% mode: latex |