handouts/ho05.tex
changeset 155 9b2d128765e1
parent 154 51d6b8b828c4
child 156 95eaee695636
equal deleted inserted replaced
154:51d6b8b828c4 155:9b2d128765e1
    93 	
    93 	
    94 \begin{document}
    94 \begin{document}
    95 
    95 
    96 \section*{Handout 5}
    96 \section*{Handout 5}
    97 
    97 
    98 Whenever you want to design a programming language or implement a compiler for an
    98 Whenever you want to design a new programming language or implement a compiler for an
    99 existing language, the first task is to fix the basic ``words'' of the language, like what are the 
    99 existing language, the first task is to fix the basic ``words'' of the language. For example what are the 
   100 keywords or reserved words of the language, what are permitted identifiers, numbers and so 
   100 keywords, or reserved words, of the language, what are permitted identifiers, numbers and so 
   101 on. One convenient way to do this is, of 
   101 on. One convenient way to do this is, of 
   102 course, to use regular expressions. In this course we want to take a closer look at the 
   102 course, by using regular expressions. 
   103 WHILE-language. This is a simple imperative language consisting of arithmetic
   103 
   104 expressions, assignments and loops only. For example the Fibonacci program can be
   104 In this course we want to take a closer look at the 
   105 written in this language as follows
   105 WHILE-language. This is a simple imperative programming language consisting of arithmetic
       
   106 expressions, assignments, if-statements and loops. For example the Fibonacci program can be
       
   107 written in this language as follows:
   106 
   108 
   107 \begin{center}
   109 \begin{center}
   108 \mbox{\lstinputlisting[language=while]{../progs/fib.while}}
   110 \mbox{\lstinputlisting[language=while]{../progs/fib.while}}
   109 \end{center}
   111 \end{center}
   110 
   112 
   114 \begin{center}
   116 \begin{center}
   115 \texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}
   117 \texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}
   116 \end{center}
   118 \end{center}
   117 
   119 
   118 \noindent
   120 \noindent
   119 In addition we will have some typical operators, like \texttt{<}, \texttt{>}, \texttt{:=} and so on; numbers
   121 In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers
   120 and strings (which we however ignore for the moment). We also need to specify what the ``white space''
   122 and strings (which we however ignore for the moment). We also need to specify what the ``white space''
   121 is in our programming language as well as what comments should look like.
   123 is in our programming language and what comments should look like.
   122 We might specify the regular expressions for our language roughly as follows
   124 As a first try, we specify the regular expressions for our language roughly as follows
   123 
   125 
   124 \begin{center}
   126 \begin{center}
   125 \begin{tabular}{lcl}
   127 \begin{tabular}{lcl}
   126 $\textit{KEYWORD}$ & $:=$ & $\texttt{while}  + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\
   128 $\textit{KEYWORD}$ & $:=$ & $\texttt{while}  + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\
   127 $\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ 
   129 $\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ 
   132 \end{center}
   134 \end{center}
   133 
   135 
   134 \noindent
   136 \noindent
   135 with the usual meanings for the regular expressions $\textit{LETTER}$ and $\textit{DIGIT}$.
   137 with the usual meanings for the regular expressions $\textit{LETTER}$ and $\textit{DIGIT}$.
   136 
   138 
   137 The problem we have to solve is given a string of our programming language, which regular expression 
   139 Having the regular expressions in place, the problem we have to solve is: 
   138 matches which part of the string. For example the input string
   140 given a string of our programming language, which regular expression 
       
   141 matches which part of the string. In this way we can split up a string into components. 
       
   142 For example we expect that the input string
   139 
   143 
   140 \begin{center}
   144 \begin{center}
   141 \texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
   145 \texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
   142 \end{center}
   146 \end{center}
   143 
   147 
   144 \noindent
   148 \noindent
   145 needs to be recognised as
   149 is split up as follows
   146 
   150 
   147 \begin{center}
   151 \begin{center}
   148 \tt
   152 \tt
   149 \Grid{if} 
   153 \Grid{if}\,
   150 \Grid{\VS} 
   154 \Grid{\VS}\, 
   151 \Grid{true} 
   155 \Grid{true}\, 
   152 \Grid{\VS} 
   156 \Grid{\VS}\, 
   153 \Grid{then} 
   157 \Grid{then}\, 
   154 \Grid{\VS} 
   158 \Grid{\VS}\, 
   155 \Grid{x} 
   159 \Grid{x}\, 
   156 \Grid{+} 
   160 \Grid{+}\, 
   157 \Grid{2}
   161 \Grid{2}\,
   158 \Grid{\VS} 
   162 \Grid{\VS}\, 
   159 \Grid{else}
   163 \Grid{else}\,
   160 \Grid{\VS}
   164 \Grid{\VS}\,
   161 \Grid{x}
   165 \Grid{x}\,
   162 \Grid{+}
   166 \Grid{+}\,
   163 \Grid{3}
   167 \Grid{3}
   164 \end{center}
   168 \end{center}
   165 
   169 
   166 \noindent
   170 \noindent
   167 Since \texttt{if} matches the \textit{KEYWORD} regular expression, \VS{}  is a whitespace and so on. This process of separating an input string into components is often called \emph{lexing} or \emph{scanning}.
   171 This process of splitting an input string into components is often called \emph{lexing} or \emph{scanning}.
   168 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, 
   172 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, 
   169 be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,
   173 be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,
   170 the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is
   174 the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is
   171 that in some languages, for example Python, whitespace matters. However in our small language we will eventually filter out all whitespaces and also comments.
   175 that in some languages, for example Python, whitespace matters. However in our small language we will eventually just filter out all whitespaces and also comments.
   172 
   176 
   173 Lexing will not just separate the input into its components, but also classify the components, that
   177 Lexing will not just separate the input into its components, but also classify the components, that
   174 is explicitly record that \texttt{it} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
   178 is explicitly record that \texttt{it} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
   175 But for the moment we will only focus on the simpler problem of separating a string into components.
   179 For the moment, though, we will only focus on the simpler problem of splitting a string into components.
   176 There are a few subtleties  we need to consider first. For example if the input string is
   180 
       
   181 There are a few subtleties  we need to consider first. For example, say the input string is
   177 
   182 
   178 \begin{center}
   183 \begin{center}
   179 \texttt{\Grid{iffoo\VS\ldots}}
   184 \texttt{\Grid{iffoo\VS\ldots}}
   180 \end{center}
   185 \end{center}
   181 
   186 
   182 \noindent
   187 \noindent
   183 then there are two possibilities: either we regard the input as the keyword \texttt{if} followed
   188 then there are two possibilities how it can be split up: either we regard the input as the keyword \texttt{if} followed
   184 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a 
   189 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a 
   185 single identifier. The choice that is often made in lexers to look for the longest possible match,
   190 single identifier. The choice that is often made in lexers is to look for the longest possible match.
   186 that is regard the input as a single identifier  \texttt{iffoo} (since it is longer than \texttt{if}).
   191 This leaves  \texttt{iffoo} as the only match (since it is longer than \texttt{if}).
       
   192 
       
   193 Unfortuantely, the convention of the longs match does not yet make the whole process 
       
   194 completely deterministic. Consider the input string
       
   195 
       
   196 \begin{center}
       
   197 \texttt{\Grid{then\VS\dots}}
       
   198 \end{center}
       
   199 
       
   200 \noindent
       
   201 Clearly, this string should clearly be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers would also match this string. To solve this ambiguity we need to rank our regular expressions.
       
   202 
       
   203 
   187 
   204 
   188 \end{document}
   205 \end{document}
   189 
   206 
   190 %%% Local Variables: 
   207 %%% Local Variables: 
   191 %%% mode: latex
   208 %%% mode: latex