handouts/ho05.tex
changeset 159 ae5ceef5355e
parent 158 77e8397ec2ec
child 160 8134c3b981e0
equal deleted inserted replaced
158:77e8397ec2ec 159:ae5ceef5355e
    95 
    95 
    96 \section*{Handout 5}
    96 \section*{Handout 5}
    97 
    97 
    98 Whenever you want to design a new programming language or implement a compiler for an
    98 Whenever you want to design a new programming language or implement a compiler for an
    99 existing language, the first task is to fix the basic ``words'' of the language. For example what are the 
    99 existing language, the first task is to fix the basic ``words'' of the language. For example what are the 
   100 keywords, or reserved words, of the language, what are permitted identifiers, numbers and so 
   100 keywords, or reserved words, of the language, what are permitted identifiers, numbers, expressions and so 
   101 on. One convenient way to do this is, of 
   101 on. One convenient way to do this is, of 
   102 course, by using regular expressions. 
   102 course, by using regular expressions. 
   103 
   103 
   104 In this course we want to take a closer look at the 
   104 In this course we want to take a closer look at the 
   105 WHILE-language. This is a simple imperative programming language consisting of arithmetic
   105 WHILE programming language. This is a simple imperative programming language consisting of arithmetic
   106 expressions, assignments, if-statements and loops. For example the Fibonacci program can be
   106 expressions, assignments, if-statements and loops. For example the Fibonacci program can be
   107 written in this language as follows:
   107 written in this language as follows:
   108 
   108 
   109 \begin{center}
   109 \begin{center}
   110 \mbox{\lstinputlisting[language=while]{../progs/fib.while}}
   110 \mbox{\lstinputlisting[language=while]{../progs/fib.while}}
   117 \texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}
   117 \texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}
   118 \end{center}
   118 \end{center}
   119 
   119 
   120 \noindent
   120 \noindent
   121 In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers
   121 In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers
   122 and strings (which we however ignore for the moment). We also need to specify what the ``white space''
   122 and strings (which we however ignore for the moment). We also need to specify what the ``whitespace''
   123 is in our programming language and what comments should look like.
   123 is in our programming language and what comments should look like.
   124 As a first try, we specify the regular expressions for our language roughly as follows
   124 As a first try, we might specify the regular expressions for our language roughly as follows
   125 
   125 
   126 \begin{center}
   126 \begin{center}
   127 \begin{tabular}{lcl}
   127 \begin{tabular}{lcl}
       
   128 $\textit{LETTER}$ & $:=$ & $\texttt{a} + \texttt{A} + \texttt{b} + \texttt{B} + \ldots$\\
       
   129 $\textit{DIGIT}$ & $:=$ & $\texttt{0} + \texttt{1} + \texttt{2} + \ldots$\\
   128 $\textit{KEYWORD}$ & $:=$ & $\texttt{while}  + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\
   130 $\textit{KEYWORD}$ & $:=$ & $\texttt{while}  + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\
   129 $\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ 
   131 $\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ 
   130 $\textit{OP}$      &  $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\
   132 $\textit{OP}$      &  $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\
   131 $\textit{NUM}$   &  $:=$ & $\textit{DIGIT}^+$\\
   133 $\textit{NUM}$   &  $:=$ & $\textit{DIGIT}^+$\\
   132 $\textit{WHITESPACE}$ & $:=$ & $"\hspace{2mm}" + \backslash\texttt{n}$
   134 $\textit{WHITESPACE}$ & $:=$ & $("\hspace{2mm}" + \backslash\texttt{n})^+$
   133 \end{tabular}
   135 \end{tabular}
   134 \end{center}
   136 \end{center}
   135 
       
   136 \noindent
       
   137 with the usual meanings for the regular expressions $\textit{LETTER}$ and $\textit{DIGIT}$.
       
   138 
   137 
   139 Having the regular expressions in place, the problem we have to solve is: 
   138 Having the regular expressions in place, the problem we have to solve is: 
   140 given a string of our programming language, which regular expression 
   139 given a string of our programming language, which regular expression 
   141 matches which part of the string. In this way we can split up a string into components. 
   140 matches which part of the string. By solving this problem, we can split up a string 
   142 For example we expect that the input string
   141 of our language into components. 
       
   142 For example given the input string
   143 
   143 
   144 \begin{center}
   144 \begin{center}
   145 \texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
   145 \texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
   146 \end{center}
   146 \end{center}
   147 
   147 
   148 \noindent
   148 \noindent
   149 is split up as follows
   149 we expect it will be split up as follows
   150 
   150 
   151 \begin{center}
   151 \begin{center}
   152 \tt
   152 \tt
   153 \Grid{if}\,
   153 \Grid{if}\,
   154 \Grid{\VS}\, 
   154 \Grid{\VS}\, 
   168 \end{center}
   168 \end{center}
   169 
   169 
   170 \noindent
   170 \noindent
   171 This process of splitting an input string into components is often called \emph{lexing} or \emph{scanning}.
   171 This process of splitting an input string into components is often called \emph{lexing} or \emph{scanning}.
   172 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, 
   172 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, 
   173 be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,
   173 be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,
   174 the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is
   174 this is not always the case. As can be seen the three components in \texttt{x+2} are not separated by any
   175 that in some languages, for example Python, whitespace matters. However in our small language we will eventually just filter out all whitespaces and also comments.
   175 whitespace. Another reason for recognising whitespaces explicitly is
   176 
   176 that in some languages, for example Python, whitespace matters. However in our small language we will eventually just filter out all whitespaces and also all comments.
   177 Lexing will not just separate the input into its components, but also classify the components, that
   177 
   178 is explicitly record that \texttt{it} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
   178 Lexing will not just separate a string into its components, but also classify the components, that
   179 For the moment, though, we will only focus on the simpler problem of splitting a string into components.
   179 is explicitly record that \texttt{if} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
   180 
   180 For the moment, though, we will only focus on the simpler problem of just splitting a string into components.
   181 There are a few subtleties  we need to consider first. For example, say the input string is
   181 
       
   182 There are a few subtleties  we need to consider first. For example, say the string is
   182 
   183 
   183 \begin{center}
   184 \begin{center}
   184 \texttt{\Grid{iffoo\VS\ldots}}
   185 \texttt{\Grid{iffoo\VS\ldots}}
   185 \end{center}
   186 \end{center}
   186 
   187 
   187 \noindent
   188 \noindent
   188 then there are two possibilities how it can be split up: either we regard the input as the keyword \texttt{if} followed
   189 then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed
   189 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a 
   190 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a 
   190 single identifier. The choice that is often made in lexers is to look for the longest possible match.
   191 single identifier. The choice that is often made in lexers is to look for the longest possible match.
   191 This leaves  \texttt{iffoo} as the only match (since it is longer than \texttt{if}).
   192 This leaves  \texttt{iffoo} as the only match (since it is longer than \texttt{if}).
   192 
   193 
   193 Unfortuantely, the convention of the longs match does not yet make the whole process 
   194 Unfortunately, the convention about the longest match does not yet make the whole process 
   194 completely deterministic. Consider the input string
   195 of lexing completely deterministic. Consider the string
   195 
   196 
   196 \begin{center}
   197 \begin{center}
   197 \texttt{\Grid{then\VS\dots}}
   198 \texttt{\Grid{then\VS\dots}}
   198 \end{center}
   199 \end{center}
   199 
   200 
   200 \noindent
   201 \noindent
   201 Clearly, this string should clearly be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers would also match this string. To overcome this ambiguity we need to rank our 
   202 Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our 
   202 regular expressions. In our running example we just use the ranking
   203 regular expressions. In our running example we just use the ranking
   203 
   204 
   204 \[
   205 \[
   205 \textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots
   206 \textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots
   206 \]
   207 \]
   243 \texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}\grid{\ldots}}
   244 \texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}\grid{\ldots}}
   244 \end{center}
   245 \end{center}
   245 
   246 
   246 \noindent
   247 \noindent
   247 and build the derivative of all regular expressions in $rs$ with respect
   248 and build the derivative of all regular expressions in $rs$ with respect
   248 to the first character. Then we take the result and continue with $c_2$
   249 to the first character $c_1$. Then we take the result and continue with $c_2$
   249 until we have either exhausted our input string or all of the regula
   250 until we have either exhausted our input string or all of the regular expressions
       
   251 are ``zeroable''.
   250 
   252 
   251 \end{document}
   253 \end{document}
   252 
   254 
   253 %%% Local Variables: 
   255 %%% Local Variables: 
   254 %%% mode: latex
   256 %%% mode: latex