handouts/ho06.tex
changeset 183 b17eff695c7f
parent 177 53def1fbf472
child 217 cd6066f1056a
equal deleted inserted replaced
182:9ce2414e470e 183:b17eff695c7f
    83 \newcommand\Grid[1]{%
    83 \newcommand\Grid[1]{%
    84   \@tfor\z:=#1\do{\grid{\z}}}
    84   \@tfor\z:=#1\do{\grid{\z}}}
    85 \makeatother	
    85 \makeatother	
    86 
    86 
    87 \newcommand\Vspace[1][.3em]{%
    87 \newcommand\Vspace[1][.3em]{%
    88   \mbox{\kern.06em\vrle height.3ex}%
    88   \mbox{\kern.06em\vrule height.3ex}%
    89   \vbox{\hrule width#1}%
    89   \vbox{\hrule width#1}%
    90   \hbox{\vrule height.3ex}}
    90   \hbox{\vrule height.3ex}}
    91 
    91 
    92 \def\VS{\Vspace[0.6em]}
    92 \def\VS{\Vspace[0.6em]}
    93 	
    93 	
    94 \begin{document}
    94 \begin{document}
    95 
    95 
    96 \section*{Handout 6}
    96 \section*{Handout 6}
    97 
    97 
    98 While regular expressions are very useful for lexing and for recognising
    98 While regular expressions are very useful for lexing and for recognising
    99 many patterns (like email addresses), they have their limitations. For
    99 many patterns in strings (like email addresses), they have their limitations. For
   100 example there is no regular expression that can recognise the language 
   100 example there is no regular expression that can recognise the language 
   101 $a^nb^n$. Another example is the language of well-parenthesised 
   101 $a^nb^n$. Another example for which there exists no regular expression is the language of well-parenthesised 
   102 expressions.  In languages like Lisp, which use parentheses rather
   102 expressions.  In languages like Lisp, which use parentheses rather
   103 extensively, it might be of interest whether the following two expressions
   103 extensively, it might be of interest whether the following two expressions
   104 are well-parenthesised (the left one is, the right one is not):
   104 are well-parenthesised (the left one is, the right one is not):
   105 
   105 
   106 \begin{center}
   106 \begin{center}
   107 $(((()()))())$  \hspace{10mm} $(((()()))()))$
   107 $(((()()))())$  \hspace{10mm} $(((()()))()))$
   108 \end{center}
   108 \end{center}
   109 
   109 
       
   110 \noindent
       
   111 Not being able to solve such recognition problems is a serious limitation.
   110 In order to solve such recognition problems, we need more powerful 
   112 In order to solve such recognition problems, we need more powerful 
   111 techniques than regular expressions. We will in particular look at \emph{context-free
   113 techniques than regular expressions. We will in particular look at \emph{context-free
   112 languages}. They include the regular languages as the picture below shows:
   114 languages}. They include the regular languages as the picture below shows:
   113 
   115 
   114 
   116 
   132 \begin{center}
   134 \begin{center}
   133 $P \;\;\rightarrow\;\; ( \cdot  P \cdot ) \cdot P \;|\; \epsilon$
   135 $P \;\;\rightarrow\;\; ( \cdot  P \cdot ) \cdot P \;|\; \epsilon$
   134 \end{center}
   136 \end{center}
   135  
   137  
   136 \noindent
   138 \noindent
   137 In general grammars consist of finitely many rules built up from terminal symbols (usually lower-case letters)
   139 In general grammars consist of finitely many rules built up from \emph{terminal symbols} 
   138 and non-terminal symbols (upper-case letters).  Rules have the shape
   140 (usually lower-case letters) and \emph{non-terminal symbols} (upper-case letters).  Rules 
       
   141 have the shape
   139 
   142 
   140 \begin{center}
   143 \begin{center}
   141 $NT \;\;\rightarrow\;\; \textit{rhs}$
   144 $NT \;\;\rightarrow\;\; \textit{rhs}$
   142 \end{center}
   145 \end{center}
   143  
   146  
   234 \begin{center}
   237 \begin{center}
   235 $NT \rightarrow \ldots \rightarrow NT \cdot \ldots$
   238 $NT \rightarrow \ldots \rightarrow NT \cdot \ldots$
   236 \end{center}
   239 \end{center}
   237 
   240 
   238 \noindent
   241 \noindent
   239 It can be easily seems that the grammar above for arithmetic expressions is left-recursive:
   242 It can be easily seen that the grammar above for arithmetic expressions is left-recursive:
   240 for example the rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow N\cdot N$ 
   243 for example the rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow N\cdot N$ 
   241 show that this grammar is left-recursive. Some algorithms cannot cope with left-recursive 
   244 show that this grammar is left-recursive. Some algorithms cannot cope with left-recursive 
   242 grammars. Fortunately every left-recursive grammar can be transformed into one that is
   245 grammars. Fortunately every left-recursive grammar can be transformed into one that is
   243 not left-recursive, although this transformation might make the grammar less human-readable.
   246 not left-recursive, although this transformation might make the grammar less human-readable.
   244 For example if we want to give a non-left-recursive grammar for numbers we might
   247 For example if we want to give a non-left-recursive grammar for numbers we might
   250 
   253 
   251 \noindent
   254 \noindent
   252 Using this grammar we can still derive every number string, but we will never be able 
   255 Using this grammar we can still derive every number string, but we will never be able 
   253 to derive a string of the form $\ldots \rightarrow N \cdot \ldots$.
   256 to derive a string of the form $\ldots \rightarrow N \cdot \ldots$.
   254 
   257 
   255 The other property we have to watch out is when a grammar is
   258 The other property we have to watch out for is when a grammar is
   256 \emph{ambiguous}. A grammar is said to be ambiguous if there are two parse-trees
   259 \emph{ambiguous}. A grammar is said to be ambiguous if there are two parse-trees
   257 for one string. Again the grammar for arithmetic expressions shown above is ambiguous.
   260 for one string. Again the grammar for arithmetic expressions shown above is ambiguous.
   258 While the shown parse tree for the string $(1 + 23) + 4$ is unique, there are two parse
   261 While the shown parse tree for the string $(1 + 23) + 4$ is unique, this is not the case in
       
   262 general. For example there are two parse
   259 trees for the string $1 + 2 + 3$, namely
   263 trees for the string $1 + 2 + 3$, namely
   260 
   264 
   261 \begin{center}
   265 \begin{center}
   262 \begin{tabular}{c@{\hspace{10mm}}c}
   266 \begin{tabular}{c@{\hspace{10mm}}c}
   263 \begin{tikzpicture}[level distance=8mm, black]
   267 \begin{tikzpicture}[level distance=8mm, black]
   289 \noindent
   293 \noindent
   290 In particular in programming languages we will try to avoid ambiguous
   294 In particular in programming languages we will try to avoid ambiguous
   291 grammars because two different parse-trees for a string mean a program can
   295 grammars because two different parse-trees for a string mean a program can
   292 be interpreted in two different ways. In such cases we have to somehow make sure
   296 be interpreted in two different ways. In such cases we have to somehow make sure
   293 the two different ways do not matter, or disambiguate the grammar in
   297 the two different ways do not matter, or disambiguate the grammar in
   294 some way (for example making the $+$ left-associative). Unfortunately already 
   298 some other way (for example making the $+$ left-associative). Unfortunately already 
   295 the problem of deciding whether a grammar
   299 the problem of deciding whether a grammar
   296 is ambiguous or not is in general undecidable. 
   300 is ambiguous or not is in general undecidable. 
   297 
   301 
   298 Let us now turn to the problem of generating a parse-tree for a grammar and string.
   302 Let us now turn to the problem of generating a parse-tree for a grammar and string.
   299 In what follows we explain \emph{parser combinators}, because they are easy
   303 In what follows we explain \emph{parser combinators}, because they are easy
   300 to implement and closely resemble grammar rules. Imagine that a grammar
   304 to implement and closely resemble grammar rules. Imagine that a grammar
   301 describes the strings of natural numbers, such as the grammar $N$ shown above.
   305 describes the strings of natural numbers, such as the grammar $N$ shown above.
   302 For all such strings we want to generate the parse-trees or later on we actually 
   306 For all such strings we want to generate the parse-trees or later on we actually 
   303 want to extract the meaning of these strings, that is the concrete integers ``behind'' 
   307 want to extract the meaning of these strings, that is the concrete integers ``behind'' 
   304 these strings. The parser combinators will be functions of type
   308 these strings. In Scala the parser combinators will be functions of type
   305 
   309 
   306 \begin{center}
   310 \begin{center}
   307 \texttt{I $\Rightarrow$ Set[(T, I)]}
   311 \texttt{I $\Rightarrow$ Set[(T, I)]}
   308 \end{center}
   312 \end{center}
   309 
   313 
   310 \noindent 
   314 \noindent 
   311 that is they take as input something of type \texttt{I}, typically a list of tokens or a string,
   315 that is they take as input something of type \texttt{I}, typically a list of tokens or a string,
   312 and return a set of pairs. The first component of these pairs corresponds to what the
   316 and return a set of pairs. The first component of these pairs corresponds to what the
   313 parser combinator was able to process from the input and the second is the unprocessed 
   317 parser combinator was able to process from the input and the second is the unprocessed 
   314 part of the input. As we shall see shortly, a parser combinator might return more than one such pair,
   318 part of the input. As we shall see shortly, a parser combinator might return more than one such pair,
   315 with the idea that there are potentially several ways how to interpret the input.
   319 with the idea that there are potentially several ways how to interpret the input. As a concrete
   316 
   320 example, consider the case where the input is of type string, say the string
   317 The abstract class for parser combinators requires the implementation of the function
   321 
       
   322 \begin{center}
       
   323 \tt\Grid{iffoo\VS testbar}
       
   324 \end{center}
       
   325 
       
   326 \noindent
       
   327 We might have a parser combinator which tries to interpret this string as a keyword (\texttt{if}) or
       
   328 an identifier (\texttt{iffoo}). Then the output will be the set
       
   329 
       
   330 \begin{center}
       
   331 $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), 
       
   332            \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
       
   333 \end{center}
       
   334 
       
   335 \noindent
       
   336 where the first pair means the parser could recognise \texttt{if} from the input and leaves 
       
   337 the rest as `unprocessed' as the second component of the pair; in the other case
       
   338 it could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parser
       
   339 cannot recognise anything from the input then parser combinators just return the empty 
       
   340 set $\varnothing$. This will indicate something ``went wrong''.
       
   341 
       
   342 The main attraction is that we can easily build parser combinators out of smaller components
       
   343 following very closely the structure of a grammar. In order to implement this in an object
       
   344 oriented programming language, like Scala, we need to specify an abstract class for parser 
       
   345 combinators. This abstract class requires the implementation of the function
   318 \texttt{parse} taking an argument of type \texttt{I} and returns a set of type  
   346 \texttt{parse} taking an argument of type \texttt{I} and returns a set of type  
   319 \mbox{\texttt{Set[(T, I)]}}.
   347 \mbox{\texttt{Set[(T, I)]}}.
   320 
   348 
   321 \begin{center}
   349 \begin{center}
   322 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   350 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   329 }
   357 }
   330 \end{lstlisting}
   358 \end{lstlisting}
   331 \end{center}
   359 \end{center}
   332 
   360 
   333 \noindent
   361 \noindent
       
   362 From the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all},
       
   363 which just filters out all pairs whose second component is not empty (that is has still some
       
   364 unprocessed part). The reason is that at the end of parsing we are only interested in the
       
   365 results where all the input has been consumed and no unprocessed part is left.
       
   366 
   334 One of the simplest parser combinators recognises just a character, say $c$, 
   367 One of the simplest parser combinators recognises just a character, say $c$, 
   335 from the beginning of strings. Its behaviour is as follows:
   368 from the beginning of strings. Its behaviour is as follows:
   336 
   369 
   337 \begin{itemize}
   370 \begin{itemize}
   338 \item if the head of the input string starts with a $c$, it returns 
   371 \item if the head of the input string starts with a $c$, it returns 
   352     if (sb.head == c) Set((c, sb.tail)) else Set()
   385     if (sb.head == c) Set((c, sb.tail)) else Set()
   353 }
   386 }
   354 \end{lstlisting}
   387 \end{lstlisting}
   355 \end{center}
   388 \end{center}
   356 
   389 
       
   390 \noindent
       
   391 The \texttt{parse} function tests whether the first character of the 
       
   392 input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits the
       
   393 string into the recognised part \texttt{c} and the unprocessed part
       
   394 \texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} then
       
   395 the parser returns the empty set (in Scala \texttt{Set()}).
       
   396 
       
   397 More interesting are the parser combinators that build larger parsers
       
   398 out of smaller component parsers. For example the alternative 
       
   399 parser combinator is as follows.
       
   400 
       
   401 \begin{center}
       
   402 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   403 class AltParser[I, T]
       
   404        (p: => Parser[I, T], 
       
   405         q: => Parser[I, T]) extends Parser[I, T] {
       
   406   def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
       
   407 }
       
   408 \end{lstlisting}
       
   409 \end{center}
       
   410 
       
   411 \noindent
       
   412 The types of this parser combinator are polymorphic (we just have \texttt{I}
       
   413 for the input type, and \texttt{T} for the output type). The alternative parser
       
   414 builds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}.
       
   415 Both need to be able to process input of type \texttt{I} and return the same
       
   416 output type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the 
       
   417 \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the
       
   418 evaluation of the arguments before they are used. This is often called 
       
   419 \emph{lazy evaluation} of the arguments.) The alternative parser should run
       
   420 the input with the first parser \texttt{p} (producing a set of outputs) and then
       
   421 run the same input with \texttt{q}. The result should be then just the union
       
   422 of both sets, which is the operation \texttt{++} in Scala.
       
   423 
       
   424 This parser combinator already allows us to construct a parser that either 
       
   425 a character \texttt{a} or \texttt{b}, as
       
   426 
       
   427 \begin{center}
       
   428 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   429 new AltParser(CharParser('a'), CharParser('b'))
       
   430 \end{lstlisting}
       
   431 \end{center}
       
   432 
       
   433 \noindent
       
   434 Scala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}. 
       
   435 We can call this parser combinator with the strings
       
   436 
       
   437 \begin{center}
       
   438 \begin{tabular}{rcl}
       
   439 input string & & output\medskip\\
       
   440 \texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\
       
   441 \texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\
       
   442 \texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$
       
   443 \end{tabular}
       
   444 \end{center}
       
   445 
       
   446 \noindent
       
   447 We receive in the first two cases a successful output (that is a non-empty set).
       
   448 
       
   449 A bit more interesting is the \emph{sequence parser combinator} implemented in
       
   450 Scala as follows:
       
   451 
       
   452 \begin{center}
       
   453 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   454 class SeqParser[I, T, S]
       
   455        (p: => Parser[I, T], 
       
   456         q: => Parser[I, S]) extends Parser[I, (T, S)] {
       
   457   def parse(sb: I) = 
       
   458     for ((head1, tail1) <- p.parse(sb); 
       
   459          (head2, tail2) <- q.parse(tail1)) 
       
   460             yield ((head1, head2), tail2)
       
   461 }
       
   462 \end{lstlisting}
       
   463 \end{center}
       
   464 
       
   465 \noindent
       
   466 This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} 
       
   467 as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}).
       
   468 The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}. 
       
   469 Let \texttt{q} run on these unprocessed parts
       
   470 producing again a set of pairs. The output of the sequence parser combinator is then a set
       
   471 containing pairs where the first components are again pairs, namely what the first parser could parse
       
   472 together with what the second parser could parse; the second component is the unprocessed
       
   473 part left over after running the second parser \texttt{q}. Therefore the input type of
       
   474 the sequence parser combinator is as usual \texttt{I}, but the output type is
       
   475 
       
   476 \begin{center}
       
   477 \texttt{Set[((T, S), I)]}
       
   478 \end{center}
       
   479 
       
   480 Scala allows us to provide some
       
   481 shorthand notation for the sequence parser combinator. So we can write for 
       
   482 example \texttt{'a'  $\sim$ 'b'}, which is the
       
   483 parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}.
       
   484 Calling this parser combinator with the strings
       
   485 
       
   486 \begin{center}
       
   487 \begin{tabular}{rcl}
       
   488 input string & & output\medskip\\
       
   489 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
       
   490 \texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\
       
   491 \texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$
       
   492 \end{tabular}
       
   493 \end{center}
       
   494 
       
   495 \noindent
       
   496 A slightly more complicated parser is \texttt{('a'  || 'b') $\sim$ 'b'} which parses as first character either
       
   497 an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results.
       
   498 
       
   499 \begin{center}
       
   500 \begin{tabular}{rcl}
       
   501 input string & & output\medskip\\
       
   502 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
       
   503 \texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
       
   504 \texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
       
   505 \end{tabular}
       
   506 \end{center}
       
   507 
       
   508 Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error.
       
   509 The first parser has as output type a single character (recall the type of \texttt{CharParser}),
       
   510 but the second parser produces a pair of characters as output. The alternative parser is however
       
   511 required to have both component parsers to have the same type. We will see later how we can 
       
   512 build this parser without the typing error.
       
   513 
       
   514 The next parser combinator does not actually combine smaller parsers, but applies
       
   515 a function to the result of the parser. It is implemented in Scala as follows
       
   516 
       
   517 \begin{center}
       
   518 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   519 class FunParser[I, T, S]
       
   520          (p: => Parser[I, T], 
       
   521           f: T => S) extends Parser[I, S] {
       
   522   def parse(sb: I) = 
       
   523     for ((head, tail) <- p.parse(sb)) yield (f(head), tail)
       
   524 }
       
   525 \end{lstlisting}
       
   526 \end{center}
       
   527 
       
   528 
       
   529 \noindent
       
   530 This parser combinator takes a parser \texttt{p} with output type \texttt{T} as
       
   531 input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p}
       
   532 produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then
       
   533 applies the function \texttt{f} to all the parer outputs. Since this function
       
   534 is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}.
       
   535 Again Scala lets us introduce some shorthand notation for this parser combinator. 
       
   536 Therefore we will write \texttt{p ==> f} for it.
       
   537 
       
   538 %\bigskip
       
   539 %takes advantage of the full generality---have a look
       
   540 %what it produces if we call it with the string \texttt{abc}
       
   541 %
       
   542 %\begin{center}
       
   543 %\begin{tabular}{rcl}
       
   544 %input string & & output\medskip\\
       
   545 %\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
       
   546 %\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
       
   547 %\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
       
   548 %\end{tabular}
       
   549 %\end{center}
       
   550 
       
   551 
   357 
   552 
   358 \end{document}
   553 \end{document}
   359 
   554 
   360 %%% Local Variables: 
   555 %%% Local Variables: 
   361 %%% mode: latex  
   556 %%% mode: latex