83 \newcommand\Grid[1]{% |
83 \newcommand\Grid[1]{% |
84 \@tfor\z:=#1\do{\grid{\z}}} |
84 \@tfor\z:=#1\do{\grid{\z}}} |
85 \makeatother |
85 \makeatother |
86 |
86 |
87 \newcommand\Vspace[1][.3em]{% |
87 \newcommand\Vspace[1][.3em]{% |
88 \mbox{\kern.06em\vrle height.3ex}% |
88 \mbox{\kern.06em\vrule height.3ex}% |
89 \vbox{\hrule width#1}% |
89 \vbox{\hrule width#1}% |
90 \hbox{\vrule height.3ex}} |
90 \hbox{\vrule height.3ex}} |
91 |
91 |
92 \def\VS{\Vspace[0.6em]} |
92 \def\VS{\Vspace[0.6em]} |
93 |
93 |
94 \begin{document} |
94 \begin{document} |
95 |
95 |
96 \section*{Handout 6} |
96 \section*{Handout 6} |
97 |
97 |
98 While regular expressions are very useful for lexing and for recognising |
98 While regular expressions are very useful for lexing and for recognising |
99 many patterns (like email addresses), they have their limitations. For |
99 many patterns in strings (like email addresses), they have their limitations. For |
100 example there is no regular expression that can recognise the language |
100 example there is no regular expression that can recognise the language |
101 $a^nb^n$. Another example is the language of well-parenthesised |
101 $a^nb^n$. Another example for which there exists no regular expression is the language of well-parenthesised |
102 expressions. In languages like Lisp, which use parentheses rather |
102 expressions. In languages like Lisp, which use parentheses rather |
103 extensively, it might be of interest whether the following two expressions |
103 extensively, it might be of interest whether the following two expressions |
104 are well-parenthesised (the left one is, the right one is not): |
104 are well-parenthesised (the left one is, the right one is not): |
105 |
105 |
106 \begin{center} |
106 \begin{center} |
107 $(((()()))())$ \hspace{10mm} $(((()()))()))$ |
107 $(((()()))())$ \hspace{10mm} $(((()()))()))$ |
108 \end{center} |
108 \end{center} |
109 |
109 |
|
110 \noindent |
|
111 Not being able to solve such recognition problems is a serious limitation. |
110 In order to solve such recognition problems, we need more powerful |
112 In order to solve such recognition problems, we need more powerful |
111 techniques than regular expressions. We will in particular look at \emph{context-free |
113 techniques than regular expressions. We will in particular look at \emph{context-free |
112 languages}. They include the regular languages as the picture below shows: |
114 languages}. They include the regular languages as the picture below shows: |
113 |
115 |
114 |
116 |
289 \noindent |
293 \noindent |
290 In particular in programming languages we will try to avoid ambiguous |
294 In particular in programming languages we will try to avoid ambiguous |
291 grammars because two different parse-trees for a string mean a program can |
295 grammars because two different parse-trees for a string mean a program can |
292 be interpreted in two different ways. In such cases we have to somehow make sure |
296 be interpreted in two different ways. In such cases we have to somehow make sure |
293 the two different ways do not matter, or disambiguate the grammar in |
297 the two different ways do not matter, or disambiguate the grammar in |
294 some way (for example making the $+$ left-associative). Unfortunately already |
298 some other way (for example making the $+$ left-associative). Unfortunately already |
295 the problem of deciding whether a grammar |
299 the problem of deciding whether a grammar |
296 is ambiguous or not is in general undecidable. |
300 is ambiguous or not is in general undecidable. |
297 |
301 |
298 Let us now turn to the problem of generating a parse-tree for a grammar and string. |
302 Let us now turn to the problem of generating a parse-tree for a grammar and string. |
299 In what follows we explain \emph{parser combinators}, because they are easy |
303 In what follows we explain \emph{parser combinators}, because they are easy |
300 to implement and closely resemble grammar rules. Imagine that a grammar |
304 to implement and closely resemble grammar rules. Imagine that a grammar |
301 describes the strings of natural numbers, such as the grammar $N$ shown above. |
305 describes the strings of natural numbers, such as the grammar $N$ shown above. |
302 For all such strings we want to generate the parse-trees or later on we actually |
306 For all such strings we want to generate the parse-trees or later on we actually |
303 want to extract the meaning of these strings, that is the concrete integers ``behind'' |
307 want to extract the meaning of these strings, that is the concrete integers ``behind'' |
304 these strings. The parser combinators will be functions of type |
308 these strings. In Scala the parser combinators will be functions of type |
305 |
309 |
306 \begin{center} |
310 \begin{center} |
307 \texttt{I $\Rightarrow$ Set[(T, I)]} |
311 \texttt{I $\Rightarrow$ Set[(T, I)]} |
308 \end{center} |
312 \end{center} |
309 |
313 |
310 \noindent |
314 \noindent |
311 that is they take as input something of type \texttt{I}, typically a list of tokens or a string, |
315 that is they take as input something of type \texttt{I}, typically a list of tokens or a string, |
312 and return a set of pairs. The first component of these pairs corresponds to what the |
316 and return a set of pairs. The first component of these pairs corresponds to what the |
313 parser combinator was able to process from the input and the second is the unprocessed |
317 parser combinator was able to process from the input and the second is the unprocessed |
314 part of the input. As we shall see shortly, a parser combinator might return more than one such pair, |
318 part of the input. As we shall see shortly, a parser combinator might return more than one such pair, |
315 with the idea that there are potentially several ways how to interpret the input. |
319 with the idea that there are potentially several ways how to interpret the input. As a concrete |
316 |
320 example, consider the case where the input is of type string, say the string |
317 The abstract class for parser combinators requires the implementation of the function |
321 |
|
322 \begin{center} |
|
323 \tt\Grid{iffoo\VS testbar} |
|
324 \end{center} |
|
325 |
|
326 \noindent |
|
327 We might have a parser combinator which tries to interpret this string as a keyword (\texttt{if}) or |
|
328 an identifier (\texttt{iffoo}). Then the output will be the set |
|
329 |
|
330 \begin{center} |
|
331 $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), |
|
332 \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$ |
|
333 \end{center} |
|
334 |
|
335 \noindent |
|
336 where the first pair means the parser could recognise \texttt{if} from the input and leaves |
|
337 the rest as `unprocessed' as the second component of the pair; in the other case |
|
338 it could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parser |
|
339 cannot recognise anything from the input then parser combinators just return the empty |
|
340 set $\varnothing$. This will indicate something ``went wrong''. |
|
341 |
|
342 The main attraction is that we can easily build parser combinators out of smaller components |
|
343 following very closely the structure of a grammar. In order to implement this in an object |
|
344 oriented programming language, like Scala, we need to specify an abstract class for parser |
|
345 combinators. This abstract class requires the implementation of the function |
318 \texttt{parse} taking an argument of type \texttt{I} and returns a set of type |
346 \texttt{parse} taking an argument of type \texttt{I} and returns a set of type |
319 \mbox{\texttt{Set[(T, I)]}}. |
347 \mbox{\texttt{Set[(T, I)]}}. |
320 |
348 |
321 \begin{center} |
349 \begin{center} |
322 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
350 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
352 if (sb.head == c) Set((c, sb.tail)) else Set() |
385 if (sb.head == c) Set((c, sb.tail)) else Set() |
353 } |
386 } |
354 \end{lstlisting} |
387 \end{lstlisting} |
355 \end{center} |
388 \end{center} |
356 |
389 |
|
390 \noindent |
|
391 The \texttt{parse} function tests whether the first character of the |
|
392 input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits the |
|
393 string into the recognised part \texttt{c} and the unprocessed part |
|
394 \texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} then |
|
395 the parser returns the empty set (in Scala \texttt{Set()}). |
|
396 |
|
397 More interesting are the parser combinators that build larger parsers |
|
398 out of smaller component parsers. For example the alternative |
|
399 parser combinator is as follows. |
|
400 |
|
401 \begin{center} |
|
402 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
403 class AltParser[I, T] |
|
404 (p: => Parser[I, T], |
|
405 q: => Parser[I, T]) extends Parser[I, T] { |
|
406 def parse(sb: I) = p.parse(sb) ++ q.parse(sb) |
|
407 } |
|
408 \end{lstlisting} |
|
409 \end{center} |
|
410 |
|
411 \noindent |
|
412 The types of this parser combinator are polymorphic (we just have \texttt{I} |
|
413 for the input type, and \texttt{T} for the output type). The alternative parser |
|
414 builds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}. |
|
415 Both need to be able to process input of type \texttt{I} and return the same |
|
416 output type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the |
|
417 \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the |
|
418 evaluation of the arguments before they are used. This is often called |
|
419 \emph{lazy evaluation} of the arguments.) The alternative parser should run |
|
420 the input with the first parser \texttt{p} (producing a set of outputs) and then |
|
421 run the same input with \texttt{q}. The result should be then just the union |
|
422 of both sets, which is the operation \texttt{++} in Scala. |
|
423 |
|
424 This parser combinator already allows us to construct a parser that either |
|
425 a character \texttt{a} or \texttt{b}, as |
|
426 |
|
427 \begin{center} |
|
428 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
429 new AltParser(CharParser('a'), CharParser('b')) |
|
430 \end{lstlisting} |
|
431 \end{center} |
|
432 |
|
433 \noindent |
|
434 Scala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}. |
|
435 We can call this parser combinator with the strings |
|
436 |
|
437 \begin{center} |
|
438 \begin{tabular}{rcl} |
|
439 input string & & output\medskip\\ |
|
440 \texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\ |
|
441 \texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\ |
|
442 \texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$ |
|
443 \end{tabular} |
|
444 \end{center} |
|
445 |
|
446 \noindent |
|
447 We receive in the first two cases a successful output (that is a non-empty set). |
|
448 |
|
449 A bit more interesting is the \emph{sequence parser combinator} implemented in |
|
450 Scala as follows: |
|
451 |
|
452 \begin{center} |
|
453 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
454 class SeqParser[I, T, S] |
|
455 (p: => Parser[I, T], |
|
456 q: => Parser[I, S]) extends Parser[I, (T, S)] { |
|
457 def parse(sb: I) = |
|
458 for ((head1, tail1) <- p.parse(sb); |
|
459 (head2, tail2) <- q.parse(tail1)) |
|
460 yield ((head1, head2), tail2) |
|
461 } |
|
462 \end{lstlisting} |
|
463 \end{center} |
|
464 |
|
465 \noindent |
|
466 This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} |
|
467 as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}). |
|
468 The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}. |
|
469 Let \texttt{q} run on these unprocessed parts |
|
470 producing again a set of pairs. The output of the sequence parser combinator is then a set |
|
471 containing pairs where the first components are again pairs, namely what the first parser could parse |
|
472 together with what the second parser could parse; the second component is the unprocessed |
|
473 part left over after running the second parser \texttt{q}. Therefore the input type of |
|
474 the sequence parser combinator is as usual \texttt{I}, but the output type is |
|
475 |
|
476 \begin{center} |
|
477 \texttt{Set[((T, S), I)]} |
|
478 \end{center} |
|
479 |
|
480 Scala allows us to provide some |
|
481 shorthand notation for the sequence parser combinator. So we can write for |
|
482 example \texttt{'a' $\sim$ 'b'}, which is the |
|
483 parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}. |
|
484 Calling this parser combinator with the strings |
|
485 |
|
486 \begin{center} |
|
487 \begin{tabular}{rcl} |
|
488 input string & & output\medskip\\ |
|
489 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
|
490 \texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\ |
|
491 \texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$ |
|
492 \end{tabular} |
|
493 \end{center} |
|
494 |
|
495 \noindent |
|
496 A slightly more complicated parser is \texttt{('a' || 'b') $\sim$ 'b'} which parses as first character either |
|
497 an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results. |
|
498 |
|
499 \begin{center} |
|
500 \begin{tabular}{rcl} |
|
501 input string & & output\medskip\\ |
|
502 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
|
503 \texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
|
504 \texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$ |
|
505 \end{tabular} |
|
506 \end{center} |
|
507 |
|
508 Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error. |
|
509 The first parser has as output type a single character (recall the type of \texttt{CharParser}), |
|
510 but the second parser produces a pair of characters as output. The alternative parser is however |
|
511 required to have both component parsers to have the same type. We will see later how we can |
|
512 build this parser without the typing error. |
|
513 |
|
514 The next parser combinator does not actually combine smaller parsers, but applies |
|
515 a function to the result of the parser. It is implemented in Scala as follows |
|
516 |
|
517 \begin{center} |
|
518 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
519 class FunParser[I, T, S] |
|
520 (p: => Parser[I, T], |
|
521 f: T => S) extends Parser[I, S] { |
|
522 def parse(sb: I) = |
|
523 for ((head, tail) <- p.parse(sb)) yield (f(head), tail) |
|
524 } |
|
525 \end{lstlisting} |
|
526 \end{center} |
|
527 |
|
528 |
|
529 \noindent |
|
530 This parser combinator takes a parser \texttt{p} with output type \texttt{T} as |
|
531 input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p} |
|
532 produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then |
|
533 applies the function \texttt{f} to all the parer outputs. Since this function |
|
534 is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}. |
|
535 Again Scala lets us introduce some shorthand notation for this parser combinator. |
|
536 Therefore we will write \texttt{p ==> f} for it. |
|
537 |
|
538 %\bigskip |
|
539 %takes advantage of the full generality---have a look |
|
540 %what it produces if we call it with the string \texttt{abc} |
|
541 % |
|
542 %\begin{center} |
|
543 %\begin{tabular}{rcl} |
|
544 %input string & & output\medskip\\ |
|
545 %\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
|
546 %\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
|
547 %\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$ |
|
548 %\end{tabular} |
|
549 %\end{center} |
|
550 |
|
551 |
357 |
552 |
358 \end{document} |
553 \end{document} |
359 |
554 |
360 %%% Local Variables: |
555 %%% Local Variables: |
361 %%% mode: latex |
556 %%% mode: latex |