# HG changeset patch # User Christian Urban # Date 1540381763 -3600 # Node ID 529460073b2424ad82d4435d0331e7e06fa606b4 # Parent 200d2a3eb1b1708c9729198643ef4869589d334b updated diff -r 200d2a3eb1b1 -r 529460073b24 handouts/ho06.pdf Binary file handouts/ho06.pdf has changed diff -r 200d2a3eb1b1 -r 529460073b24 handouts/ho06.tex --- a/handouts/ho06.tex Tue Oct 23 08:56:28 2018 +0100 +++ b/handouts/ho06.tex Wed Oct 24 12:49:23 2018 +0100 @@ -1,3 +1,4 @@ + \documentclass{article} \usepackage{../style} \usepackage{../langs} @@ -7,17 +8,18 @@ \section*{Handout 6 (Parser Combinators)} -In what follows we explain \emph{parser combinators}. Their -distinguishing feature is that they are very easy to -implement. However, they only work when the grammar to be -parsed is \emph{not} left-recursive and they are efficient -only when the grammar is unambiguous. It is the responsibility -of the grammar designer to ensure these two properties. +This handout explains how \emph{parser combinators} work and how they +can be implemented in Scala. Their distinguishing feature is that they +are very easy to implement (admittedly it is only easy in a functional +programming language). However, parser combinators require that the +grammar to be parsed is \emph{not} left-recursive and they are +efficient only when the grammar is unambiguous. It is the +responsibility of the grammar designer to ensure these two properties. -Parser combinators can deal with any kind of input as long as -this input is a kind of sequence, for example a string or a -list of tokens. The general idea behind parser combinators is -to transform the input into sets of pairs, like so +Another good point of parser combinators is that they can deal with any kind +of input as long as this input of ``sequence-kind'', for example a +string or a list of tokens. The general idea behind parser combinators +is to transform the input into sets of pairs, like so \begin{center} $\underbrace{\text{list of tokens}}_{\text{input}}$ @@ -25,22 +27,33 @@ $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$ \end{center} -\noindent In Scala parser combinators are functions of type +\noindent As said, the input can be anything as long as it is a +``sequence''. The only property of the input we need is to be able to +test when it is empty. Obviously we can do this for strings and lists. +For more lucidity we shall below often use strings as input in order +to illustrate matters. However, this does not make our previous work +on lexers obsolete (remember they transform a string into a list of +tokens). Lexers will still be needed to build a somewhat realistic +compiler. + + +In my Scala code I use the following polymorphic types for parser combinators: \begin{center} -\texttt{I $\Rightarrow$ Set[(T, I)]} +input:\;\; \texttt{I} \qquad output:\;\; \texttt{T} \end{center} -\noindent that is they take as input something of type -\texttt{I} and return a set of pairs. The first component of -these pairs corresponds to what the parser combinator was able -to process from the input and the second is the unprocessed -part of the input. As we shall see shortly, a parser -combinator might return more than one such pair, the idea -being that there are potentially several ways of how to -interpret the input. To simplify matters we will first look at -the case where the input to the parser combinators is just -strings. As a concrete example, consider the string +\noindent that is they take as input something of type \texttt{I} and +return a set of pairs of \texttt{Set[(T, I)]}. Since the input needs +to be of ``sequence-kind'' I actually have to often write \texttt{I + <\% Seq[\_]} for the input type in order to indicate it is a +subtype of Scala sequences. The first component of the generated pairs +corresponds to what the parser combinator was able to process from the +input and the second is the unprocessed part of the input (therefore +the type of this unprocessed part is the same as the input). As we +shall see shortly, a parser combinator might return more than one such +pair; the idea being that there are potentially several ways of how to +parse the input. As a concrete example, consider the string \begin{center} \tt\Grid{iffoo\VS testbar} @@ -65,14 +78,22 @@ ``went wrong''\ldots or more precisely, nothing could be parsed. -The idea of parser combinators is that we can easily build -parser combinators out of smaller components following very -closely the structure of a grammar. In order to implement this -in an object-oriented programming language, like Scala, we -need to specify an abstract class for parser combinators. This -abstract class requires the implementation of the function -\texttt{parse} taking an argument of type \texttt{I} and -returns a set of type \mbox{\texttt{Set[(T, I)]}}. +Also important to note is that the type \texttt{T} for the processed +part is different from the input type. The reason is that in general +we are interested in transform our input into something +``different''\ldots for example into a tree, or if we implement the +grammar for arithmetic expressions we might be interested in the +actual integer number the arithmetic expression, say \texttt{1 + 2 * + 3}, stands for. In this way we can use parser combinators to +implement relativaley easily a calculator. + +The main idea of parser combinators is that we can easily build parser +combinators out of smaller components following very closely the +structure of a grammar. In order to implement this in an +object-oriented programming language, like Scala, we need to specify +an abstract class for parser combinators. This abstract class states +that the function \texttt{parse} takes an argument of type \texttt{I} +and returns a set of type \mbox{\texttt{Set[(T, I)]}}. \begin{center} \begin{lstlisting}[language=Scala] @@ -86,23 +107,25 @@ \end{lstlisting} \end{center} -\noindent From the function \texttt{parse} we can then -``centrally'' derive the function \texttt{parse\_all}, which -just filters out all pairs whose second component is not empty -(that is has still some unprocessed part). The reason is that -at the end of the parsing we are only interested in the results -where all the input has been consumed and no unprocessed part -is left over. +\noindent It is the obligation in each instance (parser combinator) to +supply an implementation for \texttt{parse}. From this function we +can then ``centrally'' derive the function \texttt{parse\_all}, which +just filters out all pairs whose second component is not empty (that +is has still some unprocessed part). The reason is that at the end of +the parsing we are only interested in the results where all the input +has been consumed and no unprocessed part is left over. One of the simplest parser combinators recognises just a -character, say $c$, from the beginning of strings. Its +single character, say $c$, from the beginning of strings. Its behaviour can be described as follows: \begin{itemize} -\item if the head of the input string starts with a $c$, then returns - the set $\{(c, \textit{tail of}\; s)\}$, where \textit{tail of} - $s$ is the unprocessed part of the input string -\item otherwise return the empty set $\{\}$ +\item If the head of the input string starts with a $c$, then return + the set + \[\{(c, \textit{tail of}\; s)\}\] + where \textit{tail of} + $s$ is the unprocessed part of the input string. +\item Otherwise return the empty set $\{\}$. \end{itemize} \noindent @@ -119,21 +142,49 @@ \end{lstlisting} \end{center} -\noindent The \texttt{parse} function tests whether the first -character of the input string \texttt{sb} is equal to -\texttt{c}. If yes, then it splits the string into the -recognised part \texttt{c} and the unprocessed part -\texttt{sb.tail}. In case \texttt{sb} does not start with -\texttt{c} then the parser returns the empty set (in Scala -\texttt{Set()}). +\noindent You can see the \texttt{parse} function tests whether the +first character of the input string \texttt{sb} is equal to +\texttt{c}. If yes, then it splits the string into the recognised part +\texttt{c} and the unprocessed part \texttt{sb.tail}. In case +\texttt{sb} does not start with \texttt{c} then the parser returns the +empty set (in Scala \texttt{Set()}). Since this parser recognises +characters and just returns characters as the processed part, the +output type of the parser is \texttt{Char}. + +If we want to parse a list of tokens and interested in recognising a +number token, we could write something like this + +\begin{center} +\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none] +case object NumParser extends Parser[List[Token], Int] { + def parse(ts: List[Token]) = ts match { + case Num_token(s)::ts => Set((s.toInt, ts)) + case _ => Set () + } +} +\end{lstlisting} +\end{center} + +\noindent +In this parser the input is of type \texttt{List[Token]}. The function +parse looks at the input \texttt{ts} and checks whether the first +token is a \texttt{Num\_token}. Let us assume our lexer generated +these tokens for numbers. But this parser does not just return this +token (and the rest of the list), like the \texttt{CharParser} above, +rather extracts the string \texttt{s} from the token and converts it +into an integer. The hope is that the lexer did its work well and this +conversion always succeeds. The consequence of this is that the output +type for this parser is \texttt{Int}. Such a conversion would be +needed if we want to implement a simple calculator program. -More interesting are the parser combinators that build larger -parsers out of smaller component parsers. For example the -alternative parser combinator is as follows: given two -parsers, say, $p$ and $q$, we apply both parsers to the -input (remember parsers are functions) and combine the output -(remember they are sets) +These simple parsers that just look at the input and do a simple +transformation are often called \emph{atomic} parser combinators. +More interesting are the parser combinators that build larger parsers +out of smaller component parsers. For example the \emph{alternative + parser combinator} is as follows: given two parsers, say, $p$ and +$q$, we apply both parsers to the input (remember parsers are +functions) and combine the output (remember they are sets of pairs) \begin{center} $p(\text{input}) \cup q(\text{input})$ @@ -152,20 +203,20 @@ \end{lstlisting} \end{center} -\noindent The types of this parser combinator are polymorphic -(we just have \texttt{I} for the input type, and \texttt{T} -for the output type). The alternative parser builds a new -parser out of two existing parsers \texttt{p} and \texttt{q}. -Both need to be able to process input of type \texttt{I} and -return the same output type \texttt{Set[(T, -I)]}.\footnote{There is an interesting detail of Scala, namely -the \texttt{=>} in front of the types of \texttt{p} and -\texttt{q}. They will prevent the evaluation of the arguments -before they are used. This is often called \emph{lazy -evaluation} of the arguments. We will explain this later.} The alternative parser should -run the input with the first parser \texttt{p} (producing a -set of outputs) and then run the same input with \texttt{q}. -The result should be then just the union of both sets, which +\noindent The types of this parser combinator are again generic (we +just have \texttt{I} for the input type, and \texttt{T} for the output +type). The alternative parser builds a new parser out of two existing +parsers \texttt{p} and \texttt{q}. Both need to be able to process +input of type \texttt{I} and return the same output type +\texttt{Set[(T, I)]}.\footnote{There is an interesting detail of + Scala, namely the \texttt{=>} in front of the types of \texttt{p} + and \texttt{q}. They will prevent the evaluation of the arguments + before they are used. This is often called \emph{lazy evaluation} of + the arguments. We will explain this later.} Therefore the output +type of this parser is \texttt{T}. The alternative parser should run +the input with the first parser \texttt{p} (producing a set of pairs) +and then run the same input with \texttt{q} (producing another set of +pairs). The result should be then just the union of both sets, which is the operation \texttt{++} in Scala. The alternative parser combinator already allows us to @@ -178,42 +229,132 @@ \end{lstlisting} \end{center} -\noindent Scala allows us to introduce some more readable -shorthand notation for this, like \texttt{'a' || 'b'}. We can -call this parser combinator with the strings +\noindent Later on we will use again Scala mechanism for introducing some +more readable shorthand notation for this, like \texttt{"a" || "b"}. Let us look in detail at what this parser combinator produces with some somple strings \begin{center} \begin{tabular}{rcl} input strings & & output\medskip\\ -\texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\ -\texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\ -\texttt{\Grid{cc}} & $\rightarrow$ & $\{\}$ +\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\ +\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\ +\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$ \end{tabular} \end{center} \noindent We receive in the first two cases a successful output (that is a non-empty set). In each case, either \pcode{a} or \pcode{b} is in the processed part, and -\pcode{c} in the unprocessed part. Clearly this parser cannot -parse anything in the string \pcode{cc}, therefore the empty +\pcode{cde} in the unprocessed part. Clearly this parser cannot +parse anything in the string \pcode{ccde}, therefore the empty set. -A bit more interesting is the \emph{sequence parser -combinator}. Given two parsers, say, $p$ and $q$, apply first -the input to $p$ producing a set of pairs; then apply $q$ to -all the unparsed parts; then combine the results like +A bit more interesting is the \emph{sequence parser combinator}. Given +two parsers, say again, $p$ and $q$, we want to apply first the input +to $p$ producing a set of pairs; then apply $q$ to all the unparsed +parts in the pairs; and then combine the results like \begin{center} \begin{tabular}{lcl} $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & $(\textit{output}_1, u_1) \in p(\text{input}) \;\wedge\;$\\ -&& $\;(\textit{output}_2, u_2) \in q(u_1)\}$ +&& $(\textit{output}_2, u_2) \in q(u_1)\}$ \end{tabular} \end{center} -\noindent -This can be implemented in Scala as follows: +\noindent Notice that the $p$ wil first be run on the input, producing +pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands +for the unprocessed, or left-over, parts. We want that $q$ runs on all +these unprocessed parts $u_1$. This again will produce some +processed part , $p$ and +$q$, we apply both parsers to the input (remember parsers are +functions) and combine the output (remember they are sets of pairs) + +\begin{center} +$p(\text{input}) \cup q(\text{input})$ +\end{center} + +\noindent In Scala we would implement alternative parser +combinator as follows + +\begin{center} +\begin{lstlisting}[language=Scala, numbers=none] +class AltParser[I, T] + (p: => Parser[I, T], + q: => Parser[I, T]) extends Parser[I, T] { + def parse(sb: I) = p.parse(sb) ++ q.parse(sb) +} +\end{lstlisting} +\end{center} + +\noindent The types of this parser combinator are again generic (we +just have \texttt{I} for the input type, and \texttt{T} for the output +type). The alternative parser builds a new parser out of two existing +parsers \texttt{p} and \texttt{q}. Both need to be able to process +input of type \texttt{I} and return the same output type +\texttt{Set[(T, I)]}.\footnote{There is an interesting detail of + Scala, namely the \texttt{=>} in front of the types of \texttt{p} + and \texttt{q}. They will prevent the evaluation of the arguments + before they are used. This is often called \emph{lazy evaluation} of + the arguments. We will explain this later.} Therefore the output +type of this parser is \texttt{T}. The alternative parser should run +the input with the first parser \texttt{p} (producing a set of pairs) +and then run the same input with \texttt{q} (producing another set of +pairs). The result should be then just the union of both sets, which +is the operation \texttt{++} in Scala. + +The alternative parser combinator already allows us to +construct a parser that parses either a character \texttt{a} +or \texttt{b}, as + +\begin{center} +\begin{lstlisting}[language=Scala, numbers=none] +new AltParser(CharParser('a'), CharParser('b')) +\end{lstlisting} +\end{center} + +\noindent Later on we will use again Scala mechanism for introducing some +more readable shorthand notation for this, like \texttt{"a" || "b"}. Let us look in detail at what this parser combinator produces with some somple strings + +\begin{center} +\begin{tabular}{rcl} +input strings & & output\medskip\\ +\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\ +\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\ +\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$ +\end{tabular} +\end{center} + +\noindent We receive in the first two cases a successful +output (that is a non-empty set). In each case, either +\pcode{a} or \pcode{b} is in the processed part, and +\pcode{cde} in the unprocessed part. Clearly this parser cannot +parse anything in the string \pcode{ccde}, therefore the empty +set. + +A bit more interesting is the \emph{sequence parser combinator}. Given +two parsers, say again, $p$ and $q$, we want to apply first the input +to $p$ producing a set of pairs; then apply $q$ to all the unparsed +parts in the pairs; and then combine the results like + +\begin{center} +\begin{tabular}{lcl} +$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & +$(\textit{output}_1, u_1) \in p(\text{input}) +\;\wedge\;$\\ +&& $(\textit{output}_2, u_2) \in q(u_1)\}$ +\end{tabular} +\end{center} + +\noindent Notice that the $p$ wil first be run on the input, producing +pairs of the form $\textit{output}_1$ and unprocessed part $u_1$. The +overall result of the sequence parser combinator is pairs of the form +$((\textit{output}_1, \textit{output}_2), u_2)$. This means the +unprocessed parts of both parsers is the unprocessed parts the second +parser $q$ produces as left-over. The processed parts of both parsers +is just the pair of the outputs +$(\textit{output}_1, \textit{output}_2)$. This behavious can be +implemented in Scala as follows: \begin{center} \begin{lstlisting}[language=Scala,numbers=none] @@ -228,7 +369,7 @@ \end{lstlisting} \end{center} -\noindent This parser takes as input two parsers, \texttt{p} +\noindent This parser takes again as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{output1}, \texttt{u1}). The \texttt{u1} @@ -247,26 +388,44 @@ \texttt{Set[((T, S), I)]} \end{center} -Scala allows us to provide some shorthand notation for the -sequence parser combinator. We can write for example -\pcode{'a' ~ 'b'}, which is the parser combinator that -first consumes the character \texttt{a} from a string and then -\texttt{b}. Three examples of this parser combinator are as -follows: +\noindent +If any of the runs of \textit{p} and \textit{q} fail, that is produce +the empty set, then \texttt{parse} will also produce the empty set. +Notice that we have now two output types for the sequence parser +combinator, because in general \textit{p} and \textit{q} might produce +differetn things (for example first we recognise a number and then a +string corresponding to an operator). + + +We have not yet looked at this in detail, but Scala allows us to +provide some shorthand notation for the sequence parser combinator. We +can write for example \pcode{"a" ~ "b"}, which is the parser +combinator that first recognises the character \texttt{a} from a +string and then \texttt{b}. Let us look again at three examples of +how this parser combinator processes strings: \begin{center} \begin{tabular}{rcl} input strings & & output\medskip\\ -\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ -\texttt{\Grid{bac}} & $\rightarrow$ & $\{\}$\\ -\texttt{\Grid{ccc}} & $\rightarrow$ & $\{\}$ +\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\ +\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\ +\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$ \end{tabular} \end{center} -\noindent A slightly more complicated parser is \pcode{('a' -|| 'b') ~ 'b'} which parses as first character either an -\texttt{a} or \texttt{b} followed by a \texttt{b}. This parser -produces the following outputs. +\noindent In the first line we have a sucessful parse, because the +string started with \texttt{ab}, which is the prefix we are looking +for. But since the parsing combinator is constructed as sequence of +the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the +result is a nested pair of the form \texttt{((a, b), cde)}. It is +\emph{not} a simple pair \texttt{(ab, cde)} as one might errorneously +expects. The parser returns the ampty set in the other examples, +because they do not fit with what the parser is supposed to parse. + + +A slightly more complicated parser is \pcode{("a" || "b") ~ "c"}. +which parses as first character either an \texttt{a} or \texttt{b} +followed by a \texttt{c}. This parser produces the following outputs. \begin{center} \begin{tabular}{rcl}