diff -r 7f8516ff408d -r 31295bb945c6 handouts/ho06.tex --- a/handouts/ho06.tex Wed Nov 25 15:59:32 2015 +0000 +++ b/handouts/ho06.tex Thu Nov 26 00:11:10 2015 +0000 @@ -7,17 +7,17 @@ \section*{Handout 6 (Parser Combinators)} -In what follows we explain \emph{parser combinators}, because -they are very easy to implement. However, they only work when -the grammar to be parsed is \emph{not} left-recursive and they -are efficient only when the grammar is unambiguous. It is the -responsibility of the grammar designer to ensure these two -properties. +In what follows we explain \emph{parser combinators}. Their +distinguishing feature is that they are very easy to +implement. However, they only work when the grammar to be +parsed is \emph{not} left-recursive and they are efficient +only when the grammar is unambiguous. It is the responsibility +of the grammar designer to ensure these two properties. Parser combinators can deal with any kind of input as long as this input is a kind of sequence, for example a string or a -list of tokens. If the input are lists of tokens, then parser -combinators transform them into sets of pairs, like so +list of tokens. The general idea behind parser combinators is +to transform the input into sets of pairs, like so \begin{center} $\underbrace{\text{list of tokens}}_{\text{input}}$ @@ -25,8 +25,7 @@ $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$ \end{center} -\noindent In Scala parser combinators will therefore be -functions of type +\noindent In Scala parser combinators are functions of type \begin{center} \texttt{I $\Rightarrow$ Set[(T, I)]} @@ -37,24 +36,23 @@ these pairs corresponds to what the parser combinator was able to process from the input and the second is the unprocessed part of the input. As we shall see shortly, a parser -combinator might return more than one such pair, with the idea -that there are potentially several ways how to interpret the -input. To simplify matters we will first look at the case +combinator might return more than one such pair, the idea +being that there are potentially several ways how to interpret +the input. To simplify matters we will first look at the case where the input to the parser combinators is just strings. As -a concrete example, consider the case where the input is the -string +a concrete example, consider the string \begin{center} \tt\Grid{iffoo\VS testbar} \end{center} \noindent We might have a parser combinator which tries to -interpret this string as a keyword (\texttt{if}) or an +interpret this string as a keyword (\texttt{if}) or as an identifier (\texttt{iffoo}). Then the output will be the set \begin{center} -$\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), - \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$ +$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), + \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$ \end{center} \noindent where the first pair means the parser could @@ -62,7 +60,7 @@ `unprocessed' as the second component of the pair; in the other case it could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parser cannot -recognise anything from the input then parser combinators just +recognise anything from the input, then parser combinators just return the empty set $\{\}$. This will indicate something ``went wrong''\ldots or more precisely, nothing could be parsed. @@ -93,19 +91,19 @@ ``centrally'' derive the function \texttt{parse\_all}, which just filters out all pairs whose second component is not empty (that is has still some unprocessed part). The reason is that -at the end of parsing we are only interested in the results +at the end of the parsing we are only interested in the results where all the input has been consumed and no unprocessed part -is left. +is left over. One of the simplest parser combinators recognises just a character, say $c$, from the beginning of strings. Its behaviour can be described as follows: \begin{itemize} -\item if the head of the input string starts with a $c$, it returns +\item if the head of the input string starts with a $c$, then returns the set $\{(c, \textit{tail of}\; s)\}$, where \textit{tail of} $s$ is the unprocessed part of the input string -\item otherwise it returns the empty set $\{\}$ +\item otherwise return the empty set $\{\}$ \end{itemize} \noindent @@ -135,14 +133,15 @@ parsers out of smaller component parsers. For example the alternative parser combinator is as follows: given two parsers, say, $p$ and $q$, then we apply both parsers to the -input (remember parsers are functions) and combine the input +input (remember parsers are functions) and combine the output +(remember they are sets) \begin{center} $p(\text{input}) \cup q(\text{input})$ \end{center} -\noindent -In Scala we would implement alternative parsers as follows +\noindent In Scala we would implement alternative parser +combinator as follows \begin{center} \begin{lstlisting}[language=Scala, numbers=none] @@ -157,21 +156,22 @@ \noindent The types of this parser combinator are polymorphic (we just have \texttt{I} for the input type, and \texttt{T} for the output type). The alternative parser builds a new -parser out of two existing parser combinator \texttt{p} and -\texttt{q}. Both need to be able to process input of type -\texttt{I} and return the same output type \texttt{Set[(T, -I)]}. (There is an interesting detail of Scala, namely the -\texttt{=>} in front of the types of \texttt{p} and +parser out of two existing parsers \texttt{p} and \texttt{q}. +Both need to be able to process input of type \texttt{I} and +return the same output type \texttt{Set[(T, +I)]}.\footnote{There is an interesting detail of Scala, namely +the \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the evaluation of the arguments before they are used. This is often called \emph{lazy -evaluation} of the arguments.) The alternative parser should +evaluation} of the arguments. We will explain this later.} The alternative parser should run the input with the first parser \texttt{p} (producing a set of outputs) and then run the same input with \texttt{q}. The result should be then just the union of both sets, which is the operation \texttt{++} in Scala. -This parser combinator already allows us to construct a parser -that either a character \texttt{a} or \texttt{b}, as +The alternative parser combinator already allows us to +construct a parser that parses either a character \texttt{a} +or \texttt{b}, as \begin{center} \begin{lstlisting}[language=Scala, numbers=none] @@ -194,22 +194,22 @@ \noindent We receive in the first two cases a successful output (that is a non-empty set). In each case, either -\pcode{a} or \pcode{b} are the processed parts, and \pcode{c} -is the unprocessed part. Clearly this parser cannot parse -anything with the string \pcode{cc}. +\pcode{a} or \pcode{b} are in the processed parts, and +\pcode{c} in the unprocessed part. Clearly this parser cannot +parse anything in the string \pcode{cc}, therefore the empty +set. A bit more interesting is the \emph{sequence parser -combinator}. Given two parsers, say, $p$ and $q$, -apply first the input to $p$ producing a set of pairs; -then apply $q$ to the unparsed parts; then combine the -results like +combinator}. Given two parsers, say, $p$ and $q$, apply first +the input to $p$ producing a set of pairs; then apply $q$ to +all the unparsed parts; then combine the results like \begin{center} \begin{tabular}{lcl} $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & $(\textit{output}_1, u_1) \in p(\text{input}) \;\wedge\;$\\ -&& $(\textit{output}_2, u_2) \in q(u_1)\}$ +&& $\;(\textit{output}_2, u_2) \in q(u_1)\}$ \end{tabular} \end{center} @@ -222,9 +222,9 @@ (p: => Parser[I, T], q: => Parser[I, S]) extends Parser[I, (T, S)] { def parse(sb: I) = - for ((head1, tail1) <- p.parse(sb); - (head2, tail2) <- q.parse(tail1)) - yield ((head1, head2), tail2) + for ((output1, u1) <- p.parse(sb); + (output2, u2) <- q.parse(u1)) + yield ((output1, output2), u2) } \end{lstlisting} \end{center} @@ -232,7 +232,7 @@ \noindent This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} as follows: let first run the parser \texttt{p} on the input producing a set -of pairs (\texttt{head1}, \texttt{tail1}). The \texttt{tail1} +of pairs (\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the unprocessed parts left over by \texttt{p}. Let \texttt{q} run on these unprocessed parts producing again a set of pairs. The output of the sequence parser combinator is @@ -249,8 +249,8 @@ \end{center} Scala allows us to provide some shorthand notation for the -sequence parser combinator. So we can write for example -\texttt{'a' $\sim$ 'b'}, which is the parser combinator that +sequence parser combinator. We can write for example +\pcode{'a' ~ 'b'}, which is the parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}. Three examples of this parser combinator are as follows: @@ -264,10 +264,10 @@ \end{tabular} \end{center} -\noindent A slightly more complicated parser is \texttt{('a' -|| 'b') $\sim$ 'b'} which parses as first character either an +\noindent A slightly more complicated parser is \pcode{('a' +|| 'b') ~ 'b'} which parses as first character either an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser -produces the following results. +produces the following outputs. \begin{center} \begin{tabular}{rcl} @@ -289,10 +289,11 @@ \end{tabular} \end{center} -\noindent Notice how the results nest deeper as pairs (the -last \pcode{a} is in the unprocessed part). To consume -everything we can use the parser \pcode{(('a' ~'a') ~ 'a') ~ -'a'}. Then the output is as follows: +\noindent Notice how the results nest deeper and deeper as +pairs (the last \pcode{a} is in the unprocessed part). To +consume everything of this string we can use the parser +\pcode{(('a' ~'a') ~ 'a') ~ 'a'}. Then the output is as +follows: \begin{center} \begin{tabular}{rcl} @@ -307,8 +308,8 @@ empty string. -Note carefully that constructing a parser such \texttt{'a' || -('a' $\sim$ 'b')} will result in a tying error. The first +Note carefully that constructing a parser such \pcode{'a' || +('a' ~ 'b')} will result in a typing error. The first parser has as output type a single character (recall the type of \texttt{CharParser}), but the second parser produces a pair of characters as output. The alternative parser is however @@ -317,7 +318,7 @@ typing error. The next parser combinator does not actually combine smaller -parsers, but applies a function to the result of the parser. +parsers, but applies a function to the result of a parser. It is implemented in Scala as follows \begin{center} @@ -332,14 +333,20 @@ \end{center} -\noindent -This parser combinator takes a parser \texttt{p} with output type \texttt{T} as -input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p} -produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then -applies the function \texttt{f} to all the parer outputs. Since this function -is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}. -Again Scala lets us introduce some shorthand notation for this parser combinator. -Therefore we will write \texttt{p ==> f} for it. +\noindent This parser combinator takes a parser \texttt{p} +with output type \texttt{T} as one argument as well as a +function \texttt{f} with type \texttt{T => S}. The parser +\texttt{p} produces sets of type \texttt{(T, I)}. The +\texttt{FunParser} combinator then applies the function +\texttt{f} to all the parser outputs. Since this function is of +type \texttt{T => S}, we obtain a parser with output type +\texttt{S}. Again Scala lets us introduce some shorthand +notation for this parser combinator. Therefore we will write +\texttt{p ==> f} for it. + +\subsubsection*{How to build parsers using parser combinators?} + +\subsubsection*{Implementing an Interpreter} %\bigskip %takes advantage of the full generality---have a look