afl-material: comparison handouts/ho06.tex

equal deleted inserted replaced

-:200d2a3eb1b1
+:529460073b24
 \documentclass{article}
 \usepackage{../style}
 \usepackage{../langs}
 \begin{document}
 \section*{Handout 6 (Parser Combinators)}
-In what follows we explain \emph{parser combinators}. Their
+This handout explains how \emph{parser combinators} work and how they
-distinguishing feature is that they are very easy to
+can be implemented in Scala. Their distinguishing feature is that they
-implement. However, they only work when the grammar to be
+are very easy to implement (admittedly it is only easy in a functional
-parsed is \emph{not} left-recursive and they are efficient
+programming language). However, parser combinators require that the
-only when the grammar is unambiguous. It is the responsibility
+grammar to be parsed is \emph{not} left-recursive and they are
-of the grammar designer to ensure these two properties.
+efficient only when the grammar is unambiguous. It is the
+responsibility of the grammar designer to ensure these two properties.
-Parser combinators can deal with any kind of input as long as
-this input is a kind of sequence, for example a string or a
+Another good point of parser combinators is that they can deal with any kind
-list of tokens. The general idea behind parser combinators is
+of input as long as this input of ``sequence-kind'', for example a
-to transform the input into sets of pairs, like so
+string or a list of tokens. The general idea behind parser combinators
+is to transform the input into sets of pairs, like so
 \begin{center}
 $\underbrace{\text{list of tokens}}_{\text{input}}$
 $\Rightarrow$
 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
 \end{center}
-\noindent In Scala parser combinators are functions of type
+\noindent As said, the input can be anything as long as it is a
+``sequence''. The only property of the input we need is to be able to
-\begin{center}
+test when it is empty. Obviously we can do this for strings and lists.
-\texttt{I $\Rightarrow$ Set[(T, I)]}
+For more lucidity we shall below often use strings as input in order
-\end{center}
+to illustrate matters. However, this does not make our previous work
+on lexers obsolete (remember they transform a string into a list of
-\noindent that is they take as input something of type
+tokens). Lexers will still be needed to build a somewhat realistic
-\texttt{I} and return a set of pairs. The first component of
+compiler.
-these pairs corresponds to what the parser combinator was able
-to process from the input and the second is the unprocessed
-part of the input. As we shall see shortly, a parser
+In my Scala code I use the following polymorphic types for parser combinators:
-combinator might return more than one such pair, the idea
-being that there are potentially several ways of how to
+\begin{center}
-interpret the input. To simplify matters we will first look at
+input:\;\; \texttt{I}  \qquad output:\;\; \texttt{T}
-the case where the input to the parser combinators is just
+\end{center}
-strings. As a concrete example, consider the string
+\noindent that is they take as input something of type \texttt{I} and
+return a set of pairs of \texttt{Set[(T, I)]}. Since the input needs
+to be of ``sequence-kind'' I actually have to often write \texttt{I
+<\% Seq[\_]} for the input type in order to indicate it is a
+subtype of Scala sequences. The first component of the generated pairs
+corresponds to what the parser combinator was able to process from the
+input and the second is the unprocessed part of the input (therefore
+the type of this unprocessed part is the same as the input). As we
+shall see shortly, a parser combinator might return more than one such
+pair; the idea being that there are potentially several ways of how to
+parse the input.  As a concrete example, consider the string
 \begin{center}
 \tt\Grid{iffoo\VS testbar}
 \end{center}
 recognise anything from the input, then parser combinators just
 return the empty set $\{\}$. This will indicate something
 ``went wrong''\ldots or more precisely, nothing could be
 parsed.
-The idea of parser combinators is that we can easily build
+Also important to note is that the type \texttt{T} for the processed
-parser combinators out of smaller components following very
+part is different from the input type. The reason is that in general
-closely the structure of a grammar. In order to implement this
+we are interested in transform our input into something
-in an object-oriented programming language, like Scala, we
+``different''\ldots for example into a tree, or if we implement the
-need to specify an abstract class for parser combinators. This
+grammar for arithmetic expressions we might be interested in the
-abstract class requires the implementation of the function
+actual integer number the arithmetic expression, say \texttt{1 + 2 *
-\texttt{parse} taking an argument of type \texttt{I} and
+3}, stands for. In this way we can use parser combinators to
-returns a set of type \mbox{\texttt{Set[(T, I)]}}.
+implement relativaley easily a calculator.
+The main idea of parser combinators is that we can easily build parser
+combinators out of smaller components following very closely the
+structure of a grammar. In order to implement this in an
+object-oriented programming language, like Scala, we need to specify
+an abstract class for parser combinators. This abstract class states
+that the function \texttt{parse} takes an argument of type \texttt{I}
+and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
 \begin{center}
 \begin{lstlisting}[language=Scala]
 abstract class Parser[I, T] {
 def parse(ts: I): Set[(T, I)]
 yield head
 }
 \end{lstlisting}
 \end{center}
-\noindent From the function \texttt{parse} we can then
+\noindent It is the obligation in each instance (parser combinator) to
-``centrally'' derive the function \texttt{parse\_all}, which
+supply an implementation for \texttt{parse}.  From this function we
-just filters out all pairs whose second component is not empty
+can then ``centrally'' derive the function \texttt{parse\_all}, which
-(that is has still some unprocessed part). The reason is that
+just filters out all pairs whose second component is not empty (that
-at the end of the parsing we are only interested in the results
+is has still some unprocessed part). The reason is that at the end of
-where all the input has been consumed and no unprocessed part
+the parsing we are only interested in the results where all the input
-is left over.
+has been consumed and no unprocessed part is left over.
 One of the simplest parser combinators recognises just a
-character, say $c$, from the beginning of strings. Its
+single character, say $c$, from the beginning of strings. Its
 behaviour can be described as follows:
 \begin{itemize}
-\item if the head of the input string starts with a $c$, then returns
+\item If the head of the input string starts with a $c$, then return
-	the set $\{(c, \textit{tail of}\; s)\}$, where \textit{tail of}
+the set
-	$s$ is the unprocessed part of the input string
+\[\{(c, \textit{tail of}\; s)\}\]
-\item otherwise return the empty set $\{\}$
+where \textit{tail of}
+$s$ is the unprocessed part of the input string.
+\item Otherwise return the empty set $\{\}$.
 \end{itemize}
 \noindent
 The input type of this simple parser combinator for characters is
 \texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}.
 if (sb.head == c) Set((c, sb.tail)) else Set()
 }
 \end{lstlisting}
 \end{center}
-\noindent The \texttt{parse} function tests whether the first
+\noindent You can see the \texttt{parse} function tests whether the
-character of the input string \texttt{sb} is equal to
+first character of the input string \texttt{sb} is equal to
-\texttt{c}. If yes, then it splits the string into the
+\texttt{c}. If yes, then it splits the string into the recognised part
-recognised part \texttt{c} and the unprocessed part
+\texttt{c} and the unprocessed part \texttt{sb.tail}. In case
-\texttt{sb.tail}. In case \texttt{sb} does not start with
+\texttt{sb} does not start with \texttt{c} then the parser returns the
-\texttt{c} then the parser returns the empty set (in Scala
+empty set (in Scala \texttt{Set()}). Since this parser recognises
-\texttt{Set()}).
+characters and just returns characters as the processed part, the
+output type of the parser is \texttt{Char}.
-More interesting are the parser combinators that build larger
+If we want to parse a list of tokens and interested in recognising a
-parsers out of smaller component parsers. For example the
+number token, we could write something like this
-alternative parser combinator is as follows: given two
-parsers, say, $p$ and $q$, we apply both parsers to the
+\begin{center}
-input (remember parsers are functions) and combine the output
+\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
-(remember they are sets)
+case object NumParser extends Parser[List[Token], Int] {
+def parse(ts: List[Token]) = ts match {
+case Num_token(s)::ts => Set((s.toInt, ts))
+case _ => Set ()
+}
+}
+\end{lstlisting}
+\end{center}
+\noindent
+In this parser the input is of type \texttt{List[Token]}. The function
+parse looks at the input \texttt{ts} and checks whether the first
+token is a \texttt{Num\_token}. Let us assume our lexer generated
+these tokens for numbers. But this parser does not just return this
+token (and the rest of the list), like the \texttt{CharParser} above,
+rather extracts the string \texttt{s} from the token and converts it
+into an integer. The hope is that the lexer did its work well and this
+conversion always succeeds. The consequence of this is that the output
+type for this parser is \texttt{Int}. Such a conversion would be
+needed if we want to implement a simple calculator program.
+These simple parsers that just look at the input and do a simple
+transformation are often called \emph{atomic} parser combinators.
+More interesting are the parser combinators that build larger parsers
+out of smaller component parsers. For example the \emph{alternative
+parser combinator} is as follows: given two parsers, say, $p$ and
+$q$, we apply both parsers to the input (remember parsers are
+functions) and combine the output (remember they are sets of pairs)
 \begin{center}
 $p(\text{input}) \cup q(\text{input})$
 \end{center}
 def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
 }
 \end{lstlisting}
 \end{center}
-\noindent The types of this parser combinator are polymorphic
+\noindent The types of this parser combinator are again generic (we
-(we just have \texttt{I} for the input type, and \texttt{T}
+just have \texttt{I} for the input type, and \texttt{T} for the output
-for the output type). The alternative parser builds a new
+type). The alternative parser builds a new parser out of two existing
-parser out of two existing parsers \texttt{p} and \texttt{q}.
+parsers \texttt{p} and \texttt{q}.  Both need to be able to process
-Both need to be able to process input of type \texttt{I} and
+input of type \texttt{I} and return the same output type
-return the same output type \texttt{Set[(T,
+\texttt{Set[(T, I)]}.\footnote{There is an interesting detail of
-I)]}.\footnote{There is an interesting detail of Scala, namely
+Scala, namely the \texttt{=>} in front of the types of \texttt{p}
-the \texttt{=>} in front of the types of \texttt{p} and
+and \texttt{q}. They will prevent the evaluation of the arguments
-\texttt{q}. They will prevent the evaluation of the arguments
+before they are used. This is often called \emph{lazy evaluation} of
-before they are used. This is often called \emph{lazy
+the arguments. We will explain this later.}  Therefore the output
-evaluation} of the arguments. We will explain this later.} The alternative parser should
+type of this parser is \texttt{T}. The alternative parser should run
-run the input with the first parser \texttt{p} (producing a
+the input with the first parser \texttt{p} (producing a set of pairs)
-set of outputs) and then run the same input with \texttt{q}.
+and then run the same input with \texttt{q} (producing another set of
-The result should be then just the union of both sets, which
+pairs).  The result should be then just the union of both sets, which
 is the operation \texttt{++} in Scala.
 The alternative parser combinator already allows us to
 construct a parser that parses either a character \texttt{a}
 or \texttt{b}, as
 \begin{lstlisting}[language=Scala, numbers=none]
 new AltParser(CharParser('a'), CharParser('b'))
 \end{lstlisting}
 \end{center}
-\noindent Scala allows us to introduce some more readable
+\noindent Later on we will use again Scala mechanism for introducing some
-shorthand notation for this, like \texttt{'a' || 'b'}. We can
+more readable shorthand notation for this, like \texttt{"a" || "b"}. Let us look in detail at what this parser combinator produces with some somple strings
-call this parser combinator with the strings
 \begin{center}
 \begin{tabular}{rcl}
 input strings & & output\medskip\\
-\texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\
+\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
-\texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\
+\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
-\texttt{\Grid{cc}} & $\rightarrow$ & $\{\}$
+\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
 \end{tabular}
 \end{center}
 \noindent We receive in the first two cases a successful
 output (that is a non-empty set). In each case, either
 \pcode{a} or \pcode{b} is in the processed part, and
-\pcode{c} in the unprocessed part. Clearly this parser cannot
+\pcode{cde} in the unprocessed part. Clearly this parser cannot
-parse anything in the string \pcode{cc}, therefore the empty
+parse anything in the string \pcode{ccde}, therefore the empty
 set.
-A bit more interesting is the \emph{sequence parser
+A bit more interesting is the \emph{sequence parser combinator}. Given
-combinator}. Given two parsers, say, $p$ and $q$, apply first
+two parsers, say again, $p$ and $q$, we want to apply first the input
-the input to $p$ producing a set of pairs; then apply $q$ to
+to $p$ producing a set of pairs; then apply $q$ to all the unparsed
-all the unparsed parts; then combine the results like
+parts in the pairs; and then combine the results like
 \begin{center}
 \begin{tabular}{lcl}
 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ &
 $(\textit{output}_1, u_1) \in p(\text{input})
 \;\wedge\;$\\
-&& $\;(\textit{output}_2, u_2) \in q(u_1)\}$
+&& $(\textit{output}_2, u_2) \in q(u_1)\}$
 \end{tabular}
 \end{center}
-\noindent
+\noindent Notice that the $p$ wil first be run on the input, producing
-This can be implemented in Scala as follows:
+pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands
+for the unprocessed, or left-over, parts. We want that $q$ runs on all
+these unprocessed parts $u_1$. This again will produce some
+processed part , $p$ and
+$q$, we apply both parsers to the input (remember parsers are
+functions) and combine the output (remember they are sets of pairs)
+\begin{center}
+$p(\text{input}) \cup q(\text{input})$
+\end{center}
+\noindent In Scala we would implement alternative parser
+combinator as follows
+\begin{center}
+\begin{lstlisting}[language=Scala, numbers=none]
+class AltParser[I, T]
+(p: => Parser[I, T],
+q: => Parser[I, T]) extends Parser[I, T] {
+def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
+}
+\end{lstlisting}
+\end{center}
+\noindent The types of this parser combinator are again generic (we
+just have \texttt{I} for the input type, and \texttt{T} for the output
+type). The alternative parser builds a new parser out of two existing
+parsers \texttt{p} and \texttt{q}.  Both need to be able to process
+input of type \texttt{I} and return the same output type
+\texttt{Set[(T, I)]}.\footnote{There is an interesting detail of
+Scala, namely the \texttt{=>} in front of the types of \texttt{p}
+and \texttt{q}. They will prevent the evaluation of the arguments
+before they are used. This is often called \emph{lazy evaluation} of
+the arguments. We will explain this later.}  Therefore the output
+type of this parser is \texttt{T}. The alternative parser should run
+the input with the first parser \texttt{p} (producing a set of pairs)
+and then run the same input with \texttt{q} (producing another set of
+pairs).  The result should be then just the union of both sets, which
+is the operation \texttt{++} in Scala.
+The alternative parser combinator already allows us to
+construct a parser that parses either a character \texttt{a}
+or \texttt{b}, as
+\begin{center}
+\begin{lstlisting}[language=Scala, numbers=none]
+new AltParser(CharParser('a'), CharParser('b'))
+\end{lstlisting}
+\end{center}
+\noindent Later on we will use again Scala mechanism for introducing some
+more readable shorthand notation for this, like \texttt{"a" || "b"}. Let us look in detail at what this parser combinator produces with some somple strings
+\begin{center}
+\begin{tabular}{rcl}
+input strings & & output\medskip\\
+\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
+\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
+\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
+\end{tabular}
+\end{center}
+\noindent We receive in the first two cases a successful
+output (that is a non-empty set). In each case, either
+\pcode{a} or \pcode{b} is in the processed part, and
+\pcode{cde} in the unprocessed part. Clearly this parser cannot
+parse anything in the string \pcode{ccde}, therefore the empty
+set.
+A bit more interesting is the \emph{sequence parser combinator}. Given
+two parsers, say again, $p$ and $q$, we want to apply first the input
+to $p$ producing a set of pairs; then apply $q$ to all the unparsed
+parts in the pairs; and then combine the results like
+\begin{center}
+\begin{tabular}{lcl}
+$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ &
+$(\textit{output}_1, u_1) \in p(\text{input})
+\;\wedge\;$\\
+&& $(\textit{output}_2, u_2) \in q(u_1)\}$
+\end{tabular}
+\end{center}
+\noindent Notice that the $p$ wil first be run on the input, producing
+pairs of the form $\textit{output}_1$ and unprocessed part $u_1$. The
+overall result of the sequence parser combinator is pairs of the form
+$((\textit{output}_1, \textit{output}_2), u_2)$. This means the
+unprocessed parts of both parsers is the unprocessed parts the second
+parser $q$ produces as left-over. The processed parts of both parsers
+is just the pair of the outputs
+$(\textit{output}_1, \textit{output}_2)$. This behavious can be
+implemented in Scala as follows:
 \begin{center}
 \begin{lstlisting}[language=Scala,numbers=none]
 class SeqParser[I, T, S]
 (p: => Parser[I, T],
 yield ((output1, output2), u2)
 }
 \end{lstlisting}
 \end{center}
-\noindent This parser takes as input two parsers, \texttt{p}
+\noindent This parser takes again as input two parsers, \texttt{p}
 and \texttt{q}. It implements \texttt{parse} as follows: let
 first run the parser \texttt{p} on the input producing a set
 of pairs (\texttt{output1}, \texttt{u1}). The \texttt{u1}
 stands for the unprocessed parts left over by \texttt{p}. Let
 \texttt{q} run on these unprocessed parts producing again a
 \begin{center}
 \texttt{Set[((T, S), I)]}
 \end{center}
-Scala allows us to provide some shorthand notation for the
+\noindent
-sequence parser combinator. We can write for example
+If any of the runs of \textit{p} and \textit{q} fail, that is produce
-\pcode{'a' ~ 'b'}, which is the parser combinator that
+the empty set, then \texttt{parse} will also produce the empty set.
-first consumes the character \texttt{a} from a string and then
+Notice that we have now two output types for the sequence parser
-\texttt{b}. Three examples of this parser combinator are as
+combinator, because in general \textit{p} and \textit{q} might produce
-follows:
+differetn things (for example first we recognise a number and then a
+string corresponding to an operator).
+We have not yet looked at this in detail, but Scala allows us to
+provide some shorthand notation for the sequence parser combinator. We
+can write for example \pcode{"a" ~ "b"}, which is the parser
+combinator that first recognises the character \texttt{a} from a
+string and then \texttt{b}. Let us look again at three examples of
+how this parser combinator processes strings:
 \begin{center}
 \begin{tabular}{rcl}
 input strings & & output\medskip\\
-\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
+\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\
-\texttt{\Grid{bac}} & $\rightarrow$ & $\{\}$\\
+\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
-\texttt{\Grid{ccc}} & $\rightarrow$ & $\{\}$
+\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
 \end{tabular}
 \end{center}
-\noindent A slightly more complicated parser is \pcode{('a'
+\noindent In the first line we have a sucessful parse, because the
-|| 'b') ~ 'b'} which parses as first character either an
+string started with \texttt{ab}, which is the prefix we are looking
-\texttt{a} or \texttt{b} followed by a \texttt{b}. This parser
+for. But since the parsing combinator is constructed as sequence of
-produces the following outputs.
+the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
+result is a nested pair of the form \texttt{((a, b), cde)}. It is
+\emph{not} a simple pair \texttt{(ab, cde)} as one might errorneously
+expects.  The parser returns the ampty set in the other examples,
+because they do not fit with what the parser is supposed to parse.
+A slightly more complicated parser is \pcode{("a" || "b") ~ "c"}.
+which parses as first character either an \texttt{a} or \texttt{b}
+followed by a \texttt{c}. This parser produces the following outputs.
 \begin{center}
 \begin{tabular}{rcl}
 input strings & & output\medskip\\
 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\

changeset 584	529460073b24
parent 392	2d0a59127694
child 585	6ee22f196884