afl-material: comparison handouts/ho06.tex

equal deleted inserted replaced

-:7f8516ff408d
+:31295bb945c6
 \begin{document}
 \section*{Handout 6 (Parser Combinators)}
-In what follows we explain \emph{parser combinators}, because
+In what follows we explain \emph{parser combinators}. Their
-they are very easy to implement. However, they only work when
+distinguishing feature is that they are very easy to
-the grammar to be parsed is \emph{not} left-recursive and they
+implement. However, they only work when the grammar to be
-are efficient only when the grammar is unambiguous. It is the
+parsed is \emph{not} left-recursive and they are efficient
-responsibility of the grammar designer to ensure these two
+only when the grammar is unambiguous. It is the responsibility
-properties.
+of the grammar designer to ensure these two properties.
 Parser combinators can deal with any kind of input as long as
 this input is a kind of sequence, for example a string or a
-list of tokens. If the input are lists of tokens, then parser
+list of tokens. The general idea behind parser combinators is
-combinators transform them into sets of pairs, like so
+to transform the input into sets of pairs, like so
 \begin{center}
 $\underbrace{\text{list of tokens}}_{\text{input}}$
 $\Rightarrow$
 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
 \end{center}
-\noindent In Scala parser combinators will therefore be
+\noindent In Scala parser combinators are functions of type
-functions of type
 \begin{center}
 \texttt{I $\Rightarrow$ Set[(T, I)]}
 \end{center}
 \noindent that is they take as input something of type
 \texttt{I} and return a set of pairs. The first component of
 these pairs corresponds to what the parser combinator was able
 to process from the input and the second is the unprocessed
 part of the input. As we shall see shortly, a parser
-combinator might return more than one such pair, with the idea
+combinator might return more than one such pair, the idea
-that there are potentially several ways how to interpret the
+being that there are potentially several ways how to interpret
-input. To simplify matters we will first look at the case
+the input. To simplify matters we will first look at the case
 where the input to the parser combinators is just strings. As
-a concrete example, consider the case where the input is the
+a concrete example, consider the string
-string
 \begin{center}
 \tt\Grid{iffoo\VS testbar}
 \end{center}
 \noindent We might have a parser combinator which tries to
-interpret this string as a keyword (\texttt{if}) or an
+interpret this string as a keyword (\texttt{if}) or as an
 identifier (\texttt{iffoo}). Then the output will be the set
 \begin{center}
-$\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right),
+$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right),
-\left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
+\left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
 \end{center}
 \noindent where the first pair means the parser could
 recognise \texttt{if} from the input and leaves the rest as
 `unprocessed' as the second component of the pair; in the
 other case it could recognise \texttt{iffoo} and leaves
 \texttt{\VS testbar} as unprocessed. If the parser cannot
-recognise anything from the input then parser combinators just
+recognise anything from the input, then parser combinators just
 return the empty set $\{\}$. This will indicate something
 ``went wrong''\ldots or more precisely, nothing could be
 parsed.
 The main attraction of parser combinators is that we can
 \noindent From the function \texttt{parse} we can then
 ``centrally'' derive the function \texttt{parse\_all}, which
 just filters out all pairs whose second component is not empty
 (that is has still some unprocessed part). The reason is that
-at the end of parsing we are only interested in the results
+at the end of the parsing we are only interested in the results
 where all the input has been consumed and no unprocessed part
-is left.
+is left over.
 One of the simplest parser combinators recognises just a
 character, say $c$, from the beginning of strings. Its
 behaviour can be described as follows:
 \begin{itemize}
-\item if the head of the input string starts with a $c$, it returns
+\item if the head of the input string starts with a $c$, then returns
 	the set $\{(c, \textit{tail of}\; s)\}$, where \textit{tail of}
 	$s$ is the unprocessed part of the input string
-\item otherwise it returns the empty set $\{\}$
+\item otherwise return the empty set $\{\}$
 \end{itemize}
 \noindent
 The input type of this simple parser combinator for characters is
 \texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}.
 More interesting are the parser combinators that build larger
 parsers out of smaller component parsers. For example the
 alternative parser combinator is as follows: given two
 parsers, say, $p$ and $q$, then we apply both parsers to the
-input (remember parsers are functions) and combine the input
+input (remember parsers are functions) and combine the output
+(remember they are sets)
 \begin{center}
 $p(\text{input}) \cup q(\text{input})$
 \end{center}
-\noindent
+\noindent In Scala we would implement alternative parser
-In Scala we would implement alternative parsers as follows
+combinator as follows
 \begin{center}
 \begin{lstlisting}[language=Scala, numbers=none]
 class AltParser[I, T]
 (p: => Parser[I, T],
 \end{center}
 \noindent The types of this parser combinator are polymorphic
 (we just have \texttt{I} for the input type, and \texttt{T}
 for the output type). The alternative parser builds a new
-parser out of two existing parser combinator \texttt{p} and
+parser out of two existing parsers \texttt{p} and \texttt{q}.
-\texttt{q}. Both need to be able to process input of type
+Both need to be able to process input of type \texttt{I} and
-\texttt{I} and return the same output type \texttt{Set[(T,
+return the same output type \texttt{Set[(T,
-I)]}. (There is an interesting detail of Scala, namely the
+I)]}.\footnote{There is an interesting detail of Scala, namely
-\texttt{=>} in front of the types of \texttt{p} and
+the \texttt{=>} in front of the types of \texttt{p} and
 \texttt{q}. They will prevent the evaluation of the arguments
 before they are used. This is often called \emph{lazy
-evaluation} of the arguments.) The alternative parser should
+evaluation} of the arguments. We will explain this later.} The alternative parser should
 run the input with the first parser \texttt{p} (producing a
 set of outputs) and then run the same input with \texttt{q}.
 The result should be then just the union of both sets, which
 is the operation \texttt{++} in Scala.
-This parser combinator already allows us to construct a parser
+The alternative parser combinator already allows us to
-that either a character \texttt{a} or \texttt{b}, as
+construct a parser that parses either a character \texttt{a}
+or \texttt{b}, as
 \begin{center}
 \begin{lstlisting}[language=Scala, numbers=none]
 new AltParser(CharParser('a'), CharParser('b'))
 \end{lstlisting}
 \end{tabular}
 \end{center}
 \noindent We receive in the first two cases a successful
 output (that is a non-empty set). In each case, either
-\pcode{a} or \pcode{b} are the processed parts, and \pcode{c}
+\pcode{a} or \pcode{b} are in the processed parts, and
-is the unprocessed part. Clearly this parser cannot parse
+\pcode{c} in the unprocessed part. Clearly this parser cannot
-anything with the string \pcode{cc}.
+parse anything in the string \pcode{cc}, therefore the empty
+set.
 A bit more interesting is the \emph{sequence parser
-combinator}. Given two parsers, say, $p$ and $q$,
+combinator}. Given two parsers, say, $p$ and $q$, apply first
-apply first the input to $p$ producing a set of pairs;
+the input to $p$ producing a set of pairs; then apply $q$ to
-then apply $q$ to the unparsed parts; then combine the
+all the unparsed parts; then combine the results like
-results like
 \begin{center}
 \begin{tabular}{lcl}
 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ &
 $(\textit{output}_1, u_1) \in p(\text{input})
 \;\wedge\;$\\
-&& $(\textit{output}_2, u_2) \in q(u_1)\}$
+&& $\;(\textit{output}_2, u_2) \in q(u_1)\}$
 \end{tabular}
 \end{center}
 \noindent
 This can be implemented in Scala as follows:
 \begin{lstlisting}[language=Scala,numbers=none]
 class SeqParser[I, T, S]
 (p: => Parser[I, T],
 q: => Parser[I, S]) extends Parser[I, (T, S)] {
 def parse(sb: I) =
-for ((head1, tail1) <- p.parse(sb);
+for ((output1, u1) <- p.parse(sb);
-(head2, tail2) <- q.parse(tail1))
+(output2, u2) <- q.parse(u1))
-yield ((head1, head2), tail2)
+yield ((output1, output2), u2)
 }
 \end{lstlisting}
 \end{center}
 \noindent This parser takes as input two parsers, \texttt{p}
 and \texttt{q}. It implements \texttt{parse} as follows: let
 first run the parser \texttt{p} on the input producing a set
-of pairs (\texttt{head1}, \texttt{tail1}). The \texttt{tail1}
+of pairs (\texttt{output1}, \texttt{u1}). The \texttt{u1}
 stands for the unprocessed parts left over by \texttt{p}. Let
 \texttt{q} run on these unprocessed parts producing again a
 set of pairs. The output of the sequence parser combinator is
 then a set containing pairs where the first components are
 again pairs, namely what the first parser could parse together
 \begin{center}
 \texttt{Set[((T, S), I)]}
 \end{center}
 Scala allows us to provide some shorthand notation for the
-sequence parser combinator. So we can write for example
+sequence parser combinator. We can write for example
-\texttt{'a' $\sim$ 'b'}, which is the parser combinator that
+\pcode{'a' ~ 'b'}, which is the parser combinator that
 first consumes the character \texttt{a} from a string and then
 \texttt{b}. Three examples of this parser combinator are as
 follows:
 \begin{center}
 \texttt{\Grid{bac}} & $\rightarrow$ & $\{\}$\\
 \texttt{\Grid{ccc}} & $\rightarrow$ & $\{\}$
 \end{tabular}
 \end{center}
-\noindent A slightly more complicated parser is \texttt{('a'
+\noindent A slightly more complicated parser is \pcode{('a'
-|| 'b') $\sim$ 'b'} which parses as first character either an
+|| 'b') ~ 'b'} which parses as first character either an
 \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser
-produces the following results.
+produces the following outputs.
 \begin{center}
 \begin{tabular}{rcl}
 input strings & & output\medskip\\
 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 \texttt{\Grid{aaaa}} & $\rightarrow$ &
 $\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
 \end{tabular}
 \end{center}
-\noindent Notice how the results nest deeper as pairs (the
+\noindent Notice how the results nest deeper and deeper as
-last \pcode{a} is in the unprocessed part). To consume
+pairs (the last \pcode{a} is in the unprocessed part). To
-everything we can use the parser \pcode{(('a' ~'a') ~ 'a') ~
+consume everything of this string we can use the parser
-'a'}. Then the output is as follows:
+\pcode{(('a' ~'a') ~ 'a') ~ 'a'}. Then the output is as
+follows:
 \begin{center}
 \begin{tabular}{rcl}
 input string & & output\medskip\\
 \texttt{\Grid{aaaa}} & $\rightarrow$ &
 \noindent This is an instance where the parser consumed
 completely the input, meaning the unprocessed part is just the
 empty string.
-Note carefully that constructing a parser such \texttt{'a' ||
+Note carefully that constructing a parser such \pcode{'a' ||
-('a' $\sim$ 'b')} will result in a tying error. The first
+('a' ~ 'b')} will result in a typing error. The first
 parser has as output type a single character (recall the type
 of \texttt{CharParser}), but the second parser produces a pair
 of characters as output. The alternative parser is however
 required to have both component parsers to have the same type.
 We will see later how we can build this parser without the
 typing error.
 The next parser combinator does not actually combine smaller
-parsers, but applies a function to the result of the parser.
+parsers, but applies a function to the result of a parser.
 It is implemented in Scala as follows
 \begin{center}
 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 class FunParser[I, T, S]
 }
 \end{lstlisting}
 \end{center}
-\noindent
+\noindent This parser combinator takes a parser \texttt{p}
-This parser combinator takes a parser \texttt{p} with output type \texttt{T} as
+with output type \texttt{T} as one argument as well as a
-input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p}
+function \texttt{f} with type \texttt{T => S}. The parser
-produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then
+\texttt{p} produces sets of type \texttt{(T, I)}. The
-applies the function \texttt{f} to all the parer outputs. Since this function
+\texttt{FunParser} combinator then applies the function
-is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}.
+\texttt{f} to all the parser outputs. Since this function is of
-Again Scala lets us introduce some shorthand notation for this parser combinator.
+type \texttt{T => S}, we obtain a parser with output type
-Therefore we will write \texttt{p ==> f} for it.
+\texttt{S}. Again Scala lets us introduce some shorthand
+notation for this parser combinator. Therefore we will write
+\texttt{p ==> f} for it.
+\subsubsection*{How to build parsers using parser combinators?}
+\subsubsection*{Implementing an Interpreter}
 %\bigskip
 %takes advantage of the full generality---have a look
 %what it produces if we call it with the string \texttt{abc}
 %

changeset 386	31295bb945c6
parent 385	7f8516ff408d
child 392	2d0a59127694