diff -r 0451b8b67f62 -r c6a1e19e9801 handouts/ho06.tex --- a/handouts/ho06.tex Thu Oct 25 06:53:16 2018 +0100 +++ b/handouts/ho06.tex Thu Oct 25 18:36:04 2018 +0100 @@ -32,15 +32,15 @@ \end{center} \noindent -Given the extended effort we have spent implementing a lexer -in order to generate list of tokens, it might be surprising that in -what follows we shall often use strings as input. This is for making -the explanation more lucid. It does not make our previous work on -lexers obsolete (remember they transform a string into a list of -tokens). Lexers will still be needed for building a somewhat realistic -compiler. +Given the extended effort we have spent implementing a lexer in order +to generate list of tokens, it might be surprising that in what +follows we shall often use strings as input, rather than list of +tokens. This is for making the explanation more lucid. It does not +make our previous work on lexers obsolete (remember they transform a +string into a list of tokens). Lexers will still be needed for +building a somewhat realistic compiler. -But as said, parser combinators are relatively agnostic about what +As mentioned above, parser combinators are relatively agnostic about what kind of input they process. In my Scala code I use the following polymorphic types for parser combinators: @@ -49,16 +49,16 @@ \end{center} \noindent That is they take as input something of type \texttt{I} and -return a set of pairs of type \texttt{Set[(T, I)]}. Since the input needs -to be of ``sequence-kind'', I actually have to often write \texttt{I - <\% Seq[\_]} for the input type. This ensures the input is a +return a set of pairs of type \texttt{Set[(T, I)]}. Since the input +needs to be of ``sequence-kind'', I actually have to often write +\texttt{I <\% Seq[\_]} for the input type. This ensures the input is a subtype of Scala sequences. The first component of the generated pairs corresponds to what the parser combinator was able to process from the -input and the second is the unprocessed part of the input (therefore -the type of this unprocessed part is the same as the input). As we -shall see shortly, a parser combinator might return more than one such -pair; the idea is that there are potentially several ways of how to -parse the input. As a concrete example, consider the string +input and the second is the unprocessed part, or leftover, of the +input (therefore the type of this unprocessed part is the same as the +input). A parser combinator might return more than one such pair; the +idea is that there are potentially several ways of how to parse the +input. As a concrete example, consider the string \begin{center} \tt\Grid{iffoo\VS testbar} @@ -74,8 +74,8 @@ \end{center} \noindent where the first pair means the parser could recognise -\texttt{if} from the input and leaves the rest as `unprocessed' as the -second component of the pair; in the other case it could recognise +\texttt{if} from the input and leaves the \texttt{foo\VS testbar} as +`unprocessed' part; in the other case it could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parser cannot recognise anything from the input at all, then parser combinators just return the empty set $\{\}$. This will indicate @@ -83,14 +83,15 @@ parsed. Also important to note is that the type \texttt{T} for the processed -part is different from the input type. In the example above is just -happens to be the same. The reason for the difference is that in -general we are interested in transforming our input into something -``different''\ldots for example into a tree; or if we implement the -grammar for arithmetic expressions, we might be interested in the -actual integer number the arithmetic expression, say \texttt{1 + 2 * - 3}, stands for. In this way we can use parser combinators to -implement relatively easily a calculator, for instance. +part is different from the input type \texttt{I} in the parse. In the +example above is just happens to be the same. The reason for the +potential difference is that in general we are interested in +transforming our input into something ``different''\ldots for example +into a tree; or if we implement the grammar for arithmetic +expressions, we might be interested in the actual integer number the +arithmetic expression, say \texttt{1 + 2 * 3}, stands for. In this way +we can use parser combinators to implement relatively easily a +calculator, for instance. The main idea of parser combinators is that we can easily build parser combinators out of smaller components following very closely the @@ -103,10 +104,10 @@ \begin{center} \begin{lstlisting}[language=Scala] abstract class Parser[I, T] { - def parse(ts: I): Set[(T, I)] + def parse(in: I) : Set[(T, I)] - def parse_all(ts: I): Set[T] = - for ((head, tail) <- parse(ts); if (tail.isEmpty)) + def parse_all(in: I) : Set[T] = + for ((head, tail) <- parse(in); if (tail.isEmpty)) yield head } \end{lstlisting} @@ -134,10 +135,9 @@ \end{itemize} \noindent -The input type of this simple parser combinator for characters is -\texttt{String} and the output type is \texttt{Char}. This means -\texttt{parse} returns \mbox{\texttt{Set[(Char, String)]}}. -The code in Scala is as follows: +The input type of this simple parser combinator is \texttt{String} and +the output type is \texttt{Char}. This means \texttt{parse} returns +\mbox{\texttt{Set[(Char, String)]}}. The code in Scala is as follows: \begin{center} \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] @@ -158,7 +158,7 @@ output type of the parser is \texttt{Char}. If we want to parse a list of tokens and interested in recognising a -number token, we could write something like this +number token, for example, we could write something like this \begin{center} \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none] @@ -177,12 +177,14 @@ token is a \texttt{Num\_token} (let us assume our lexer generated these tokens for numbers). But this parser does not just return this token (and the rest of the list), like the \texttt{CharParser} above, -rather it extracts the string \texttt{s} from the token and converts it -into an integer. The hope is that the lexer did its work well and this -conversion always succeeds. The consequence of this is that the output -type for this parser is \texttt{Int}, not \texttt{Token}. Such a -conversion would be needed if we want to implement a simple calculator -program. +rather it extracts also the string \texttt{s} from the token and +converts it into an integer. The hope is that the lexer did its work +well and this conversion always succeeds. The consequence of this is +that the output type for this parser is \texttt{Int}, not +\texttt{Token}. Such a conversion would be needed if we want to +implement a simple calculator program, because string-numbers need to +be transformed into \texttt{Int}-numbers in order to do the +calculations. These simple parsers that just look at the input and do a simple @@ -214,18 +216,18 @@ \noindent The types of this parser combinator are again generic (we have \texttt{I} for the input type, and \texttt{T} for the output type). The alternative parser builds a new parser out of two existing -parsers \texttt{p} and \texttt{q} given as arguments. Both parsers -need to be able to process input of type \texttt{I} and return in -\texttt{parse} the same output type \texttt{Set[(T, +parsers \texttt{p} and \texttt{q} which are given as arguments. Both +parsers need to be able to process input of type \texttt{I} and return +in \texttt{parse} the same output type \texttt{Set[(T, I)]}.\footnote{There is an interesting detail of Scala, namely the \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the evaluation of the arguments before they are used. This is often called \emph{lazy evaluation} of the - arguments. We will explain this later.} The alternative parser -should run the input with the first parser \texttt{p} (producing a set -of pairs) and then run the same input with \texttt{q} (producing -another set of pairs). The result should be then just the union of -both sets, which is the operation \texttt{++} in Scala. + arguments. We will explain this later.} The alternative parser runs +the input with the first parser \texttt{p} (producing a set of pairs) +and then runs the same input with \texttt{q} (producing another set of +pairs). The result should be then just the union of both sets, which +is the operation \texttt{++} in Scala. The alternative parser combinator allows us to construct a parser that parses either a character \texttt{a} or \texttt{b} using the @@ -240,7 +242,7 @@ \noindent Later on we will use Scala mechanism for introducing some more readable shorthand notation for this, like \texttt{"a" | "b"}. Let us look in detail at what this parser combinator produces -with some sample strings +with some sample strings. \begin{center} \begin{tabular}{rcl} @@ -260,9 +262,9 @@ A bit more interesting is the \emph{sequence parser combinator}. Given two parsers, say again, $p$ and $q$, we want to apply first the input -to $p$ producing a set of pairs; then apply $q$ to all the unparsed +to $p$ producing a set of pairs; then apply $q$ to all the unparsed parts in the pairs; and then combine the results. Mathematically we would -write something like this for the expected set of pairs: +write something like this for the result set of pairs: \begin{center} \begin{tabular}{lcl} @@ -273,100 +275,18 @@ \end{tabular} \end{center} -\noindent Notice that the $p$ wil first be run on the input, producing -pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands -for the unprocessed, or leftover, parts. We want that $q$ runs on all -these unprocessed parts $u_1$. This again will produce some -processed part , $p$ and -$q$, we apply both parsers to the input (remember parsers are -functions) and combine the output (remember they are sets of pairs) - -\begin{center} -$p(\text{input}) \cup q(\text{input})$ -\end{center} - -\noindent In Scala we would implement alternative parser -combinator as follows - -\begin{center} -\begin{lstlisting}[language=Scala, numbers=none] -class AltParser[I, T] - (p: => Parser[I, T], - q: => Parser[I, T]) extends Parser[I, T] { - def parse(in: I) = p.parse(in) ++ q.parse(in) -} -\end{lstlisting} -\end{center} - -\noindent The types of this parser combinator are again generic (we -just have \texttt{I} for the input type, and \texttt{T} for the output -type). The alternative parser builds a new parser out of two existing -parsers \texttt{p} and \texttt{q}. Both need to be able to process -input of type \texttt{I} and return the same output type -\texttt{Set[(T, I)]}.\footnote{There is an interesting detail of - Scala, namely the \texttt{=>} in front of the types of \texttt{p} - and \texttt{q}. They will prevent the evaluation of the arguments - before they are used. This is often called \emph{lazy evaluation} of - the arguments. We will explain this later.} Therefore the output -type of this parser is \texttt{T}. The alternative parser should run -the input with the first parser \texttt{p} (producing a set of pairs) -and then run the same input with \texttt{q} (producing another set of -pairs). The result should be then just the union of both sets, which -is the operation \texttt{++} in Scala. - -The alternative parser combinator already allows us to -construct a parser that parses either a character \texttt{a} -or \texttt{b}, as - -\begin{center} -\begin{lstlisting}[language=Scala, numbers=none] -new AltParser(CharParser('a'), CharParser('b')) -\end{lstlisting} -\end{center} - -\noindent Later on we will use again Scala mechanism for introducing some -more readable shorthand notation for this, like \texttt{"a" | "b"}. Let us look in detail at what this parser combinator produces with some sample strings - -\begin{center} -\begin{tabular}{rcl} -input strings & & output\medskip\\ -\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\ -\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\ -\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$ -\end{tabular} -\end{center} - -\noindent We receive in the first two cases a successful -output (that is a non-empty set). In each case, either -\pcode{a} or \pcode{b} is in the processed part, and -\pcode{cde} in the unprocessed part. Clearly this parser cannot -parse anything in the string \pcode{ccde}, therefore the empty -set. - -A bit more interesting is the \emph{sequence parser combinator}. Given -two parsers, say again, $p$ and $q$, we want to apply first the input -to $p$ producing a set of pairs; then apply $q$ to all the unparsed -parts in the pairs; and then combine the results like - -\begin{center} -\begin{tabular}{lcl} -$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & -$(\textit{output}_1, u_1) \in p(\text{input}) -\;\wedge\;$\\ -&& $(\textit{output}_2, u_2) \in q(u_1)\}$ -\end{tabular} -\end{center} - \noindent Notice that the $p$ will first be run on the input, -producing pairs of the form $\textit{output}_1$ and unprocessed part -$u_1$. This unprocessed part is fed into the second parser $q$. The -overall result of the sequence parser combinator is pairs of the form +producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ +stands for the unprocessed, or leftover, parts pf $p$. We want that +$q$ runs on all these unprocessed parts $u_1$. Therefore these +unprocessed parts are fed into the second parser $q$. The overall +result of the sequence parser combinator is pairs of the form $((\textit{output}_1, \textit{output}_2), u_2)$. This means the -unprocessed part of both parsers is the unprocessed part the second -parser $q$ leaves as leftover. The processed parts of both -parsers is a pair consisting of the outputs of $p$ and $q$, namely -$(\textit{output}_1, \textit{output}_2)$. This behaviour can be -implemented in Scala as follows: +unprocessed part of the sequqnce p[arser combinator is the unprocessed +part the second parser $q$ leaves as leftover. The processed parts of +of the component parsers is a pair consisting of the outputs of $p$ +and $q$, namely $(\textit{output}_1, \textit{output}_2)$. This +behaviour can be implemented in Scala as follows: \begin{center} \begin{lstlisting}[language=Scala,numbers=none] @@ -395,16 +315,18 @@ combinator is as usual \texttt{I}, but the output type is \begin{center} -\texttt{Set[((T, S), I)]} +\texttt{(T, S)} \end{center} \noindent -If any of the runs of \textit{p} and \textit{q} fail, that is produce -the empty set, then \texttt{parse} will also produce the empty set. -Notice that we have now two output types for the sequence parser -combinator, because in general \textit{p} and \textit{q} might produce -different things (for example first we recognise a number and then a -string corresponding to an operator). +This means \texttt{parse} in the sequence parser combinator returns +sets of type \texttt{Set[((T, S), I)]}. Notice that we have +essentially two output types for the sequence parser combinator +(packaged in a pair), because in general \textit{p} and \textit{q} +might produce different things (for example first we recognise a +number and then a string corresponding to an operator). If any of the +runs of \textit{p} and \textit{q} fail, that is produce the empty set, +then \texttt{parse} will also produce the empty set. With the shorthand notation we shall introduce later for the sequence parser combinator, we can write for example \pcode{"a" ~ "b"}, which @@ -460,11 +382,11 @@ \noindent The second and third example fail, because something is -``missing'' in the sequence we are looking for. Also notice how the -results nest with sequences: the parsed part is a nested pair of the -form \pcode{((a, b), c)}. If we nest the sequence parser differently, -for example \pcode{"a" ~ ("b" ~ "c")}, then also our output pairs nest -differently +``missing'' in the sequence we are looking for. The first succeeds but +notice how the results nest with sequences: the parsed part is a +nested pair of the form \pcode{((a, b), c)}. If we nest the sequence +parser differently, for example \pcode{"a" ~ ("b" ~ "c")}, then also +our output pairs nest differently \begin{center} \begin{tabular}{rcl} @@ -510,19 +432,21 @@ \noindent where the unprocessed (empty) parts have been stripped away from the pairs; everything where the second part was not empty has been thrown away as well, because they represent -ultimately-unsuccessful-parses. +ultimately-unsuccessful-parses. The main point is that the sequence +parser combinator returns pairs that can nest according to the +nesting of the component parsers. -Note carefully that constructing a parser such \pcode{"a" | ("a" ~ - "b")} will result in a typing error. The intention is that we want -to parse an \texttt{a}, or an \texttt{a} followed by a -\texttt{b}. However, the first parser has as output type a single -character (recall the type of \texttt{CharParser}), but the second -parser produces a pair of characters as output. The alternative parser -is required to have both component parsers to have the same type---we -need to be able to build the union of two sets, which means in Scala -they need to be of the same type. Therefore the typing error. -We will see later how we can build +Consider also carefully that constructing a parser such \pcode{"a" | + ("a" ~ "b")} will result in a typing error. The intention with this +parser is that we want to parse an \texttt{a}, or an \texttt{a} +followed by a \texttt{b}. However, the first parser has as output type +a single character (recall the type of \texttt{CharParser}), but the +second parser produces a pair of characters as output. The alternative +parser is required to have both component parsers to have the same +type---we need to be able to build the union of two sets, which means +in Scala they need to be of the same type. Since ther are not, there +is a typing error in this example. We will see later how we can build this parser without the typing error. The next parser combinator, called \emph{semantic action}, does not @@ -541,19 +465,18 @@ \end{center} -\noindent This parser combinator takes a parser \texttt{p} -with output type \texttt{T} as one argument as well as a -function \texttt{f} with type \texttt{T => S}. The parser -\texttt{p} produces sets of type \texttt{(T, I)}. The -semantic action combinator then applies the function -\texttt{f} to all the parser outputs. Since this function is of -type \texttt{T => S}, we obtain a parser with output type -\texttt{S}. Again Scala lets us introduce some shorthand -notation for this parser combinator. Therefore we will write -\texttt{p ==> f} for it. +\noindent This parser combinator takes a parser \texttt{p} (with input +type \texttt{I} and output type \texttt{T}) as one argument but also a +function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p} +produces sets of type \texttt{Set[(T, I)]}. The semantic action +combinator then applies the function \texttt{f} to all the `processed' +parser outputs. Since this function is of type \texttt{T => S}, we +obtain a parser with output type \texttt{S}. Again Scala lets us +introduce some shorthand notation for this parser +combinator. Therefore we will write \texttt{p ==> f} for it. What are semantic actions good for? Well, they allow you to transform -the parsed input into a datastructure you can use for further +the parsed input into datastructures you can use for further processing. A simple example would be to transform parsed characters into ASCII numbers. Suppose we define a function \texttt{f} (from characters to ints) and use a \texttt{CharParser} for parsing @@ -625,9 +548,9 @@ \noindent produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is -still a string (the double-quotes are not printed by Scala). We need -to convert it into the corresponding \texttt{Int}. We can do this as -follows using a semantic action +still a string (the required double-quotes are not printed by +Scala). We want to convert this string into the corresponding +\texttt{Int}. We can do this as follows using a semantic action \begin{center} \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] @@ -644,22 +567,33 @@ Before we proceed, let us just explain the shorthand notation for parser combinators. Like for regular expressions, the shorthand notation -will make our life much easier when writing actual parsers. +will make our life much easier when writing actual parsers. We can define +some implicits which allow us to write \pcode{p | q}, \pcode{p ~ q} and +\pcode{p ==> f} as well as to use plain strings for specifying simple +string parsers. + +The idea is that this shorthand notation allows us to easily translate +context-free grammars into code. For example recall our context-free +grammar for palindromes: + +\begin{plstx}[margin=3cm] +: \meta{P} ::= a\cdot \meta{P}\cdot a | b\cdot \meta{P}\cdot b | a | b | \epsilon\\ +\end{plstx} + +\noindent +Each alternative in this grammar translates into an alternative parser +combinator. The $\cdot$ can be translated to a sequence parser +combinator. The parsers for $a$, $b$ and $\epsilon$ can be simply +written as \texttt{"a"}, \texttt{"b"} and \texttt{""}. + \subsubsection*{How to build parsers using parser combinators?} The beauty of parser combinators is the ease with which they can be implemented and how easy it is to translate context-free grammars into code (though the grammars need to be non-left-recursive). To -demonstrate this recall our context-free grammar for palindromes: - -\begin{plstx}[margin=3cm] -: \meta{P} ::= a\cdot \meta{P}\cdot a | b\cdot \meta{P}\cdot b | a | b | \epsilon\\ -\end{plstx} - -\noindent -Given the parser combinators for alternatives and sequeneces, this grammar should be -straightforward to implement. The first idea would be +demonstrate this recall the grammar for palindromes from above. +The first idea would be to translate it into the following code \begin{center} \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] @@ -669,15 +603,15 @@ \end{center} \noindent -Unfortunately, this does not work as it produces a typing error. The -reason is that the parsers \texttt{"a"}, \texttt{"b"} and \texttt{""} -all produce strings and therefore can be put into an alternative -\texttt{...| "a" | "b" | ""}. But both \pcode{"a" ~ Pal ~ "a"} and -\pcode{"b" ~ Pal ~ "b"} produce pairs of the form -$(((\_, \_), \_), \_)$---that is how the sequence parser combinator -nests results when \pcode{\~} is used between two components. The -solution is to use a semantic action that ``flattens'' these pairs and -appends the corresponding strings, like +Unfortunately, this does not quite work yet as it produces a typing +error. The reason is that the parsers \texttt{"a"}, \texttt{"b"} and +\texttt{""} all produce strings as output type and therefore can be +put into an alternative \texttt{...| "a" | "b" | ""}. But both +\pcode{"a" ~ Pal ~ "a"} and \pcode{"b" ~ Pal ~ "b"} produce pairs of +the form $(((\_, \_), \_), \_)$---that is how the sequence parser +combinator nests results when \pcode{\~} is used between two +components. The solution is to use a semantic action that ``flattens'' +these pairs and appends the corresponding strings, like \begin{center} \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] @@ -689,6 +623,7 @@ \end{center} \noindent +Now in all cases we have strings as output type for the parser variants. The semantic action