--- a/handouts/ho06.tex Thu Oct 25 06:53:16 2018 +0100
+++ b/handouts/ho06.tex Thu Oct 25 18:36:04 2018 +0100
@@ -32,15 +32,15 @@
\end{center}
\noindent
-Given the extended effort we have spent implementing a lexer
-in order to generate list of tokens, it might be surprising that in
-what follows we shall often use strings as input. This is for making
-the explanation more lucid. It does not make our previous work on
-lexers obsolete (remember they transform a string into a list of
-tokens). Lexers will still be needed for building a somewhat realistic
-compiler.
+Given the extended effort we have spent implementing a lexer in order
+to generate list of tokens, it might be surprising that in what
+follows we shall often use strings as input, rather than list of
+tokens. This is for making the explanation more lucid. It does not
+make our previous work on lexers obsolete (remember they transform a
+string into a list of tokens). Lexers will still be needed for
+building a somewhat realistic compiler.
-But as said, parser combinators are relatively agnostic about what
+As mentioned above, parser combinators are relatively agnostic about what
kind of input they process. In my Scala code I use the following
polymorphic types for parser combinators:
@@ -49,16 +49,16 @@
\end{center}
\noindent That is they take as input something of type \texttt{I} and
-return a set of pairs of type \texttt{Set[(T, I)]}. Since the input needs
-to be of ``sequence-kind'', I actually have to often write \texttt{I
- <\% Seq[\_]} for the input type. This ensures the input is a
+return a set of pairs of type \texttt{Set[(T, I)]}. Since the input
+needs to be of ``sequence-kind'', I actually have to often write
+\texttt{I <\% Seq[\_]} for the input type. This ensures the input is a
subtype of Scala sequences. The first component of the generated pairs
corresponds to what the parser combinator was able to process from the
-input and the second is the unprocessed part of the input (therefore
-the type of this unprocessed part is the same as the input). As we
-shall see shortly, a parser combinator might return more than one such
-pair; the idea is that there are potentially several ways of how to
-parse the input. As a concrete example, consider the string
+input and the second is the unprocessed part, or leftover, of the
+input (therefore the type of this unprocessed part is the same as the
+input). A parser combinator might return more than one such pair; the
+idea is that there are potentially several ways of how to parse the
+input. As a concrete example, consider the string
\begin{center}
\tt\Grid{iffoo\VS testbar}
@@ -74,8 +74,8 @@
\end{center}
\noindent where the first pair means the parser could recognise
-\texttt{if} from the input and leaves the rest as `unprocessed' as the
-second component of the pair; in the other case it could recognise
+\texttt{if} from the input and leaves the \texttt{foo\VS testbar} as
+`unprocessed' part; in the other case it could recognise
\texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
parser cannot recognise anything from the input at all, then parser
combinators just return the empty set $\{\}$. This will indicate
@@ -83,14 +83,15 @@
parsed.
Also important to note is that the type \texttt{T} for the processed
-part is different from the input type. In the example above is just
-happens to be the same. The reason for the difference is that in
-general we are interested in transforming our input into something
-``different''\ldots for example into a tree; or if we implement the
-grammar for arithmetic expressions, we might be interested in the
-actual integer number the arithmetic expression, say \texttt{1 + 2 *
- 3}, stands for. In this way we can use parser combinators to
-implement relatively easily a calculator, for instance.
+part is different from the input type \texttt{I} in the parse. In the
+example above is just happens to be the same. The reason for the
+potential difference is that in general we are interested in
+transforming our input into something ``different''\ldots for example
+into a tree; or if we implement the grammar for arithmetic
+expressions, we might be interested in the actual integer number the
+arithmetic expression, say \texttt{1 + 2 * 3}, stands for. In this way
+we can use parser combinators to implement relatively easily a
+calculator, for instance.
The main idea of parser combinators is that we can easily build parser
combinators out of smaller components following very closely the
@@ -103,10 +104,10 @@
\begin{center}
\begin{lstlisting}[language=Scala]
abstract class Parser[I, T] {
- def parse(ts: I): Set[(T, I)]
+ def parse(in: I) : Set[(T, I)]
- def parse_all(ts: I): Set[T] =
- for ((head, tail) <- parse(ts); if (tail.isEmpty))
+ def parse_all(in: I) : Set[T] =
+ for ((head, tail) <- parse(in); if (tail.isEmpty))
yield head
}
\end{lstlisting}
@@ -134,10 +135,9 @@
\end{itemize}
\noindent
-The input type of this simple parser combinator for characters is
-\texttt{String} and the output type is \texttt{Char}. This means
-\texttt{parse} returns \mbox{\texttt{Set[(Char, String)]}}.
-The code in Scala is as follows:
+The input type of this simple parser combinator is \texttt{String} and
+the output type is \texttt{Char}. This means \texttt{parse} returns
+\mbox{\texttt{Set[(Char, String)]}}. The code in Scala is as follows:
\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
@@ -158,7 +158,7 @@
output type of the parser is \texttt{Char}.
If we want to parse a list of tokens and interested in recognising a
-number token, we could write something like this
+number token, for example, we could write something like this
\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
@@ -177,12 +177,14 @@
token is a \texttt{Num\_token} (let us assume our lexer generated
these tokens for numbers). But this parser does not just return this
token (and the rest of the list), like the \texttt{CharParser} above,
-rather it extracts the string \texttt{s} from the token and converts it
-into an integer. The hope is that the lexer did its work well and this
-conversion always succeeds. The consequence of this is that the output
-type for this parser is \texttt{Int}, not \texttt{Token}. Such a
-conversion would be needed if we want to implement a simple calculator
-program.
+rather it extracts also the string \texttt{s} from the token and
+converts it into an integer. The hope is that the lexer did its work
+well and this conversion always succeeds. The consequence of this is
+that the output type for this parser is \texttt{Int}, not
+\texttt{Token}. Such a conversion would be needed if we want to
+implement a simple calculator program, because string-numbers need to
+be transformed into \texttt{Int}-numbers in order to do the
+calculations.
These simple parsers that just look at the input and do a simple
@@ -214,18 +216,18 @@
\noindent The types of this parser combinator are again generic (we
have \texttt{I} for the input type, and \texttt{T} for the output
type). The alternative parser builds a new parser out of two existing
-parsers \texttt{p} and \texttt{q} given as arguments. Both parsers
-need to be able to process input of type \texttt{I} and return in
-\texttt{parse} the same output type \texttt{Set[(T,
+parsers \texttt{p} and \texttt{q} which are given as arguments. Both
+parsers need to be able to process input of type \texttt{I} and return
+in \texttt{parse} the same output type \texttt{Set[(T,
I)]}.\footnote{There is an interesting detail of Scala, namely the
\texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They
will prevent the evaluation of the arguments before they are
used. This is often called \emph{lazy evaluation} of the
- arguments. We will explain this later.} The alternative parser
-should run the input with the first parser \texttt{p} (producing a set
-of pairs) and then run the same input with \texttt{q} (producing
-another set of pairs). The result should be then just the union of
-both sets, which is the operation \texttt{++} in Scala.
+ arguments. We will explain this later.} The alternative parser runs
+the input with the first parser \texttt{p} (producing a set of pairs)
+and then runs the same input with \texttt{q} (producing another set of
+pairs). The result should be then just the union of both sets, which
+is the operation \texttt{++} in Scala.
The alternative parser combinator allows us to construct a parser that
parses either a character \texttt{a} or \texttt{b} using the
@@ -240,7 +242,7 @@
\noindent Later on we will use Scala mechanism for introducing some
more readable shorthand notation for this, like \texttt{"a" |
"b"}. Let us look in detail at what this parser combinator produces
-with some sample strings
+with some sample strings.
\begin{center}
\begin{tabular}{rcl}
@@ -260,9 +262,9 @@
A bit more interesting is the \emph{sequence parser combinator}. Given
two parsers, say again, $p$ and $q$, we want to apply first the input
-to $p$ producing a set of pairs; then apply $q$ to all the unparsed
+to $p$ producing a set of pairs; then apply $q$ to all the unparsed
parts in the pairs; and then combine the results. Mathematically we would
-write something like this for the expected set of pairs:
+write something like this for the result set of pairs:
\begin{center}
\begin{tabular}{lcl}
@@ -273,100 +275,18 @@
\end{tabular}
\end{center}
-\noindent Notice that the $p$ wil first be run on the input, producing
-pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands
-for the unprocessed, or leftover, parts. We want that $q$ runs on all
-these unprocessed parts $u_1$. This again will produce some
-processed part , $p$ and
-$q$, we apply both parsers to the input (remember parsers are
-functions) and combine the output (remember they are sets of pairs)
-
-\begin{center}
-$p(\text{input}) \cup q(\text{input})$
-\end{center}
-
-\noindent In Scala we would implement alternative parser
-combinator as follows
-
-\begin{center}
-\begin{lstlisting}[language=Scala, numbers=none]
-class AltParser[I, T]
- (p: => Parser[I, T],
- q: => Parser[I, T]) extends Parser[I, T] {
- def parse(in: I) = p.parse(in) ++ q.parse(in)
-}
-\end{lstlisting}
-\end{center}
-
-\noindent The types of this parser combinator are again generic (we
-just have \texttt{I} for the input type, and \texttt{T} for the output
-type). The alternative parser builds a new parser out of two existing
-parsers \texttt{p} and \texttt{q}. Both need to be able to process
-input of type \texttt{I} and return the same output type
-\texttt{Set[(T, I)]}.\footnote{There is an interesting detail of
- Scala, namely the \texttt{=>} in front of the types of \texttt{p}
- and \texttt{q}. They will prevent the evaluation of the arguments
- before they are used. This is often called \emph{lazy evaluation} of
- the arguments. We will explain this later.} Therefore the output
-type of this parser is \texttt{T}. The alternative parser should run
-the input with the first parser \texttt{p} (producing a set of pairs)
-and then run the same input with \texttt{q} (producing another set of
-pairs). The result should be then just the union of both sets, which
-is the operation \texttt{++} in Scala.
-
-The alternative parser combinator already allows us to
-construct a parser that parses either a character \texttt{a}
-or \texttt{b}, as
-
-\begin{center}
-\begin{lstlisting}[language=Scala, numbers=none]
-new AltParser(CharParser('a'), CharParser('b'))
-\end{lstlisting}
-\end{center}
-
-\noindent Later on we will use again Scala mechanism for introducing some
-more readable shorthand notation for this, like \texttt{"a" | "b"}. Let us look in detail at what this parser combinator produces with some sample strings
-
-\begin{center}
-\begin{tabular}{rcl}
-input strings & & output\medskip\\
-\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
-\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
-\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
-\end{tabular}
-\end{center}
-
-\noindent We receive in the first two cases a successful
-output (that is a non-empty set). In each case, either
-\pcode{a} or \pcode{b} is in the processed part, and
-\pcode{cde} in the unprocessed part. Clearly this parser cannot
-parse anything in the string \pcode{ccde}, therefore the empty
-set.
-
-A bit more interesting is the \emph{sequence parser combinator}. Given
-two parsers, say again, $p$ and $q$, we want to apply first the input
-to $p$ producing a set of pairs; then apply $q$ to all the unparsed
-parts in the pairs; and then combine the results like
-
-\begin{center}
-\begin{tabular}{lcl}
-$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ &
-$(\textit{output}_1, u_1) \in p(\text{input})
-\;\wedge\;$\\
-&& $(\textit{output}_2, u_2) \in q(u_1)\}$
-\end{tabular}
-\end{center}
-
\noindent Notice that the $p$ will first be run on the input,
-producing pairs of the form $\textit{output}_1$ and unprocessed part
-$u_1$. This unprocessed part is fed into the second parser $q$. The
-overall result of the sequence parser combinator is pairs of the form
+producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$
+stands for the unprocessed, or leftover, parts pf $p$. We want that
+$q$ runs on all these unprocessed parts $u_1$. Therefore these
+unprocessed parts are fed into the second parser $q$. The overall
+result of the sequence parser combinator is pairs of the form
$((\textit{output}_1, \textit{output}_2), u_2)$. This means the
-unprocessed part of both parsers is the unprocessed part the second
-parser $q$ leaves as leftover. The processed parts of both
-parsers is a pair consisting of the outputs of $p$ and $q$, namely
-$(\textit{output}_1, \textit{output}_2)$. This behaviour can be
-implemented in Scala as follows:
+unprocessed part of the sequqnce p[arser combinator is the unprocessed
+part the second parser $q$ leaves as leftover. The processed parts of
+of the component parsers is a pair consisting of the outputs of $p$
+and $q$, namely $(\textit{output}_1, \textit{output}_2)$. This
+behaviour can be implemented in Scala as follows:
\begin{center}
\begin{lstlisting}[language=Scala,numbers=none]
@@ -395,16 +315,18 @@
combinator is as usual \texttt{I}, but the output type is
\begin{center}
-\texttt{Set[((T, S), I)]}
+\texttt{(T, S)}
\end{center}
\noindent
-If any of the runs of \textit{p} and \textit{q} fail, that is produce
-the empty set, then \texttt{parse} will also produce the empty set.
-Notice that we have now two output types for the sequence parser
-combinator, because in general \textit{p} and \textit{q} might produce
-different things (for example first we recognise a number and then a
-string corresponding to an operator).
+This means \texttt{parse} in the sequence parser combinator returns
+sets of type \texttt{Set[((T, S), I)]}. Notice that we have
+essentially two output types for the sequence parser combinator
+(packaged in a pair), because in general \textit{p} and \textit{q}
+might produce different things (for example first we recognise a
+number and then a string corresponding to an operator). If any of the
+runs of \textit{p} and \textit{q} fail, that is produce the empty set,
+then \texttt{parse} will also produce the empty set.
With the shorthand notation we shall introduce later for the sequence
parser combinator, we can write for example \pcode{"a" ~ "b"}, which
@@ -460,11 +382,11 @@
\noindent The second and third example fail, because something is
-``missing'' in the sequence we are looking for. Also notice how the
-results nest with sequences: the parsed part is a nested pair of the
-form \pcode{((a, b), c)}. If we nest the sequence parser differently,
-for example \pcode{"a" ~ ("b" ~ "c")}, then also our output pairs nest
-differently
+``missing'' in the sequence we are looking for. The first succeeds but
+notice how the results nest with sequences: the parsed part is a
+nested pair of the form \pcode{((a, b), c)}. If we nest the sequence
+parser differently, for example \pcode{"a" ~ ("b" ~ "c")}, then also
+our output pairs nest differently
\begin{center}
\begin{tabular}{rcl}
@@ -510,19 +432,21 @@
\noindent where the unprocessed (empty) parts have been stripped away
from the pairs; everything where the second part was not empty has
been thrown away as well, because they represent
-ultimately-unsuccessful-parses.
+ultimately-unsuccessful-parses. The main point is that the sequence
+parser combinator returns pairs that can nest according to the
+nesting of the component parsers.
-Note carefully that constructing a parser such \pcode{"a" | ("a" ~
- "b")} will result in a typing error. The intention is that we want
-to parse an \texttt{a}, or an \texttt{a} followed by a
-\texttt{b}. However, the first parser has as output type a single
-character (recall the type of \texttt{CharParser}), but the second
-parser produces a pair of characters as output. The alternative parser
-is required to have both component parsers to have the same type---we
-need to be able to build the union of two sets, which means in Scala
-they need to be of the same type. Therefore the typing error.
-We will see later how we can build
+Consider also carefully that constructing a parser such \pcode{"a" |
+ ("a" ~ "b")} will result in a typing error. The intention with this
+parser is that we want to parse an \texttt{a}, or an \texttt{a}
+followed by a \texttt{b}. However, the first parser has as output type
+a single character (recall the type of \texttt{CharParser}), but the
+second parser produces a pair of characters as output. The alternative
+parser is required to have both component parsers to have the same
+type---we need to be able to build the union of two sets, which means
+in Scala they need to be of the same type. Since ther are not, there
+is a typing error in this example. We will see later how we can build
this parser without the typing error.
The next parser combinator, called \emph{semantic action}, does not
@@ -541,19 +465,18 @@
\end{center}
-\noindent This parser combinator takes a parser \texttt{p}
-with output type \texttt{T} as one argument as well as a
-function \texttt{f} with type \texttt{T => S}. The parser
-\texttt{p} produces sets of type \texttt{(T, I)}. The
-semantic action combinator then applies the function
-\texttt{f} to all the parser outputs. Since this function is of
-type \texttt{T => S}, we obtain a parser with output type
-\texttt{S}. Again Scala lets us introduce some shorthand
-notation for this parser combinator. Therefore we will write
-\texttt{p ==> f} for it.
+\noindent This parser combinator takes a parser \texttt{p} (with input
+type \texttt{I} and output type \texttt{T}) as one argument but also a
+function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p}
+produces sets of type \texttt{Set[(T, I)]}. The semantic action
+combinator then applies the function \texttt{f} to all the `processed'
+parser outputs. Since this function is of type \texttt{T => S}, we
+obtain a parser with output type \texttt{S}. Again Scala lets us
+introduce some shorthand notation for this parser
+combinator. Therefore we will write \texttt{p ==> f} for it.
What are semantic actions good for? Well, they allow you to transform
-the parsed input into a datastructure you can use for further
+the parsed input into datastructures you can use for further
processing. A simple example would be to transform parsed characters
into ASCII numbers. Suppose we define a function \texttt{f} (from
characters to ints) and use a \texttt{CharParser} for parsing
@@ -625,9 +548,9 @@
\noindent
produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
-still a string (the double-quotes are not printed by Scala). We need
-to convert it into the corresponding \texttt{Int}. We can do this as
-follows using a semantic action
+still a string (the required double-quotes are not printed by
+Scala). We want to convert this string into the corresponding
+\texttt{Int}. We can do this as follows using a semantic action
\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
@@ -644,22 +567,33 @@
Before we proceed, let us just explain the shorthand notation for
parser combinators. Like for regular expressions, the shorthand notation
-will make our life much easier when writing actual parsers.
+will make our life much easier when writing actual parsers. We can define
+some implicits which allow us to write \pcode{p | q}, \pcode{p ~ q} and
+\pcode{p ==> f} as well as to use plain strings for specifying simple
+string parsers.
+
+The idea is that this shorthand notation allows us to easily translate
+context-free grammars into code. For example recall our context-free
+grammar for palindromes:
+
+\begin{plstx}[margin=3cm]
+: \meta{P} ::= a\cdot \meta{P}\cdot a | b\cdot \meta{P}\cdot b | a | b | \epsilon\\
+\end{plstx}
+
+\noindent
+Each alternative in this grammar translates into an alternative parser
+combinator. The $\cdot$ can be translated to a sequence parser
+combinator. The parsers for $a$, $b$ and $\epsilon$ can be simply
+written as \texttt{"a"}, \texttt{"b"} and \texttt{""}.
+
\subsubsection*{How to build parsers using parser combinators?}
The beauty of parser combinators is the ease with which they can be
implemented and how easy it is to translate context-free grammars into
code (though the grammars need to be non-left-recursive). To
-demonstrate this recall our context-free grammar for palindromes:
-
-\begin{plstx}[margin=3cm]
-: \meta{P} ::= a\cdot \meta{P}\cdot a | b\cdot \meta{P}\cdot b | a | b | \epsilon\\
-\end{plstx}
-
-\noindent
-Given the parser combinators for alternatives and sequeneces, this grammar should be
-straightforward to implement. The first idea would be
+demonstrate this recall the grammar for palindromes from above.
+The first idea would be to translate it into the following code
\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
@@ -669,15 +603,15 @@
\end{center}
\noindent
-Unfortunately, this does not work as it produces a typing error. The
-reason is that the parsers \texttt{"a"}, \texttt{"b"} and \texttt{""}
-all produce strings and therefore can be put into an alternative
-\texttt{...| "a" | "b" | ""}. But both \pcode{"a" ~ Pal ~ "a"} and
-\pcode{"b" ~ Pal ~ "b"} produce pairs of the form
-$(((\_, \_), \_), \_)$---that is how the sequence parser combinator
-nests results when \pcode{\~} is used between two components. The
-solution is to use a semantic action that ``flattens'' these pairs and
-appends the corresponding strings, like
+Unfortunately, this does not quite work yet as it produces a typing
+error. The reason is that the parsers \texttt{"a"}, \texttt{"b"} and
+\texttt{""} all produce strings as output type and therefore can be
+put into an alternative \texttt{...| "a" | "b" | ""}. But both
+\pcode{"a" ~ Pal ~ "a"} and \pcode{"b" ~ Pal ~ "b"} produce pairs of
+the form $(((\_, \_), \_), \_)$---that is how the sequence parser
+combinator nests results when \pcode{\~} is used between two
+components. The solution is to use a semantic action that ``flattens''
+these pairs and appends the corresponding strings, like
\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
@@ -689,6 +623,7 @@
\end{center}
\noindent
+Now in all cases we have strings as output type for the parser variants.
The semantic action
--- a/progs/comb1.scala Thu Oct 25 06:53:16 2018 +0100
+++ b/progs/comb1.scala Thu Oct 25 18:36:04 2018 +0100
@@ -38,6 +38,7 @@
}
import scala.util.matching.Regex
+
case class RegexParser(reg: Regex) extends Parser[String, String] {
def parse(sb: String) = reg.findPrefixMatchOf(sb) match {
case None => Set()
@@ -51,17 +52,17 @@
// convenience
implicit def string2parser(s: String) = StringParser(s)
-implicit def char2parser(c: Char) = CharParser(c)
+//implicit def char2parser(c: Char) = CharParser(c)
implicit def ParserOps[I<% Seq[_], T](p: Parser[I, T]) = new {
- def || (q : => Parser[I, T]) = new AltParser[I, T](p, q)
+ def | (q : => Parser[I, T]) = new AltParser[I, T](p, q)
def ==>[S] (f: => T => S) = new FunParser[I, T, S](p, f)
def ~[S] (q : => Parser[I, S]) = new SeqParser[I, T, S](p, q)
}
implicit def StringOps(s: String) = new {
- def || (q : => Parser[String, String]) = new AltParser[String, String](s, q)
- def || (r: String) = new AltParser[String, String](s, r)
+ def | (q : => Parser[String, String]) = new AltParser[String, String](s, q)
+ def | (r: String) = new AltParser[String, String](s, r)
def ==>[S] (f: => String => S) = new FunParser[String, String, S](s, f)
def ~[S] (q : => Parser[String, S]) =
new SeqParser[String, String, S](s, q)