--- a/handouts/ho06.tex Mon Oct 02 23:10:56 2023 +0100
+++ b/handouts/ho06.tex Tue Oct 03 14:29:12 2023 +0100
@@ -37,10 +37,11 @@
Given the extended effort we have spent implementing a lexer in order
to generate lists of tokens, it might be surprising that in what
follows we shall often use strings as input, rather than lists of
-tokens. This is for making the explanation more lucid and for quick
-examples. It does not make our previous work on lexers obsolete
+tokens. This is for making the explanation more lucid and ensure the
+examples are simple. It does not make our previous work on lexers obsolete
(remember they transform a string into a list of tokens). Lexers will
-still be needed for building a somewhat realistic compiler.
+still be needed for building a somewhat realistic compiler. See also
+a question in the homework regarding this issue.
As mentioned above, parser combinators are relatively agnostic about what
kind of input they process. In my Scala code I use the following
@@ -53,14 +54,16 @@
\noindent That is they take as input something of type \texttt{I} and
return a set of pairs of type \texttt{Set[(T, I)]}. Since the input
needs to be of ``sequence-kind'', I actually have to often write
-\texttt{I <\% Seq[\_]} for the input type. This ensures the
-input is a subtype of Scala sequences. The first component of the
-generated pairs corresponds to what the parser combinator was able to
-parse from the input and the second is the unprocessed, or
-leftover, part of the input (therefore the type of this unprocessed part is
-the same as the input). A parser combinator might return more than one
-such pair; the idea is that there are potentially several ways of how
-to parse the input. As a concrete example, consider the string
+\code{(using is: I => Seq[_])} for the input type. This ensures the
+input is a subtype of Scala sequences.\footnote{This is a new feature
+ in Scala 3 and is about type-classes, meaning if you use Scala 2 you will have difficulties
+ with running my code.} The first component of the generated pairs
+corresponds to what the parser combinator was able to parse from the
+input and the second is the unprocessed, or leftover, part of the
+input (therefore the type of this unprocessed part is the same as the
+input). A parser combinator might return more than one such pair; the
+idea is that there are potentially several ways of how to parse the
+input. As a concrete example, consider the string
\begin{center}
\tt\Grid{iffoo\VS testbar}
@@ -108,12 +111,12 @@
\begin{center}
\begin{lstlisting}[language=Scala]
-abstract class Parser[I, T] {
- def parse(in: I) : Set[(T, I)]
+abstract class Parser[I, T](using is: I => Seq[_]) {
+ def parse(in: I): Set[(T, I)]
def parse_all(in: I) : Set[T] =
- for ((head, tail) <- parse(in); if (tail.isEmpty))
- yield head
+ for ((hd, tl) <- parse(in);
+ if is(tl).isEmpty) yield hd
}
\end{lstlisting}
\end{center}
@@ -131,9 +134,9 @@
behaviour can be described as follows:
\begin{itemize}
-\item If the head of the input string starts with a $c$, then return
+\item If the head of the input string $s$ starts with a $c$, then return
the set
- \[\{(c, \textit{tail of}\; s)\}\]
+ \[\{(c, \textit{tail-of-}s)\}\]
where \textit{tail of}
$s$ is the unprocessed part of the input string.
\item Otherwise return the empty set $\{\}$.
@@ -147,17 +150,17 @@
\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
case class CharParser(c: Char) extends Parser[String, Char] {
- def parse(in: String) =
- if (in.head == c) Set((c, in.tail)) else Set()
+ def parse(s: String) =
+ if (s != "" && s.head == c) Set((c, s.tail)) else Set()
}
\end{lstlisting}
\end{center}
-\noindent You can see \texttt{parse} tests whether the
-first character of the input string \texttt{in} is equal to
+\noindent You can see \texttt{parse} tests here whether the
+first character of the input string \texttt{s} is equal to
\texttt{c}. If yes, then it splits the string into the recognised part
-\texttt{c} and the unprocessed part \texttt{in.tail}. In case
-\texttt{in} does not start with \texttt{c} then the parser returns the
+\texttt{c} and the unprocessed part \texttt{s.tail}. In case
+\texttt{s} does not start with \texttt{c} then the parser returns the
empty set (in Scala \texttt{Set()}). Since this parser recognises
characters and just returns characters as the processed part, the
output type of the parser is \texttt{Char}.
@@ -169,7 +172,7 @@
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
case object NumParser extends Parser[List[Token], Int] {
def parse(ts: List[Token]) = ts match {
- case Num_token(s)::ts => Set((s.toInt, ts))
+ case Num_token(s)::rest => Set((s.toInt, rest))
case _ => Set ()
}
}
@@ -186,10 +189,9 @@
converts it into an integer. The hope is that the lexer did its work
well and this conversion always succeeds. The consequence of this is
that the output type for this parser is \texttt{Int}, not
-\texttt{Token}. Such a conversion would be needed if we want to
-implement a simple calculator program, because string-numbers need to
-be transformed into \texttt{Int}-numbers in order to do the
-calculations.
+\texttt{Token}. Such a conversion would be needed in our parser,
+because when we encounter a number in our program, we want to do
+some calculations based on integers, not strings (or tokens).
These simple parsers that just look at the input and do a simple
@@ -211,10 +213,11 @@
\begin{center}
\begin{lstlisting}[language=Scala, numbers=none]
class AltParser[I, T]
- (p: => Parser[I, T],
- q: => Parser[I, T]) extends Parser[I, T] {
- def parse(in: I) = p.parse(in) ++ q.parse(in)
-}
+ (p: => Parser[I, T],
+ q: => Parser[I, T])(using I => Seq[_])
+ extends Parser[I, T] {
+ def parse(in: I) = p.parse(in) ++ q.parse(in)
+}
\end{lstlisting}
\end{center}
@@ -225,7 +228,7 @@
parsers need to be able to process input of type \texttt{I} and return
in \texttt{parse} the same output type \texttt{Set[(T,
I)]}.\footnote{There is an interesting detail of Scala, namely the
- \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They
+ \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. These arrows
will prevent the evaluation of the arguments before they are
used. This is often called \emph{lazy evaluation} of the
arguments. We will explain this later.} The alternative parser runs
@@ -236,7 +239,11 @@
The alternative parser combinator allows us to construct a parser that
parses either a character \texttt{a} or \texttt{b} using the
-\texttt{CharParser} shown above. For this we can write
+\texttt{CharParser} shown above. For this we can write\footnote{Note
+ that we cannot use a \texttt{case}-class for \texttt{AltParser}s
+ because of the problem with laziness and Scala quirks. Hating
+ \texttt{new} like the plague, we will work around this later with
+ some syntax tricks. ;o)}
\begin{center}
\begin{lstlisting}[language=Scala, numbers=none]
@@ -246,24 +253,25 @@
\noindent Later on we will use Scala mechanism for introducing some
more readable shorthand notation for this, like \texttt{p"a" ||
- p"b"}. Let us look in detail at what this parser combinator produces
+ p"b"}. But first let us look in detail at what this parser combinator produces
with some sample strings.
\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
-\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
-\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
+\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}},\; \texttt{\Grid{cde}})\right\}$\\
+\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}},\; \texttt{\Grid{cde}})\right\}$\\
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}
-\noindent We receive in the first two cases a successful
-output (that is a non-empty set). In each case, either
-\pcode{a} or \pcode{b} is in the parsed part, and
-\pcode{cde} in the unprocessed part. Clearly this parser cannot
-parse anything with \pcode{ccde}, therefore the empty
-set is returned.
+\noindent We receive in the first two cases a successful output (that
+is a non-empty set). In each case, either \pcode{a} or \pcode{b} is in
+the parsed part, and \pcode{cde} in the unprocessed part. Clearly this
+parser cannot parse anything of the form \pcode{ccde}, therefore the
+empty set is returned in the last case. Observe that parser
+combinators only look at the beginning of the given input: they do not
+fish out something in the ``middle'' of the input.
A bit more interesting is the \emph{sequence parser combinator}. Given
two parsers, say again, $p$ and $q$, we want to apply first the input
@@ -298,7 +306,8 @@
\begin{lstlisting}[language=Scala,numbers=none]
class SeqParser[I, T, S]
(p: => Parser[I, T],
- q: => Parser[I, S]) extends Parser[I, (T, S)] {
+ q: => Parser[I, S])(using I => Seq[_])
+ extends Parser[I, (T, S)] {
def parse(in: I) =
for ((output1, u1) <- p.parse(in);
(output2, u2) <- q.parse(u1))
@@ -312,7 +321,7 @@
parser \texttt{p} on the input producing a set of pairs
(\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
unprocessed parts left over by \texttt{p} (recall that there can be
-several such pairs). Let then \texttt{q} run on these unprocessed
+several such pairs). Let then \texttt{q} run on these unprocessed
parts producing again a set of pairs. The output of the sequence
parser combinator is then a set containing pairs where the first
components are again pairs, namely what the first parser could parse
@@ -339,13 +348,16 @@
With the shorthand notation we shall introduce later for the sequence
parser combinator, we can write for example \pcode{p"a" ~ p"b"}, which
is the parser combinator that first recognises the character
-\texttt{a} from a string and then \texttt{b}. Let us look again at
-some examples of how this parser combinator processes some strings:
+\texttt{a} from a string and then \texttt{b}. (Actually, we will be
+able to write just \pcode{p"ab"} for such parsers, but it is good to
+ understand first what happens behind the scenes.) Let us look again
+ at some examples of how the sequence parser combinator processes some
+ strings:
\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
-\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\
+\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}),\; \texttt{\Grid{cde}})\right\}$\\
\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
\end{tabular}
@@ -368,8 +380,8 @@
\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
-\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
-\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
+\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
+\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$
\end{tabular}
\end{center}
@@ -382,7 +394,7 @@
\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
-\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
+\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\
\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$
\end{tabular}
@@ -399,7 +411,7 @@
\begin{center}
\begin{tabular}{rcl}
input strings & & output\medskip\\
-\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})), \texttt{\Grid{de}})\right\}$\\
+\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})),\; \texttt{\Grid{de}})\right\}$\\
\end{tabular}
\end{center}
@@ -445,8 +457,14 @@
nesting of the component parsers.
-Consider also carefully that constructing a parser such \pcode{p"a" ||
- (p"a" ~ p"b")} will result in a typing error. The intention with this
+Consider also carefully that constructing a parser such
+
+\begin{center}
+\pcode{p"a" || (p"a" ~ p"b")}
+\end{center}
+
+\noindent
+will result in a typing error. The intention with this
parser is that we want to parse either an \texttt{a}, or an \texttt{a}
followed by a \texttt{b}. However, the first parser has as output type
a single character (recall the type of \texttt{CharParser}), but the
@@ -457,17 +475,17 @@
they are not in this case, there is a typing error. We will see later
how we can build this parser without the typing error.
-The next parser combinator, called \emph{semantic action}, does not
+The next parser combinator, called \emph{semantic action} or \emph{map-parser}, does not
actually combine two smaller parsers, but applies a function to the result
of a parser. It is implemented in Scala as follows
\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
class MapParser[I, T, S]
- (p: => Parser[I, T],
- f: T => S) extends Parser[I, S] {
- def parse(in: I) =
- for ((head, tail) <- p.parse(in)) yield (f(head), tail)
+ (p: => Parser[I, T],
+ f: T => S)(using I => Seq[_]) extends Parser[I, S] {
+ def parse(in: I) =
+ for ((hd, tl) <- p.parse(in)) yield (f(hd), tl)
}
\end{lstlisting}
\end{center}
@@ -476,7 +494,7 @@
\noindent This parser combinator takes a parser \texttt{p} (with input
type \texttt{I} and output type \texttt{T}) as one argument but also a
function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p}
-produces sets of type \texttt{Set[(T, I)]}. The semantic action
+produces sets of type \texttt{Set[(S, I)]}. The semantic action
combinator then applies the function \texttt{f} to all the `processed'
parser outputs. Since this function is of type \texttt{T => S}, we
obtain a parser with output type \texttt{S}. Again Scala lets us
@@ -560,7 +578,7 @@
\noindent
produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
-still a string (the required double-quotes are not printed by
+still a string (the expected double-quotes are not printed by
Scala). We want to convert this string into the corresponding
\texttt{Int}. We can do this as follows using a semantic action
@@ -573,16 +591,19 @@
\noindent
The function in the semantic action converts a string into an
\texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))},
-but this time \texttt{123} is an \texttt{Int}. Let us come back to
-semantic actions when we are going to implement actual context-free
-grammars.
+but this time \texttt{123} is an \texttt{Int}. Think carefully what
+the input and output type of the parser is without the semantic action
+adn what with the semantic action (the type of the function can
+already tell you). Let us come back to semantic actions when we are
+going to implement actual context-free grammars.
\subsubsection*{Shorthand notation for parser combinators}
Before we proceed, let us just explain the shorthand notation for
-parser combinators. Like for regular expressions, the shorthand notation
-will make our life much easier when writing actual parsers. We can define
-some implicits which allow us to write
+parser combinators. Like for regular expressions, the shorthand
+notation will make our life much easier when writing actual
+parsers. We can define some extensions\footnote{In Scala 2 this was
+ generically called as ``implicits''.} which allow us to write
\begin{center}
\begin{tabular}{ll}
@@ -593,7 +614,7 @@
\end{center}
\noindent
-as well as to use string interpolations for specifying simple string parsers.
+We will also use the \texttt{p}-string-interpolation for specifying simple string parsers.
The idea is that this shorthand notation allows us to easily translate
context-free grammars into code. For example recall our context-free
@@ -610,7 +631,7 @@
written as \texttt{p"a"}, \texttt{p"b"} and \texttt{p""}.
-\subsubsection*{How to build parsers using parser combinators?}
+\subsubsection*{How to build more interesting parsers using parser combinators?}
The beauty of parser combinators is the ease with which they can be
implemented and how easy it is to translate context-free grammars into
@@ -740,11 +761,11 @@
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
class AltParser[I, T]
(p: => Parser[I, T],
- q: => Parser[I, T]) extends Parser[I, T] {...}
+ q: => Parser[I, T]) ... extends Parser[I, T] {...}
class SeqParser[I, T, S]
(p: => Parser[I, T],
- q: => Parser[I, S]) extends Parser[I, (T, S)] {...}
+ q: => Parser[I, S]) ... extends Parser[I, (T, S)] {...}
\end{lstlisting}
\end{center}
@@ -862,7 +883,7 @@
\noindent
Recall what left-recursive means from Handout 5 and make sure you see
why this grammar is \emph{non} left-recursive. This version of the grammar
-also deals with the fact that $*$ should have a higher precedence. This does not
+also deals with the fact that $*$ should have a higher precedence than $+$ and $-$. This does not
affect which strings this grammar can recognise, but in which order we are going
to evaluate any arithmetic expression. We can translate this grammar into
parsing combinators as follows: