afl-material: comparison handouts/ho06.tex

equal deleted inserted replaced

-:0b5f06539a84
+:dc5ab66b11cc
 \noindent
 Given the extended effort we have spent implementing a lexer in order
 to generate lists of tokens, it might be surprising that in what
 follows we shall often use strings as input, rather than lists of
-tokens. This is for making the explanation more lucid and for quick
+tokens. This is for making the explanation more lucid and ensure the
-examples. It does not make our previous work on lexers obsolete
+examples are simple. It does not make our previous work on lexers obsolete
 (remember they transform a string into a list of tokens). Lexers will
-still be needed for building a somewhat realistic compiler.
+still be needed for building a somewhat realistic compiler. See also
+a question in the homework regarding this issue.
 As mentioned above, parser combinators are relatively agnostic about what
 kind of input they process. In my Scala code I use the following
 polymorphic types for parser combinators:
 \end{center}
 \noindent That is they take as input something of type \texttt{I} and
 return a set of pairs of type \texttt{Set[(T, I)]}. Since the input
 needs to be of ``sequence-kind'', I actually have to often write
-\texttt{I <\% Seq[\_]} for the input type. This ensures the
+\code{(using is: I => Seq[_])} for the input type. This ensures the
-input is a subtype of Scala sequences. The first component of the
+input is a subtype of Scala sequences.\footnote{This is a new feature
-generated pairs corresponds to what the parser combinator was able to
+in Scala 3 and is about type-classes, meaning if you use Scala 2 you will have difficulties
-parse from the input and the second is the unprocessed, or
+with running my code.} The first component of the generated pairs
-leftover, part of the input (therefore the type of this unprocessed part is
+corresponds to what the parser combinator was able to parse from the
-the same as the input). A parser combinator might return more than one
+input and the second is the unprocessed, or leftover, part of the
-such pair; the idea is that there are potentially several ways of how
+input (therefore the type of this unprocessed part is the same as the
-to parse the input.  As a concrete example, consider the string
+input). A parser combinator might return more than one such pair; the
+idea is that there are potentially several ways of how to parse the
+input.  As a concrete example, consider the string
 \begin{center}
 \tt\Grid{iffoo\VS testbar}
 \end{center}
 implies that the function \texttt{parse} takes an argument of type
 \texttt{I} and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
 \begin{center}
 \begin{lstlisting}[language=Scala]
-abstract class Parser[I, T] {
+abstract class Parser[I, T](using is: I => Seq[_])  {
-def parse(in: I) : Set[(T, I)]
+def parse(in: I): Set[(T, I)]
 def parse_all(in: I) : Set[T] =
-for ((head, tail) <- parse(in); if (tail.isEmpty))
+for ((hd, tl) <- parse(in);
-yield head
+if is(tl).isEmpty) yield hd
 }
 \end{lstlisting}
 \end{center}
 \noindent It is the obligation in each instance of this class to
 One of the simplest parser combinators recognises just a
 single character, say $c$, from the beginning of strings. Its
 behaviour can be described as follows:
 \begin{itemize}
-\item If the head of the input string starts with a $c$, then return
+\item If the head of the input string $s$ starts with a $c$, then return
 the set
-\[\{(c, \textit{tail of}\; s)\}\]
+\[\{(c, \textit{tail-of-}s)\}\]
 where \textit{tail of}
 $s$ is the unprocessed part of the input string.
 \item Otherwise return the empty set $\{\}$.
 \end{itemize}
 \mbox{\texttt{Set[(Char, String)]}}.  The code in Scala is as follows:
 \begin{center}
 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 case class CharParser(c: Char) extends Parser[String, Char] {
-def parse(in: String) =
+def parse(s: String) =
-if (in.head == c) Set((c, in.tail)) else Set()
+if (s != "" && s.head == c) Set((c, s.tail)) else Set()
 }
 \end{lstlisting}
 \end{center}
-\noindent You can see \texttt{parse} tests whether the
+\noindent You can see \texttt{parse} tests here whether the
-first character of the input string \texttt{in} is equal to
+first character of the input string \texttt{s} is equal to
 \texttt{c}. If yes, then it splits the string into the recognised part
-\texttt{c} and the unprocessed part \texttt{in.tail}. In case
+\texttt{c} and the unprocessed part \texttt{s.tail}. In case
-\texttt{in} does not start with \texttt{c} then the parser returns the
+\texttt{s} does not start with \texttt{c} then the parser returns the
 empty set (in Scala \texttt{Set()}). Since this parser recognises
 characters and just returns characters as the processed part, the
 output type of the parser is \texttt{Char}.
 If we want to parse a list of tokens and interested in recognising a
 \begin{center}
 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
 case object NumParser extends Parser[List[Token], Int] {
 def parse(ts: List[Token]) = ts match {
-case Num_token(s)::ts => Set((s.toInt, ts))
+case Num_token(s)::rest => Set((s.toInt, rest))
 case _ => Set ()
 }
 }
 \end{lstlisting}
 \end{center}
 token (and the rest of the list), like the \texttt{CharParser} above,
 rather it extracts also the string \texttt{s} from the token and
 converts it into an integer. The hope is that the lexer did its work
 well and this conversion always succeeds. The consequence of this is
 that the output type for this parser is \texttt{Int}, not
-\texttt{Token}. Such a conversion would be needed if we want to
+\texttt{Token}. Such a conversion would be needed in our parser,
-implement a simple calculator program, because string-numbers need to
+because when we encounter a number in our program, we want to do
-be transformed into \texttt{Int}-numbers in order to do the
+some calculations based on integers, not strings (or tokens).
-calculations.
 These simple parsers that just look at the input and do a simple
 transformation are often called \emph{atomic} parser combinators.
 More interesting are the parser combinators that build larger parsers
 combinator as follows
 \begin{center}
 \begin{lstlisting}[language=Scala, numbers=none]
 class AltParser[I, T]
 (p: => Parser[I, T],
-q: => Parser[I, T]) extends Parser[I, T] {
+q: => Parser[I, T])(using I => Seq[_])
-def parse(in: I) = p.parse(in) ++ q.parse(in)
+extends Parser[I, T] {
-}
+def parse(in: I) = p.parse(in) ++ q.parse(in)
+}
 \end{lstlisting}
 \end{center}
 \noindent The types of this parser combinator are again generic (we
 have \texttt{I} for the input type, and \texttt{T} for the output
 type). The alternative parser builds a new parser out of two existing
 parsers \texttt{p} and \texttt{q} which are given as arguments.  Both
 parsers need to be able to process input of type \texttt{I} and return
 in \texttt{parse} the same output type \texttt{Set[(T,
 I)]}.\footnote{There is an interesting detail of Scala, namely the
-\texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They
+\texttt{=>} in front of the types of \texttt{p} and \texttt{q}. These arrows
 will prevent the evaluation of the arguments before they are
 used. This is often called \emph{lazy evaluation} of the
 arguments. We will explain this later.}  The alternative parser runs
 the input with the first parser \texttt{p} (producing a set of pairs)
 and then runs the same input with \texttt{q} (producing another set of
 pairs).  The result should be then just the union of both sets, which
 is the operation \texttt{++} in Scala.
 The alternative parser combinator allows us to construct a parser that
 parses either a character \texttt{a} or \texttt{b} using the
-\texttt{CharParser} shown above. For this we can write
+\texttt{CharParser} shown above. For this we can write\footnote{Note
+that we cannot use a \texttt{case}-class for \texttt{AltParser}s
+because of the problem with laziness and Scala quirks. Hating
+\texttt{new} like the plague, we will work around this later with
+some syntax tricks. ;o)}
 \begin{center}
 \begin{lstlisting}[language=Scala, numbers=none]
 new AltParser(CharParser('a'), CharParser('b'))
 \end{lstlisting}
 \end{center}
 \noindent Later on we will use Scala mechanism for introducing some
 more readable shorthand notation for this, like \texttt{p"a" ||
-p"b"}. Let us look in detail at what this parser combinator produces
+p"b"}. But first let us look in detail at what this parser combinator produces
 with some sample strings.
 \begin{center}
 \begin{tabular}{rcl}
 input strings & & output\medskip\\
-\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
+\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}},\; \texttt{\Grid{cde}})\right\}$\\
-\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
+\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}},\; \texttt{\Grid{cde}})\right\}$\\
 \texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
 \end{tabular}
 \end{center}
-\noindent We receive in the first two cases a successful
+\noindent We receive in the first two cases a successful output (that
-output (that is a non-empty set). In each case, either
+is a non-empty set). In each case, either \pcode{a} or \pcode{b} is in
-\pcode{a} or \pcode{b} is in the parsed part, and
+the parsed part, and \pcode{cde} in the unprocessed part. Clearly this
-\pcode{cde} in the unprocessed part. Clearly this parser cannot
+parser cannot parse anything of the form \pcode{ccde}, therefore the
-parse anything with \pcode{ccde}, therefore the empty
+empty set is returned in the last case. Observe that parser
-set is returned.
+combinators only look at the beginning of the given input: they do not
+fish out something in the ``middle'' of the input.
 A bit more interesting is the \emph{sequence parser combinator}. Given
 two parsers, say again, $p$ and $q$, we want to apply first the input
 to $p$ producing a set of pairs; then apply $q$ to all the unparsed
 parts in the pairs; and then combine the results. Mathematically we would
 \begin{center}
 \begin{lstlisting}[language=Scala,numbers=none]
 class SeqParser[I, T, S]
 (p: => Parser[I, T],
-q: => Parser[I, S]) extends Parser[I, (T, S)] {
+q: => Parser[I, S])(using I => Seq[_])
+extends Parser[I, (T, S)] {
 def parse(in: I) =
 for ((output1, u1) <- p.parse(in);
 (output2, u2) <- q.parse(u1))
 yield ((output1, output2), u2)
 }
 \noindent This parser takes again as arguments two parsers, \texttt{p}
 and \texttt{q}. It implements \texttt{parse} as follows: first run the
 parser \texttt{p} on the input producing a set of pairs
 (\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
 unprocessed parts left over by \texttt{p} (recall that there can be
-several such pairs). Let then \texttt{q} run on these unprocessed
+several such pairs).  Let then \texttt{q} run on these unprocessed
 parts producing again a set of pairs. The output of the sequence
 parser combinator is then a set containing pairs where the first
 components are again pairs, namely what the first parser could parse
 together with what the second parser could parse; the second component
 is the unprocessed part left over after running the second parser
 produce the empty set.
 With the shorthand notation we shall introduce later for the sequence
 parser combinator, we can write for example \pcode{p"a" ~ p"b"}, which
 is the parser combinator that first recognises the character
-\texttt{a} from a string and then \texttt{b}. Let us look again at
+\texttt{a} from a string and then \texttt{b}. (Actually, we will be
-some examples of how this parser combinator processes some strings:
+able to write just \pcode{p"ab"} for such parsers, but it is good to
+understand first what happens behind the scenes.) Let us look again
+at some examples of how the sequence parser combinator processes some
+strings:
 \begin{center}
 \begin{tabular}{rcl}
 input strings & & output\medskip\\
-\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\
+\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}),\; \texttt{\Grid{cde}})\right\}$\\
 \texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
 \texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
 \end{tabular}
 \end{center}
 by a \texttt{c}. This parser produces the following outputs.
 \begin{center}
 \begin{tabular}{rcl}
 input strings & & output\medskip\\
-\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
+\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
-\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
+\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
 \texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$
 \end{tabular}
 \end{center}
 \noindent
 the following outputs.
 \begin{center}
 \begin{tabular}{rcl}
 input strings & & output\medskip\\
-\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
+\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
 \texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\
 \texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$
 \end{tabular}
 \end{center}
 our output pairs nest differently
 \begin{center}
 \begin{tabular}{rcl}
 input strings & & output\medskip\\
-\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})), \texttt{\Grid{de}})\right\}$\\
+\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})),\; \texttt{\Grid{de}})\right\}$\\
 \end{tabular}
 \end{center}
 \noindent
 Two more examples: first consider the parser
 ultimately-unsuccessful-parses. The main point is that the sequence
 parser combinator returns pairs that can nest according to the
 nesting of the component parsers.
-Consider also carefully that constructing a parser such \pcode{p"a" ||
+Consider also carefully that constructing a parser such
-(p"a" ~ p"b")} will result in a typing error. The intention with this
+\begin{center}
+\pcode{p"a" || (p"a" ~ p"b")}
+\end{center}
+\noindent
+will result in a typing error. The intention with this
 parser is that we want to parse either an \texttt{a}, or an \texttt{a}
 followed by a \texttt{b}. However, the first parser has as output type
 a single character (recall the type of \texttt{CharParser}), but the
 second parser produces a pair of characters as output. The alternative
 parser is required to have both component parsers to have the same
 type---the reason is that we need to be able to build the union of two
 sets, which requires in Scala that the sets have the same type.  Since
 they are not in this case, there is a typing error.  We will see later
 how we can build this parser without the typing error.
-The next parser combinator, called \emph{semantic action}, does not
+The next parser combinator, called \emph{semantic action} or \emph{map-parser}, does not
 actually combine two smaller parsers, but applies a function to the result
 of a parser.  It is implemented in Scala as follows
 \begin{center}
 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 class MapParser[I, T, S]
 (p: => Parser[I, T],
-f: T => S) extends Parser[I, S] {
+f: T => S)(using I => Seq[_]) extends Parser[I, S] {
 def parse(in: I) =
-for ((head, tail) <- p.parse(in)) yield (f(head), tail)
+for ((hd, tl) <- p.parse(in)) yield (f(hd), tl)
 }
 \end{lstlisting}
 \end{center}
 \noindent This parser combinator takes a parser \texttt{p} (with input
 type \texttt{I} and output type \texttt{T}) as one argument but also a
 function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p}
-produces sets of type \texttt{Set[(T, I)]}. The semantic action
+produces sets of type \texttt{Set[(S, I)]}. The semantic action
 combinator then applies the function \texttt{f} to all the `processed'
 parser outputs. Since this function is of type \texttt{T => S}, we
 obtain a parser with output type \texttt{S}. Again Scala lets us
 introduce some shorthand notation for this parser
 combinator. Therefore we will write short \texttt{p.map(f)} for it.
 \end{lstlisting}
 \end{center}
 \noindent
 produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
-still a string (the required double-quotes are not printed by
+still a string (the expected double-quotes are not printed by
 Scala). We want to convert this string into the corresponding
 \texttt{Int}. We can do this as follows using a semantic action
 \begin{center}
 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 \end{center}
 \noindent
 The function in the semantic action converts a string into an
 \texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))},
-but this time \texttt{123} is an \texttt{Int}. Let us come back to
+but this time \texttt{123} is an \texttt{Int}. Think carefully what
-semantic actions when we are going to implement actual context-free
+the input and output type of the parser is without the semantic action
-grammars.
+adn what with the semantic action (the type of the function can
+already tell you). Let us come back to semantic actions when we are
+going to implement actual context-free grammars.
 \subsubsection*{Shorthand notation for parser combinators}
 Before we proceed, let us just explain the shorthand notation for
-parser combinators. Like for regular expressions, the shorthand notation
+parser combinators. Like for regular expressions, the shorthand
-will make our life much easier when writing actual parsers. We can define
+notation will make our life much easier when writing actual
-some implicits which allow us to write
+parsers. We can define some extensions\footnote{In Scala 2 this was
+generically called as ``implicits''.} which allow us to write
 \begin{center}
 \begin{tabular}{ll}
 \pcode{p || q} & alternative parser\\
 \pcode{p ~ q} & sequence parser\\
 \pcode{p.map(f)} & semantic action parser
 \end{tabular}
 \end{center}
 \noindent
-as well as to use string interpolations for specifying simple string parsers.
+We will also use the \texttt{p}-string-interpolation for specifying simple string parsers.
 The idea is that this shorthand notation allows us to easily translate
 context-free grammars into code. For example recall our context-free
 grammar for palindromes:
 combinator.  The $\cdot$ can be translated to a sequence parser
 combinator. The parsers for $a$, $b$ and $\epsilon$ can be simply
 written as \texttt{p"a"}, \texttt{p"b"} and \texttt{p""}.
-\subsubsection*{How to build parsers using parser combinators?}
+\subsubsection*{How to build more interesting parsers using parser combinators?}
 The beauty of parser combinators is the ease with which they can be
 implemented and how easy it is to translate context-free grammars into
 code (though the grammars need to be non-left-recursive). To
 demonstrate this consider again the grammar for palindromes from above.
 \begin{center}
 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 class AltParser[I, T]
 (p: => Parser[I, T],
-q: => Parser[I, T]) extends Parser[I, T] {...}
+q: => Parser[I, T]) ... extends Parser[I, T] {...}
 class SeqParser[I, T, S]
 (p: => Parser[I, T],
-q: => Parser[I, S]) extends Parser[I, (T, S)] {...}
+q: => Parser[I, S]) ... extends Parser[I, (T, S)] {...}
 \end{lstlisting}
 \end{center}
 \noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front of
 the argument types for \texttt{p} and \texttt{q} prevent Scala from
 \end{plstx}
 \noindent
 Recall what left-recursive means from Handout 5 and make sure you see
 why this grammar is \emph{non} left-recursive. This version of the grammar
-also deals with the fact that $*$ should have a higher precedence. This does not
+also deals with the fact that $*$ should have a higher precedence than $+$ and $-$. This does not
 affect which strings this grammar can recognise, but in which order we are going
 to evaluate any arithmetic expression. We can translate this grammar into
 parsing combinators as follows:

changeset 937	dc5ab66b11cc
parent 799	85267be9a5ed
child 941	66adcae6c762