afl-material: handouts/ho06.tex@b3a237a5f4ad (annotated)

584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	1
595 d062fb6feefd updated Christian Urban <urbanc@in.tum.de> parents: 594 diff changeset	2	% !TEX program = xelatex
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	3	\documentclass{article}
297 5c51839c88fd updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 292 diff changeset	4	\usepackage{../style}
217 cd6066f1056a updated handouts Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 183 diff changeset	5	\usepackage{../langs}
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	6	\usepackage{../grammar}
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	7	\usepackage{../graphics}
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	8
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	9	\begin{document}
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	10
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 217 diff changeset	11	\section*{Handout 6 (Parser Combinators)}
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	12
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	13	This handout explains how \emph{parser combinators} work and how they
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	14	can be implemented in Scala. Their most distinguishing feature is that
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	15	they are very easy to implement (admittedly it is only easy in a
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	16	functional programming language). Another good point of parser
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	17	combinators is that they can deal with any kind of input as long as
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	18	this input is of ``sequence-kind'', for example a string or a list of
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	19	tokens. The only two properties of the input we need is to be able to
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	20	test when it is empty and ``sequentially'' take it apart. Strings and
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	21	lists fit this bill. However, parser combinators also have their
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	22	drawbacks. For example they require that the grammar to be parsed is
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	23	\emph{not} left-recursive and they are efficient only when the grammar
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	24	is unambiguous. It is the responsibility of the grammar designer to
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	25	ensure these two properties hold.
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	26
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	27	The general idea behind parser combinators is to transform the input
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	28	into sets of pairs, like so
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	29
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	30	\begin{center}
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	31	$\underbrace{\text{list of tokens}}_{\text{input}}$
594 ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	32	$\quad\Rightarrow\quad$
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	33	$\underbrace{\text{set of (parsed part, unprocessed part)}}_{\text{output}}$
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	34	\end{center}
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	35
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	36	\noindent
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	37	Given the extended effort we have spent implementing a lexer in order
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	38	to generate lists of tokens, it might be surprising that in what
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	39	follows we shall often use strings as input, rather than lists of
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	40	tokens. This is for making the explanation more lucid and ensure the
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	41	examples are simple. It does not make our previous work on lexers obsolete
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	42	(remember they transform a string into a list of tokens). Lexers will
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	43	still be needed for building a somewhat realistic compiler. See also
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	44	a question in the homework regarding this issue.
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	45
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	46	As mentioned above, parser combinators are relatively agnostic about what
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	47	kind of input they process. In my Scala code I use the following
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	48	polymorphic types for parser combinators:
176 3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	49
3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	50	\begin{center}
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	51	input:\;\; \texttt{I} \qquad output:\;\; \texttt{T}
176 3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	52	\end{center}
3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	53
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	54	\noindent That is they take as input something of type \texttt{I} and
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	55	return a set of pairs of type \texttt{Set[(T, I)]}. Since the input
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	56	needs to be of ``sequence-kind'', I actually have to often write
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	57	\code{(using is: I => Seq[_])} for the input type. This ensures the
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	58	input is a subtype of Scala sequences.\footnote{This is a new feature
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	59	in Scala 3 and is about type-classes, meaning if you use Scala 2 you will have difficulties
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	60	with running my code.} The first component of the generated pairs
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	61	corresponds to what the parser combinator was able to parse from the
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	62	input and the second is the unprocessed, or leftover, part of the
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	63	input (therefore the type of this unprocessed part is the same as the
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	64	input). A parser combinator might return more than one such pair; the
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	65	idea is that there are potentially several ways of how to parse the
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	66	input. As a concrete example, consider the string
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	67
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	68	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	69	\tt\Grid{iffoo\VS testbar}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	70	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	71
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	72	\noindent We might have a parser combinator which tries to
386 31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	73	interpret this string as a keyword (\texttt{if}) or as an
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	74	identifier (\texttt{iffoo}). Then the output will be the set
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	75
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	76	\begin{center}
386 31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	77	$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right),
31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	78	\left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	79	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	80
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	81	\noindent where the first pair means the parser could recognise
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	82	\texttt{if} from the input and leaves the \texttt{foo\VS testbar} as
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	83	unprocessed part; in the other case it could recognise
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	84	\texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	85	parser cannot recognise anything from the input at all, then parser
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	86	combinators just return the empty set $\{\}$. This will indicate
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	87	something ``went wrong''\ldots or more precisely, nothing could be
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	88	parsed.
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	89
594 ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	90	Also important to note is that the output type \texttt{T} for the
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	91	processed part can potentially be different from the input type
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	92	\texttt{I} in the parser. In the example above is just happens to be
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	93	the same. The reason for the difference is that in general we are
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	94	interested in transforming our input into something
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	95	``different''\ldots for example into a tree; or if we implement the
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	96	grammar for arithmetic expressions, we might be interested in the
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	97	actual integer number the arithmetic expression, say \texttt{1 + 2 *
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	98	3}, stands for. In this way we can use parser combinators to
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	99	implement relatively easily a calculator, for instance (we shall do
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	100	this later on).
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	101
594 ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	102	The main driving force behind parser combinators is that we can easily
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	103	build parser combinators out of smaller components following very
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	104	closely the structure of a grammar. In order to implement this in a
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	105	functional/object-oriented programming language, like Scala, we need
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	106	to specify an abstract class for parser combinators. In the abstract
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	107	class we specify that \texttt{I} is the \emph{input type} of the
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	108	parser combinator and that \texttt{T} is the \emph{output type}. This
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	109	implies that the function \texttt{parse} takes an argument of type
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	110	\texttt{I} and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	111
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	112	\begin{center}
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	113	\begin{lstlisting}[language=Scala]
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	114	abstract class Parser[I, T](using is: I => Seq[_]) {
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	115	def parse(in: I): Set[(T, I)]
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	116
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	117	def parse_all(in: I) : Set[T] =
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	118	for ((hd, tl) <- parse(in);
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	119	if is(tl).isEmpty) yield hd
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	120	}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	121	\end{lstlisting}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	122	\end{center}
176 3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	123
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	124	\noindent It is the obligation in each instance of this class to
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	125	supply an implementation for \texttt{parse}. From this function we
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	126	can then ``centrally'' derive the function \texttt{parse\_all}, which
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	127	just filters out all pairs whose second component is not empty (that
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	128	is has still some unprocessed part). The reason is that at the end of
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	129	the parsing we are only interested in the results where all the input
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	130	has been consumed and no unprocessed part is left over.
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	131
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	132	One of the simplest parser combinators recognises just a
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	133	single character, say $c$, from the beginning of strings. Its
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	134	behaviour can be described as follows:
176 3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	135
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	136	\begin{itemize}
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	137	\item If the head of the input string $s$ starts with a $c$, then return
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	138	the set
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	139	\[\{(c, \textit{tail-of-}s)\}\]
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	140	where \textit{tail of}
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	141	$s$ is the unprocessed part of the input string.
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	142	\item Otherwise return the empty set $\{\}$.
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	143	\end{itemize}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	144
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	145	\noindent
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	146	The input type of this simple parser combinator is \texttt{String} and
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	147	the output type is \texttt{Char}. This means \texttt{parse} returns
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	148	\mbox{\texttt{Set[(Char, String)]}}. The code in Scala is as follows:
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	149
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	150	\begin{center}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	151	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	152	case class CharParser(c: Char) extends Parser[String, Char] {
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	153	def parse(s: String) =
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	154	if (s != "" && s.head == c) Set((c, s.tail)) else Set()
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	155	}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	156	\end{lstlisting}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	157	\end{center}
176 3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	158
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	159	\noindent You can see \texttt{parse} tests here whether the
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	160	first character of the input string \texttt{s} is equal to
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	161	\texttt{c}. If yes, then it splits the string into the recognised part
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	162	\texttt{c} and the unprocessed part \texttt{s.tail}. In case
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	163	\texttt{s} does not start with \texttt{c} then the parser returns the
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	164	empty set (in Scala \texttt{Set()}). Since this parser recognises
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	165	characters and just returns characters as the processed part, the
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	166	output type of the parser is \texttt{Char}.
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	167
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	168	If we want to parse a list of tokens and interested in recognising a
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	169	number token, for example, we could write something like this
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	170
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	171	\begin{center}
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	172	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	173	case object NumParser extends Parser[List[Token], Int] {
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	174	def parse(ts: List[Token]) = ts match {
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	175	case Num_token(s)::rest => Set((s.toInt, rest))
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	176	case _ => Set ()
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	177	}
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	178	}
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	179	\end{lstlisting}
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	180	\end{center}
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	181
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	182	\noindent
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	183	In this parser the input is of type \texttt{List[Token]}. The function
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	184	parse looks at the input \texttt{ts} and checks whether the first
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	185	token is a \texttt{Num\_token} (let us assume our lexer generated
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	186	these tokens for numbers). But this parser does not just return this
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	187	token (and the rest of the list), like the \texttt{CharParser} above,
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	188	rather it extracts also the string \texttt{s} from the token and
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	189	converts it into an integer. The hope is that the lexer did its work
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	190	well and this conversion always succeeds. The consequence of this is
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	191	that the output type for this parser is \texttt{Int}, not
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	192	\texttt{Token}. Such a conversion would be needed in our parser,
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	193	because when we encounter a number in our program, we want to do
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	194	some calculations based on integers, not strings (or tokens).
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	195
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	196
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	197	These simple parsers that just look at the input and do a simple
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	198	transformation are often called \emph{atomic} parser combinators.
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	199	More interesting are the parser combinators that build larger parsers
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	200	out of smaller component parsers. There are three such parser
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	201	combinators that can be implemented generically. The \emph{alternative
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	202	parser combinator} is as follows: given two parsers, say, $p$ and
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	203	$q$, we apply both parsers to the input (remember parsers are
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	204	functions) and combine the output (remember they are sets of pairs):
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	205
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	206	\begin{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	207	$p(\text{input}) \cup q(\text{input})$
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	208	\end{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	209
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	210	\noindent In Scala we can implement alternative parser
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	211	combinator as follows
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	212
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	213	\begin{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	214	\begin{lstlisting}[language=Scala, numbers=none]
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	215	class AltParser[I, T]
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	216	(p: => Parser[I, T],
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	217	q: => Parser[I, T])(using I => Seq[_])
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	218	extends Parser[I, T] {
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	219	def parse(in: I) = p.parse(in) ++ q.parse(in)
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	220	}
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	221	\end{lstlisting}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	222	\end{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	223
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	224	\noindent The types of this parser combinator are again generic (we
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	225	have \texttt{I} for the input type, and \texttt{T} for the output
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	226	type). The alternative parser builds a new parser out of two existing
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	227	parsers \texttt{p} and \texttt{q} which are given as arguments. Both
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	228	parsers need to be able to process input of type \texttt{I} and return
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	229	in \texttt{parse} the same output type \texttt{Set[(T,
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	230	I)]}.\footnote{There is an interesting detail of Scala, namely the
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	231	\texttt{=>} in front of the types of \texttt{p} and \texttt{q}. These arrows
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	232	will prevent the evaluation of the arguments before they are
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	233	used. This is often called \emph{lazy evaluation} of the
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	234	arguments. We will explain this later.} The alternative parser runs
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	235	the input with the first parser \texttt{p} (producing a set of pairs)
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	236	and then runs the same input with \texttt{q} (producing another set of
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	237	pairs). The result should be then just the union of both sets, which
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	238	is the operation \texttt{++} in Scala.
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	239
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	240	The alternative parser combinator allows us to construct a parser that
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	241	parses either a character \texttt{a} or \texttt{b} using the
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	242	\texttt{CharParser} shown above. For this we can write\footnote{Note
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	243	that we cannot use a \texttt{case}-class for \texttt{AltParser}s
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	244	because of the problem with laziness and Scala quirks. Hating
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	245	\texttt{new} like the plague, we will work around this later with
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	246	some syntax tricks. ;o)}
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	247
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	248	\begin{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	249	\begin{lstlisting}[language=Scala, numbers=none]
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	250	new AltParser(CharParser('a'), CharParser('b'))
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	251	\end{lstlisting}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	252	\end{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	253
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	254	\noindent Later on we will use Scala mechanism for introducing some
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	255	more readable shorthand notation for this, like \texttt{p"a" \|\|
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	256	p"b"}. But first let us look in detail at what this parser combinator produces
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	257	with some sample strings.
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	258
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	259	\begin{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	260	\begin{tabular}{rcl}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	261	input strings & & output\medskip\\
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	262	\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}},\; \texttt{\Grid{cde}})\right\}$\\
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	263	\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}},\; \texttt{\Grid{cde}})\right\}$\\
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	264	\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	265	\end{tabular}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	266	\end{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	267
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	268	\noindent We receive in the first two cases a successful output (that
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	269	is a non-empty set). In each case, either \pcode{a} or \pcode{b} is in
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	270	the parsed part, and \pcode{cde} in the unprocessed part. Clearly this
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	271	parser cannot parse anything of the form \pcode{ccde}, therefore the
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	272	empty set is returned in the last case. Observe that parser
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	273	combinators only look at the beginning of the given input: they do not
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	274	fish out something in the ``middle'' of the input.
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	275
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	276	A bit more interesting is the \emph{sequence parser combinator}. Given
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	277	two parsers, say again, $p$ and $q$, we want to apply first the input
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	278	to $p$ producing a set of pairs; then apply $q$ to all the unparsed
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	279	parts in the pairs; and then combine the results. Mathematically we would
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	280	write something like this for the set of pairs:
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	281
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	282	\begin{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	283	\begin{tabular}{lcl}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	284	$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,\|\,$ &
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	285	$(\textit{output}_1, u_1) \in p(\text{input})
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	286	\;\wedge\;$\\
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	287	&& $(\textit{output}_2, u_2) \in q(u_1)\}$
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	288	\end{tabular}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	289	\end{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	290
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	291	\noindent Notice that the $p$ will first be run on the input,
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	292	producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	293	stands for the unprocessed, or leftover, parts of $p$. We want that
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	294	$q$ runs on all these unprocessed parts $u_1$. Therefore these
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	295	unprocessed parts are fed into the second parser $q$. The overall
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	296	result of the sequence parser combinator is pairs of the form
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	297	$((\textit{output}_1, \textit{output}_2), u_2)$. This means the
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	298	unprocessed part of the sequence parser combinator is the unprocessed
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	299	part the second parser $q$ leaves as leftover. The parsed parts of the
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	300	component parsers are combined in a pair, namely
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	301	$(\textit{output}_1, \textit{output}_2)$. The reason is we want to
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	302	know what $p$ and $q$ were able to parse. This behaviour can be
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	303	implemented in Scala as follows:
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	304
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	305	\begin{center}
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	306	\begin{lstlisting}[language=Scala,numbers=none]
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	307	class SeqParser[I, T, S]
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	308	(p: => Parser[I, T],
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	309	q: => Parser[I, S])(using I => Seq[_])
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	310	extends Parser[I, (T, S)] {
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	311	def parse(in: I) =
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	312	for ((output1, u1) <- p.parse(in);
386 31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	313	(output2, u2) <- q.parse(u1))
31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	314	yield ((output1, output2), u2)
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	315	}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	316	\end{lstlisting}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	317	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	318
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	319	\noindent This parser takes again as arguments two parsers, \texttt{p}
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	320	and \texttt{q}. It implements \texttt{parse} as follows: first run the
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	321	parser \texttt{p} on the input producing a set of pairs
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	322	(\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	323	unprocessed parts left over by \texttt{p} (recall that there can be
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	324	several such pairs). Let then \texttt{q} run on these unprocessed
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	325	parts producing again a set of pairs. The output of the sequence
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	326	parser combinator is then a set containing pairs where the first
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	327	components are again pairs, namely what the first parser could parse
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	328	together with what the second parser could parse; the second component
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	329	is the unprocessed part left over after running the second parser
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	330	\texttt{q}. Note that the input type of the sequence parser combinator
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	331	is as usual \texttt{I}, but the output type is
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	332
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	333	\begin{center}
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	334	\texttt{(T, S)}
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	335	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	336
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	337	\noindent
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	338	Consequently, the function \texttt{parse} in the sequence parser
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	339	combinator returns sets of type \texttt{Set[((T, S), I)]}. That means
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	340	we have essentially two output types for the sequence parser
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	341	combinator (packaged in a pair), because in general \textit{p} and
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	342	\textit{q} might produce different things (for example we recognise a
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	343	number with \texttt{p} and then with \texttt{q} a string corresponding
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	344	to an operator). If any of the runs of \textit{p} and \textit{q}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	345	fail, that is produce the empty set, then \texttt{parse} will also
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	346	produce the empty set.
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	347
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	348	With the shorthand notation we shall introduce later for the sequence
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	349	parser combinator, we can write for example \pcode{p"a" ~ p"b"}, which
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	350	is the parser combinator that first recognises the character
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	351	\texttt{a} from a string and then \texttt{b}. (Actually, we will be
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	352	able to write just \pcode{p"ab"} for such parsers, but it is good to
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	353	understand first what happens behind the scenes.) Let us look again
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	354	at some examples of how the sequence parser combinator processes some
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	355	strings:
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	356
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	357	\begin{center}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	358	\begin{tabular}{rcl}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	359	input strings & & output\medskip\\
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	360	\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}),\; \texttt{\Grid{cde}})\right\}$\\
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	361	\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	362	\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	363	\end{tabular}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	364	\end{center}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	365
586 9cb8dfcb7f30 typos Christian Urban <urbanc@in.tum.de> parents: 585 diff changeset	366	\noindent In the first line we have a successful parse, because the
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	367	string starts with \texttt{ab}, which is the prefix we are looking
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	368	for. But since the parsing combinator is constructed as sequence of
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	369	the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	370	result is a nested pair of the form \texttt{((a, b), cde)}. It is
586 9cb8dfcb7f30 typos Christian Urban <urbanc@in.tum.de> parents: 585 diff changeset	371	\emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	372	expect. The parser returns the empty set in the other examples,
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	373	because they do not fit with what the parser is supposed to parse.
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	374
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	375
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	376	A slightly more complicated parser is \pcode{(p"a" \|\| p"b") ~ p"c"} which
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	377	parses as first character either an \texttt{a} or \texttt{b}, followed
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	378	by a \texttt{c}. This parser produces the following outputs.
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	379
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	380	\begin{center}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	381	\begin{tabular}{rcl}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	382	input strings & & output\medskip\\
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	383	\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	384	\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
585 3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	385	\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	386	\end{tabular}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	387	\end{center}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	388
585 3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	389	\noindent
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	390	Now consider the parser \pcode{(p"a" ~ p"b") ~ p"c"} which parses
585 3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	391	\texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	392	the following outputs.
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	393
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	394	\begin{center}
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	395	\begin{tabular}{rcl}
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	396	input strings & & output\medskip\\
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	397	\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
585 3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	398	\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	399	\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	400	\end{tabular}
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	401	\end{center}
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	402
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	403
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	404	\noindent The second and third example fail, because something is
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	405	``missing'' in the sequence we are looking for. The first succeeds but
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	406	notice how the results nest with sequences: the parsed part is a
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	407	nested pair of the form \pcode{((a, b), c)}. If we nest the sequence
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	408	parser differently, say \pcode{p"a" ~ (p"b" ~ p"c")}, then also
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	409	our output pairs nest differently
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	410
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	411	\begin{center}
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	412	\begin{tabular}{rcl}
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	413	input strings & & output\medskip\\
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	414	\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})),\; \texttt{\Grid{de}})\right\}$\\
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	415	\end{tabular}
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	416	\end{center}
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	417
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	418	\noindent
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	419	Two more examples: first consider the parser
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	420	\pcode{(p"a" ~ p"a") ~ p"a"} and the input \pcode{aaaa}:
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	421
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	422	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	423	\begin{tabular}{rcl}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	424	input string & & output\medskip\\
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	425	\texttt{\Grid{aaaa}} & $\rightarrow$ &
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	426	$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	427	\end{tabular}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	428	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	429
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	430	\noindent Notice again how the results nest deeper and deeper as pairs (the
585 3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	431	last \pcode{a} is in the unprocessed part). To consume everything of
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	432	this string we can use the parser \pcode{((p"a" ~ p"a") ~ p"a") ~
c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	433	p"a"}. Then the output is as follows:
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	434
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	435	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	436	\begin{tabular}{rcl}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	437	input string & & output\medskip\\
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	438	\texttt{\Grid{aaaa}} & $\rightarrow$ &
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	439	$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	440	\end{tabular}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	441	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	442
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	443	\noindent This is an instance where the parser consumed
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	444	completely the input, meaning the unprocessed part is just the
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	445	empty string. So if we called \pcode{parse_all}, instead of \pcode{parse},
585 3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	446	we would get back the result
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	447
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	448	\[
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	449	\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	450	\]
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	451
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	452	\noindent where the unprocessed (empty) parts have been stripped away
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	453	from the pairs; everything where the second part was not empty has
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	454	been thrown away as well, because they represent
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	455	ultimately-unsuccessful-parses. The main point is that the sequence
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	456	parser combinator returns pairs that can nest according to the
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	457	nesting of the component parsers.
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	458
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	459
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	460	Consider also carefully that constructing a parser such
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	461
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	462	\begin{center}
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	463	\pcode{p"a" \|\| (p"a" ~ p"b")}
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	464	\end{center}
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	465
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	466	\noindent
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	467	will result in a typing error. The intention with this
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	468	parser is that we want to parse either an \texttt{a}, or an \texttt{a}
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	469	followed by a \texttt{b}. However, the first parser has as output type
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	470	a single character (recall the type of \texttt{CharParser}), but the
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	471	second parser produces a pair of characters as output. The alternative
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	472	parser is required to have both component parsers to have the same
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	473	type---the reason is that we need to be able to build the union of two
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	474	sets, which requires in Scala that the sets have the same type. Since
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	475	they are not in this case, there is a typing error. We will see later
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	476	how we can build this parser without the typing error.
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	477
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	478	The next parser combinator, called \emph{semantic action} or \emph{map-parser}, does not
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	479	actually combine two smaller parsers, but applies a function to the result
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	480	of a parser. It is implemented in Scala as follows
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	481
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	482	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	483	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	484	class MapParser[I, T, S]
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	485	(p: => Parser[I, T],
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	486	f: T => S)(using I => Seq[_]) extends Parser[I, S] {
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	487	def parse(in: I) =
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	488	for ((hd, tl) <- p.parse(in)) yield (f(hd), tl)
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	489	}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	490	\end{lstlisting}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	491	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	492
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	493
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	494	\noindent This parser combinator takes a parser \texttt{p} (with input
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	495	type \texttt{I} and output type \texttt{T}) as one argument but also a
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	496	function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p}
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	497	produces sets of type \texttt{Set[(S, I)]}. The semantic action
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	498	combinator then applies the function \texttt{f} to all the `processed'
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	499	parser outputs. Since this function is of type \texttt{T => S}, we
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	500	obtain a parser with output type \texttt{S}. Again Scala lets us
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	501	introduce some shorthand notation for this parser
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	502	combinator. Therefore we will write short \texttt{p.map(f)} for it.
386 31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	503
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	504	What are semantic actions good for? Well, they allow you to transform
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	505	the parsed input into datastructures you can use for further
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	506	processing. A simple (contrived) example would be to transform parsed
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	507	characters into ASCII numbers. Suppose we define a function \texttt{f}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	508	(from characters to \texttt{Int}s) and use a \texttt{CharParser} for parsing
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	509	the character \texttt{c}.
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	510
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	511
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	512	\begin{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	513	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	514	val f = (c: Char) => c.toInt
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	515	val c = new CharParser('c')
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	516	\end{lstlisting}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	517	\end{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	518
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	519	\noindent
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	520	We then can run the following two parsers on the input \texttt{cbd}:
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	521
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	522	\begin{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	523	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	524	c.parse("cbd")
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	525	c.map(f).parse("cbd")
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	526	\end{lstlisting}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	527	\end{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	528
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	529	\noindent
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	530	In the first line we obtain the expected result \texttt{Set(('c',
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	531	"bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	532	character has been transformed into an ASCII number.
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	533
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	534	A slightly less contrived example is about parsing numbers (recall
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	535	\texttt{NumParser} above). However, we want to do this here for
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	536	strings, not for tokens. For this assume we have the following
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	537	(atomic) \texttt{RegexParser}.
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	538
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	539	\begin{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	540	\begin{lstlisting}[language=Scala,xleftmargin=0mm,
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	541	basicstyle=\small\ttfamily, numbers=none]
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	542	import scala.util.matching.Regex
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	543
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	544	case class RegexParser(reg: Regex) extends Parser[String, String] {
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	545	def parse(in: String) = reg.findPrefixMatchOf(in) match {
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	546	case None => Set()
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	547	case Some(m) => Set((m.matched, m.after.toString))
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	548	}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	549	}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	550	\end{lstlisting}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	551	\end{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	552
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	553	\noindent
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	554	This parser takes a regex as argument and splits up a string into a
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	555	prefix and the rest according to this regex
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	556	(\texttt{reg.findPrefixMatchOf} generates a match---in the successful
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	557	case---and the corresponding strings can be extracted with
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	558	\texttt{matched} and \texttt{after}). The input and output type for
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	559	this parser is \texttt{String}. Using \texttt{RegexParser} we can
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	560	define a \texttt{NumParser} for \texttt{Strings} to \texttt{Int} as
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	561	follows:
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	562
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	563	\begin{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	564	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	565	val NumParser = RegexParser("[0-9]+".r)
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	566	\end{lstlisting}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	567	\end{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	568
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	569	\noindent
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	570	This parser will recognise a number at the beginning of a string. For
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	571	example
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	572
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	573	\begin{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	574	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	575	NumParser.parse("123abc")
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	576	\end{lstlisting}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	577	\end{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	578
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	579	\noindent
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	580	produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	581	still a string (the expected double-quotes are not printed by
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	582	Scala). We want to convert this string into the corresponding
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	583	\texttt{Int}. We can do this as follows using a semantic action
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	584
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	585	\begin{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	586	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	587	NumParser.map{s => s.toInt}.parse("123abc")
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	588	\end{lstlisting}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	589	\end{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	590
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	591	\noindent
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	592	The function in the semantic action converts a string into an
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	593	\texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))},
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	594	but this time \texttt{123} is an \texttt{Int}. Think carefully what
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	595	the input and output type of the parser is without the semantic action
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	596	adn what with the semantic action (the type of the function can
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	597	already tell you). Let us come back to semantic actions when we are
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	598	going to implement actual context-free grammars.
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	599
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	600	\subsubsection*{Shorthand notation for parser combinators}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	601
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	602	Before we proceed, let us just explain the shorthand notation for
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	603	parser combinators. Like for regular expressions, the shorthand
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	604	notation will make our life much easier when writing actual
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	605	parsers. We can define some extensions\footnote{In Scala 2 this was
aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	606	generically called as ``implicits''.} which allow us to write
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	607
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	608	\begin{center}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	609	\begin{tabular}{ll}
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	610	\pcode{p \|\| q} & alternative parser\\
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	611	\pcode{p ~ q} & sequence parser\\
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	612	\pcode{p.map(f)} & semantic action parser
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	613	\end{tabular}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	614	\end{center}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	615
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	616	\noindent
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	617	We will also use the \texttt{p}-string-interpolation for specifying simple string parsers.
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	618
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	619	The idea is that this shorthand notation allows us to easily translate
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	620	context-free grammars into code. For example recall our context-free
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	621	grammar for palindromes:
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	622
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	623	\begin{plstx}[margin=3cm]
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	624	: \meta{Pal} ::= a\cdot \meta{Pal}\cdot a \| b\cdot \meta{Pal}\cdot b \| a \| b \| \epsilon\\
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	625	\end{plstx}
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	626
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	627	\noindent
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	628	Each alternative in this grammar translates into an alternative parser
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	629	combinator. The $\cdot$ can be translated to a sequence parser
5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	630	combinator. The parsers for $a$, $b$ and $\epsilon$ can be simply
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	631	written as \texttt{p"a"}, \texttt{p"b"} and \texttt{p""}.
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	632
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	633
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	634	\subsubsection*{How to build more interesting parsers using parser combinators?}
386 31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	635
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	636	The beauty of parser combinators is the ease with which they can be
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	637	implemented and how easy it is to translate context-free grammars into
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	638	code (though the grammars need to be non-left-recursive). To
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	639	demonstrate this consider again the grammar for palindromes from above.
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	640	The first idea would be to translate it into the following code
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	641
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	642	\begin{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	643	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	644	lazy val Pal : Parser[String, String] =
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	645	((p"a" ~ Pal ~ p"a") \|\| (p"b" ~ Pal ~ p"b") \|\| p"a" \|\| p"b" \|\| p"")
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	646	\end{lstlisting}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	647	\end{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	648
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	649	\noindent
590 5cdefb0e309e updated Christian Urban <urbanc@in.tum.de> parents: 589 diff changeset	650	Unfortunately, this does not quite work yet as it produces a typing
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	651	error. The reason is that the parsers \texttt{p"a"}, \texttt{p"b"} and
c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	652	\texttt{p""} all produce strings as output type and therefore can be
c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	653	put into an alternative \texttt{...\|\| p"a" \|\| p"b" \|\| p""}. But both
c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	654	sequence parsers \pcode{p"a" ~ Pal ~ p"a"} and \pcode{p"b" ~ Pal ~ p"b"}
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	655	produce pairs of the form
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	656
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	657	\begin{center}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	658	(((\texttt{a}-part, \texttt{Pal}-part), \texttt{a}-part), unprocessed part)
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	659	\end{center}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	660
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	661	\noindent That is how the
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	662	sequence parser combinator nests results when \pcode{\~} is used
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	663	between two components. The solution is to use a semantic action that
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	664	``flattens'' these pairs and appends the corresponding strings, like
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	665
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	666	\begin{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	667	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	668	lazy val Pal : Parser[String, String] =
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	669	((p"a" ~ Pal ~ p"a").map{ case ((x, y), z) => x + y + z } \|\|
c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	670	(p"b" ~ Pal ~ p"b").map{ case ((x, y), z) => x + y + z } \|\|
c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	671	p"a" \|\| p"b" \|\| p"")
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	672	\end{lstlisting}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	673	\end{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	674
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	675	\noindent
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	676	How does this work? Well, recall again what the pairs look like for
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	677	the parser \pcode{p"a" ~ Pal ~ p"a"}. The pattern in the semantic
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	678	action matches the nested pairs (the \texttt{x} with the
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	679	\texttt{a}-part and so on). Unfortunately when we have such nested
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	680	pairs, Scala requires us to define the function using the
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	681	\pcode{case}-syntax
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	682
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	683	\begin{center}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	684	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	685	{ case ((x, y), z) => ... }
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	686	\end{lstlisting}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	687	\end{center}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	688
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	689	\noindent
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	690	If we have more sequence parser combinators or have them differently nested,
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	691	then the pattern in the semantic action needs to be adjusted accordingly.
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	692	The action we implement above is to concatenate all three strings, which
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	693	means after the semantic action is applied the output type of the parser
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	694	is \texttt{String}, which means it fits with the alternative parsers
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	695	\texttt{...\|\| p"a" \|\| p"b" \|\| p""}.
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	696
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	697	If we run the parser above with \pcode{Pal.parse_all("abaaaba")} we obtain
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	698	as result the \pcode{Set(abaaaba)}, which indicates that the string is a palindrome
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	699	(an empty set would mean something is wrong). But also notice what the
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	700	intermediate results are generated by \pcode{Pal.parse("abaaaba")}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	701
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	702	\begin{center}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	703	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	704	Set((abaaaba,""),(aba,aaba), (a,baaaba), ("",abaaaba))
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	705	\end{lstlisting}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	706	\end{center}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	707
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	708	\noindent
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	709	That there are more than one output might be slightly unexpected, but
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	710	can be explained as follows: the pairs represent all possible
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	711	(partial) parses of the string \pcode{"abaaaba"}. The first pair above
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	712	corresponds to a complete parse (all output is consumed) and this is
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	713	what \pcode{Pal.parse_all} returns. The second pair is a small
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	714	``sub-palindrome'' that can also be parsed, but the parse fails with
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	715	the rest \pcode{aaba}, which is therefore left as unprocessed. The
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	716	third one is an attempt to parse the whole string with the
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	717	single-character parser \pcode{a}. That of course only partially
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	718	succeeds, by leaving \pcode{"baaaba"} as the unprocessed
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	719	part. Finally, since we allow the empty string to be a palindrome we
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	720	also obtain the last pair, where actually nothing is consumed from the
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	721	input string. While all this works as intended, we need to be careful
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	722	with this (especially with including the \pcode{""} parser in our
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	723	grammar): if during parsing the set of parsing attempts gets too big,
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	724	then the parsing process can become very slow as the potential
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	725	candidates for applying rules can snowball.
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	726
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	727
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	728	Important is also to note is that we must define the
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	729	\texttt{Pal}-parser as a \emph{lazy} value in Scala. Look again at the
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	730	code: \texttt{Pal} occurs on the right-hand side of the definition. If we had
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	731	just written
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	732
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	733	\begin{center}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	734	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	735	val Pal : Parser[String, String] = ...rhs...
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	736	\end{lstlisting}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	737	\end{center}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	738
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	739	\noindent
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	740	then Scala before making this assignment to \texttt{Pal} attempts to
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	741	find out what the expression on the right-hand side evaluates to. This
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	742	is straightforward in case of simple expressions \texttt{2 + 3}, but
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	743	the expression above contains \texttt{Pal} in the right-hand
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	744	side. Without \pcode{lazy} it would try to evaluate what \texttt{Pal}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	745	evaluates to and start a new recursion, which means it falls into an
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	746	infinite loop. The definition of \texttt{Pal} is recursive and the
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	747	\pcode{lazy} key-word prevents it from being fully evaluated. Therefore
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	748	whenever we want to define a recursive parser we have to write
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	749
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	750	\begin{center}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	751	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	752	lazy val SomeParser : Parser[...,...] = ...rhs...
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	753	\end{lstlisting}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	754	\end{center}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	755
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	756	\noindent That was not necessary for our atomic parsers, like
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	757	\texttt{RegexParser} or \texttt{CharParser}, because they are not recursive.
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	758	Note that this is also the reason why we had to write
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	759
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	760	\begin{center}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	761	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	762	class AltParser[I, T]
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	763	(p: => Parser[I, T],
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	764	q: => Parser[I, T]) ... extends Parser[I, T] {...}
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	765
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	766	class SeqParser[I, T, S]
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	767	(p: => Parser[I, T],
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	768	q: => Parser[I, S]) ... extends Parser[I, (T, S)] {...}
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	769	\end{lstlisting}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	770	\end{center}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	771
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	772	\noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front of
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	773	the argument types for \texttt{p} and \texttt{q} prevent Scala from
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	774	evaluating the arguments. Normally, Scala would first evaluate what
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	775	kind of parsers \texttt{p} and \texttt{q} are, and only then generate
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	776	the alternative parser combinator, respectively sequence parser
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	777	combinator. Since the arguments can be recursive parsers, such as
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	778	\texttt{Pal}, this would lead again to an infinite loop.
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	779
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	780	As a final example in this section, let us consider the grammar for
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	781	well-nested parentheses:
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	782
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	783	\begin{plstx}[margin=3cm]
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	784	: \meta{P} ::= (\cdot \meta{P}\cdot ) \cdot \meta{P} \| \epsilon\\
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	785	\end{plstx}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	786
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	787	\noindent
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	788	Let us assume we want to not just recognise strings of
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	789	well-nested parentheses but also transform round parentheses
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	790	into curly braces. We can do this by using a semantic
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	791	action:
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	792
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	793	\begin{center}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	794	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	795	xleftmargin=0mm, numbers=none]
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	796	lazy val P : Parser[String, String] =
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	797	(p"(" ~ P ~ p")" ~ P).map{ case (((_,x),_),y) => "{" + x + "}" + y } \|\| p""
591 e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	798	\end{lstlisting}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	799	\end{center}
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	800
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	801	\noindent
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	802	Here we define a function where which ignores the parentheses in the
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	803	pairs, but replaces them in the right places with curly braces when
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	804	assembling the new string in the right-hand side. If we run
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	805	\pcode{P.parse_all("(((()()))())")} we obtain
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	806	\texttt{Set(\{\{\{\{\}\{\}\}\}\{\}\})} as expected.
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	807
e3d10383ae37 updated Christian Urban <urbanc@in.tum.de> parents: 590 diff changeset	808
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	809
386 31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	810	\subsubsection*{Implementing an Interpreter}
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	811
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	812	The first step before implementing an interpreter for a full-blown
592 6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	813	language is to implement a simple calculator for arithmetic
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	814	expressions. Suppose our arithmetic expressions are given by the
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	815	grammar:
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	816
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	817	\begin{plstx}[margin=3cm,one per line]
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	818	: \meta{E} ::= \meta{E} \cdot + \cdot \meta{E}
592 6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	819	\| \meta{E} \cdot - \cdot \meta{E}
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	820	\| \meta{E} \cdot * \cdot \meta{E}
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	821	\| ( \cdot \meta{E} \cdot )
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	822	\| Number \\
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	823	\end{plstx}
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	824
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	825	\noindent
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	826	Naturally we want to implement the grammar in such a way that we can
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	827	calculate what the result of, for example, \texttt{4*2+3} is---we are
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	828	interested in an \texttt{Int} rather than a string. This means every
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	829	component parser needs to have as output type \texttt{Int} and when we
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	830	assemble the intermediate results, strings like \texttt{"+"},
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	831	\texttt{"*"} and so on, need to be translated into the appropriate
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	832	Scala operation of adding, multiplying and so on. Being inspired by
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	833	the parser for well-nested parentheses above and ignoring the fact
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	834	that we want $*$ to take precedence over $+$ and $-$, we might want to
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	835	write something like
592 6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	836
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	837	\begin{center}
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	838	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	839	lazy val E: Parser[String, Int] =
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	840	((E ~ p"+" ~ E).map{ case ((x, y), z) => x + z} \|\|
c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	841	(E ~ p"-" ~ E).map{ case ((x, y), z) => x - z} \|\|
c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	842	(E ~ p"" ~ E).map{ case ((x, y), z) => x z} \|\|
c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	843	(p"(" ~ E ~ p")").map{ case ((x, y), z) => y} \|\|
592 6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	844	NumParserInt)
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	845	\end{lstlisting}
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	846	\end{center}
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	847
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	848	\noindent
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	849	Consider again carefully how the semantic actions pick out the correct
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	850	arguments for the calculation. In case of plus, we need \texttt{x} and
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	851	\texttt{z}, because they correspond to the results of the component
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	852	parser \texttt{E}. We can just add \texttt{x + z} in order to obtain
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	853	an \texttt{Int} because the output type of \texttt{E} is
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	854	\texttt{Int}. Similarly with subtraction and multiplication. In
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	855	contrast in the fourth clause we need to return \texttt{y}, because it
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	856	is the result enclosed inside the parentheses. The information about
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	857	parentheses, roughly speaking, we just throw away.
592 6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	858
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	859	So far so good. The problem arises when we try to call \pcode{parse_all} with the
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	860	expression \texttt{"1+2+3"}. Lets try it
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	861
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	862	\begin{center}
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	863	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	864	E.parse_all("1+2+3")
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	865	\end{lstlisting}
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	866	\end{center}
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	867
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	868	\noindent
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	869	\ldots and we wait and wait and \ldots still wait. What is the
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	870	problem? Actually, the parser just fell into an infinite loop! The
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	871	reason is that the above grammar is left-recursive and recall that our
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	872	parser combinators cannot deal with such left-recursive
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	873	grammars. Fortunately, every left-recursive context-free grammar can be
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	874	transformed into a non-left-recursive grammars that still recognises
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	875	the same strings. This allows us to design the following grammar
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	876
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	877	\begin{plstx}[margin=3cm]
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	878	: \meta{E} ::= \meta{T} \cdot + \cdot \meta{E} \| \meta{T} \cdot - \cdot \meta{E} \| \meta{T}\\
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	879	: \meta{T} ::= \meta{F} \cdot * \cdot \meta{T} \| \meta{F}\\
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	880	: \meta{F} ::= ( \cdot \meta{E} \cdot ) \| Number\\
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	881	\end{plstx}
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	882
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	883	\noindent
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	884	Recall what left-recursive means from Handout 5 and make sure you see
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	885	why this grammar is \emph{non} left-recursive. This version of the grammar
936 aabd9168c7ac updated Christian Urban <christian.urban@kcl.ac.uk> parents: 799 diff changeset	886	also deals with the fact that $*$ should have a higher precedence than $+$ and $-$. This does not
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	887	affect which strings this grammar can recognise, but in which order we are going
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	888	to evaluate any arithmetic expression. We can translate this grammar into
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	889	parsing combinators as follows:
592 6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	890
6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	891
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	892	\begin{center}
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	893	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	894	lazy val E: Parser[String, Int] =
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	895	(T ~ p"+" ~ E).map{ case ((x, y), z) => x + z } \|\|
c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	896	(T ~ p"-" ~ E).map{ case ((x, y), z) => x - z } \|\| T
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	897	lazy val T: Parser[String, Int] =
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	898	(F ~ p"" ~ T).map{ case ((x, y), z) => x z } \|\| F
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	899	lazy val F: Parser[String, Int] =
799 c18b991eaad2 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 595 diff changeset	900	(p"(" ~ E ~ p")").map{ case ((x, y), z) => y } \|\| NumParserInt
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	901	\end{lstlisting}
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	902	\end{center}
592 6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	903
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	904	\noindent
594 ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	905	Let us try out some examples:
592 6b62b697321d updated Christian Urban <urbanc@in.tum.de> parents: 591 diff changeset	906
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	907	\begin{center}
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	908	\begin{tabular}{rcl}
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	909	input strings & & output of \pcode{parse_all}\medskip\\
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	910	\texttt{\Grid{1+2+3}} & $\rightarrow$ & \texttt{Set(6)}\\
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	911	\texttt{\Grid{4*2+3}} & $\rightarrow$ & \texttt{Set(11)}\\
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	912	\texttt{\Grid{4*(2+3)}} & $\rightarrow$ & \texttt{Set(20)}\\
594 ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	913	\texttt{\Grid{(4)*((2+3))}} & $\rightarrow$ & \texttt{Set(20)}\\
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	914	\texttt{\Grid{4/2+3}} & $\rightarrow$ & \texttt{Set()}\\
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	915	\texttt{\Grid{1\VS +\VS 2\VS +\VS 3}} & $\rightarrow$ & \texttt{Set()}\\
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	916	\end{tabular}
f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	917	\end{center}
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	918
593 f7c6512bb85a updated Christian Urban <urbanc@in.tum.de> parents: 592 diff changeset	919	\noindent
594 ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	920	Note that we call \pcode{parse_all}, not \pcode{parse}. The examples
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	921	should be quite self-explanatory. The last two example do not produce
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	922	any integer result because our parser does not define what to do in
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	923	case of division (could be easily added), but also has no idea what to
595 d062fb6feefd updated Christian Urban <urbanc@in.tum.de> parents: 594 diff changeset	924	do with whitespaces. To deal with them is the task of the lexer! Yes,
594 ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	925	we can deal with them inside the grammar, but that would render many
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	926	grammars becoming unintelligible, including this one.\footnote{If you
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	927	think an easy solution is to extend the notion of what a number
ff496c243d10 updated Christian Urban <urbanc@in.tum.de> parents: 593 diff changeset	928	should be, then think again---you still would have to deal with
595 d062fb6feefd updated Christian Urban <urbanc@in.tum.de> parents: 594 diff changeset	929	cases like \texttt{\Grid{(\VS (\VS 2+3)\VS )}}. Just think of the mess
d062fb6feefd updated Christian Urban <urbanc@in.tum.de> parents: 594 diff changeset	930	you would have in a grammar for a full-blown language where there are
d062fb6feefd updated Christian Urban <urbanc@in.tum.de> parents: 594 diff changeset	931	numerous such cases.}
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	932
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	933	\end{document}
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	934
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	935	%%% Local Variables:
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	936	%%% mode: latex
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	937	%%% TeX-master: t
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	938	%%% End:

author	Christian Urban <christian.urban@kcl.ac.uk>
	Tue, 03 Oct 2023 23:23:57 +0100
changeset 937	b3a237a5f4ad
parent 936	aabd9168c7ac
child 940	1c1fbf45a03c
permissions	-rw-r--r--