afl-material: handouts/ho06.tex@bdb9940be094 (annotated)

584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	1
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	2	\documentclass{article}
297 5c51839c88fd updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 292 diff changeset	3	\usepackage{../style}
217 cd6066f1056a updated handouts Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 183 diff changeset	4	\usepackage{../langs}
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	5	\usepackage{../grammar}
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	6
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	7	\begin{document}
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	8
292 7ed2a25dd115 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 217 diff changeset	9	\section*{Handout 6 (Parser Combinators)}
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	10
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	11	This handout explains how \emph{parser combinators} work and how they
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	12	can be implemented in Scala. Their most distinguishing feature is that
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	13	they are very easy to implement (admittedly it is only easy in a
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	14	functional programming language). Another good point of parser
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	15	combinators is that they can deal with any kind of input as long as
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	16	this input is of ``sequence-kind'', for example a string or a list of
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	17	tokens. The only two properties of the input we need is to be able to
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	18	test when it is empty and ``sequentially'' take it apart. Strings and
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	19	lists fit this bill. However, parser combinators also have their
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	20	drawbacks. For example they require that the grammar to be parsed is
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	21	\emph{not} left-recursive and they are efficient only when the grammar
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	22	is unambiguous. It is the responsibility of the grammar designer to
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	23	ensure these two properties.
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	24
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	25	The general idea behind parser combinators is to transform the input
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	26	into sets of pairs, like so
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	27
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	28	\begin{center}
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	29	$\underbrace{\text{list of tokens}}_{\text{input}}$
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	30	$\Rightarrow$
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	31	$\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	32	\end{center}
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	33
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	34	\noindent
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	35	Given the extended effort we have spent implementing a lexer
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	36	in order to generate list of tokens, it might be surprising that in
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	37	what follows we shall often use strings as input. This is for making
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	38	the explanation more lucid. It does not make our previous work on
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	39	lexers obsolete (remember they transform a string into a list of
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	40	tokens). Lexers will still be needed for building a somewhat realistic
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	41	compiler.
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	42
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	43	But as said, parser combinators are relatively agnostic about what
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	44	kind of input they process. In my Scala code I use the following
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	45	polymorphic types for parser combinators:
176 3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	46
3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	47	\begin{center}
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	48	input:\;\; \texttt{I} \qquad output:\;\; \texttt{T}
176 3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	49	\end{center}
3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	50
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	51	\noindent That is they take as input something of type \texttt{I} and
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	52	return a set of pairs of type \texttt{Set[(T, I)]}. Since the input needs
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	53	to be of ``sequence-kind'', I actually have to often write \texttt{I
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	54	<\% Seq[\_]} for the input type. This ensures the input is a
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	55	subtype of Scala sequences. The first component of the generated pairs
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	56	corresponds to what the parser combinator was able to process from the
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	57	input and the second is the unprocessed part of the input (therefore
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	58	the type of this unprocessed part is the same as the input). As we
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	59	shall see shortly, a parser combinator might return more than one such
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	60	pair; the idea is that there are potentially several ways of how to
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	61	parse the input. As a concrete example, consider the string
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	62
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	63	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	64	\tt\Grid{iffoo\VS testbar}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	65	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	66
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	67	\noindent We might have a parser combinator which tries to
386 31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	68	interpret this string as a keyword (\texttt{if}) or as an
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	69	identifier (\texttt{iffoo}). Then the output will be the set
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	70
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	71	\begin{center}
386 31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	72	$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right),
31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	73	\left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	74	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	75
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	76	\noindent where the first pair means the parser could recognise
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	77	\texttt{if} from the input and leaves the rest as `unprocessed' as the
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	78	second component of the pair; in the other case it could recognise
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	79	\texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	80	parser cannot recognise anything from the input at all, then parser
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	81	combinators just return the empty set $\{\}$. This will indicate
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	82	something ``went wrong''\ldots or more precisely, nothing could be
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	83	parsed.
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	84
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	85	Also important to note is that the type \texttt{T} for the processed
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	86	part is different from the input type. In the example above is just
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	87	happens to be the same. The reason for the difference is that in
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	88	general we are interested in transforming our input into something
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	89	``different''\ldots for example into a tree; or if we implement the
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	90	grammar for arithmetic expressions, we might be interested in the
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	91	actual integer number the arithmetic expression, say \texttt{1 + 2 *
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	92	3}, stands for. In this way we can use parser combinators to
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	93	implement relatively easily a calculator, for instance.
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	94
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	95	The main idea of parser combinators is that we can easily build parser
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	96	combinators out of smaller components following very closely the
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	97	structure of a grammar. In order to implement this in an
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	98	object-oriented programming language, like Scala, we need to specify
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	99	an abstract class for parser combinators. This abstract class states
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	100	that the function \texttt{parse} takes an argument of type \texttt{I}
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	101	and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	102
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	103	\begin{center}
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	104	\begin{lstlisting}[language=Scala]
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	105	abstract class Parser[I, T] {
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	106	def parse(ts: I): Set[(T, I)]
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	107
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	108	def parse_all(ts: I): Set[T] =
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	109	for ((head, tail) <- parse(ts); if (tail.isEmpty))
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	110	yield head
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	111	}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	112	\end{lstlisting}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	113	\end{center}
176 3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	114
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	115	\noindent It is the obligation in each instance (parser combinator) to
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	116	supply an implementation for \texttt{parse}. From this function we
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	117	can then ``centrally'' derive the function \texttt{parse\_all}, which
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	118	just filters out all pairs whose second component is not empty (that
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	119	is has still some unprocessed part). The reason is that at the end of
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	120	the parsing we are only interested in the results where all the input
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	121	has been consumed and no unprocessed part is left over.
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	122
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	123	One of the simplest parser combinators recognises just a
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	124	single character, say $c$, from the beginning of strings. Its
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	125	behaviour can be described as follows:
176 3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	126
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	127	\begin{itemize}
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	128	\item If the head of the input string starts with a $c$, then return
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	129	the set
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	130	\[\{(c, \textit{tail of}\; s)\}\]
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	131	where \textit{tail of}
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	132	$s$ is the unprocessed part of the input string.
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	133	\item Otherwise return the empty set $\{\}$.
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	134	\end{itemize}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	135
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	136	\noindent
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	137	The input type of this simple parser combinator for characters is
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	138	\texttt{String} and the output type is \texttt{Char}. This means
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	139	\texttt{parse} returns \mbox{\texttt{Set[(Char, String)]}}.
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	140	The code in Scala is as follows:
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	141
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	142	\begin{center}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	143	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	144	case class CharParser(c: Char) extends Parser[String, Char] {
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	145	def parse(in: String) =
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	146	if (in.head == c) Set((c, in.tail)) else Set()
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	147	}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	148	\end{lstlisting}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	149	\end{center}
176 3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	150
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	151	\noindent You can see \texttt{parse} tests whether the
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	152	first character of the input string \texttt{in} is equal to
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	153	\texttt{c}. If yes, then it splits the string into the recognised part
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	154	\texttt{c} and the unprocessed part \texttt{in.tail}. In case
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	155	\texttt{in} does not start with \texttt{c} then the parser returns the
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	156	empty set (in Scala \texttt{Set()}). Since this parser recognises
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	157	characters and just returns characters as the processed part, the
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	158	output type of the parser is \texttt{Char}.
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	159
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	160	If we want to parse a list of tokens and interested in recognising a
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	161	number token, we could write something like this
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	162
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	163	\begin{center}
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	164	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	165	case object NumParser extends Parser[List[Token], Int] {
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	166	def parse(ts: List[Token]) = ts match {
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	167	case Num_token(s)::ts => Set((s.toInt, ts))
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	168	case _ => Set ()
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	169	}
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	170	}
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	171	\end{lstlisting}
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	172	\end{center}
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	173
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	174	\noindent
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	175	In this parser the input is of type \texttt{List[Token]}. The function
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	176	parse looks at the input \texttt{ts} and checks whether the first
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	177	token is a \texttt{Num\_token} (let us assume our lexer generated
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	178	these tokens for numbers). But this parser does not just return this
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	179	token (and the rest of the list), like the \texttt{CharParser} above,
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	180	rather it extracts the string \texttt{s} from the token and converts it
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	181	into an integer. The hope is that the lexer did its work well and this
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	182	conversion always succeeds. The consequence of this is that the output
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	183	type for this parser is \texttt{Int}, not \texttt{Token}. Such a
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	184	conversion would be needed if we want to implement a simple calculator
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	185	program.
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	186
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	187
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	188	These simple parsers that just look at the input and do a simple
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	189	transformation are often called \emph{atomic} parser combinators.
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	190	More interesting are the parser combinators that build larger parsers
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	191	out of smaller component parsers. There are three such parser
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	192	combinators that can be implemented generically. The \emph{alternative
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	193	parser combinator} is as follows: given two parsers, say, $p$ and
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	194	$q$, we apply both parsers to the input (remember parsers are
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	195	functions) and combine the output (remember they are sets of pairs):
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	196
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	197	\begin{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	198	$p(\text{input}) \cup q(\text{input})$
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	199	\end{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	200
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	201	\noindent In Scala we can implement alternative parser
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	202	combinator as follows
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	203
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	204	\begin{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	205	\begin{lstlisting}[language=Scala, numbers=none]
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	206	class AltParser[I, T]
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	207	(p: => Parser[I, T],
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	208	q: => Parser[I, T]) extends Parser[I, T] {
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	209	def parse(in: I) = p.parse(in) ++ q.parse(in)
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	210	}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	211	\end{lstlisting}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	212	\end{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	213
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	214	\noindent The types of this parser combinator are again generic (we
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	215	have \texttt{I} for the input type, and \texttt{T} for the output
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	216	type). The alternative parser builds a new parser out of two existing
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	217	parsers \texttt{p} and \texttt{q} given as arguments. Both parsers
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	218	need to be able to process input of type \texttt{I} and return in
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	219	\texttt{parse} the same output type \texttt{Set[(T,
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	220	I)]}.\footnote{There is an interesting detail of Scala, namely the
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	221	\texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	222	will prevent the evaluation of the arguments before they are
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	223	used. This is often called \emph{lazy evaluation} of the
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	224	arguments. We will explain this later.} The alternative parser
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	225	should run the input with the first parser \texttt{p} (producing a set
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	226	of pairs) and then run the same input with \texttt{q} (producing
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	227	another set of pairs). The result should be then just the union of
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	228	both sets, which is the operation \texttt{++} in Scala.
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	229
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	230	The alternative parser combinator allows us to construct a parser that
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	231	parses either a character \texttt{a} or \texttt{b} using the
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	232	\texttt{CharParser} shown above. For this we can write
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	233
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	234	\begin{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	235	\begin{lstlisting}[language=Scala, numbers=none]
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	236	new AltParser(CharParser('a'), CharParser('b'))
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	237	\end{lstlisting}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	238	\end{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	239
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	240	\noindent Later on we will use Scala mechanism for introducing some
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	241	more readable shorthand notation for this, like \texttt{"a" \|
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	242	"b"}. Let us look in detail at what this parser combinator produces
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	243	with some sample strings
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	244
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	245	\begin{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	246	\begin{tabular}{rcl}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	247	input strings & & output\medskip\\
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	248	\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	249	\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	250	\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	251	\end{tabular}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	252	\end{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	253
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	254	\noindent We receive in the first two cases a successful
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	255	output (that is a non-empty set). In each case, either
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	256	\pcode{a} or \pcode{b} is in the processed part, and
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	257	\pcode{cde} in the unprocessed part. Clearly this parser cannot
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	258	parse anything with \pcode{ccde}, therefore the empty
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	259	set is returned.
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	260
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	261	A bit more interesting is the \emph{sequence parser combinator}. Given
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	262	two parsers, say again, $p$ and $q$, we want to apply first the input
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	263	to $p$ producing a set of pairs; then apply $q$ to all the unparsed
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	264	parts in the pairs; and then combine the results. Mathematically we would
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	265	write something like this for the expected set of pairs:
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	266
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	267	\begin{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	268	\begin{tabular}{lcl}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	269	$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,\|\,$ &
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	270	$(\textit{output}_1, u_1) \in p(\text{input})
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	271	\;\wedge\;$\\
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	272	&& $(\textit{output}_2, u_2) \in q(u_1)\}$
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	273	\end{tabular}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	274	\end{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	275
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	276	\noindent Notice that the $p$ wil first be run on the input, producing
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	277	pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	278	for the unprocessed, or leftover, parts. We want that $q$ runs on all
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	279	these unprocessed parts $u_1$. This again will produce some
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	280	processed part , $p$ and
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	281	$q$, we apply both parsers to the input (remember parsers are
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	282	functions) and combine the output (remember they are sets of pairs)
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	283
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	284	\begin{center}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	285	$p(\text{input}) \cup q(\text{input})$
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	286	\end{center}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	287
386 31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	288	\noindent In Scala we would implement alternative parser
31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	289	combinator as follows
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	290
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	291	\begin{center}
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	292	\begin{lstlisting}[language=Scala, numbers=none]
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	293	class AltParser[I, T]
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	294	(p: => Parser[I, T],
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	295	q: => Parser[I, T]) extends Parser[I, T] {
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	296	def parse(in: I) = p.parse(in) ++ q.parse(in)
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	297	}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	298	\end{lstlisting}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	299	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	300
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	301	\noindent The types of this parser combinator are again generic (we
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	302	just have \texttt{I} for the input type, and \texttt{T} for the output
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	303	type). The alternative parser builds a new parser out of two existing
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	304	parsers \texttt{p} and \texttt{q}. Both need to be able to process
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	305	input of type \texttt{I} and return the same output type
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	306	\texttt{Set[(T, I)]}.\footnote{There is an interesting detail of
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	307	Scala, namely the \texttt{=>} in front of the types of \texttt{p}
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	308	and \texttt{q}. They will prevent the evaluation of the arguments
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	309	before they are used. This is often called \emph{lazy evaluation} of
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	310	the arguments. We will explain this later.} Therefore the output
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	311	type of this parser is \texttt{T}. The alternative parser should run
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	312	the input with the first parser \texttt{p} (producing a set of pairs)
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	313	and then run the same input with \texttt{q} (producing another set of
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	314	pairs). The result should be then just the union of both sets, which
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	315	is the operation \texttt{++} in Scala.
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	316
386 31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	317	The alternative parser combinator already allows us to
31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	318	construct a parser that parses either a character \texttt{a}
31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	319	or \texttt{b}, as
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	320
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	321	\begin{center}
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	322	\begin{lstlisting}[language=Scala, numbers=none]
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	323	new AltParser(CharParser('a'), CharParser('b'))
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	324	\end{lstlisting}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	325	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	326
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	327	\noindent Later on we will use again Scala mechanism for introducing some
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	328	more readable shorthand notation for this, like \texttt{"a" \| "b"}. Let us look in detail at what this parser combinator produces with some sample strings
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	329
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	330	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	331	\begin{tabular}{rcl}
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	332	input strings & & output\medskip\\
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	333	\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	334	\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	335	\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	336	\end{tabular}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	337	\end{center}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	338
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	339	\noindent We receive in the first two cases a successful
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	340	output (that is a non-empty set). In each case, either
392 2d0a59127694 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 386 diff changeset	341	\pcode{a} or \pcode{b} is in the processed part, and
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	342	\pcode{cde} in the unprocessed part. Clearly this parser cannot
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	343	parse anything in the string \pcode{ccde}, therefore the empty
386 31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	344	set.
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	345
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	346	A bit more interesting is the \emph{sequence parser combinator}. Given
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	347	two parsers, say again, $p$ and $q$, we want to apply first the input
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	348	to $p$ producing a set of pairs; then apply $q$ to all the unparsed
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	349	parts in the pairs; and then combine the results like
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	350
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	351	\begin{center}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	352	\begin{tabular}{lcl}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	353	$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;\|\;$ &
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	354	$(\textit{output}_1, u_1) \in p(\text{input})
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	355	\;\wedge\;$\\
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	356	&& $(\textit{output}_2, u_2) \in q(u_1)\}$
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	357	\end{tabular}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	358	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	359
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	360	\noindent Notice that the $p$ will first be run on the input,
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	361	producing pairs of the form $\textit{output}_1$ and unprocessed part
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	362	$u_1$. This unprocessed part is fed into the second parser $q$. The
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	363	overall result of the sequence parser combinator is pairs of the form
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	364	$((\textit{output}_1, \textit{output}_2), u_2)$. This means the
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	365	unprocessed part of both parsers is the unprocessed part the second
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	366	parser $q$ leaves as leftover. The processed parts of both
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	367	parsers is a pair consisting of the outputs of $p$ and $q$, namely
586 9cb8dfcb7f30 typos Christian Urban <urbanc@in.tum.de> parents: 585 diff changeset	368	$(\textit{output}_1, \textit{output}_2)$. This behaviour can be
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	369	implemented in Scala as follows:
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	370
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	371	\begin{center}
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	372	\begin{lstlisting}[language=Scala,numbers=none]
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	373	class SeqParser[I, T, S]
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	374	(p: => Parser[I, T],
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	375	q: => Parser[I, S]) extends Parser[I, (T, S)] {
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	376	def parse(in: I) =
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	377	for ((output1, u1) <- p.parse(in);
386 31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	378	(output2, u2) <- q.parse(u1))
31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	379	yield ((output1, output2), u2)
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	380	}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	381	\end{lstlisting}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	382	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	383
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	384	\noindent This parser takes again as arguments two parsers, \texttt{p}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	385	and \texttt{q}. It implements \texttt{parse} as follows: let first run
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	386	the parser \texttt{p} on the input producing a set of pairs
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	387	(\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	388	unprocessed parts left over by \texttt{p}. Let then \texttt{q} run on these
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	389	unprocessed parts producing again a set of pairs. The output of the
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	390	sequence parser combinator is then a set containing pairs where the
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	391	first components are again pairs, namely what the first parser could
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	392	parse together with what the second parser could parse; the second
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	393	component is the unprocessed part left over after running the second
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	394	parser \texttt{q}. Therefore the input type of the sequence parser
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	395	combinator is as usual \texttt{I}, but the output type is
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	396
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	397	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	398	\texttt{Set[((T, S), I)]}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	399	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	400
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	401	\noindent
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	402	If any of the runs of \textit{p} and \textit{q} fail, that is produce
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	403	the empty set, then \texttt{parse} will also produce the empty set.
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	404	Notice that we have now two output types for the sequence parser
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	405	combinator, because in general \textit{p} and \textit{q} might produce
586 9cb8dfcb7f30 typos Christian Urban <urbanc@in.tum.de> parents: 585 diff changeset	406	different things (for example first we recognise a number and then a
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	407	string corresponding to an operator).
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	408
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	409	With the shorthand notation we shall introduce later for the sequence
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	410	parser combinator, we can write for example \pcode{"a" ~ "b"}, which
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	411	is the parser combinator that first recognises the character
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	412	\texttt{a} from a string and then \texttt{b}. Let us look again at
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	413	three examples of how this parser combinator processes some strings:
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	414
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	415	\begin{center}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	416	\begin{tabular}{rcl}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	417	input strings & & output\medskip\\
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	418	\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	419	\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	420	\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	421	\end{tabular}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	422	\end{center}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	423
586 9cb8dfcb7f30 typos Christian Urban <urbanc@in.tum.de> parents: 585 diff changeset	424	\noindent In the first line we have a successful parse, because the
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	425	string starts with \texttt{ab}, which is the prefix we are looking
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	426	for. But since the parsing combinator is constructed as sequence of
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	427	the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	428	result is a nested pair of the form \texttt{((a, b), cde)}. It is
586 9cb8dfcb7f30 typos Christian Urban <urbanc@in.tum.de> parents: 585 diff changeset	429	\emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	430	expect. The parser returns the empty set in the other examples,
584 7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	431	because they do not fit with what the parser is supposed to parse.
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	432
7b879f2a8f6a updated Christian Urban <urbanc@in.tum.de> parents: 392 diff changeset	433
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	434	A slightly more complicated parser is \pcode{("a" \| "b") ~ "c"} which
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	435	parses as first character either an \texttt{a} or \texttt{b}, followed
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	436	by a \texttt{c}. This parser produces the following outputs.
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	437
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	438	\begin{center}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	439	\begin{tabular}{rcl}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	440	input strings & & output\medskip\\
585 3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	441	\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	442	\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	443	\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	444	\end{tabular}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	445	\end{center}
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	446
585 3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	447	\noindent
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	448	Now consider the parser \pcode{("a" ~ "b") ~ "c"} which parses
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	449	\texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	450	the following outputs.
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	451
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	452	\begin{center}
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	453	\begin{tabular}{rcl}
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	454	input strings & & output\medskip\\
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	455	\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	456	\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	457	\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	458	\end{tabular}
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	459	\end{center}
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	460
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	461
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	462	\noindent The second and third example fail, because something is
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	463	``missing'' in the sequence we are looking for. Also notice how the
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	464	results nest with sequences: the parsed part is a nested pair of the
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	465	form \pcode{((a, b), c)}. If we nest the sequence parser differently,
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	466	for example \pcode{"a" ~ ("b" ~ "c")}, then also our output pairs nest
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	467	differently
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	468
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	469	\begin{center}
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	470	\begin{tabular}{rcl}
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	471	input strings & & output\medskip\\
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	472	\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})), \texttt{\Grid{de}})\right\}$\\
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	473	\end{tabular}
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	474	\end{center}
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	475
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	476	\noindent
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	477	Two more examples: first consider the parser
585 3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	478	\pcode{("a" ~ "a") ~ "a"} and the input \pcode{aaaa}:
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	479
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	480	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	481	\begin{tabular}{rcl}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	482	input string & & output\medskip\\
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	483	\texttt{\Grid{aaaa}} & $\rightarrow$ &
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	484	$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	485	\end{tabular}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	486	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	487
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	488	\noindent Notice how again the results nest deeper and deeper as pairs (the
585 3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	489	last \pcode{a} is in the unprocessed part). To consume everything of
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	490	this string we can use the parser \pcode{(("a" ~ "a") ~ "a") ~
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	491	"a"}. Then the output is as follows:
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	492
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	493	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	494	\begin{tabular}{rcl}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	495	input string & & output\medskip\\
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	496	\texttt{\Grid{aaaa}} & $\rightarrow$ &
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	497	$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	498	\end{tabular}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	499	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	500
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	501	\noindent This is an instance where the parser consumed
7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	502	completely the input, meaning the unprocessed part is just the
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	503	empty string. So if we called \pcode{parse_all}, instead of \pcode{parse},
585 3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	504	we would get back the result
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	505
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	506	\[
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	507	\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	508	\]
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	509
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	510	\noindent where the unprocessed (empty) parts have been stripped away
3ebc7b45ecd5 updated Christian Urban <urbanc@in.tum.de> parents: 584 diff changeset	511	from the pairs; everything where the second part was not empty has
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	512	been thrown away as well, because they represent
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	513	ultimately-unsuccessful-parses.
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	514
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	515
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	516	Note carefully that constructing a parser such \pcode{"a" \| ("a" ~
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	517	"b")} will result in a typing error. The intention is that we want
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	518	to parse an \texttt{a}, or an \texttt{a} followed by a
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	519	\texttt{b}. However, the first parser has as output type a single
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	520	character (recall the type of \texttt{CharParser}), but the second
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	521	parser produces a pair of characters as output. The alternative parser
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	522	is required to have both component parsers to have the same type---we
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	523	need to be able to build the union of two sets, which means in Scala
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	524	they need to be of the same type. Therefore the typing error.
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	525	We will see later how we can build
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	526	this parser without the typing error.
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 297 diff changeset	527
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	528	The next parser combinator, called \emph{semantic action}, does not
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	529	actually combine smaller parsers, but applies a function to the result
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	530	of a parser. It is implemented in Scala as follows
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	531
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	532	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	533	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	534	class FunParser[I, T, S]
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	535	(p: => Parser[I, T],
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	536	f: T => S) extends Parser[I, S] {
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	537	def parse(in: I) =
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	538	for ((head, tail) <- p.parse(in)) yield (f(head), tail)
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	539	}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	540	\end{lstlisting}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	541	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	542
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	543
386 31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	544	\noindent This parser combinator takes a parser \texttt{p}
31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	545	with output type \texttt{T} as one argument as well as a
31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	546	function \texttt{f} with type \texttt{T => S}. The parser
31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	547	\texttt{p} produces sets of type \texttt{(T, I)}. The
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	548	semantic action combinator then applies the function
386 31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	549	\texttt{f} to all the parser outputs. Since this function is of
31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	550	type \texttt{T => S}, we obtain a parser with output type
31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	551	\texttt{S}. Again Scala lets us introduce some shorthand
31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	552	notation for this parser combinator. Therefore we will write
31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	553	\texttt{p ==> f} for it.
31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	554
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	555	What are semantic actions good for? Well, they allow you to transform
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	556	the parsed input into a datastructure you can use for further
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	557	processing. A simple example would be to transform parsed characters
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	558	into ASCII numbers. Suppose we define a function \texttt{f} (from
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	559	characters to ints) and use a \texttt{CharParser} for parsing
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	560	the character \texttt{c}.
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	561
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	562	\begin{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	563	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	564	val f = (c: Char) => c.toInt
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	565	val c = new CharParser('c')
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	566	\end{lstlisting}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	567	\end{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	568
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	569	\noindent
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	570	We then can run the following two parsers on the input \texttt{cbd}:
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	571
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	572	\begin{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	573	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	574	c.parse("cbd")
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	575	(c ==> f).parse("cbd")
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	576	\end{lstlisting}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	577	\end{center}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	578
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	579	\noindent
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	580	In the first line we obtain the expected result \texttt{Set(('c',
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	581	"bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	582	character has been transformed into an ASCII number.
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	583
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	584	A slightly less contrived example is about parsing numbers (recall
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	585	\texttt{NumParser} above). However, we want to do this here for strings.
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	586	For this assume we have the following \texttt{RegexParser}.
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	587
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	588	\begin{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	589	\begin{lstlisting}[language=Scala,xleftmargin=0mm,
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	590	basicstyle=\small\ttfamily, numbers=none]
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	591	import scala.util.matching.Regex
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	592
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	593	case class RegexParser(reg: Regex) extends Parser[String, String] {
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	594	def parse(in: String) = reg.findPrefixMatchOf(in) match {
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	595	case None => Set()
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	596	case Some(m) => Set((m.matched, m.after.toString))
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	597	}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	598	}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	599	\end{lstlisting}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	600	\end{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	601
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	602	\noindent
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	603	This parser takes a regex as argument and splits up a string into a
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	604	prefix and the rest according to this regex
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	605	(\texttt{reg.findPrefixMatchOf} generates a match---in the successful
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	606	case---and the corresponding strings can be extracted with
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	607	\texttt{matched} and \texttt{after}). Using this parser we can define a
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	608	\texttt{NumParser} for strings as follows:
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	609
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	610	\begin{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	611	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	612	val NumParser = RegexParser("[0-9]+".r)
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	613	\end{lstlisting}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	614	\end{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	615
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	616	\noindent
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	617	This parser will recognise a number at the beginning of a string, for
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	618	example
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	619
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	620	\begin{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	621	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	622	NumParser.parse("123abc")
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	623	\end{lstlisting}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	624	\end{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	625
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	626	\noindent
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	627	produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	628	still a string (the double-quotes are not printed by Scala). We need
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	629	to convert it into the corresponding \texttt{Int}. We can do this as
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	630	follows using a semantic action
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	631
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	632	\begin{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	633	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	634	(NumParser ==> (s => s.toInt)).parse("123abc")
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	635	\end{lstlisting}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	636	\end{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	637
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	638	\noindent
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	639	The function in the semantic action converts a string into an
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	640	\texttt{Int}. Let us come back to semantic actions when we are going
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	641	to implement actual context-free gammars.
587 b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	642
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	643	\subsubsection*{Shorthand notation for parser combinators}
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	644
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	645	Before we proceed, let us just explain the shorthand notation for
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	646	parser combinators. Like for regular expressions, the shorthand notation
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	647	will make our life much easier when writing actual parsers.
b2f9734e435a updated Christian Urban <urbanc@in.tum.de> parents: 586 diff changeset	648
386 31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	649	\subsubsection*{How to build parsers using parser combinators?}
31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	650
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	651	The beauty of parser combinators is the ease with which they can be
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	652	implemented and how easy it is to translate context-free grammars into
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	653	code (though the grammars need to be non-left-recursive). To
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	654	demonstrate this recall our context-free grammar for palindromes:
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	655
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	656	\begin{plstx}[margin=3cm]
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	657	: \meta{P} ::= a\cdot \meta{P}\cdot a \| b\cdot \meta{P}\cdot b \| a \| b \| \epsilon\\
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	658	\end{plstx}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	659
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	660	\noindent
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	661	Given the parser combinators for alternatives and sequeneces, this grammar should be
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	662	straightforward to implement. The first idea would be
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	663
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	664	\begin{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	665	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	666	lazy val Pal : Parser[String, String] =
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	667	(("a" ~ Pal ~ "a") \| ("b" ~ Pal ~ "b") \| "a" \| "b" \| "")
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	668	\end{lstlisting}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	669	\end{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	670
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	671	\noindent
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	672	Unfortunately, this does not work as it produces a typing error. The
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	673	reason is that the parsers \texttt{"a"}, \texttt{"b"} and \texttt{""}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	674	all produce strings and therefore can be put into an alternative
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	675	\texttt{...\| "a" \| "b" \| ""}. But both \pcode{"a" ~ Pal ~ "a"} and
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	676	\pcode{"b" ~ Pal ~ "b"} produce pairs of the form
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	677	$(((\_, \_), \_), \_)$---that is how the sequence parser combinator
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	678	nests results when \pcode{\~} is used between two components. The
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	679	solution is to use a semantic action that ``flattens'' these pairs and
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	680	appends the corresponding strings, like
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	681
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	682	\begin{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	683	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	684	lazy val Pal : Parser[String, String] =
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	685	(("a" ~ Pal ~ "a") ==> { case ((x, y), z) => x + y + z } \|
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	686	("b" ~ Pal ~ "b") ==> { case ((x, y), z) => x + y + z } \|
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	687	"a" \| "b" \| "")
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	688	\end{lstlisting}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	689	\end{center}
3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	690
589 bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	691	\noindent
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	692	The semantic action
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	693
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	694
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	695	\noindent
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	696	Important to note is that we must define \texttt{Pal}-parser as a
bdb9940be094 updated Christian Urban <urbanc@in.tum.de> parents: 588 diff changeset	697	\emph{lazy} value.
588 3e317772acc9 updated Christian Urban <urbanc@in.tum.de> parents: 587 diff changeset	698
386 31295bb945c6 update Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 385 diff changeset	699	\subsubsection*{Implementing an Interpreter}
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	700
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	701	%\bigskip
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	702	%takes advantage of the full generality---have a look
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	703	%what it produces if we call it with the string \texttt{abc}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	704	%
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	705	%\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	706	%\begin{tabular}{rcl}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	707	%input string & & output\medskip\\
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	708	%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	709	%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	710	%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	711	%\end{tabular}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	712	%\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	713
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	714
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	715
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	716	\end{document}
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	717
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	718	%%% Local Variables:
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	719	%%% mode: latex
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	720	%%% TeX-master: t
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	721	%%% End:

author	Christian Urban <urbanc@in.tum.de>
	Thu, 25 Oct 2018 06:53:16 +0100
changeset 589	bdb9940be094
parent 588	3e317772acc9
child 590	5cdefb0e309e
permissions	-rw-r--r--