| author | Christian Urban <urbanc@in.tum.de> | 
| Thu, 25 Oct 2018 06:53:16 +0100 | |
| changeset 589 | bdb9940be094 | 
| parent 588 | 3e317772acc9 | 
| child 590 | 5cdefb0e309e | 
| permissions | -rw-r--r-- | 
| 584 | 1  | 
|
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
2  | 
\documentclass{article}
 | 
| 
297
 
5c51839c88fd
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
292 
diff
changeset
 | 
3  | 
\usepackage{../style}
 | 
| 
217
 
cd6066f1056a
updated handouts
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
183 
diff
changeset
 | 
4  | 
\usepackage{../langs}
 | 
| 588 | 5  | 
\usepackage{../grammar}
 | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
6  | 
|
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
7  | 
\begin{document}
 | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
8  | 
|
| 
292
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
9  | 
\section*{Handout 6 (Parser Combinators)}
 | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
10  | 
|
| 584 | 11  | 
This handout explains how \emph{parser combinators} work and how they
 | 
| 587 | 12  | 
can be implemented in Scala. Their most distinguishing feature is that  | 
13  | 
they are very easy to implement (admittedly it is only easy in a  | 
|
14  | 
functional programming language). Another good point of parser  | 
|
15  | 
combinators is that they can deal with any kind of input as long as  | 
|
16  | 
this input is of ``sequence-kind'', for example a string or a list of  | 
|
17  | 
tokens. The only two properties of the input we need is to be able to  | 
|
18  | 
test when it is empty and ``sequentially'' take it apart. Strings and  | 
|
19  | 
lists fit this bill. However, parser combinators also have their  | 
|
20  | 
drawbacks. For example they require that the grammar to be parsed is  | 
|
21  | 
\emph{not} left-recursive and they are efficient only when the grammar
 | 
|
22  | 
is unambiguous. It is the responsibility of the grammar designer to  | 
|
23  | 
ensure these two properties.  | 
|
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
24  | 
|
| 587 | 25  | 
The general idea behind parser combinators is to transform the input  | 
26  | 
into sets of pairs, like so  | 
|
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
27  | 
|
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
28  | 
\begin{center}
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
29  | 
$\underbrace{\text{list of tokens}}_{\text{input}}$ 
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
30  | 
$\Rightarrow$  | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
31  | 
$\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
32  | 
\end{center} 
 | 
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
33  | 
|
| 587 | 34  | 
\noindent  | 
| 589 | 35  | 
Given the extended effort we have spent implementing a lexer  | 
| 587 | 36  | 
in order to generate list of tokens, it might be surprising that in  | 
37  | 
what follows we shall often use strings as input. This is for making  | 
|
38  | 
the explanation more lucid. It does not make our previous work on  | 
|
39  | 
lexers obsolete (remember they transform a string into a list of  | 
|
40  | 
tokens). Lexers will still be needed for building a somewhat realistic  | 
|
| 584 | 41  | 
compiler.  | 
42  | 
||
| 587 | 43  | 
But as said, parser combinators are relatively agnostic about what  | 
44  | 
kind of input they process. In my Scala code I use the following  | 
|
45  | 
polymorphic types for parser combinators:  | 
|
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
46  | 
|
| 
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
47  | 
\begin{center}
 | 
| 584 | 48  | 
input:\;\; \texttt{I}  \qquad output:\;\; \texttt{T}
 | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
49  | 
\end{center}
 | 
| 
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
50  | 
|
| 587 | 51  | 
\noindent That is they take as input something of type \texttt{I} and
 | 
52  | 
return a set of pairs of type \texttt{Set[(T, I)]}. Since the input needs
 | 
|
53  | 
to be of ``sequence-kind'', I actually have to often write \texttt{I
 | 
|
54  | 
<\% Seq[\_]} for the input type. This ensures the input is a  | 
|
| 584 | 55  | 
subtype of Scala sequences. The first component of the generated pairs  | 
56  | 
corresponds to what the parser combinator was able to process from the  | 
|
57  | 
input and the second is the unprocessed part of the input (therefore  | 
|
58  | 
the type of this unprocessed part is the same as the input). As we  | 
|
59  | 
shall see shortly, a parser combinator might return more than one such  | 
|
| 587 | 60  | 
pair; the idea is that there are potentially several ways of how to  | 
| 584 | 61  | 
parse the input. As a concrete example, consider the string  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
62  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
63  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
64  | 
\tt\Grid{iffoo\VS testbar}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
65  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
66  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
67  | 
\noindent We might have a parser combinator which tries to  | 
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
68  | 
interpret this string as a keyword (\texttt{if}) or as an
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
69  | 
identifier (\texttt{iffoo}). Then the output will be the set
 | 
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
70  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
71  | 
\begin{center}
 | 
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
72  | 
$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), 
 | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
73  | 
           \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
74  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
75  | 
|
| 587 | 76  | 
\noindent where the first pair means the parser could recognise  | 
77  | 
\texttt{if} from the input and leaves the rest as `unprocessed' as the
 | 
|
78  | 
second component of the pair; in the other case it could recognise  | 
|
79  | 
\texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
 | 
|
80  | 
parser cannot recognise anything from the input at all, then parser  | 
|
81  | 
combinators just return the empty set $\{\}$. This will indicate
 | 
|
82  | 
something ``went wrong''\ldots or more precisely, nothing could be  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
83  | 
parsed.  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
84  | 
|
| 584 | 85  | 
Also important to note is that the type \texttt{T} for the processed
 | 
| 587 | 86  | 
part is different from the input type. In the example above is just  | 
87  | 
happens to be the same. The reason for the difference is that in  | 
|
88  | 
general we are interested in transforming our input into something  | 
|
89  | 
``different''\ldots for example into a tree; or if we implement the  | 
|
90  | 
grammar for arithmetic expressions, we might be interested in the  | 
|
| 584 | 91  | 
actual integer number the arithmetic expression, say \texttt{1 + 2 *
 | 
92  | 
3}, stands for. In this way we can use parser combinators to  | 
|
| 587 | 93  | 
implement relatively easily a calculator, for instance.  | 
| 584 | 94  | 
|
95  | 
The main idea of parser combinators is that we can easily build parser  | 
|
96  | 
combinators out of smaller components following very closely the  | 
|
97  | 
structure of a grammar. In order to implement this in an  | 
|
98  | 
object-oriented programming language, like Scala, we need to specify  | 
|
99  | 
an abstract class for parser combinators. This abstract class states  | 
|
100  | 
that the function \texttt{parse} takes an argument of type \texttt{I}
 | 
|
101  | 
and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
 | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
102  | 
|
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
103  | 
\begin{center}
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
104  | 
\begin{lstlisting}[language=Scala]
 | 
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
105  | 
abstract class Parser[I, T] {
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
106  | 
def parse(ts: I): Set[(T, I)]  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
107  | 
|
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
108  | 
def parse_all(ts: I): Set[T] =  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
109  | 
for ((head, tail) <- parse(ts); if (tail.isEmpty))  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
110  | 
yield head  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
111  | 
}  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
112  | 
\end{lstlisting}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
113  | 
\end{center}
 | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
114  | 
|
| 584 | 115  | 
\noindent It is the obligation in each instance (parser combinator) to  | 
116  | 
supply an implementation for \texttt{parse}.  From this function we
 | 
|
117  | 
can then ``centrally'' derive the function \texttt{parse\_all}, which
 | 
|
118  | 
just filters out all pairs whose second component is not empty (that  | 
|
119  | 
is has still some unprocessed part). The reason is that at the end of  | 
|
120  | 
the parsing we are only interested in the results where all the input  | 
|
121  | 
has been consumed and no unprocessed part is left over.  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
122  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
123  | 
One of the simplest parser combinators recognises just a  | 
| 584 | 124  | 
single character, say $c$, from the beginning of strings. Its  | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
125  | 
behaviour can be described as follows:  | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
126  | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
127  | 
\begin{itemize}
 | 
| 584 | 128  | 
\item If the head of the input string starts with a $c$, then return  | 
129  | 
the set  | 
|
130  | 
  \[\{(c, \textit{tail of}\; s)\}\]
 | 
|
131  | 
  where \textit{tail of} 
 | 
|
132  | 
$s$ is the unprocessed part of the input string.  | 
|
133  | 
\item Otherwise return the empty set $\{\}$.	
 | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
134  | 
\end{itemize}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
135  | 
|
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
136  | 
\noindent  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
137  | 
The input type of this simple parser combinator for characters is  | 
| 587 | 138  | 
\texttt{String} and the output type is \texttt{Char}. This means
 | 
139  | 
\texttt{parse} returns  \mbox{\texttt{Set[(Char, String)]}}. 
 | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
140  | 
The code in Scala is as follows:  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
141  | 
|
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
142  | 
\begin{center}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
143  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
144  | 
case class CharParser(c: Char) extends Parser[String, Char] {
 | 
| 587 | 145  | 
def parse(in: String) =  | 
146  | 
if (in.head == c) Set((c, in.tail)) else Set()  | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
147  | 
}  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
148  | 
\end{lstlisting}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
149  | 
\end{center}
 | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
150  | 
|
| 589 | 151  | 
\noindent You can see \texttt{parse} tests whether the
 | 
| 587 | 152  | 
first character of the input string \texttt{in} is equal to
 | 
| 584 | 153  | 
\texttt{c}. If yes, then it splits the string into the recognised part
 | 
| 587 | 154  | 
\texttt{c} and the unprocessed part \texttt{in.tail}. In case
 | 
155  | 
\texttt{in} does not start with \texttt{c} then the parser returns the
 | 
|
| 584 | 156  | 
empty set (in Scala \texttt{Set()}). Since this parser recognises
 | 
157  | 
characters and just returns characters as the processed part, the  | 
|
158  | 
output type of the parser is \texttt{Char}.
 | 
|
159  | 
||
160  | 
If we want to parse a list of tokens and interested in recognising a  | 
|
161  | 
number token, we could write something like this  | 
|
162  | 
||
163  | 
\begin{center}
 | 
|
164  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
 | 
|
165  | 
case object NumParser extends Parser[List[Token], Int] {
 | 
|
166  | 
  def parse(ts: List[Token]) = ts match {
 | 
|
167  | 
case Num_token(s)::ts => Set((s.toInt, ts))  | 
|
168  | 
case _ => Set ()  | 
|
169  | 
}  | 
|
170  | 
}  | 
|
171  | 
\end{lstlisting}
 | 
|
172  | 
\end{center}
 | 
|
173  | 
||
174  | 
\noindent  | 
|
175  | 
In this parser the input is of type \texttt{List[Token]}. The function
 | 
|
176  | 
parse looks at the input \texttt{ts} and checks whether the first
 | 
|
| 589 | 177  | 
token is a \texttt{Num\_token} (let us assume our lexer generated
 | 
178  | 
these tokens for numbers). But this parser does not just return this  | 
|
| 584 | 179  | 
token (and the rest of the list), like the \texttt{CharParser} above,
 | 
| 589 | 180  | 
rather it extracts the string \texttt{s} from the token and converts it
 | 
| 584 | 181  | 
into an integer. The hope is that the lexer did its work well and this  | 
182  | 
conversion always succeeds. The consequence of this is that the output  | 
|
| 587 | 183  | 
type for this parser is \texttt{Int}, not \texttt{Token}. Such a
 | 
184  | 
conversion would be needed if we want to implement a simple calculator  | 
|
185  | 
program.  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
186  | 
|
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
187  | 
|
| 584 | 188  | 
These simple parsers that just look at the input and do a simple  | 
189  | 
transformation are often called \emph{atomic} parser combinators.
 | 
|
190  | 
More interesting are the parser combinators that build larger parsers  | 
|
| 587 | 191  | 
out of smaller component parsers. There are three such parser  | 
192  | 
combinators that can be implemented generically. The \emph{alternative
 | 
|
| 584 | 193  | 
parser combinator} is as follows: given two parsers, say, $p$ and  | 
194  | 
$q$, we apply both parsers to the input (remember parsers are  | 
|
| 587 | 195  | 
functions) and combine the output (remember they are sets of pairs):  | 
196  | 
||
197  | 
\begin{center}
 | 
|
198  | 
$p(\text{input}) \cup q(\text{input})$
 | 
|
199  | 
\end{center}
 | 
|
200  | 
||
201  | 
\noindent In Scala we can implement alternative parser  | 
|
202  | 
combinator as follows  | 
|
203  | 
||
204  | 
\begin{center}
 | 
|
205  | 
\begin{lstlisting}[language=Scala, numbers=none]
 | 
|
206  | 
class AltParser[I, T]  | 
|
207  | 
(p: => Parser[I, T],  | 
|
208  | 
        q: => Parser[I, T]) extends Parser[I, T] {
 | 
|
209  | 
def parse(in: I) = p.parse(in) ++ q.parse(in)  | 
|
210  | 
}  | 
|
211  | 
\end{lstlisting}
 | 
|
212  | 
\end{center}
 | 
|
213  | 
||
214  | 
\noindent The types of this parser combinator are again generic (we  | 
|
215  | 
have \texttt{I} for the input type, and \texttt{T} for the output
 | 
|
216  | 
type). The alternative parser builds a new parser out of two existing  | 
|
217  | 
parsers \texttt{p} and \texttt{q} given as arguments.  Both parsers
 | 
|
218  | 
need to be able to process input of type \texttt{I} and return in
 | 
|
219  | 
\texttt{parse} the same output type \texttt{Set[(T,
 | 
|
220  | 
  I)]}.\footnote{There is an interesting detail of Scala, namely the
 | 
|
221  | 
  \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They
 | 
|
222  | 
will prevent the evaluation of the arguments before they are  | 
|
223  | 
  used. This is often called \emph{lazy evaluation} of the
 | 
|
224  | 
arguments. We will explain this later.} The alternative parser  | 
|
225  | 
should run the input with the first parser \texttt{p} (producing a set
 | 
|
226  | 
of pairs) and then run the same input with \texttt{q} (producing
 | 
|
227  | 
another set of pairs). The result should be then just the union of  | 
|
228  | 
both sets, which is the operation \texttt{++} in Scala.
 | 
|
229  | 
||
230  | 
The alternative parser combinator allows us to construct a parser that  | 
|
231  | 
parses either a character \texttt{a} or \texttt{b} using the
 | 
|
232  | 
\texttt{CharParser} shown above. For this we can write
 | 
|
233  | 
||
234  | 
\begin{center}
 | 
|
235  | 
\begin{lstlisting}[language=Scala, numbers=none]
 | 
|
236  | 
new AltParser(CharParser('a'), CharParser('b'))
 | 
|
237  | 
\end{lstlisting}
 | 
|
238  | 
\end{center}
 | 
|
239  | 
||
240  | 
\noindent Later on we will use Scala mechanism for introducing some  | 
|
| 589 | 241  | 
more readable shorthand notation for this, like \texttt{"a" |
 | 
| 587 | 242  | 
"b"}. Let us look in detail at what this parser combinator produces  | 
243  | 
with some sample strings  | 
|
244  | 
||
245  | 
\begin{center}
 | 
|
246  | 
\begin{tabular}{rcl}
 | 
|
247  | 
input strings & & output\medskip\\  | 
|
248  | 
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
 | 
|
249  | 
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
 | 
|
250  | 
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
 | 
|
251  | 
\end{tabular}
 | 
|
252  | 
\end{center}
 | 
|
253  | 
||
254  | 
\noindent We receive in the first two cases a successful  | 
|
255  | 
output (that is a non-empty set). In each case, either  | 
|
256  | 
\pcode{a} or \pcode{b} is in the processed part, and
 | 
|
257  | 
\pcode{cde} in the unprocessed part. Clearly this parser cannot
 | 
|
258  | 
parse anything with \pcode{ccde}, therefore the empty
 | 
|
259  | 
set is returned.  | 
|
260  | 
||
261  | 
A bit more interesting is the \emph{sequence parser combinator}. Given
 | 
|
262  | 
two parsers, say again, $p$ and $q$, we want to apply first the input  | 
|
263  | 
to $p$ producing a set of pairs; then apply $q$ to all the unparsed  | 
|
264  | 
parts in the pairs; and then combine the results. Mathematically we would  | 
|
265  | 
write something like this for the expected set of pairs:  | 
|
266  | 
||
267  | 
\begin{center}
 | 
|
268  | 
\begin{tabular}{lcl}
 | 
|
269  | 
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & 
 | 
|
270  | 
$(\textit{output}_1, u_1) \in p(\text{input}) 
 | 
|
271  | 
\;\wedge\;$\\  | 
|
272  | 
&& $(\textit{output}_2, u_2) \in q(u_1)\}$
 | 
|
273  | 
\end{tabular}
 | 
|
274  | 
\end{center}
 | 
|
275  | 
||
276  | 
\noindent Notice that the $p$ wil first be run on the input, producing  | 
|
277  | 
pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands
 | 
|
| 589 | 278  | 
for the unprocessed, or leftover, parts. We want that $q$ runs on all  | 
| 587 | 279  | 
these unprocessed parts $u_1$. This again will produce some  | 
280  | 
processed part , $p$ and  | 
|
281  | 
$q$, we apply both parsers to the input (remember parsers are  | 
|
| 584 | 282  | 
functions) and combine the output (remember they are sets of pairs)  | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
283  | 
|
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
284  | 
\begin{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
285  | 
$p(\text{input}) \cup q(\text{input})$
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
286  | 
\end{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
287  | 
|
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
288  | 
\noindent In Scala we would implement alternative parser  | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
289  | 
combinator as follows  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
290  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
291  | 
\begin{center}
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
292  | 
\begin{lstlisting}[language=Scala, numbers=none]
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
293  | 
class AltParser[I, T]  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
294  | 
(p: => Parser[I, T],  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
295  | 
        q: => Parser[I, T]) extends Parser[I, T] {
 | 
| 587 | 296  | 
def parse(in: I) = p.parse(in) ++ q.parse(in)  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
297  | 
}  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
298  | 
\end{lstlisting}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
299  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
300  | 
|
| 584 | 301  | 
\noindent The types of this parser combinator are again generic (we  | 
302  | 
just have \texttt{I} for the input type, and \texttt{T} for the output
 | 
|
303  | 
type). The alternative parser builds a new parser out of two existing  | 
|
304  | 
parsers \texttt{p} and \texttt{q}.  Both need to be able to process
 | 
|
305  | 
input of type \texttt{I} and return the same output type
 | 
|
306  | 
\texttt{Set[(T, I)]}.\footnote{There is an interesting detail of
 | 
|
307  | 
  Scala, namely the \texttt{=>} in front of the types of \texttt{p}
 | 
|
308  | 
  and \texttt{q}. They will prevent the evaluation of the arguments
 | 
|
309  | 
  before they are used. This is often called \emph{lazy evaluation} of
 | 
|
310  | 
the arguments. We will explain this later.} Therefore the output  | 
|
311  | 
type of this parser is \texttt{T}. The alternative parser should run
 | 
|
312  | 
the input with the first parser \texttt{p} (producing a set of pairs)
 | 
|
313  | 
and then run the same input with \texttt{q} (producing another set of
 | 
|
314  | 
pairs). The result should be then just the union of both sets, which  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
315  | 
is the operation \texttt{++} in Scala.
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
316  | 
|
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
317  | 
The alternative parser combinator already allows us to  | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
318  | 
construct a parser that parses either a character \texttt{a}
 | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
319  | 
or \texttt{b}, as
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
320  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
321  | 
\begin{center}
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
322  | 
\begin{lstlisting}[language=Scala, numbers=none]
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
323  | 
new AltParser(CharParser('a'), CharParser('b'))
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
324  | 
\end{lstlisting}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
325  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
326  | 
|
| 584 | 327  | 
\noindent Later on we will use again Scala mechanism for introducing some  | 
| 589 | 328  | 
more readable shorthand notation for this, like \texttt{"a" | "b"}. Let us look in detail at what this parser combinator produces with some sample strings 
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
329  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
330  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
331  | 
\begin{tabular}{rcl}
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
332  | 
input strings & & output\medskip\\  | 
| 584 | 333  | 
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
 | 
334  | 
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
 | 
|
335  | 
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
 | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
336  | 
\end{tabular}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
337  | 
\end{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
338  | 
|
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
339  | 
\noindent We receive in the first two cases a successful  | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
340  | 
output (that is a non-empty set). In each case, either  | 
| 
392
 
2d0a59127694
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
386 
diff
changeset
 | 
341  | 
\pcode{a} or \pcode{b} is in the processed part, and
 | 
| 584 | 342  | 
\pcode{cde} in the unprocessed part. Clearly this parser cannot
 | 
343  | 
parse anything in the string \pcode{ccde}, therefore the empty
 | 
|
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
344  | 
set.  | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
345  | 
|
| 584 | 346  | 
A bit more interesting is the \emph{sequence parser combinator}. Given
 | 
347  | 
two parsers, say again, $p$ and $q$, we want to apply first the input  | 
|
348  | 
to $p$ producing a set of pairs; then apply $q$ to all the unparsed  | 
|
349  | 
parts in the pairs; and then combine the results like  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
350  | 
|
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
351  | 
\begin{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
352  | 
\begin{tabular}{lcl}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
353  | 
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & 
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
354  | 
$(\textit{output}_1, u_1) \in p(\text{input}) 
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
355  | 
\;\wedge\;$\\  | 
| 584 | 356  | 
&& $(\textit{output}_2, u_2) \in q(u_1)\}$
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
357  | 
\end{tabular}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
358  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
359  | 
|
| 587 | 360  | 
\noindent Notice that the $p$ will first be run on the input,  | 
361  | 
producing pairs of the form $\textit{output}_1$ and unprocessed part
 | 
|
362  | 
$u_1$. This unprocessed part is fed into the second parser $q$. The  | 
|
| 584 | 363  | 
overall result of the sequence parser combinator is pairs of the form  | 
364  | 
$((\textit{output}_1, \textit{output}_2), u_2)$. This means the
 | 
|
| 587 | 365  | 
unprocessed part of both parsers is the unprocessed part the second  | 
| 589 | 366  | 
parser $q$ leaves as leftover. The processed parts of both  | 
| 587 | 367  | 
parsers is a pair consisting of the outputs of $p$ and $q$, namely  | 
| 586 | 368  | 
$(\textit{output}_1, \textit{output}_2)$. This behaviour can be
 | 
| 584 | 369  | 
implemented in Scala as follows:  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
370  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
371  | 
\begin{center}
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
372  | 
\begin{lstlisting}[language=Scala,numbers=none]
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
373  | 
class SeqParser[I, T, S]  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
374  | 
(p: => Parser[I, T],  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
375  | 
        q: => Parser[I, S]) extends Parser[I, (T, S)] {
 | 
| 587 | 376  | 
def parse(in: I) =  | 
377  | 
for ((output1, u1) <- p.parse(in);  | 
|
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
378  | 
(output2, u2) <- q.parse(u1))  | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
379  | 
yield ((output1, output2), u2)  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
380  | 
}  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
381  | 
\end{lstlisting}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
382  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
383  | 
|
| 587 | 384  | 
\noindent This parser takes again as arguments two parsers, \texttt{p}
 | 
385  | 
and \texttt{q}. It implements \texttt{parse} as follows: let first run
 | 
|
386  | 
the parser \texttt{p} on the input producing a set of pairs
 | 
|
387  | 
(\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
 | 
|
388  | 
unprocessed parts left over by \texttt{p}. Let then \texttt{q} run on these
 | 
|
389  | 
unprocessed parts producing again a set of pairs. The output of the  | 
|
390  | 
sequence parser combinator is then a set containing pairs where the  | 
|
391  | 
first components are again pairs, namely what the first parser could  | 
|
392  | 
parse together with what the second parser could parse; the second  | 
|
393  | 
component is the unprocessed part left over after running the second  | 
|
394  | 
parser \texttt{q}. Therefore the input type of the sequence parser
 | 
|
395  | 
combinator is as usual \texttt{I}, but the output type is
 | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
396  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
397  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
398  | 
\texttt{Set[((T, S), I)]}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
399  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
400  | 
|
| 584 | 401  | 
\noindent  | 
402  | 
If any of the runs of \textit{p} and \textit{q} fail, that is produce
 | 
|
403  | 
the empty set, then \texttt{parse} will also produce the empty set.
 | 
|
404  | 
Notice that we have now two output types for the sequence parser  | 
|
405  | 
combinator, because in general \textit{p} and \textit{q} might produce
 | 
|
| 586 | 406  | 
different things (for example first we recognise a number and then a  | 
| 584 | 407  | 
string corresponding to an operator).  | 
408  | 
||
| 587 | 409  | 
With the shorthand notation we shall introduce later for the sequence  | 
410  | 
parser combinator, we can write for example \pcode{"a" ~ "b"}, which
 | 
|
411  | 
is the parser combinator that first recognises the character  | 
|
412  | 
\texttt{a} from a string and then \texttt{b}. Let us look again at
 | 
|
413  | 
three examples of how this parser combinator processes some strings:  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
414  | 
|
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
415  | 
\begin{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
416  | 
\begin{tabular}{rcl}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
417  | 
input strings & & output\medskip\\  | 
| 584 | 418  | 
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\
 | 
419  | 
\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
 | 
|
420  | 
\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
 | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
421  | 
\end{tabular}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
422  | 
\end{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
423  | 
|
| 586 | 424  | 
\noindent In the first line we have a successful parse, because the  | 
| 587 | 425  | 
string starts with \texttt{ab}, which is the prefix we are looking
 | 
| 584 | 426  | 
for. But since the parsing combinator is constructed as sequence of  | 
427  | 
the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
 | 
|
428  | 
result is a nested pair of the form \texttt{((a, b), cde)}. It is
 | 
|
| 586 | 429  | 
\emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
 | 
| 587 | 430  | 
expect. The parser returns the empty set in the other examples,  | 
| 584 | 431  | 
because they do not fit with what the parser is supposed to parse.  | 
432  | 
||
433  | 
||
| 589 | 434  | 
A slightly more complicated parser is \pcode{("a" | "b") ~ "c"} which
 | 
| 587 | 435  | 
parses as first character either an \texttt{a} or \texttt{b}, followed
 | 
436  | 
by a \texttt{c}. This parser produces the following outputs.
 | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
437  | 
|
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
438  | 
\begin{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
439  | 
\begin{tabular}{rcl}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
440  | 
input strings & & output\medskip\\  | 
| 585 | 441  | 
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
 | 
442  | 
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
 | 
|
443  | 
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$
 | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
444  | 
\end{tabular}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
445  | 
\end{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
446  | 
|
| 585 | 447  | 
\noindent  | 
448  | 
Now consider the parser \pcode{("a" ~ "b") ~ "c"} which parses
 | 
|
449  | 
\texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces
 | 
|
450  | 
the following outputs.  | 
|
451  | 
||
452  | 
\begin{center}
 | 
|
453  | 
\begin{tabular}{rcl}
 | 
|
454  | 
input strings & & output\medskip\\  | 
|
455  | 
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
 | 
|
456  | 
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\
 | 
|
457  | 
\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$
 | 
|
458  | 
\end{tabular}
 | 
|
459  | 
\end{center}
 | 
|
460  | 
||
461  | 
||
462  | 
\noindent The second and third example fail, because something is  | 
|
463  | 
``missing'' in the sequence we are looking for. Also notice how the  | 
|
464  | 
results nest with sequences: the parsed part is a nested pair of the  | 
|
| 589 | 465  | 
form \pcode{((a, b), c)}. If we nest the sequence parser differently,
 | 
466  | 
for example \pcode{"a" ~ ("b" ~ "c")}, then also our output pairs nest
 | 
|
467  | 
differently  | 
|
468  | 
||
469  | 
\begin{center}
 | 
|
470  | 
\begin{tabular}{rcl}
 | 
|
471  | 
input strings & & output\medskip\\  | 
|
472  | 
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})), \texttt{\Grid{de}})\right\}$\\
 | 
|
473  | 
\end{tabular}
 | 
|
474  | 
\end{center}
 | 
|
475  | 
||
476  | 
\noindent  | 
|
477  | 
Two more examples: first consider the parser  | 
|
| 585 | 478  | 
\pcode{("a" ~ "a") ~ "a"} and the input \pcode{aaaa}:
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
479  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
480  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
481  | 
\begin{tabular}{rcl}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
482  | 
input string & & output\medskip\\  | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
483  | 
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
484  | 
$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
485  | 
\end{tabular}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
486  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
487  | 
|
| 587 | 488  | 
\noindent Notice how again the results nest deeper and deeper as pairs (the  | 
| 585 | 489  | 
last \pcode{a} is in the unprocessed part). To consume everything of
 | 
490  | 
this string we can use the parser \pcode{(("a" ~ "a") ~ "a") ~
 | 
|
491  | 
"a"}. Then the output is as follows:  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
492  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
493  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
494  | 
\begin{tabular}{rcl}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
495  | 
input string & & output\medskip\\  | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
496  | 
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
497  | 
$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
498  | 
\end{tabular}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
499  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
500  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
501  | 
\noindent This is an instance where the parser consumed  | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
502  | 
completely the input, meaning the unprocessed part is just the  | 
| 587 | 503  | 
empty string. So if we called \pcode{parse_all}, instead of \pcode{parse},
 | 
| 585 | 504  | 
we would get back the result  | 
505  | 
||
506  | 
\[  | 
|
507  | 
\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}
 | 
|
508  | 
\]  | 
|
509  | 
||
510  | 
\noindent where the unprocessed (empty) parts have been stripped away  | 
|
511  | 
from the pairs; everything where the second part was not empty has  | 
|
| 587 | 512  | 
been thrown away as well, because they represent  | 
513  | 
ultimately-unsuccessful-parses.  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
514  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
515  | 
|
| 589 | 516  | 
Note carefully that constructing a parser such \pcode{"a" | ("a" ~
 | 
| 587 | 517  | 
"b")} will result in a typing error. The intention is that we want  | 
518  | 
to parse an \texttt{a}, or an \texttt{a} followed by a
 | 
|
519  | 
\texttt{b}. However, the first parser has as output type a single
 | 
|
520  | 
character (recall the type of \texttt{CharParser}), but the second
 | 
|
521  | 
parser produces a pair of characters as output. The alternative parser  | 
|
522  | 
is required to have both component parsers to have the same type---we  | 
|
523  | 
need to be able to build the union of two sets, which means in Scala  | 
|
| 589 | 524  | 
they need to be of the same type. Therefore the typing error.  | 
525  | 
We will see later how we can build  | 
|
| 587 | 526  | 
this parser without the typing error.  | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
527  | 
|
| 587 | 528  | 
The next parser combinator, called \emph{semantic action}, does not
 | 
529  | 
actually combine smaller parsers, but applies a function to the result  | 
|
530  | 
of a parser. It is implemented in Scala as follows  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
531  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
532  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
533  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
534  | 
class FunParser[I, T, S]  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
535  | 
(p: => Parser[I, T],  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
536  | 
          f: T => S) extends Parser[I, S] {
 | 
| 587 | 537  | 
def parse(in: I) =  | 
538  | 
for ((head, tail) <- p.parse(in)) yield (f(head), tail)  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
539  | 
}  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
540  | 
\end{lstlisting}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
541  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
542  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
543  | 
|
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
544  | 
\noindent This parser combinator takes a parser \texttt{p}
 | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
545  | 
with output type \texttt{T} as one argument as well as a
 | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
546  | 
function \texttt{f} with type \texttt{T => S}. The parser
 | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
547  | 
\texttt{p} produces sets of type \texttt{(T, I)}. The
 | 
| 589 | 548  | 
semantic action combinator then applies the function  | 
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
549  | 
\texttt{f} to all the parser outputs. Since this function is of
 | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
550  | 
type \texttt{T => S}, we obtain a parser with output type
 | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
551  | 
\texttt{S}. Again Scala lets us introduce some shorthand
 | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
552  | 
notation for this parser combinator. Therefore we will write  | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
553  | 
\texttt{p ==> f} for it.
 | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
554  | 
|
| 589 | 555  | 
What are semantic actions good for? Well, they allow you to transform  | 
556  | 
the parsed input into a datastructure you can use for further  | 
|
| 587 | 557  | 
processing. A simple example would be to transform parsed characters  | 
558  | 
into ASCII numbers. Suppose we define a function \texttt{f} (from
 | 
|
| 589 | 559  | 
characters to ints) and use a \texttt{CharParser} for parsing
 | 
560  | 
the character \texttt{c}.
 | 
|
| 587 | 561  | 
|
562  | 
\begin{center}
 | 
|
563  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
564  | 
val f = (c: Char) => c.toInt  | 
|
565  | 
val c = new CharParser('c')
 | 
|
566  | 
\end{lstlisting}
 | 
|
567  | 
\end{center}
 | 
|
568  | 
||
569  | 
\noindent  | 
|
| 589 | 570  | 
We then can run the following two parsers on the input \texttt{cbd}:
 | 
| 587 | 571  | 
|
572  | 
\begin{center}
 | 
|
573  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
574  | 
c.parse("cbd")
 | 
|
575  | 
(c ==> f).parse("cbd")
 | 
|
576  | 
\end{lstlisting}
 | 
|
577  | 
\end{center}
 | 
|
578  | 
||
579  | 
\noindent  | 
|
| 589 | 580  | 
In the first line we obtain the expected result \texttt{Set(('c',
 | 
581  | 
  "bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the
 | 
|
582  | 
character has been transformed into an ASCII number.  | 
|
| 588 | 583  | 
|
584  | 
A slightly less contrived example is about parsing numbers (recall  | 
|
585  | 
\texttt{NumParser} above). However, we want to do this here for strings.
 | 
|
586  | 
For this assume we have the following \texttt{RegexParser}.
 | 
|
587  | 
||
588  | 
\begin{center}
 | 
|
589  | 
  \begin{lstlisting}[language=Scala,xleftmargin=0mm,
 | 
|
590  | 
basicstyle=\small\ttfamily, numbers=none]  | 
|
591  | 
import scala.util.matching.Regex  | 
|
592  | 
||
593  | 
case class RegexParser(reg: Regex) extends Parser[String, String] {
 | 
|
594  | 
  def parse(in: String) = reg.findPrefixMatchOf(in) match {
 | 
|
595  | 
case None => Set()  | 
|
596  | 
case Some(m) => Set((m.matched, m.after.toString))  | 
|
597  | 
}  | 
|
598  | 
}  | 
|
599  | 
\end{lstlisting}
 | 
|
600  | 
\end{center}
 | 
|
601  | 
||
602  | 
\noindent  | 
|
603  | 
This parser takes a regex as argument and splits up a string into a  | 
|
604  | 
prefix and the rest according to this regex  | 
|
605  | 
(\texttt{reg.findPrefixMatchOf} generates a match---in the successful
 | 
|
606  | 
case---and the corresponding strings can be extracted with  | 
|
| 589 | 607  | 
\texttt{matched} and \texttt{after}). Using this parser we can define a
 | 
| 588 | 608  | 
\texttt{NumParser} for strings as follows:
 | 
609  | 
||
610  | 
\begin{center}
 | 
|
611  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
612  | 
val NumParser = RegexParser("[0-9]+".r)
 | 
|
613  | 
\end{lstlisting}
 | 
|
614  | 
\end{center}
 | 
|
615  | 
||
616  | 
\noindent  | 
|
617  | 
This parser will recognise a number at the beginning of a string, for  | 
|
618  | 
example  | 
|
619  | 
||
620  | 
\begin{center}
 | 
|
621  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
622  | 
NumParser.parse("123abc")
 | 
|
623  | 
\end{lstlisting}
 | 
|
624  | 
\end{center}  
 | 
|
625  | 
||
626  | 
\noindent  | 
|
627  | 
produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
 | 
|
| 589 | 628  | 
still a string (the double-quotes are not printed by Scala). We need  | 
629  | 
to convert it into the corresponding \texttt{Int}. We can do this as
 | 
|
630  | 
follows using a semantic action  | 
|
| 588 | 631  | 
|
632  | 
\begin{center}
 | 
|
633  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
634  | 
(NumParser ==> (s => s.toInt)).parse("123abc")
 | 
|
635  | 
\end{lstlisting}
 | 
|
636  | 
\end{center}  
 | 
|
637  | 
||
638  | 
\noindent  | 
|
| 589 | 639  | 
The function in the semantic action converts a string into an  | 
| 588 | 640  | 
\texttt{Int}. Let us come back to semantic actions when we are going
 | 
| 589 | 641  | 
to implement actual context-free gammars.  | 
| 587 | 642  | 
|
643  | 
\subsubsection*{Shorthand notation for parser combinators}
 | 
|
644  | 
||
645  | 
Before we proceed, let us just explain the shorthand notation for  | 
|
646  | 
parser combinators. Like for regular expressions, the shorthand notation  | 
|
647  | 
will make our life much easier when writing actual parsers.  | 
|
648  | 
||
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
649  | 
\subsubsection*{How to build parsers using parser combinators?}
 | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
650  | 
|
| 588 | 651  | 
The beauty of parser combinators is the ease with which they can be  | 
652  | 
implemented and how easy it is to translate context-free grammars into  | 
|
653  | 
code (though the grammars need to be non-left-recursive). To  | 
|
654  | 
demonstrate this recall our context-free grammar for palindromes:  | 
|
655  | 
||
656  | 
\begin{plstx}[margin=3cm]
 | 
|
657  | 
: \meta{P} ::=  a\cdot \meta{P}\cdot a | b\cdot \meta{P}\cdot b | a | b | \epsilon\\
 | 
|
658  | 
\end{plstx}
 | 
|
659  | 
||
660  | 
\noindent  | 
|
661  | 
Given the parser combinators for alternatives and sequeneces, this grammar should be  | 
|
662  | 
straightforward to implement. The first idea would be  | 
|
663  | 
||
664  | 
\begin{center}
 | 
|
665  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
666  | 
lazy val Pal : Parser[String, String] =  | 
|
667  | 
  (("a" ~ Pal ~ "a") | ("b" ~ Pal ~ "b") | "a" | "b" | "")
 | 
|
668  | 
\end{lstlisting}
 | 
|
669  | 
\end{center}
 | 
|
670  | 
||
671  | 
\noindent  | 
|
672  | 
Unfortunately, this does not work as it produces a typing error. The  | 
|
673  | 
reason is that the parsers \texttt{"a"}, \texttt{"b"} and \texttt{""}
 | 
|
674  | 
all produce strings and therefore can be put into an alternative  | 
|
675  | 
\texttt{...| "a" | "b" | ""}. But both \pcode{"a" ~ Pal ~ "a"} and
 | 
|
676  | 
\pcode{"b" ~ Pal ~ "b"} produce pairs of the form
 | 
|
677  | 
$(((\_, \_), \_), \_)$---that is how the sequence parser combinator  | 
|
678  | 
nests results when \pcode{\~} is used between two components. The
 | 
|
679  | 
solution is to use a semantic action that ``flattens'' these pairs and  | 
|
680  | 
appends the corresponding strings, like  | 
|
681  | 
||
682  | 
\begin{center}
 | 
|
683  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
684  | 
lazy val Pal : Parser[String, String] =  | 
|
685  | 
  (("a" ~ Pal ~ "a") ==> { case ((x, y), z) => x + y + z } |
 | 
|
686  | 
   ("b" ~ Pal ~ "b") ==> { case ((x, y), z) => x + y + z } |
 | 
|
687  | 
"a" | "b" | "")  | 
|
688  | 
\end{lstlisting}
 | 
|
689  | 
\end{center}
 | 
|
690  | 
||
| 589 | 691  | 
\noindent  | 
692  | 
The semantic action  | 
|
693  | 
||
694  | 
||
695  | 
\noindent  | 
|
696  | 
Important to note is that we must define \texttt{Pal}-parser as a
 | 
|
697  | 
\emph{lazy} value.
 | 
|
| 588 | 698  | 
|
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
699  | 
\subsubsection*{Implementing an Interpreter}
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
700  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
701  | 
%\bigskip  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
702  | 
%takes advantage of the full generality---have a look  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
703  | 
%what it produces if we call it with the string \texttt{abc}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
704  | 
%  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
705  | 
%\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
706  | 
%\begin{tabular}{rcl}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
707  | 
%input string & & output\medskip\\  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
708  | 
%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
709  | 
%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
710  | 
%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
711  | 
%\end{tabular}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
712  | 
%\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
713  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
714  | 
|
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
715  | 
|
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
716  | 
\end{document}
 | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
717  | 
|
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
718  | 
%%% Local Variables:  | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
719  | 
%%% mode: latex  | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
720  | 
%%% TeX-master: t  | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
721  | 
%%% End:  |