| author | Christian Urban <christian.urban@kcl.ac.uk> | 
| Fri, 18 Oct 2024 05:45:14 +0100 | |
| changeset 968 | 79cfee5e170b | 
| parent 940 | 1c1fbf45a03c | 
| permissions | -rw-r--r-- | 
| 584 | 1  | 
|
| 595 | 2  | 
% !TEX program = xelatex  | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
3  | 
\documentclass{article}
 | 
| 
297
 
5c51839c88fd
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
292 
diff
changeset
 | 
4  | 
\usepackage{../style}
 | 
| 
217
 
cd6066f1056a
updated handouts
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
183 
diff
changeset
 | 
5  | 
\usepackage{../langs}
 | 
| 588 | 6  | 
\usepackage{../grammar}
 | 
| 799 | 7  | 
\usepackage{../graphics}
 | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
8  | 
|
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
9  | 
\begin{document}
 | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
10  | 
|
| 
292
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
11  | 
\section*{Handout 6 (Parser Combinators)}
 | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
12  | 
|
| 584 | 13  | 
This handout explains how \emph{parser combinators} work and how they
 | 
| 587 | 14  | 
can be implemented in Scala. Their most distinguishing feature is that  | 
15  | 
they are very easy to implement (admittedly it is only easy in a  | 
|
16  | 
functional programming language). Another good point of parser  | 
|
17  | 
combinators is that they can deal with any kind of input as long as  | 
|
18  | 
this input is of ``sequence-kind'', for example a string or a list of  | 
|
19  | 
tokens. The only two properties of the input we need is to be able to  | 
|
20  | 
test when it is empty and ``sequentially'' take it apart. Strings and  | 
|
21  | 
lists fit this bill. However, parser combinators also have their  | 
|
22  | 
drawbacks. For example they require that the grammar to be parsed is  | 
|
23  | 
\emph{not} left-recursive and they are efficient only when the grammar
 | 
|
24  | 
is unambiguous. It is the responsibility of the grammar designer to  | 
|
| 591 | 25  | 
ensure these two properties hold.  | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
26  | 
|
| 587 | 27  | 
The general idea behind parser combinators is to transform the input  | 
28  | 
into sets of pairs, like so  | 
|
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
29  | 
|
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
30  | 
\begin{center}
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
31  | 
$\underbrace{\text{list of tokens}}_{\text{input}}$ 
 | 
| 594 | 32  | 
$\quad\Rightarrow\quad$  | 
| 591 | 33  | 
$\underbrace{\text{set of (parsed part, unprocessed part)}}_{\text{output}}$
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
34  | 
\end{center} 
 | 
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
35  | 
|
| 587 | 36  | 
\noindent  | 
| 590 | 37  | 
Given the extended effort we have spent implementing a lexer in order  | 
| 591 | 38  | 
to generate lists of tokens, it might be surprising that in what  | 
39  | 
follows we shall often use strings as input, rather than lists of  | 
|
| 936 | 40  | 
tokens. This is for making the explanation more lucid and ensure the  | 
41  | 
examples are simple. It does not make our previous work on lexers obsolete  | 
|
| 591 | 42  | 
(remember they transform a string into a list of tokens). Lexers will  | 
| 936 | 43  | 
still be needed for building a somewhat realistic compiler. See also  | 
44  | 
a question in the homework regarding this issue.  | 
|
| 584 | 45  | 
|
| 590 | 46  | 
As mentioned above, parser combinators are relatively agnostic about what  | 
| 587 | 47  | 
kind of input they process. In my Scala code I use the following  | 
48  | 
polymorphic types for parser combinators:  | 
|
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
49  | 
|
| 
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
50  | 
\begin{center}
 | 
| 584 | 51  | 
input:\;\; \texttt{I}  \qquad output:\;\; \texttt{T}
 | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
52  | 
\end{center}
 | 
| 
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
53  | 
|
| 587 | 54  | 
\noindent That is they take as input something of type \texttt{I} and
 | 
| 590 | 55  | 
return a set of pairs of type \texttt{Set[(T, I)]}. Since the input
 | 
56  | 
needs to be of ``sequence-kind'', I actually have to often write  | 
|
| 936 | 57  | 
\code{(using is: I => Seq[_])} for the input type. This ensures the
 | 
58  | 
input is a subtype of Scala sequences.\footnote{This is a new feature
 | 
|
59  | 
in Scala 3 and is about type-classes, meaning if you use Scala 2 you will have difficulties  | 
|
60  | 
with running my code.} The first component of the generated pairs  | 
|
61  | 
corresponds to what the parser combinator was able to parse from the  | 
|
62  | 
input and the second is the unprocessed, or leftover, part of the  | 
|
63  | 
input (therefore the type of this unprocessed part is the same as the  | 
|
64  | 
input). A parser combinator might return more than one such pair; the  | 
|
65  | 
idea is that there are potentially several ways of how to parse the  | 
|
66  | 
input. As a concrete example, consider the string  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
67  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
68  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
69  | 
\tt\Grid{iffoo\VS testbar}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
70  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
71  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
72  | 
\noindent We might have a parser combinator which tries to  | 
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
73  | 
interpret this string as a keyword (\texttt{if}) or as an
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
74  | 
identifier (\texttt{iffoo}). Then the output will be the set
 | 
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
75  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
76  | 
\begin{center}
 | 
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
77  | 
$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), 
 | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
78  | 
           \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
79  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
80  | 
|
| 587 | 81  | 
\noindent where the first pair means the parser could recognise  | 
| 590 | 82  | 
\texttt{if} from the input and leaves the \texttt{foo\VS testbar} as
 | 
| 591 | 83  | 
unprocessed part; in the other case it could recognise  | 
| 587 | 84  | 
\texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
 | 
85  | 
parser cannot recognise anything from the input at all, then parser  | 
|
86  | 
combinators just return the empty set $\{\}$. This will indicate
 | 
|
87  | 
something ``went wrong''\ldots or more precisely, nothing could be  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
88  | 
parsed.  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
89  | 
|
| 594 | 90  | 
Also important to note is that the output type \texttt{T} for the
 | 
91  | 
processed part can potentially be different from the input type  | 
|
92  | 
\texttt{I} in the parser. In the example above is just happens to be
 | 
|
93  | 
the same. The reason for the difference is that in general we are  | 
|
94  | 
interested in transforming our input into something  | 
|
95  | 
``different''\ldots for example into a tree; or if we implement the  | 
|
96  | 
grammar for arithmetic expressions, we might be interested in the  | 
|
97  | 
actual integer number the arithmetic expression, say \texttt{1 + 2 *
 | 
|
98  | 
3}, stands for. In this way we can use parser combinators to  | 
|
99  | 
implement relatively easily a calculator, for instance (we shall do  | 
|
100  | 
this later on).  | 
|
| 584 | 101  | 
|
| 594 | 102  | 
The main driving force behind parser combinators is that we can easily  | 
103  | 
build parser combinators out of smaller components following very  | 
|
104  | 
closely the structure of a grammar. In order to implement this in a  | 
|
| 591 | 105  | 
functional/object-oriented programming language, like Scala, we need  | 
106  | 
to specify an abstract class for parser combinators. In the abstract  | 
|
107  | 
class we specify that \texttt{I} is the \emph{input type} of the
 | 
|
| 593 | 108  | 
parser combinator and that \texttt{T} is the \emph{output type}.  This
 | 
| 591 | 109  | 
implies that the function \texttt{parse} takes an argument of type
 | 
| 940 | 110  | 
\texttt{I} and returns a set of type \mbox{\texttt{Set[(T,
 | 
111  | 
    I)]}}.\footnote{In the actual code on KEATS there is some
 | 
|
112  | 
kerfuffle about correct type-bounds and using clauses. Ignore this  | 
|
113  | 
part of the implementation for the moment---it is not needed for  | 
|
114  | 
understanding how the code works.}  | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
115  | 
|
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
116  | 
\begin{center}
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
117  | 
\begin{lstlisting}[language=Scala]
 | 
| 940 | 118  | 
abstract class Parser[I, T]  {
 | 
| 936 | 119  | 
def parse(in: I): Set[(T, I)]  | 
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
120  | 
|
| 590 | 121  | 
def parse_all(in: I) : Set[T] =  | 
| 936 | 122  | 
for ((hd, tl) <- parse(in);  | 
| 940 | 123  | 
if tl.isEmpty) yield hd  | 
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
124  | 
}  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
125  | 
\end{lstlisting}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
126  | 
\end{center}
 | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
127  | 
|
| 591 | 128  | 
\noindent It is the obligation in each instance of this class to  | 
| 584 | 129  | 
supply an implementation for \texttt{parse}.  From this function we
 | 
130  | 
can then ``centrally'' derive the function \texttt{parse\_all}, which
 | 
|
131  | 
just filters out all pairs whose second component is not empty (that  | 
|
132  | 
is has still some unprocessed part). The reason is that at the end of  | 
|
133  | 
the parsing we are only interested in the results where all the input  | 
|
134  | 
has been consumed and no unprocessed part is left over.  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
135  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
136  | 
One of the simplest parser combinators recognises just a  | 
| 584 | 137  | 
single character, say $c$, from the beginning of strings. Its  | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
138  | 
behaviour can be described as follows:  | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
139  | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
140  | 
\begin{itemize}
 | 
| 936 | 141  | 
\item If the head of the input string $s$ starts with a $c$, then return  | 
| 584 | 142  | 
the set  | 
| 936 | 143  | 
  \[\{(c, \textit{tail-of-}s)\}\]
 | 
| 584 | 144  | 
  where \textit{tail of} 
 | 
145  | 
$s$ is the unprocessed part of the input string.  | 
|
146  | 
\item Otherwise return the empty set $\{\}$.	
 | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
147  | 
\end{itemize}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
148  | 
|
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
149  | 
\noindent  | 
| 590 | 150  | 
The input type of this simple parser combinator is \texttt{String} and
 | 
151  | 
the output type is \texttt{Char}. This means \texttt{parse} returns
 | 
|
152  | 
\mbox{\texttt{Set[(Char, String)]}}.  The code in Scala is as follows:
 | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
153  | 
|
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
154  | 
\begin{center}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
155  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
156  | 
case class CharParser(c: Char) extends Parser[String, Char] {
 | 
| 936 | 157  | 
def parse(s: String) =  | 
158  | 
if (s != "" && s.head == c) Set((c, s.tail)) else Set()  | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
159  | 
}  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
160  | 
\end{lstlisting}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
161  | 
\end{center}
 | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
162  | 
|
| 936 | 163  | 
\noindent You can see \texttt{parse} tests here whether the
 | 
164  | 
first character of the input string \texttt{s} is equal to
 | 
|
| 584 | 165  | 
\texttt{c}. If yes, then it splits the string into the recognised part
 | 
| 936 | 166  | 
\texttt{c} and the unprocessed part \texttt{s.tail}. In case
 | 
167  | 
\texttt{s} does not start with \texttt{c} then the parser returns the
 | 
|
| 584 | 168  | 
empty set (in Scala \texttt{Set()}). Since this parser recognises
 | 
169  | 
characters and just returns characters as the processed part, the  | 
|
170  | 
output type of the parser is \texttt{Char}.
 | 
|
171  | 
||
172  | 
If we want to parse a list of tokens and interested in recognising a  | 
|
| 590 | 173  | 
number token, for example, we could write something like this  | 
| 584 | 174  | 
|
175  | 
\begin{center}
 | 
|
176  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
 | 
|
177  | 
case object NumParser extends Parser[List[Token], Int] {
 | 
|
178  | 
  def parse(ts: List[Token]) = ts match {
 | 
|
| 936 | 179  | 
case Num_token(s)::rest => Set((s.toInt, rest))  | 
| 584 | 180  | 
case _ => Set ()  | 
181  | 
}  | 
|
182  | 
}  | 
|
183  | 
\end{lstlisting}
 | 
|
184  | 
\end{center}
 | 
|
185  | 
||
186  | 
\noindent  | 
|
187  | 
In this parser the input is of type \texttt{List[Token]}. The function
 | 
|
188  | 
parse looks at the input \texttt{ts} and checks whether the first
 | 
|
| 589 | 189  | 
token is a \texttt{Num\_token} (let us assume our lexer generated
 | 
190  | 
these tokens for numbers). But this parser does not just return this  | 
|
| 584 | 191  | 
token (and the rest of the list), like the \texttt{CharParser} above,
 | 
| 590 | 192  | 
rather it extracts also the string \texttt{s} from the token and
 | 
193  | 
converts it into an integer. The hope is that the lexer did its work  | 
|
194  | 
well and this conversion always succeeds. The consequence of this is  | 
|
195  | 
that the output type for this parser is \texttt{Int}, not
 | 
|
| 936 | 196  | 
\texttt{Token}. Such a conversion would be needed in our parser,
 | 
197  | 
because when we encounter a number in our program, we want to do  | 
|
198  | 
some calculations based on integers, not strings (or tokens).  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
199  | 
|
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
200  | 
|
| 584 | 201  | 
These simple parsers that just look at the input and do a simple  | 
202  | 
transformation are often called \emph{atomic} parser combinators.
 | 
|
203  | 
More interesting are the parser combinators that build larger parsers  | 
|
| 587 | 204  | 
out of smaller component parsers. There are three such parser  | 
205  | 
combinators that can be implemented generically. The \emph{alternative
 | 
|
| 584 | 206  | 
parser combinator} is as follows: given two parsers, say, $p$ and  | 
207  | 
$q$, we apply both parsers to the input (remember parsers are  | 
|
| 587 | 208  | 
functions) and combine the output (remember they are sets of pairs):  | 
209  | 
||
210  | 
\begin{center}
 | 
|
211  | 
$p(\text{input}) \cup q(\text{input})$
 | 
|
212  | 
\end{center}
 | 
|
213  | 
||
214  | 
\noindent In Scala we can implement alternative parser  | 
|
215  | 
combinator as follows  | 
|
216  | 
||
217  | 
\begin{center}
 | 
|
218  | 
\begin{lstlisting}[language=Scala, numbers=none]
 | 
|
219  | 
class AltParser[I, T]  | 
|
| 936 | 220  | 
(p: => Parser[I, T],  | 
| 940 | 221  | 
          q: => Parser[I, T]) extends Parser[I, T] {
 | 
| 936 | 222  | 
def parse(in: I) = p.parse(in) ++ q.parse(in)  | 
223  | 
}  | 
|
| 587 | 224  | 
\end{lstlisting}
 | 
225  | 
\end{center}
 | 
|
226  | 
||
227  | 
\noindent The types of this parser combinator are again generic (we  | 
|
228  | 
have \texttt{I} for the input type, and \texttt{T} for the output
 | 
|
229  | 
type). The alternative parser builds a new parser out of two existing  | 
|
| 590 | 230  | 
parsers \texttt{p} and \texttt{q} which are given as arguments.  Both
 | 
231  | 
parsers need to be able to process input of type \texttt{I} and return
 | 
|
232  | 
in \texttt{parse} the same output type \texttt{Set[(T,
 | 
|
| 587 | 233  | 
  I)]}.\footnote{There is an interesting detail of Scala, namely the
 | 
| 936 | 234  | 
  \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. These arrows
 | 
| 587 | 235  | 
will prevent the evaluation of the arguments before they are  | 
236  | 
  used. This is often called \emph{lazy evaluation} of the
 | 
|
| 590 | 237  | 
arguments. We will explain this later.} The alternative parser runs  | 
238  | 
the input with the first parser \texttt{p} (producing a set of pairs)
 | 
|
239  | 
and then runs the same input with \texttt{q} (producing another set of
 | 
|
240  | 
pairs). The result should be then just the union of both sets, which  | 
|
241  | 
is the operation \texttt{++} in Scala.
 | 
|
| 587 | 242  | 
|
243  | 
The alternative parser combinator allows us to construct a parser that  | 
|
244  | 
parses either a character \texttt{a} or \texttt{b} using the
 | 
|
| 936 | 245  | 
\texttt{CharParser} shown above. For this we can write\footnote{Note
 | 
246  | 
  that we cannot use a \texttt{case}-class for \texttt{AltParser}s
 | 
|
247  | 
because of the problem with laziness and Scala quirks. Hating  | 
|
248  | 
  \texttt{new} like the plague, we will work around this later with
 | 
|
249  | 
some syntax tricks. ;o)}  | 
|
| 587 | 250  | 
|
251  | 
\begin{center}
 | 
|
252  | 
\begin{lstlisting}[language=Scala, numbers=none]
 | 
|
253  | 
new AltParser(CharParser('a'), CharParser('b'))
 | 
|
254  | 
\end{lstlisting}
 | 
|
255  | 
\end{center}
 | 
|
256  | 
||
257  | 
\noindent Later on we will use Scala mechanism for introducing some  | 
|
| 799 | 258  | 
more readable shorthand notation for this, like \texttt{p"a" ||
 | 
| 936 | 259  | 
p"b"}. But first let us look in detail at what this parser combinator produces  | 
| 590 | 260  | 
with some sample strings.  | 
| 587 | 261  | 
|
262  | 
\begin{center}
 | 
|
263  | 
\begin{tabular}{rcl}
 | 
|
264  | 
input strings & & output\medskip\\  | 
|
| 936 | 265  | 
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}},\; \texttt{\Grid{cde}})\right\}$\\
 | 
266  | 
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}},\; \texttt{\Grid{cde}})\right\}$\\
 | 
|
| 587 | 267  | 
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
 | 
268  | 
\end{tabular}
 | 
|
269  | 
\end{center}
 | 
|
270  | 
||
| 936 | 271  | 
\noindent We receive in the first two cases a successful output (that  | 
272  | 
is a non-empty set). In each case, either \pcode{a} or \pcode{b} is in
 | 
|
273  | 
the parsed part, and \pcode{cde} in the unprocessed part. Clearly this
 | 
|
274  | 
parser cannot parse anything of the form \pcode{ccde}, therefore the
 | 
|
275  | 
empty set is returned in the last case. Observe that parser  | 
|
276  | 
combinators only look at the beginning of the given input: they do not  | 
|
277  | 
fish out something in the ``middle'' of the input.  | 
|
| 587 | 278  | 
|
279  | 
A bit more interesting is the \emph{sequence parser combinator}. Given
 | 
|
280  | 
two parsers, say again, $p$ and $q$, we want to apply first the input  | 
|
| 590 | 281  | 
to $p$ producing a set of pairs; then apply $q$ to all the unparsed  | 
| 587 | 282  | 
parts in the pairs; and then combine the results. Mathematically we would  | 
| 591 | 283  | 
write something like this for the set of pairs:  | 
| 587 | 284  | 
|
285  | 
\begin{center}
 | 
|
286  | 
\begin{tabular}{lcl}
 | 
|
287  | 
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & 
 | 
|
288  | 
$(\textit{output}_1, u_1) \in p(\text{input}) 
 | 
|
289  | 
\;\wedge\;$\\  | 
|
290  | 
&& $(\textit{output}_2, u_2) \in q(u_1)\}$
 | 
|
291  | 
\end{tabular}
 | 
|
292  | 
\end{center}
 | 
|
293  | 
||
294  | 
\noindent Notice that the $p$ will first be run on the input,  | 
|
| 590 | 295  | 
producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$
 | 
| 591 | 296  | 
stands for the unprocessed, or leftover, parts of $p$. We want that  | 
| 590 | 297  | 
$q$ runs on all these unprocessed parts $u_1$. Therefore these  | 
298  | 
unprocessed parts are fed into the second parser $q$. The overall  | 
|
299  | 
result of the sequence parser combinator is pairs of the form  | 
|
| 584 | 300  | 
$((\textit{output}_1, \textit{output}_2), u_2)$. This means the
 | 
| 593 | 301  | 
unprocessed part of the sequence parser combinator is the unprocessed  | 
| 591 | 302  | 
part the second parser $q$ leaves as leftover. The parsed parts of the  | 
303  | 
component parsers are combined in a pair, namely  | 
|
304  | 
$(\textit{output}_1, \textit{output}_2)$. The reason is we want to
 | 
|
305  | 
know what $p$ and $q$ were able to parse. This behaviour can be  | 
|
306  | 
implemented in Scala as follows:  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
307  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
308  | 
\begin{center}
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
309  | 
\begin{lstlisting}[language=Scala,numbers=none]
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
310  | 
class SeqParser[I, T, S]  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
311  | 
(p: => Parser[I, T],  | 
| 940 | 312  | 
        q: => Parser[I, S]) extends Parser[I, (T, S)] {
 | 
| 587 | 313  | 
def parse(in: I) =  | 
314  | 
for ((output1, u1) <- p.parse(in);  | 
|
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
315  | 
(output2, u2) <- q.parse(u1))  | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
316  | 
yield ((output1, output2), u2)  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
317  | 
}  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
318  | 
\end{lstlisting}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
319  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
320  | 
|
| 587 | 321  | 
\noindent This parser takes again as arguments two parsers, \texttt{p}
 | 
| 591 | 322  | 
and \texttt{q}. It implements \texttt{parse} as follows: first run the
 | 
323  | 
parser \texttt{p} on the input producing a set of pairs
 | 
|
| 587 | 324  | 
(\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
 | 
| 591 | 325  | 
unprocessed parts left over by \texttt{p} (recall that there can be
 | 
| 936 | 326  | 
several such pairs).  Let then \texttt{q} run on these unprocessed
 | 
| 591 | 327  | 
parts producing again a set of pairs. The output of the sequence  | 
328  | 
parser combinator is then a set containing pairs where the first  | 
|
329  | 
components are again pairs, namely what the first parser could parse  | 
|
330  | 
together with what the second parser could parse; the second component  | 
|
331  | 
is the unprocessed part left over after running the second parser  | 
|
332  | 
\texttt{q}. Note that the input type of the sequence parser combinator
 | 
|
333  | 
is as usual \texttt{I}, but the output type is
 | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
334  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
335  | 
\begin{center}
 | 
| 590 | 336  | 
\texttt{(T, S)}
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
337  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
338  | 
|
| 584 | 339  | 
\noindent  | 
| 591 | 340  | 
Consequently, the function \texttt{parse} in the sequence parser
 | 
341  | 
combinator returns sets of type \texttt{Set[((T, S), I)]}.  That means
 | 
|
342  | 
we have essentially two output types for the sequence parser  | 
|
343  | 
combinator (packaged in a pair), because in general \textit{p} and
 | 
|
344  | 
\textit{q} might produce different things (for example we recognise a
 | 
|
345  | 
number with \texttt{p} and then with \texttt{q} a string corresponding
 | 
|
346  | 
to an operator).  If any of the runs of \textit{p} and \textit{q}
 | 
|
347  | 
fail, that is produce the empty set, then \texttt{parse} will also
 | 
|
348  | 
produce the empty set.  | 
|
| 584 | 349  | 
|
| 587 | 350  | 
With the shorthand notation we shall introduce later for the sequence  | 
| 799 | 351  | 
parser combinator, we can write for example \pcode{p"a" ~ p"b"}, which
 | 
| 587 | 352  | 
is the parser combinator that first recognises the character  | 
| 936 | 353  | 
\texttt{a} from a string and then \texttt{b}. (Actually, we will be
 | 
354  | 
able to write just \pcode{p"ab"} for such parsers, but it is good to
 | 
|
355  | 
understand first what happens behind the scenes.) Let us look again  | 
|
356  | 
at some examples of how the sequence parser combinator processes some  | 
|
357  | 
strings:  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
358  | 
|
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
359  | 
\begin{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
360  | 
\begin{tabular}{rcl}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
361  | 
input strings & & output\medskip\\  | 
| 936 | 362  | 
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}),\; \texttt{\Grid{cde}})\right\}$\\
 | 
| 584 | 363  | 
\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
 | 
364  | 
\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
 | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
365  | 
\end{tabular}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
366  | 
\end{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
367  | 
|
| 586 | 368  | 
\noindent In the first line we have a successful parse, because the  | 
| 587 | 369  | 
string starts with \texttt{ab}, which is the prefix we are looking
 | 
| 584 | 370  | 
for. But since the parsing combinator is constructed as sequence of  | 
371  | 
the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
 | 
|
372  | 
result is a nested pair of the form \texttt{((a, b), cde)}. It is
 | 
|
| 586 | 373  | 
\emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
 | 
| 587 | 374  | 
expect. The parser returns the empty set in the other examples,  | 
| 584 | 375  | 
because they do not fit with what the parser is supposed to parse.  | 
376  | 
||
377  | 
||
| 799 | 378  | 
A slightly more complicated parser is \pcode{(p"a" || p"b") ~ p"c"} which
 | 
| 587 | 379  | 
parses as first character either an \texttt{a} or \texttt{b}, followed
 | 
380  | 
by a \texttt{c}. This parser produces the following outputs.
 | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
381  | 
|
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
382  | 
\begin{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
383  | 
\begin{tabular}{rcl}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
384  | 
input strings & & output\medskip\\  | 
| 936 | 385  | 
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
 | 
386  | 
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
 | 
|
| 585 | 387  | 
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
388  | 
\end{tabular}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
389  | 
\end{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
390  | 
|
| 585 | 391  | 
\noindent  | 
| 799 | 392  | 
Now consider the parser \pcode{(p"a" ~ p"b") ~ p"c"} which parses
 | 
| 585 | 393  | 
\texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces
 | 
394  | 
the following outputs.  | 
|
395  | 
||
396  | 
\begin{center}
 | 
|
397  | 
\begin{tabular}{rcl}
 | 
|
398  | 
input strings & & output\medskip\\  | 
|
| 936 | 399  | 
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
 | 
| 585 | 400  | 
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\
 | 
401  | 
\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$
 | 
|
402  | 
\end{tabular}
 | 
|
403  | 
\end{center}
 | 
|
404  | 
||
405  | 
||
406  | 
\noindent The second and third example fail, because something is  | 
|
| 590 | 407  | 
``missing'' in the sequence we are looking for. The first succeeds but  | 
408  | 
notice how the results nest with sequences: the parsed part is a  | 
|
409  | 
nested pair of the form \pcode{((a, b), c)}. If we nest the sequence
 | 
|
| 799 | 410  | 
parser differently, say \pcode{p"a" ~ (p"b" ~ p"c")}, then also
 | 
| 590 | 411  | 
our output pairs nest differently  | 
| 589 | 412  | 
|
413  | 
\begin{center}
 | 
|
414  | 
\begin{tabular}{rcl}
 | 
|
415  | 
input strings & & output\medskip\\  | 
|
| 936 | 416  | 
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})),\; \texttt{\Grid{de}})\right\}$\\
 | 
| 589 | 417  | 
\end{tabular}
 | 
418  | 
\end{center}
 | 
|
419  | 
||
420  | 
\noindent  | 
|
421  | 
Two more examples: first consider the parser  | 
|
| 799 | 422  | 
\pcode{(p"a" ~ p"a") ~ p"a"} and the input \pcode{aaaa}:
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
423  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
424  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
425  | 
\begin{tabular}{rcl}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
426  | 
input string & & output\medskip\\  | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
427  | 
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
428  | 
$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
429  | 
\end{tabular}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
430  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
431  | 
|
| 591 | 432  | 
\noindent Notice again how the results nest deeper and deeper as pairs (the  | 
| 585 | 433  | 
last \pcode{a} is in the unprocessed part). To consume everything of
 | 
| 799 | 434  | 
this string we can use the parser \pcode{((p"a" ~ p"a") ~ p"a") ~
 | 
435  | 
p"a"}. Then the output is as follows:  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
436  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
437  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
438  | 
\begin{tabular}{rcl}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
439  | 
input string & & output\medskip\\  | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
440  | 
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
441  | 
$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
442  | 
\end{tabular}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
443  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
444  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
445  | 
\noindent This is an instance where the parser consumed  | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
446  | 
completely the input, meaning the unprocessed part is just the  | 
| 587 | 447  | 
empty string. So if we called \pcode{parse_all}, instead of \pcode{parse},
 | 
| 585 | 448  | 
we would get back the result  | 
449  | 
||
450  | 
\[  | 
|
451  | 
\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}
 | 
|
452  | 
\]  | 
|
453  | 
||
454  | 
\noindent where the unprocessed (empty) parts have been stripped away  | 
|
455  | 
from the pairs; everything where the second part was not empty has  | 
|
| 587 | 456  | 
been thrown away as well, because they represent  | 
| 590 | 457  | 
ultimately-unsuccessful-parses. The main point is that the sequence  | 
458  | 
parser combinator returns pairs that can nest according to the  | 
|
459  | 
nesting of the component parsers.  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
460  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
461  | 
|
| 936 | 462  | 
Consider also carefully that constructing a parser such  | 
463  | 
||
464  | 
\begin{center}
 | 
|
465  | 
\pcode{p"a" || (p"a" ~ p"b")}
 | 
|
466  | 
\end{center}
 | 
|
467  | 
||
468  | 
\noindent  | 
|
469  | 
will result in a typing error. The intention with this  | 
|
| 591 | 470  | 
parser is that we want to parse either an \texttt{a}, or an \texttt{a}
 | 
| 590 | 471  | 
followed by a \texttt{b}. However, the first parser has as output type
 | 
472  | 
a single character (recall the type of \texttt{CharParser}), but the
 | 
|
473  | 
second parser produces a pair of characters as output. The alternative  | 
|
474  | 
parser is required to have both component parsers to have the same  | 
|
| 591 | 475  | 
type---the reason is that we need to be able to build the union of two  | 
476  | 
sets, which requires in Scala that the sets have the same type. Since  | 
|
477  | 
they are not in this case, there is a typing error. We will see later  | 
|
478  | 
how we can build this parser without the typing error.  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
479  | 
|
| 936 | 480  | 
The next parser combinator, called \emph{semantic action} or \emph{map-parser}, does not
 | 
| 591 | 481  | 
actually combine two smaller parsers, but applies a function to the result  | 
| 587 | 482  | 
of a parser. It is implemented in Scala as follows  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
483  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
484  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
485  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 799 | 486  | 
class MapParser[I, T, S]  | 
| 936 | 487  | 
(p: => Parser[I, T],  | 
| 940 | 488  | 
         f: T => S) extends Parser[I, S] {
 | 
| 936 | 489  | 
def parse(in: I) =  | 
490  | 
for ((hd, tl) <- p.parse(in)) yield (f(hd), tl)  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
491  | 
}  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
492  | 
\end{lstlisting}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
493  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
494  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
495  | 
|
| 590 | 496  | 
\noindent This parser combinator takes a parser \texttt{p} (with input
 | 
497  | 
type \texttt{I} and output type \texttt{T}) as one argument but also a
 | 
|
498  | 
function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p}
 | 
|
| 936 | 499  | 
produces sets of type \texttt{Set[(S, I)]}. The semantic action
 | 
| 590 | 500  | 
combinator then applies the function \texttt{f} to all the `processed'
 | 
501  | 
parser outputs. Since this function is of type \texttt{T => S}, we
 | 
|
502  | 
obtain a parser with output type \texttt{S}. Again Scala lets us
 | 
|
503  | 
introduce some shorthand notation for this parser  | 
|
| 799 | 504  | 
combinator. Therefore we will write short \texttt{p.map(f)} for it.
 | 
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
505  | 
|
| 589 | 506  | 
What are semantic actions good for? Well, they allow you to transform  | 
| 590 | 507  | 
the parsed input into datastructures you can use for further  | 
| 591 | 508  | 
processing. A simple (contrived) example would be to transform parsed  | 
509  | 
characters into ASCII numbers. Suppose we define a function \texttt{f}
 | 
|
510  | 
(from characters to \texttt{Int}s) and use a \texttt{CharParser} for parsing
 | 
|
| 589 | 511  | 
the character \texttt{c}.
 | 
| 587 | 512  | 
|
| 591 | 513  | 
|
| 587 | 514  | 
\begin{center}
 | 
515  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
516  | 
val f = (c: Char) => c.toInt  | 
|
517  | 
val c = new CharParser('c')
 | 
|
518  | 
\end{lstlisting}
 | 
|
519  | 
\end{center}
 | 
|
520  | 
||
521  | 
\noindent  | 
|
| 589 | 522  | 
We then can run the following two parsers on the input \texttt{cbd}:
 | 
| 587 | 523  | 
|
524  | 
\begin{center}
 | 
|
525  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
526  | 
c.parse("cbd")
 | 
|
| 799 | 527  | 
c.map(f).parse("cbd")
 | 
| 587 | 528  | 
\end{lstlisting}
 | 
529  | 
\end{center}
 | 
|
530  | 
||
531  | 
\noindent  | 
|
| 589 | 532  | 
In the first line we obtain the expected result \texttt{Set(('c',
 | 
533  | 
  "bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the
 | 
|
534  | 
character has been transformed into an ASCII number.  | 
|
| 588 | 535  | 
|
536  | 
A slightly less contrived example is about parsing numbers (recall  | 
|
| 591 | 537  | 
\texttt{NumParser} above). However, we want to do this here for
 | 
538  | 
strings, not for tokens. For this assume we have the following  | 
|
539  | 
(atomic) \texttt{RegexParser}.
 | 
|
| 588 | 540  | 
|
541  | 
\begin{center}
 | 
|
542  | 
  \begin{lstlisting}[language=Scala,xleftmargin=0mm,
 | 
|
543  | 
basicstyle=\small\ttfamily, numbers=none]  | 
|
544  | 
import scala.util.matching.Regex  | 
|
545  | 
||
546  | 
case class RegexParser(reg: Regex) extends Parser[String, String] {
 | 
|
547  | 
  def parse(in: String) = reg.findPrefixMatchOf(in) match {
 | 
|
548  | 
case None => Set()  | 
|
549  | 
case Some(m) => Set((m.matched, m.after.toString))  | 
|
550  | 
}  | 
|
551  | 
}  | 
|
552  | 
\end{lstlisting}
 | 
|
553  | 
\end{center}
 | 
|
554  | 
||
555  | 
\noindent  | 
|
556  | 
This parser takes a regex as argument and splits up a string into a  | 
|
557  | 
prefix and the rest according to this regex  | 
|
558  | 
(\texttt{reg.findPrefixMatchOf} generates a match---in the successful
 | 
|
559  | 
case---and the corresponding strings can be extracted with  | 
|
| 591 | 560  | 
\texttt{matched} and \texttt{after}). The input and output type for
 | 
561  | 
this parser is \texttt{String}. Using \texttt{RegexParser} we can
 | 
|
562  | 
define a \texttt{NumParser} for \texttt{Strings} to \texttt{Int} as
 | 
|
563  | 
follows:  | 
|
| 588 | 564  | 
|
565  | 
\begin{center}
 | 
|
566  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
567  | 
val NumParser = RegexParser("[0-9]+".r)
 | 
|
568  | 
\end{lstlisting}
 | 
|
569  | 
\end{center}
 | 
|
570  | 
||
571  | 
\noindent  | 
|
| 591 | 572  | 
This parser will recognise a number at the beginning of a string. For  | 
| 588 | 573  | 
example  | 
574  | 
||
575  | 
\begin{center}
 | 
|
576  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
577  | 
NumParser.parse("123abc")
 | 
|
578  | 
\end{lstlisting}
 | 
|
579  | 
\end{center}  
 | 
|
580  | 
||
581  | 
\noindent  | 
|
582  | 
produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
 | 
|
| 936 | 583  | 
still a string (the expected double-quotes are not printed by  | 
| 590 | 584  | 
Scala). We want to convert this string into the corresponding  | 
585  | 
\texttt{Int}. We can do this as follows using a semantic action
 | 
|
| 588 | 586  | 
|
587  | 
\begin{center}
 | 
|
588  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
| 799 | 589  | 
NumParser.map{s => s.toInt}.parse("123abc")
 | 
| 588 | 590  | 
\end{lstlisting}
 | 
591  | 
\end{center}  
 | 
|
592  | 
||
593  | 
\noindent  | 
|
| 589 | 594  | 
The function in the semantic action converts a string into an  | 
| 591 | 595  | 
\texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))},
 | 
| 936 | 596  | 
but this time \texttt{123} is an \texttt{Int}. Think carefully what
 | 
597  | 
the input and output type of the parser is without the semantic action  | 
|
| 940 | 598  | 
and what with the semantic action (the type of the function can  | 
| 936 | 599  | 
already tell you). Let us come back to semantic actions when we are  | 
600  | 
going to implement actual context-free grammars.  | 
|
| 587 | 601  | 
|
602  | 
\subsubsection*{Shorthand notation for parser combinators}
 | 
|
603  | 
||
604  | 
Before we proceed, let us just explain the shorthand notation for  | 
|
| 936 | 605  | 
parser combinators. Like for regular expressions, the shorthand  | 
606  | 
notation will make our life much easier when writing actual  | 
|
607  | 
parsers. We can define some extensions\footnote{In Scala 2 this was
 | 
|
608  | 
generically called as ``implicits''.} which allow us to write  | 
|
| 591 | 609  | 
|
610  | 
\begin{center}
 | 
|
611  | 
\begin{tabular}{ll}  
 | 
|
| 799 | 612  | 
  \pcode{p || q} & alternative parser\\
 | 
| 591 | 613  | 
  \pcode{p ~ q} & sequence parser\\ 
 | 
| 799 | 614  | 
  \pcode{p.map(f)} & semantic action parser
 | 
| 591 | 615  | 
\end{tabular}
 | 
616  | 
\end{center}
 | 
|
617  | 
||
618  | 
\noindent  | 
|
| 936 | 619  | 
We will also use the \texttt{p}-string-interpolation for specifying simple string parsers.
 | 
| 590 | 620  | 
|
621  | 
The idea is that this shorthand notation allows us to easily translate  | 
|
622  | 
context-free grammars into code. For example recall our context-free  | 
|
623  | 
grammar for palindromes:  | 
|
624  | 
||
625  | 
\begin{plstx}[margin=3cm]
 | 
|
| 591 | 626  | 
: \meta{Pal} ::=  a\cdot \meta{Pal}\cdot a | b\cdot \meta{Pal}\cdot b | a | b | \epsilon\\
 | 
| 590 | 627  | 
\end{plstx}
 | 
628  | 
||
629  | 
\noindent  | 
|
630  | 
Each alternative in this grammar translates into an alternative parser  | 
|
631  | 
combinator. The $\cdot$ can be translated to a sequence parser  | 
|
632  | 
combinator. The parsers for $a$, $b$ and $\epsilon$ can be simply  | 
|
| 799 | 633  | 
written as \texttt{p"a"}, \texttt{p"b"} and \texttt{p""}.
 | 
| 590 | 634  | 
|
| 587 | 635  | 
|
| 936 | 636  | 
\subsubsection*{How to build more interesting parsers using parser combinators?}
 | 
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
637  | 
|
| 588 | 638  | 
The beauty of parser combinators is the ease with which they can be  | 
639  | 
implemented and how easy it is to translate context-free grammars into  | 
|
640  | 
code (though the grammars need to be non-left-recursive). To  | 
|
| 591 | 641  | 
demonstrate this consider again the grammar for palindromes from above.  | 
| 590 | 642  | 
The first idea would be to translate it into the following code  | 
| 588 | 643  | 
|
644  | 
\begin{center}
 | 
|
645  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
646  | 
lazy val Pal : Parser[String, String] =  | 
|
| 799 | 647  | 
((p"a" ~ Pal ~ p"a") || (p"b" ~ Pal ~ p"b") || p"a" || p"b" || p"")  | 
| 588 | 648  | 
\end{lstlisting}
 | 
649  | 
\end{center}
 | 
|
650  | 
||
651  | 
\noindent  | 
|
| 590 | 652  | 
Unfortunately, this does not quite work yet as it produces a typing  | 
| 799 | 653  | 
error. The reason is that the parsers \texttt{p"a"}, \texttt{p"b"} and
 | 
654  | 
\texttt{p""} all produce strings as output type and therefore can be
 | 
|
655  | 
put into an alternative \texttt{...|| p"a" || p"b" || p""}. But both
 | 
|
656  | 
sequence parsers \pcode{p"a" ~ Pal ~ p"a"} and \pcode{p"b" ~ Pal ~ p"b"}
 | 
|
| 591 | 657  | 
produce pairs of the form  | 
658  | 
||
659  | 
\begin{center}
 | 
|
660  | 
(((\texttt{a}-part, \texttt{Pal}-part), \texttt{a}-part), unprocessed part)
 | 
|
661  | 
\end{center}
 | 
|
662  | 
||
663  | 
\noindent That is how the  | 
|
664  | 
sequence parser combinator nests results when \pcode{\~} is used
 | 
|
665  | 
between two components. The solution is to use a semantic action that  | 
|
666  | 
``flattens'' these pairs and appends the corresponding strings, like  | 
|
| 588 | 667  | 
|
668  | 
\begin{center}
 | 
|
669  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
670  | 
lazy val Pal : Parser[String, String] =  | 
|
| 799 | 671  | 
  ((p"a" ~ Pal ~ p"a").map{ case ((x, y), z) => x + y + z } ||
 | 
672  | 
   (p"b" ~ Pal ~ p"b").map{ case ((x, y), z) => x + y + z } ||
 | 
|
673  | 
p"a" || p"b" || p"")  | 
|
| 588 | 674  | 
\end{lstlisting}
 | 
675  | 
\end{center}
 | 
|
676  | 
||
| 589 | 677  | 
\noindent  | 
| 591 | 678  | 
How does this work? Well, recall again what the pairs look like for  | 
| 799 | 679  | 
the parser \pcode{p"a" ~ Pal ~ p"a"}.  The pattern in the semantic
 | 
| 591 | 680  | 
action matches the nested pairs (the \texttt{x} with the
 | 
681  | 
\texttt{a}-part and so on).  Unfortunately when we have such nested
 | 
|
682  | 
pairs, Scala requires us to define the function using the  | 
|
683  | 
\pcode{case}-syntax
 | 
|
684  | 
||
685  | 
\begin{center}
 | 
|
686  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
687  | 
{ case ((x, y), z) => ... }
 | 
|
688  | 
\end{lstlisting}
 | 
|
689  | 
\end{center}
 | 
|
690  | 
||
691  | 
\noindent  | 
|
692  | 
If we have more sequence parser combinators or have them differently nested,  | 
|
693  | 
then the pattern in the semantic action needs to be adjusted accordingly.  | 
|
694  | 
The action we implement above is to concatenate all three strings, which  | 
|
695  | 
means after the semantic action is applied the output type of the parser  | 
|
696  | 
is \texttt{String}, which means it fits with the alternative parsers
 | 
|
| 799 | 697  | 
\texttt{...|| p"a" || p"b" || p""}.
 | 
| 591 | 698  | 
|
699  | 
If we run the parser above with \pcode{Pal.parse_all("abaaaba")} we obtain
 | 
|
| 593 | 700  | 
as result the \pcode{Set(abaaaba)}, which indicates that the string is a palindrome
 | 
| 591 | 701  | 
(an empty set would mean something is wrong). But also notice what the  | 
702  | 
intermediate results are generated by \pcode{Pal.parse("abaaaba")}
 | 
|
703  | 
||
704  | 
\begin{center}
 | 
|
705  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
706  | 
Set((abaaaba,""),(aba,aaba), (a,baaaba), ("",abaaaba))
 | 
|
707  | 
\end{lstlisting}
 | 
|
708  | 
\end{center}
 | 
|
709  | 
||
710  | 
\noindent  | 
|
711  | 
That there are more than one output might be slightly unexpected, but  | 
|
712  | 
can be explained as follows: the pairs represent all possible  | 
|
713  | 
(partial) parses of the string \pcode{"abaaaba"}. The first pair above
 | 
|
| 593 | 714  | 
corresponds to a complete parse (all output is consumed) and this is  | 
| 591 | 715  | 
what \pcode{Pal.parse_all} returns. The second pair is a small
 | 
716  | 
``sub-palindrome'' that can also be parsed, but the parse fails with  | 
|
717  | 
the rest \pcode{aaba}, which is therefore left as unprocessed. The
 | 
|
718  | 
third one is an attempt to parse the whole string with the  | 
|
719  | 
single-character parser \pcode{a}. That of course only partially
 | 
|
720  | 
succeeds, by leaving \pcode{"baaaba"} as the unprocessed
 | 
|
| 593 | 721  | 
part. Finally, since we allow the empty string to be a palindrome we  | 
| 591 | 722  | 
also obtain the last pair, where actually nothing is consumed from the  | 
723  | 
input string. While all this works as intended, we need to be careful  | 
|
724  | 
with this (especially with including the \pcode{""} parser in our
 | 
|
725  | 
grammar): if during parsing the set of parsing attempts gets too big,  | 
|
726  | 
then the parsing process can become very slow as the potential  | 
|
727  | 
candidates for applying rules can snowball.  | 
|
| 589 | 728  | 
|
729  | 
||
| 591 | 730  | 
Important is also to note is that we must define the  | 
731  | 
\texttt{Pal}-parser as a \emph{lazy} value in Scala. Look again at the
 | 
|
732  | 
code: \texttt{Pal} occurs on the right-hand side of the definition. If we had
 | 
|
733  | 
just written  | 
|
734  | 
||
735  | 
\begin{center}
 | 
|
736  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
737  | 
val Pal : Parser[String, String] = ...rhs...  | 
|
738  | 
\end{lstlisting}
 | 
|
739  | 
\end{center}
 | 
|
740  | 
||
| 589 | 741  | 
\noindent  | 
| 593 | 742  | 
then Scala before making this assignment to \texttt{Pal} attempts to
 | 
| 591 | 743  | 
find out what the expression on the right-hand side evaluates to. This  | 
744  | 
is straightforward in case of simple expressions \texttt{2 + 3}, but
 | 
|
745  | 
the expression above contains \texttt{Pal} in the right-hand
 | 
|
746  | 
side. Without \pcode{lazy} it would try to evaluate what \texttt{Pal}
 | 
|
747  | 
evaluates to and start a new recursion, which means it falls into an  | 
|
748  | 
infinite loop. The definition of \texttt{Pal} is recursive and the
 | 
|
749  | 
\pcode{lazy} key-word prevents it from being fully evaluated. Therefore
 | 
|
750  | 
whenever we want to define a recursive parser we have to write  | 
|
751  | 
||
752  | 
\begin{center}
 | 
|
753  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
754  | 
lazy val SomeParser : Parser[...,...] = ...rhs...  | 
|
755  | 
\end{lstlisting}
 | 
|
756  | 
\end{center}
 | 
|
757  | 
||
758  | 
\noindent That was not necessary for our atomic parsers, like  | 
|
759  | 
\texttt{RegexParser} or \texttt{CharParser}, because they are not recursive.
 | 
|
760  | 
Note that this is also the reason why we had to write  | 
|
761  | 
||
762  | 
\begin{center}
 | 
|
763  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
764  | 
class AltParser[I, T]  | 
|
765  | 
(p: => Parser[I, T],  | 
|
| 936 | 766  | 
        q: => Parser[I, T]) ... extends Parser[I, T] {...}
 | 
| 591 | 767  | 
|
768  | 
class SeqParser[I, T, S]  | 
|
769  | 
(p: => Parser[I, T],  | 
|
| 936 | 770  | 
        q: => Parser[I, S]) ... extends Parser[I, (T, S)] {...}
 | 
| 591 | 771  | 
\end{lstlisting}
 | 
772  | 
\end{center}
 | 
|
773  | 
||
774  | 
\noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front of
 | 
|
775  | 
the argument types for \texttt{p} and \texttt{q} prevent Scala from
 | 
|
776  | 
evaluating the arguments. Normally, Scala would first evaluate what  | 
|
777  | 
kind of parsers \texttt{p} and \texttt{q} are, and only then generate
 | 
|
| 593 | 778  | 
the alternative parser combinator, respectively sequence parser  | 
779  | 
combinator. Since the arguments can be recursive parsers, such as  | 
|
| 591 | 780  | 
\texttt{Pal}, this would lead again to an infinite loop.
 | 
781  | 
||
782  | 
As a final example in this section, let us consider the grammar for  | 
|
783  | 
well-nested parentheses:  | 
|
784  | 
||
785  | 
\begin{plstx}[margin=3cm]
 | 
|
786  | 
: \meta{P} ::=  (\cdot \meta{P}\cdot ) \cdot \meta{P} | \epsilon\\
 | 
|
787  | 
\end{plstx}
 | 
|
788  | 
||
789  | 
\noindent  | 
|
790  | 
Let us assume we want to not just recognise strings of  | 
|
| 593 | 791  | 
well-nested parentheses but also transform round parentheses  | 
| 591 | 792  | 
into curly braces. We can do this by using a semantic  | 
793  | 
action:  | 
|
794  | 
||
795  | 
\begin{center}
 | 
|
796  | 
  \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,
 | 
|
797  | 
xleftmargin=0mm, numbers=none]  | 
|
798  | 
lazy val P : Parser[String, String] =  | 
|
| 799 | 799  | 
  (p"(" ~ P ~ p")" ~ P).map{ case (((_,x),_),y) => "{" + x + "}" + y } || p""
 | 
| 591 | 800  | 
\end{lstlisting}
 | 
801  | 
\end{center}
 | 
|
802  | 
||
803  | 
\noindent  | 
|
804  | 
Here we define a function where which ignores the parentheses in the  | 
|
805  | 
pairs, but replaces them in the right places with curly braces when  | 
|
806  | 
assembling the new string in the right-hand side. If we run  | 
|
807  | 
\pcode{P.parse_all("(((()()))())")} we obtain
 | 
|
808  | 
\texttt{Set(\{\{\{\{\}\{\}\}\}\{\}\})} as expected.
 | 
|
809  | 
||
810  | 
||
| 588 | 811  | 
|
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
812  | 
\subsubsection*{Implementing an Interpreter}
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
813  | 
|
| 593 | 814  | 
The first step before implementing an interpreter for a full-blown  | 
| 592 | 815  | 
language is to implement a simple calculator for arithmetic  | 
816  | 
expressions. Suppose our arithmetic expressions are given by the  | 
|
817  | 
grammar:  | 
|
818  | 
||
819  | 
\begin{plstx}[margin=3cm,one per line]
 | 
|
| 593 | 820  | 
: \meta{E} ::= \meta{E} \cdot + \cdot \meta{E} 
 | 
| 592 | 821  | 
   | \meta{E} \cdot - \cdot \meta{E} 
 | 
822  | 
   | \meta{E} \cdot * \cdot \meta{E} 
 | 
|
823  | 
   | ( \cdot \meta{E} \cdot )
 | 
|
824  | 
| Number \\  | 
|
825  | 
\end{plstx}
 | 
|
826  | 
||
827  | 
\noindent  | 
|
828  | 
Naturally we want to implement the grammar in such a way that we can  | 
|
| 593 | 829  | 
calculate what the result of, for example, \texttt{4*2+3} is---we are
 | 
830  | 
interested in an \texttt{Int} rather than a string. This means every
 | 
|
831  | 
component parser needs to have as output type \texttt{Int} and when we
 | 
|
832  | 
assemble the intermediate results, strings like \texttt{"+"},
 | 
|
833  | 
\texttt{"*"} and so on, need to be translated into the appropriate
 | 
|
834  | 
Scala operation of adding, multiplying and so on. Being inspired by  | 
|
835  | 
the parser for well-nested parentheses above and ignoring the fact  | 
|
836  | 
that we want $*$ to take precedence over $+$ and $-$, we might want to  | 
|
837  | 
write something like  | 
|
| 592 | 838  | 
|
839  | 
\begin{center}
 | 
|
840  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
841  | 
lazy val E: Parser[String, Int] =  | 
|
| 799 | 842  | 
  ((E ~ p"+" ~ E).map{ case ((x, y), z) => x + z} ||
 | 
843  | 
   (E ~ p"-" ~ E).map{ case ((x, y), z) => x - z} ||
 | 
|
844  | 
   (E ~ p"*" ~ E).map{ case ((x, y), z) => x * z} ||
 | 
|
845  | 
   (p"(" ~ E ~ p")").map{ case ((x, y), z) => y} ||
 | 
|
| 592 | 846  | 
NumParserInt)  | 
847  | 
\end{lstlisting}
 | 
|
848  | 
\end{center}
 | 
|
849  | 
||
850  | 
\noindent  | 
|
| 593 | 851  | 
Consider again carefully how the semantic actions pick out the correct  | 
852  | 
arguments for the calculation. In case of plus, we need \texttt{x} and
 | 
|
853  | 
\texttt{z}, because they correspond to the results of the component
 | 
|
854  | 
parser \texttt{E}. We can just add \texttt{x + z} in order to obtain
 | 
|
855  | 
an \texttt{Int} because the output type of \texttt{E} is
 | 
|
856  | 
\texttt{Int}.  Similarly with subtraction and multiplication. In
 | 
|
857  | 
contrast in the fourth clause we need to return \texttt{y}, because it
 | 
|
858  | 
is the result enclosed inside the parentheses. The information about  | 
|
859  | 
parentheses, roughly speaking, we just throw away.  | 
|
| 592 | 860  | 
|
861  | 
So far so good. The problem arises when we try to call \pcode{parse_all} with the
 | 
|
862  | 
expression \texttt{"1+2+3"}. Lets try it
 | 
|
863  | 
||
864  | 
\begin{center}
 | 
|
865  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
866  | 
E.parse_all("1+2+3")
 | 
|
867  | 
\end{lstlisting}
 | 
|
868  | 
\end{center}
 | 
|
869  | 
||
870  | 
\noindent  | 
|
| 593 | 871  | 
\ldots and we wait and wait and \ldots still wait. What is the  | 
872  | 
problem? Actually, the parser just fell into an infinite loop! The  | 
|
873  | 
reason is that the above grammar is left-recursive and recall that our  | 
|
874  | 
parser combinators cannot deal with such left-recursive  | 
|
875  | 
grammars. Fortunately, every left-recursive context-free grammar can be  | 
|
876  | 
transformed into a non-left-recursive grammars that still recognises  | 
|
877  | 
the same strings. This allows us to design the following grammar  | 
|
878  | 
||
879  | 
\begin{plstx}[margin=3cm]
 | 
|
880  | 
  : \meta{E} ::=  \meta{T} \cdot + \cdot \meta{E} |  \meta{T} \cdot - \cdot \meta{E} | \meta{T}\\
 | 
|
881  | 
: \meta{T} ::=  \meta{F} \cdot * \cdot \meta{T} | \meta{F}\\
 | 
|
882  | 
: \meta{F} ::= ( \cdot \meta{E} \cdot ) | Number\\
 | 
|
883  | 
\end{plstx}
 | 
|
884  | 
||
885  | 
\noindent  | 
|
886  | 
Recall what left-recursive means from Handout 5 and make sure you see  | 
|
887  | 
why this grammar is \emph{non} left-recursive. This version of the grammar
 | 
|
| 936 | 888  | 
also deals with the fact that $*$ should have a higher precedence than $+$ and $-$. This does not  | 
| 593 | 889  | 
affect which strings this grammar can recognise, but in which order we are going  | 
890  | 
to evaluate any arithmetic expression. We can translate this grammar into  | 
|
891  | 
parsing combinators as follows:  | 
|
| 592 | 892  | 
|
893  | 
||
| 593 | 894  | 
\begin{center}
 | 
895  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
896  | 
lazy val E: Parser[String, Int] =  | 
|
| 799 | 897  | 
  (T ~ p"+" ~ E).map{ case ((x, y), z) => x + z } ||
 | 
898  | 
  (T ~ p"-" ~ E).map{ case ((x, y), z) => x - z } || T 
 | 
|
| 593 | 899  | 
lazy val T: Parser[String, Int] =  | 
| 799 | 900  | 
  (F ~ p"*" ~ T).map{ case ((x, y), z) => x * z } || F
 | 
| 593 | 901  | 
lazy val F: Parser[String, Int] =  | 
| 799 | 902  | 
  (p"(" ~ E ~ p")").map{ case ((x, y), z) => y } || NumParserInt
 | 
| 593 | 903  | 
\end{lstlisting}
 | 
904  | 
\end{center}
 | 
|
| 592 | 905  | 
|
| 593 | 906  | 
\noindent  | 
| 594 | 907  | 
Let us try out some examples:  | 
| 592 | 908  | 
|
| 593 | 909  | 
\begin{center}
 | 
910  | 
\begin{tabular}{rcl}
 | 
|
911  | 
  input strings & & output of \pcode{parse_all}\medskip\\
 | 
|
912  | 
  \texttt{\Grid{1+2+3}} & $\rightarrow$ & \texttt{Set(6)}\\
 | 
|
913  | 
  \texttt{\Grid{4*2+3}} & $\rightarrow$ & \texttt{Set(11)}\\
 | 
|
914  | 
  \texttt{\Grid{4*(2+3)}} & $\rightarrow$ & \texttt{Set(20)}\\
 | 
|
| 594 | 915  | 
  \texttt{\Grid{(4)*((2+3))}} & $\rightarrow$ & \texttt{Set(20)}\\
 | 
| 593 | 916  | 
  \texttt{\Grid{4/2+3}} & $\rightarrow$ & \texttt{Set()}\\
 | 
917  | 
  \texttt{\Grid{1\VS +\VS 2\VS +\VS 3}} & $\rightarrow$ & \texttt{Set()}\\                      
 | 
|
918  | 
\end{tabular}
 | 
|
919  | 
\end{center}
 | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
920  | 
|
| 593 | 921  | 
\noindent  | 
| 594 | 922  | 
Note that we call \pcode{parse_all}, not \pcode{parse}.  The examples
 | 
923  | 
should be quite self-explanatory. The last two example do not produce  | 
|
924  | 
any integer result because our parser does not define what to do in  | 
|
925  | 
case of division (could be easily added), but also has no idea what to  | 
|
| 595 | 926  | 
do with whitespaces. To deal with them is the task of the lexer! Yes,  | 
| 594 | 927  | 
we can deal with them inside the grammar, but that would render many  | 
928  | 
grammars becoming unintelligible, including this one.\footnote{If you
 | 
|
929  | 
think an easy solution is to extend the notion of what a number  | 
|
930  | 
should be, then think again---you still would have to deal with  | 
|
| 595 | 931  | 
  cases like \texttt{\Grid{(\VS (\VS 2+3)\VS )}}. Just think of the mess 
 | 
932  | 
you would have in a grammar for a full-blown language where there are  | 
|
933  | 
numerous such cases.}  | 
|
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
934  | 
|
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
935  | 
\end{document}
 | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
936  | 
|
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
937  | 
%%% Local Variables:  | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
938  | 
%%% mode: latex  | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
939  | 
%%% TeX-master: t  | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
940  | 
%%% End:  |