| author | Christian Urban <urbanc@in.tum.de> | 
| Fri, 26 Oct 2018 16:14:10 +0100 | |
| changeset 592 | 6b62b697321d | 
| parent 591 | e3d10383ae37 | 
| child 593 | f7c6512bb85a | 
| permissions | -rw-r--r-- | 
| 584 | 1  | 
|
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
2  | 
\documentclass{article}
 | 
| 
297
 
5c51839c88fd
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
292 
diff
changeset
 | 
3  | 
\usepackage{../style}
 | 
| 
217
 
cd6066f1056a
updated handouts
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
183 
diff
changeset
 | 
4  | 
\usepackage{../langs}
 | 
| 588 | 5  | 
\usepackage{../grammar}
 | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
6  | 
|
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
7  | 
\begin{document}
 | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
8  | 
|
| 
292
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
9  | 
\section*{Handout 6 (Parser Combinators)}
 | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
10  | 
|
| 584 | 11  | 
This handout explains how \emph{parser combinators} work and how they
 | 
| 587 | 12  | 
can be implemented in Scala. Their most distinguishing feature is that  | 
13  | 
they are very easy to implement (admittedly it is only easy in a  | 
|
14  | 
functional programming language). Another good point of parser  | 
|
15  | 
combinators is that they can deal with any kind of input as long as  | 
|
16  | 
this input is of ``sequence-kind'', for example a string or a list of  | 
|
17  | 
tokens. The only two properties of the input we need is to be able to  | 
|
18  | 
test when it is empty and ``sequentially'' take it apart. Strings and  | 
|
19  | 
lists fit this bill. However, parser combinators also have their  | 
|
20  | 
drawbacks. For example they require that the grammar to be parsed is  | 
|
21  | 
\emph{not} left-recursive and they are efficient only when the grammar
 | 
|
22  | 
is unambiguous. It is the responsibility of the grammar designer to  | 
|
| 591 | 23  | 
ensure these two properties hold.  | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
24  | 
|
| 587 | 25  | 
The general idea behind parser combinators is to transform the input  | 
26  | 
into sets of pairs, like so  | 
|
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
27  | 
|
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
28  | 
\begin{center}
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
29  | 
$\underbrace{\text{list of tokens}}_{\text{input}}$ 
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
30  | 
$\Rightarrow$  | 
| 591 | 31  | 
$\underbrace{\text{set of (parsed part, unprocessed part)}}_{\text{output}}$
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
32  | 
\end{center} 
 | 
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
33  | 
|
| 587 | 34  | 
\noindent  | 
| 590 | 35  | 
Given the extended effort we have spent implementing a lexer in order  | 
| 591 | 36  | 
to generate lists of tokens, it might be surprising that in what  | 
37  | 
follows we shall often use strings as input, rather than lists of  | 
|
38  | 
tokens. This is for making the explanation more lucid and for quick  | 
|
39  | 
examples. It does not make our previous work on lexers obsolete  | 
|
40  | 
(remember they transform a string into a list of tokens). Lexers will  | 
|
41  | 
still be needed for building a somewhat realistic compiler.  | 
|
| 584 | 42  | 
|
| 590 | 43  | 
As mentioned above, parser combinators are relatively agnostic about what  | 
| 587 | 44  | 
kind of input they process. In my Scala code I use the following  | 
45  | 
polymorphic types for parser combinators:  | 
|
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
46  | 
|
| 
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
47  | 
\begin{center}
 | 
| 584 | 48  | 
input:\;\; \texttt{I}  \qquad output:\;\; \texttt{T}
 | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
49  | 
\end{center}
 | 
| 
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
50  | 
|
| 587 | 51  | 
\noindent That is they take as input something of type \texttt{I} and
 | 
| 590 | 52  | 
return a set of pairs of type \texttt{Set[(T, I)]}. Since the input
 | 
53  | 
needs to be of ``sequence-kind'', I actually have to often write  | 
|
| 591 | 54  | 
\texttt{I <\% Seq[\_]} for the input type. This ensures the
 | 
55  | 
input is a subtype of Scala sequences. The first component of the  | 
|
56  | 
generated pairs corresponds to what the parser combinator was able to  | 
|
57  | 
parse from the input and the second is the unprocessed, or  | 
|
58  | 
leftover, part of the input (therefore the type of this unprocessed part is  | 
|
59  | 
the same as the input). A parser combinator might return more than one  | 
|
60  | 
such pair; the idea is that there are potentially several ways of how  | 
|
61  | 
to parse the input. As a concrete example, consider the string  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
62  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
63  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
64  | 
\tt\Grid{iffoo\VS testbar}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
65  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
66  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
67  | 
\noindent We might have a parser combinator which tries to  | 
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
68  | 
interpret this string as a keyword (\texttt{if}) or as an
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
69  | 
identifier (\texttt{iffoo}). Then the output will be the set
 | 
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
70  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
71  | 
\begin{center}
 | 
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
72  | 
$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), 
 | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
73  | 
           \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
74  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
75  | 
|
| 587 | 76  | 
\noindent where the first pair means the parser could recognise  | 
| 590 | 77  | 
\texttt{if} from the input and leaves the \texttt{foo\VS testbar} as
 | 
| 591 | 78  | 
unprocessed part; in the other case it could recognise  | 
| 587 | 79  | 
\texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
 | 
80  | 
parser cannot recognise anything from the input at all, then parser  | 
|
81  | 
combinators just return the empty set $\{\}$. This will indicate
 | 
|
82  | 
something ``went wrong''\ldots or more precisely, nothing could be  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
83  | 
parsed.  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
84  | 
|
| 584 | 85  | 
Also important to note is that the type \texttt{T} for the processed
 | 
| 590 | 86  | 
part is different from the input type \texttt{I} in the parse. In the
 | 
87  | 
example above is just happens to be the same. The reason for the  | 
|
| 591 | 88  | 
difference is that in general we are interested in  | 
| 590 | 89  | 
transforming our input into something ``different''\ldots for example  | 
90  | 
into a tree; or if we implement the grammar for arithmetic  | 
|
91  | 
expressions, we might be interested in the actual integer number the  | 
|
92  | 
arithmetic expression, say \texttt{1 + 2 * 3}, stands for. In this way
 | 
|
93  | 
we can use parser combinators to implement relatively easily a  | 
|
94  | 
calculator, for instance.  | 
|
| 584 | 95  | 
|
96  | 
The main idea of parser combinators is that we can easily build parser  | 
|
97  | 
combinators out of smaller components following very closely the  | 
|
| 591 | 98  | 
structure of a grammar. In order to implement this in a  | 
99  | 
functional/object-oriented programming language, like Scala, we need  | 
|
100  | 
to specify an abstract class for parser combinators. In the abstract  | 
|
101  | 
class we specify that \texttt{I} is the \emph{input type} of the
 | 
|
102  | 
parser combinator and that \texttt{T} is the \emph{ouput type}.  This
 | 
|
103  | 
implies that the function \texttt{parse} takes an argument of type
 | 
|
104  | 
\texttt{I} and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
 | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
105  | 
|
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
106  | 
\begin{center}
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
107  | 
\begin{lstlisting}[language=Scala]
 | 
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
108  | 
abstract class Parser[I, T] {
 | 
| 590 | 109  | 
def parse(in: I) : Set[(T, I)]  | 
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
110  | 
|
| 590 | 111  | 
def parse_all(in: I) : Set[T] =  | 
112  | 
for ((head, tail) <- parse(in); if (tail.isEmpty))  | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
113  | 
yield head  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
114  | 
}  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
115  | 
\end{lstlisting}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
116  | 
\end{center}
 | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
117  | 
|
| 591 | 118  | 
\noindent It is the obligation in each instance of this class to  | 
| 584 | 119  | 
supply an implementation for \texttt{parse}.  From this function we
 | 
120  | 
can then ``centrally'' derive the function \texttt{parse\_all}, which
 | 
|
121  | 
just filters out all pairs whose second component is not empty (that  | 
|
122  | 
is has still some unprocessed part). The reason is that at the end of  | 
|
123  | 
the parsing we are only interested in the results where all the input  | 
|
124  | 
has been consumed and no unprocessed part is left over.  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
125  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
126  | 
One of the simplest parser combinators recognises just a  | 
| 584 | 127  | 
single character, say $c$, from the beginning of strings. Its  | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
128  | 
behaviour can be described as follows:  | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
129  | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
130  | 
\begin{itemize}
 | 
| 584 | 131  | 
\item If the head of the input string starts with a $c$, then return  | 
132  | 
the set  | 
|
133  | 
  \[\{(c, \textit{tail of}\; s)\}\]
 | 
|
134  | 
  where \textit{tail of} 
 | 
|
135  | 
$s$ is the unprocessed part of the input string.  | 
|
136  | 
\item Otherwise return the empty set $\{\}$.	
 | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
137  | 
\end{itemize}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
138  | 
|
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
139  | 
\noindent  | 
| 590 | 140  | 
The input type of this simple parser combinator is \texttt{String} and
 | 
141  | 
the output type is \texttt{Char}. This means \texttt{parse} returns
 | 
|
142  | 
\mbox{\texttt{Set[(Char, String)]}}.  The code in Scala is as follows:
 | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
143  | 
|
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
144  | 
\begin{center}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
145  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
146  | 
case class CharParser(c: Char) extends Parser[String, Char] {
 | 
| 587 | 147  | 
def parse(in: String) =  | 
148  | 
if (in.head == c) Set((c, in.tail)) else Set()  | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
149  | 
}  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
150  | 
\end{lstlisting}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
151  | 
\end{center}
 | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
152  | 
|
| 589 | 153  | 
\noindent You can see \texttt{parse} tests whether the
 | 
| 587 | 154  | 
first character of the input string \texttt{in} is equal to
 | 
| 584 | 155  | 
\texttt{c}. If yes, then it splits the string into the recognised part
 | 
| 587 | 156  | 
\texttt{c} and the unprocessed part \texttt{in.tail}. In case
 | 
157  | 
\texttt{in} does not start with \texttt{c} then the parser returns the
 | 
|
| 584 | 158  | 
empty set (in Scala \texttt{Set()}). Since this parser recognises
 | 
159  | 
characters and just returns characters as the processed part, the  | 
|
160  | 
output type of the parser is \texttt{Char}.
 | 
|
161  | 
||
162  | 
If we want to parse a list of tokens and interested in recognising a  | 
|
| 590 | 163  | 
number token, for example, we could write something like this  | 
| 584 | 164  | 
|
165  | 
\begin{center}
 | 
|
166  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
 | 
|
167  | 
case object NumParser extends Parser[List[Token], Int] {
 | 
|
168  | 
  def parse(ts: List[Token]) = ts match {
 | 
|
169  | 
case Num_token(s)::ts => Set((s.toInt, ts))  | 
|
170  | 
case _ => Set ()  | 
|
171  | 
}  | 
|
172  | 
}  | 
|
173  | 
\end{lstlisting}
 | 
|
174  | 
\end{center}
 | 
|
175  | 
||
176  | 
\noindent  | 
|
177  | 
In this parser the input is of type \texttt{List[Token]}. The function
 | 
|
178  | 
parse looks at the input \texttt{ts} and checks whether the first
 | 
|
| 589 | 179  | 
token is a \texttt{Num\_token} (let us assume our lexer generated
 | 
180  | 
these tokens for numbers). But this parser does not just return this  | 
|
| 584 | 181  | 
token (and the rest of the list), like the \texttt{CharParser} above,
 | 
| 590 | 182  | 
rather it extracts also the string \texttt{s} from the token and
 | 
183  | 
converts it into an integer. The hope is that the lexer did its work  | 
|
184  | 
well and this conversion always succeeds. The consequence of this is  | 
|
185  | 
that the output type for this parser is \texttt{Int}, not
 | 
|
186  | 
\texttt{Token}. Such a conversion would be needed if we want to
 | 
|
187  | 
implement a simple calculator program, because string-numbers need to  | 
|
188  | 
be transformed into \texttt{Int}-numbers in order to do the
 | 
|
189  | 
calculations.  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
190  | 
|
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
191  | 
|
| 584 | 192  | 
These simple parsers that just look at the input and do a simple  | 
193  | 
transformation are often called \emph{atomic} parser combinators.
 | 
|
194  | 
More interesting are the parser combinators that build larger parsers  | 
|
| 587 | 195  | 
out of smaller component parsers. There are three such parser  | 
196  | 
combinators that can be implemented generically. The \emph{alternative
 | 
|
| 584 | 197  | 
parser combinator} is as follows: given two parsers, say, $p$ and  | 
198  | 
$q$, we apply both parsers to the input (remember parsers are  | 
|
| 587 | 199  | 
functions) and combine the output (remember they are sets of pairs):  | 
200  | 
||
201  | 
\begin{center}
 | 
|
202  | 
$p(\text{input}) \cup q(\text{input})$
 | 
|
203  | 
\end{center}
 | 
|
204  | 
||
205  | 
\noindent In Scala we can implement alternative parser  | 
|
206  | 
combinator as follows  | 
|
207  | 
||
208  | 
\begin{center}
 | 
|
209  | 
\begin{lstlisting}[language=Scala, numbers=none]
 | 
|
210  | 
class AltParser[I, T]  | 
|
211  | 
(p: => Parser[I, T],  | 
|
212  | 
        q: => Parser[I, T]) extends Parser[I, T] {
 | 
|
213  | 
def parse(in: I) = p.parse(in) ++ q.parse(in)  | 
|
214  | 
}  | 
|
215  | 
\end{lstlisting}
 | 
|
216  | 
\end{center}
 | 
|
217  | 
||
218  | 
\noindent The types of this parser combinator are again generic (we  | 
|
219  | 
have \texttt{I} for the input type, and \texttt{T} for the output
 | 
|
220  | 
type). The alternative parser builds a new parser out of two existing  | 
|
| 590 | 221  | 
parsers \texttt{p} and \texttt{q} which are given as arguments.  Both
 | 
222  | 
parsers need to be able to process input of type \texttt{I} and return
 | 
|
223  | 
in \texttt{parse} the same output type \texttt{Set[(T,
 | 
|
| 587 | 224  | 
  I)]}.\footnote{There is an interesting detail of Scala, namely the
 | 
225  | 
  \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They
 | 
|
226  | 
will prevent the evaluation of the arguments before they are  | 
|
227  | 
  used. This is often called \emph{lazy evaluation} of the
 | 
|
| 590 | 228  | 
arguments. We will explain this later.} The alternative parser runs  | 
229  | 
the input with the first parser \texttt{p} (producing a set of pairs)
 | 
|
230  | 
and then runs the same input with \texttt{q} (producing another set of
 | 
|
231  | 
pairs). The result should be then just the union of both sets, which  | 
|
232  | 
is the operation \texttt{++} in Scala.
 | 
|
| 587 | 233  | 
|
234  | 
The alternative parser combinator allows us to construct a parser that  | 
|
235  | 
parses either a character \texttt{a} or \texttt{b} using the
 | 
|
236  | 
\texttt{CharParser} shown above. For this we can write
 | 
|
237  | 
||
238  | 
\begin{center}
 | 
|
239  | 
\begin{lstlisting}[language=Scala, numbers=none]
 | 
|
240  | 
new AltParser(CharParser('a'), CharParser('b'))
 | 
|
241  | 
\end{lstlisting}
 | 
|
242  | 
\end{center}
 | 
|
243  | 
||
244  | 
\noindent Later on we will use Scala mechanism for introducing some  | 
|
| 589 | 245  | 
more readable shorthand notation for this, like \texttt{"a" |
 | 
| 587 | 246  | 
"b"}. Let us look in detail at what this parser combinator produces  | 
| 590 | 247  | 
with some sample strings.  | 
| 587 | 248  | 
|
249  | 
\begin{center}
 | 
|
250  | 
\begin{tabular}{rcl}
 | 
|
251  | 
input strings & & output\medskip\\  | 
|
252  | 
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
 | 
|
253  | 
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
 | 
|
254  | 
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
 | 
|
255  | 
\end{tabular}
 | 
|
256  | 
\end{center}
 | 
|
257  | 
||
258  | 
\noindent We receive in the first two cases a successful  | 
|
259  | 
output (that is a non-empty set). In each case, either  | 
|
| 591 | 260  | 
\pcode{a} or \pcode{b} is in the parsed part, and
 | 
| 587 | 261  | 
\pcode{cde} in the unprocessed part. Clearly this parser cannot
 | 
262  | 
parse anything with \pcode{ccde}, therefore the empty
 | 
|
263  | 
set is returned.  | 
|
264  | 
||
265  | 
A bit more interesting is the \emph{sequence parser combinator}. Given
 | 
|
266  | 
two parsers, say again, $p$ and $q$, we want to apply first the input  | 
|
| 590 | 267  | 
to $p$ producing a set of pairs; then apply $q$ to all the unparsed  | 
| 587 | 268  | 
parts in the pairs; and then combine the results. Mathematically we would  | 
| 591 | 269  | 
write something like this for the set of pairs:  | 
| 587 | 270  | 
|
271  | 
\begin{center}
 | 
|
272  | 
\begin{tabular}{lcl}
 | 
|
273  | 
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & 
 | 
|
274  | 
$(\textit{output}_1, u_1) \in p(\text{input}) 
 | 
|
275  | 
\;\wedge\;$\\  | 
|
276  | 
&& $(\textit{output}_2, u_2) \in q(u_1)\}$
 | 
|
277  | 
\end{tabular}
 | 
|
278  | 
\end{center}
 | 
|
279  | 
||
280  | 
\noindent Notice that the $p$ will first be run on the input,  | 
|
| 590 | 281  | 
producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$
 | 
| 591 | 282  | 
stands for the unprocessed, or leftover, parts of $p$. We want that  | 
| 590 | 283  | 
$q$ runs on all these unprocessed parts $u_1$. Therefore these  | 
284  | 
unprocessed parts are fed into the second parser $q$. The overall  | 
|
285  | 
result of the sequence parser combinator is pairs of the form  | 
|
| 584 | 286  | 
$((\textit{output}_1, \textit{output}_2), u_2)$. This means the
 | 
| 591 | 287  | 
unprocessed part of the sequqnce parser combinator is the unprocessed  | 
288  | 
part the second parser $q$ leaves as leftover. The parsed parts of the  | 
|
289  | 
component parsers are combined in a pair, namely  | 
|
290  | 
$(\textit{output}_1, \textit{output}_2)$. The reason is we want to
 | 
|
291  | 
know what $p$ and $q$ were able to parse. This behaviour can be  | 
|
292  | 
implemented in Scala as follows:  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
293  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
294  | 
\begin{center}
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
295  | 
\begin{lstlisting}[language=Scala,numbers=none]
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
296  | 
class SeqParser[I, T, S]  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
297  | 
(p: => Parser[I, T],  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
298  | 
        q: => Parser[I, S]) extends Parser[I, (T, S)] {
 | 
| 587 | 299  | 
def parse(in: I) =  | 
300  | 
for ((output1, u1) <- p.parse(in);  | 
|
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
301  | 
(output2, u2) <- q.parse(u1))  | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
302  | 
yield ((output1, output2), u2)  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
303  | 
}  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
304  | 
\end{lstlisting}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
305  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
306  | 
|
| 587 | 307  | 
\noindent This parser takes again as arguments two parsers, \texttt{p}
 | 
| 591 | 308  | 
and \texttt{q}. It implements \texttt{parse} as follows: first run the
 | 
309  | 
parser \texttt{p} on the input producing a set of pairs
 | 
|
| 587 | 310  | 
(\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
 | 
| 591 | 311  | 
unprocessed parts left over by \texttt{p} (recall that there can be
 | 
312  | 
several such pairs). Let then \texttt{q} run on these unprocessed
 | 
|
313  | 
parts producing again a set of pairs. The output of the sequence  | 
|
314  | 
parser combinator is then a set containing pairs where the first  | 
|
315  | 
components are again pairs, namely what the first parser could parse  | 
|
316  | 
together with what the second parser could parse; the second component  | 
|
317  | 
is the unprocessed part left over after running the second parser  | 
|
318  | 
\texttt{q}. Note that the input type of the sequence parser combinator
 | 
|
319  | 
is as usual \texttt{I}, but the output type is
 | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
320  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
321  | 
\begin{center}
 | 
| 590 | 322  | 
\texttt{(T, S)}
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
323  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
324  | 
|
| 584 | 325  | 
\noindent  | 
| 591 | 326  | 
Consequently, the function \texttt{parse} in the sequence parser
 | 
327  | 
combinator returns sets of type \texttt{Set[((T, S), I)]}.  That means
 | 
|
328  | 
we have essentially two output types for the sequence parser  | 
|
329  | 
combinator (packaged in a pair), because in general \textit{p} and
 | 
|
330  | 
\textit{q} might produce different things (for example we recognise a
 | 
|
331  | 
number with \texttt{p} and then with \texttt{q} a string corresponding
 | 
|
332  | 
to an operator).  If any of the runs of \textit{p} and \textit{q}
 | 
|
333  | 
fail, that is produce the empty set, then \texttt{parse} will also
 | 
|
334  | 
produce the empty set.  | 
|
| 584 | 335  | 
|
| 587 | 336  | 
With the shorthand notation we shall introduce later for the sequence  | 
337  | 
parser combinator, we can write for example \pcode{"a" ~ "b"}, which
 | 
|
338  | 
is the parser combinator that first recognises the character  | 
|
339  | 
\texttt{a} from a string and then \texttt{b}. Let us look again at
 | 
|
| 591 | 340  | 
some examples of how this parser combinator processes some strings:  | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
341  | 
|
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
342  | 
\begin{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
343  | 
\begin{tabular}{rcl}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
344  | 
input strings & & output\medskip\\  | 
| 584 | 345  | 
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\
 | 
346  | 
\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
 | 
|
347  | 
\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
 | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
348  | 
\end{tabular}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
349  | 
\end{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
350  | 
|
| 586 | 351  | 
\noindent In the first line we have a successful parse, because the  | 
| 587 | 352  | 
string starts with \texttt{ab}, which is the prefix we are looking
 | 
| 584 | 353  | 
for. But since the parsing combinator is constructed as sequence of  | 
354  | 
the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
 | 
|
355  | 
result is a nested pair of the form \texttt{((a, b), cde)}. It is
 | 
|
| 586 | 356  | 
\emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
 | 
| 587 | 357  | 
expect. The parser returns the empty set in the other examples,  | 
| 584 | 358  | 
because they do not fit with what the parser is supposed to parse.  | 
359  | 
||
360  | 
||
| 589 | 361  | 
A slightly more complicated parser is \pcode{("a" | "b") ~ "c"} which
 | 
| 587 | 362  | 
parses as first character either an \texttt{a} or \texttt{b}, followed
 | 
363  | 
by a \texttt{c}. This parser produces the following outputs.
 | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
364  | 
|
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
365  | 
\begin{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
366  | 
\begin{tabular}{rcl}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
367  | 
input strings & & output\medskip\\  | 
| 585 | 368  | 
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
 | 
369  | 
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
 | 
|
370  | 
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$
 | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
371  | 
\end{tabular}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
372  | 
\end{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
373  | 
|
| 585 | 374  | 
\noindent  | 
375  | 
Now consider the parser \pcode{("a" ~ "b") ~ "c"} which parses
 | 
|
376  | 
\texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces
 | 
|
377  | 
the following outputs.  | 
|
378  | 
||
379  | 
\begin{center}
 | 
|
380  | 
\begin{tabular}{rcl}
 | 
|
381  | 
input strings & & output\medskip\\  | 
|
382  | 
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
 | 
|
383  | 
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\
 | 
|
384  | 
\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$
 | 
|
385  | 
\end{tabular}
 | 
|
386  | 
\end{center}
 | 
|
387  | 
||
388  | 
||
389  | 
\noindent The second and third example fail, because something is  | 
|
| 590 | 390  | 
``missing'' in the sequence we are looking for. The first succeeds but  | 
391  | 
notice how the results nest with sequences: the parsed part is a  | 
|
392  | 
nested pair of the form \pcode{((a, b), c)}. If we nest the sequence
 | 
|
| 591 | 393  | 
parser differently, say \pcode{"a" ~ ("b" ~ "c")}, then also
 | 
| 590 | 394  | 
our output pairs nest differently  | 
| 589 | 395  | 
|
396  | 
\begin{center}
 | 
|
397  | 
\begin{tabular}{rcl}
 | 
|
398  | 
input strings & & output\medskip\\  | 
|
399  | 
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})), \texttt{\Grid{de}})\right\}$\\
 | 
|
400  | 
\end{tabular}
 | 
|
401  | 
\end{center}
 | 
|
402  | 
||
403  | 
\noindent  | 
|
404  | 
Two more examples: first consider the parser  | 
|
| 585 | 405  | 
\pcode{("a" ~ "a") ~ "a"} and the input \pcode{aaaa}:
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
406  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
407  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
408  | 
\begin{tabular}{rcl}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
409  | 
input string & & output\medskip\\  | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
410  | 
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
411  | 
$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
412  | 
\end{tabular}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
413  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
414  | 
|
| 591 | 415  | 
\noindent Notice again how the results nest deeper and deeper as pairs (the  | 
| 585 | 416  | 
last \pcode{a} is in the unprocessed part). To consume everything of
 | 
417  | 
this string we can use the parser \pcode{(("a" ~ "a") ~ "a") ~
 | 
|
418  | 
"a"}. Then the output is as follows:  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
419  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
420  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
421  | 
\begin{tabular}{rcl}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
422  | 
input string & & output\medskip\\  | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
423  | 
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
424  | 
$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
425  | 
\end{tabular}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
426  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
427  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
428  | 
\noindent This is an instance where the parser consumed  | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
429  | 
completely the input, meaning the unprocessed part is just the  | 
| 587 | 430  | 
empty string. So if we called \pcode{parse_all}, instead of \pcode{parse},
 | 
| 585 | 431  | 
we would get back the result  | 
432  | 
||
433  | 
\[  | 
|
434  | 
\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}
 | 
|
435  | 
\]  | 
|
436  | 
||
437  | 
\noindent where the unprocessed (empty) parts have been stripped away  | 
|
438  | 
from the pairs; everything where the second part was not empty has  | 
|
| 587 | 439  | 
been thrown away as well, because they represent  | 
| 590 | 440  | 
ultimately-unsuccessful-parses. The main point is that the sequence  | 
441  | 
parser combinator returns pairs that can nest according to the  | 
|
442  | 
nesting of the component parsers.  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
443  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
444  | 
|
| 590 | 445  | 
Consider also carefully that constructing a parser such \pcode{"a" |
 | 
446  | 
  ("a" ~ "b")} will result in a typing error. The intention with this
 | 
|
| 591 | 447  | 
parser is that we want to parse either an \texttt{a}, or an \texttt{a}
 | 
| 590 | 448  | 
followed by a \texttt{b}. However, the first parser has as output type
 | 
449  | 
a single character (recall the type of \texttt{CharParser}), but the
 | 
|
450  | 
second parser produces a pair of characters as output. The alternative  | 
|
451  | 
parser is required to have both component parsers to have the same  | 
|
| 591 | 452  | 
type---the reason is that we need to be able to build the union of two  | 
453  | 
sets, which requires in Scala that the sets have the same type. Since  | 
|
454  | 
they are not in this case, there is a typing error. We will see later  | 
|
455  | 
how we can build this parser without the typing error.  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
456  | 
|
| 587 | 457  | 
The next parser combinator, called \emph{semantic action}, does not
 | 
| 591 | 458  | 
actually combine two smaller parsers, but applies a function to the result  | 
| 587 | 459  | 
of a parser. It is implemented in Scala as follows  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
460  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
461  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
462  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
463  | 
class FunParser[I, T, S]  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
464  | 
(p: => Parser[I, T],  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
465  | 
          f: T => S) extends Parser[I, S] {
 | 
| 587 | 466  | 
def parse(in: I) =  | 
467  | 
for ((head, tail) <- p.parse(in)) yield (f(head), tail)  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
468  | 
}  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
469  | 
\end{lstlisting}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
470  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
471  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
472  | 
|
| 590 | 473  | 
\noindent This parser combinator takes a parser \texttt{p} (with input
 | 
474  | 
type \texttt{I} and output type \texttt{T}) as one argument but also a
 | 
|
475  | 
function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p}
 | 
|
476  | 
produces sets of type \texttt{Set[(T, I)]}. The semantic action
 | 
|
477  | 
combinator then applies the function \texttt{f} to all the `processed'
 | 
|
478  | 
parser outputs. Since this function is of type \texttt{T => S}, we
 | 
|
479  | 
obtain a parser with output type \texttt{S}. Again Scala lets us
 | 
|
480  | 
introduce some shorthand notation for this parser  | 
|
| 591 | 481  | 
combinator. Therefore we will write short \texttt{p ==> f} for it.
 | 
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
482  | 
|
| 589 | 483  | 
What are semantic actions good for? Well, they allow you to transform  | 
| 590 | 484  | 
the parsed input into datastructures you can use for further  | 
| 591 | 485  | 
processing. A simple (contrived) example would be to transform parsed  | 
486  | 
characters into ASCII numbers. Suppose we define a function \texttt{f}
 | 
|
487  | 
(from characters to \texttt{Int}s) and use a \texttt{CharParser} for parsing
 | 
|
| 589 | 488  | 
the character \texttt{c}.
 | 
| 587 | 489  | 
|
| 591 | 490  | 
|
| 587 | 491  | 
\begin{center}
 | 
492  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
493  | 
val f = (c: Char) => c.toInt  | 
|
494  | 
val c = new CharParser('c')
 | 
|
495  | 
\end{lstlisting}
 | 
|
496  | 
\end{center}
 | 
|
497  | 
||
498  | 
\noindent  | 
|
| 589 | 499  | 
We then can run the following two parsers on the input \texttt{cbd}:
 | 
| 587 | 500  | 
|
501  | 
\begin{center}
 | 
|
502  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
503  | 
c.parse("cbd")
 | 
|
504  | 
(c ==> f).parse("cbd")
 | 
|
505  | 
\end{lstlisting}
 | 
|
506  | 
\end{center}
 | 
|
507  | 
||
508  | 
\noindent  | 
|
| 589 | 509  | 
In the first line we obtain the expected result \texttt{Set(('c',
 | 
510  | 
  "bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the
 | 
|
511  | 
character has been transformed into an ASCII number.  | 
|
| 588 | 512  | 
|
513  | 
A slightly less contrived example is about parsing numbers (recall  | 
|
| 591 | 514  | 
\texttt{NumParser} above). However, we want to do this here for
 | 
515  | 
strings, not for tokens. For this assume we have the following  | 
|
516  | 
(atomic) \texttt{RegexParser}.
 | 
|
| 588 | 517  | 
|
518  | 
\begin{center}
 | 
|
519  | 
  \begin{lstlisting}[language=Scala,xleftmargin=0mm,
 | 
|
520  | 
basicstyle=\small\ttfamily, numbers=none]  | 
|
521  | 
import scala.util.matching.Regex  | 
|
522  | 
||
523  | 
case class RegexParser(reg: Regex) extends Parser[String, String] {
 | 
|
524  | 
  def parse(in: String) = reg.findPrefixMatchOf(in) match {
 | 
|
525  | 
case None => Set()  | 
|
526  | 
case Some(m) => Set((m.matched, m.after.toString))  | 
|
527  | 
}  | 
|
528  | 
}  | 
|
529  | 
\end{lstlisting}
 | 
|
530  | 
\end{center}
 | 
|
531  | 
||
532  | 
\noindent  | 
|
533  | 
This parser takes a regex as argument and splits up a string into a  | 
|
534  | 
prefix and the rest according to this regex  | 
|
535  | 
(\texttt{reg.findPrefixMatchOf} generates a match---in the successful
 | 
|
536  | 
case---and the corresponding strings can be extracted with  | 
|
| 591 | 537  | 
\texttt{matched} and \texttt{after}). The input and output type for
 | 
538  | 
this parser is \texttt{String}. Using \texttt{RegexParser} we can
 | 
|
539  | 
define a \texttt{NumParser} for \texttt{Strings} to \texttt{Int} as
 | 
|
540  | 
follows:  | 
|
| 588 | 541  | 
|
542  | 
\begin{center}
 | 
|
543  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
544  | 
val NumParser = RegexParser("[0-9]+".r)
 | 
|
545  | 
\end{lstlisting}
 | 
|
546  | 
\end{center}
 | 
|
547  | 
||
548  | 
\noindent  | 
|
| 591 | 549  | 
This parser will recognise a number at the beginning of a string. For  | 
| 588 | 550  | 
example  | 
551  | 
||
552  | 
\begin{center}
 | 
|
553  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
554  | 
NumParser.parse("123abc")
 | 
|
555  | 
\end{lstlisting}
 | 
|
556  | 
\end{center}  
 | 
|
557  | 
||
558  | 
\noindent  | 
|
559  | 
produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
 | 
|
| 590 | 560  | 
still a string (the required double-quotes are not printed by  | 
561  | 
Scala). We want to convert this string into the corresponding  | 
|
562  | 
\texttt{Int}. We can do this as follows using a semantic action
 | 
|
| 588 | 563  | 
|
564  | 
\begin{center}
 | 
|
565  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
566  | 
(NumParser ==> (s => s.toInt)).parse("123abc")
 | 
|
567  | 
\end{lstlisting}
 | 
|
568  | 
\end{center}  
 | 
|
569  | 
||
570  | 
\noindent  | 
|
| 589 | 571  | 
The function in the semantic action converts a string into an  | 
| 591 | 572  | 
\texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))},
 | 
573  | 
but this time \texttt{123} is an \texttt{Int}. Let us come back to
 | 
|
574  | 
semantic actions when we are going to implement actual context-free  | 
|
575  | 
gammars.  | 
|
| 587 | 576  | 
|
577  | 
\subsubsection*{Shorthand notation for parser combinators}
 | 
|
578  | 
||
579  | 
Before we proceed, let us just explain the shorthand notation for  | 
|
580  | 
parser combinators. Like for regular expressions, the shorthand notation  | 
|
| 590 | 581  | 
will make our life much easier when writing actual parsers. We can define  | 
| 591 | 582  | 
some implicits which allow us to write  | 
583  | 
||
584  | 
\begin{center}
 | 
|
585  | 
\begin{tabular}{ll}  
 | 
|
586  | 
  \pcode{p | q} & alternative parser\\
 | 
|
587  | 
  \pcode{p ~ q} & sequence parser\\ 
 | 
|
588  | 
  \pcode{p ==> f} & semantic action parser
 | 
|
589  | 
\end{tabular}
 | 
|
590  | 
\end{center}
 | 
|
591  | 
||
592  | 
\noindent  | 
|
593  | 
as well as to use plain strings for specifying simple string parsers.  | 
|
| 590 | 594  | 
|
595  | 
The idea is that this shorthand notation allows us to easily translate  | 
|
596  | 
context-free grammars into code. For example recall our context-free  | 
|
597  | 
grammar for palindromes:  | 
|
598  | 
||
599  | 
\begin{plstx}[margin=3cm]
 | 
|
| 591 | 600  | 
: \meta{Pal} ::=  a\cdot \meta{Pal}\cdot a | b\cdot \meta{Pal}\cdot b | a | b | \epsilon\\
 | 
| 590 | 601  | 
\end{plstx}
 | 
602  | 
||
603  | 
\noindent  | 
|
604  | 
Each alternative in this grammar translates into an alternative parser  | 
|
605  | 
combinator. The $\cdot$ can be translated to a sequence parser  | 
|
606  | 
combinator. The parsers for $a$, $b$ and $\epsilon$ can be simply  | 
|
607  | 
written as \texttt{"a"}, \texttt{"b"} and \texttt{""}.
 | 
|
608  | 
||
| 587 | 609  | 
|
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
610  | 
\subsubsection*{How to build parsers using parser combinators?}
 | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
611  | 
|
| 588 | 612  | 
The beauty of parser combinators is the ease with which they can be  | 
613  | 
implemented and how easy it is to translate context-free grammars into  | 
|
614  | 
code (though the grammars need to be non-left-recursive). To  | 
|
| 591 | 615  | 
demonstrate this consider again the grammar for palindromes from above.  | 
| 590 | 616  | 
The first idea would be to translate it into the following code  | 
| 588 | 617  | 
|
618  | 
\begin{center}
 | 
|
619  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
620  | 
lazy val Pal : Parser[String, String] =  | 
|
621  | 
  (("a" ~ Pal ~ "a") | ("b" ~ Pal ~ "b") | "a" | "b" | "")
 | 
|
622  | 
\end{lstlisting}
 | 
|
623  | 
\end{center}
 | 
|
624  | 
||
625  | 
\noindent  | 
|
| 590 | 626  | 
Unfortunately, this does not quite work yet as it produces a typing  | 
627  | 
error. The reason is that the parsers \texttt{"a"}, \texttt{"b"} and
 | 
|
628  | 
\texttt{""} all produce strings as output type and therefore can be
 | 
|
629  | 
put into an alternative \texttt{...| "a" | "b" | ""}. But both
 | 
|
| 591 | 630  | 
sequence parsers \pcode{"a" ~ Pal ~ "a"} and \pcode{"b" ~ Pal ~ "b"}
 | 
631  | 
produce pairs of the form  | 
|
632  | 
||
633  | 
\begin{center}
 | 
|
634  | 
(((\texttt{a}-part, \texttt{Pal}-part), \texttt{a}-part), unprocessed part)
 | 
|
635  | 
\end{center}
 | 
|
636  | 
||
637  | 
\noindent That is how the  | 
|
638  | 
sequence parser combinator nests results when \pcode{\~} is used
 | 
|
639  | 
between two components. The solution is to use a semantic action that  | 
|
640  | 
``flattens'' these pairs and appends the corresponding strings, like  | 
|
| 588 | 641  | 
|
642  | 
\begin{center}
 | 
|
643  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
644  | 
lazy val Pal : Parser[String, String] =  | 
|
645  | 
  (("a" ~ Pal ~ "a") ==> { case ((x, y), z) => x + y + z } |
 | 
|
646  | 
   ("b" ~ Pal ~ "b") ==> { case ((x, y), z) => x + y + z } |
 | 
|
647  | 
"a" | "b" | "")  | 
|
648  | 
\end{lstlisting}
 | 
|
649  | 
\end{center}
 | 
|
650  | 
||
| 589 | 651  | 
\noindent  | 
| 591 | 652  | 
How does this work? Well, recall again what the pairs look like for  | 
653  | 
the parser \pcode{"a" ~ Pal ~ "a"}.  The pattern in the semantic
 | 
|
654  | 
action matches the nested pairs (the \texttt{x} with the
 | 
|
655  | 
\texttt{a}-part and so on).  Unfortunately when we have such nested
 | 
|
656  | 
pairs, Scala requires us to define the function using the  | 
|
657  | 
\pcode{case}-syntax
 | 
|
658  | 
||
659  | 
\begin{center}
 | 
|
660  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
661  | 
{ case ((x, y), z) => ... }
 | 
|
662  | 
\end{lstlisting}
 | 
|
663  | 
\end{center}
 | 
|
664  | 
||
665  | 
\noindent  | 
|
666  | 
If we have more sequence parser combinators or have them differently nested,  | 
|
667  | 
then the pattern in the semantic action needs to be adjusted accordingly.  | 
|
668  | 
The action we implement above is to concatenate all three strings, which  | 
|
669  | 
means after the semantic action is applied the output type of the parser  | 
|
670  | 
is \texttt{String}, which means it fits with the alternative parsers
 | 
|
671  | 
\texttt{...| "a" | "b" | ""}.
 | 
|
672  | 
||
673  | 
If we run the parser above with \pcode{Pal.parse_all("abaaaba")} we obtain
 | 
|
674  | 
as result the \pcode{Set(abaaaba)}, which indicates that the string is a palindrom
 | 
|
675  | 
(an empty set would mean something is wrong). But also notice what the  | 
|
676  | 
intermediate results are generated by \pcode{Pal.parse("abaaaba")}
 | 
|
677  | 
||
678  | 
\begin{center}
 | 
|
679  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
680  | 
Set((abaaaba,""),(aba,aaba), (a,baaaba), ("",abaaaba))
 | 
|
681  | 
\end{lstlisting}
 | 
|
682  | 
\end{center}
 | 
|
683  | 
||
684  | 
\noindent  | 
|
685  | 
That there are more than one output might be slightly unexpected, but  | 
|
686  | 
can be explained as follows: the pairs represent all possible  | 
|
687  | 
(partial) parses of the string \pcode{"abaaaba"}. The first pair above
 | 
|
688  | 
correesponds to a complete parse (all output is consumed) and this is  | 
|
689  | 
what \pcode{Pal.parse_all} returns. The second pair is a small
 | 
|
690  | 
``sub-palindrome'' that can also be parsed, but the parse fails with  | 
|
691  | 
the rest \pcode{aaba}, which is therefore left as unprocessed. The
 | 
|
692  | 
third one is an attempt to parse the whole string with the  | 
|
693  | 
single-character parser \pcode{a}. That of course only partially
 | 
|
694  | 
succeeds, by leaving \pcode{"baaaba"} as the unprocessed
 | 
|
695  | 
part. Finally, since we allow the empty string to be a palindrom we  | 
|
696  | 
also obtain the last pair, where actually nothing is consumed from the  | 
|
697  | 
input string. While all this works as intended, we need to be careful  | 
|
698  | 
with this (especially with including the \pcode{""} parser in our
 | 
|
699  | 
grammar): if during parsing the set of parsing attempts gets too big,  | 
|
700  | 
then the parsing process can become very slow as the potential  | 
|
701  | 
candidates for applying rules can snowball.  | 
|
| 589 | 702  | 
|
703  | 
||
| 591 | 704  | 
Important is also to note is that we must define the  | 
705  | 
\texttt{Pal}-parser as a \emph{lazy} value in Scala. Look again at the
 | 
|
706  | 
code: \texttt{Pal} occurs on the right-hand side of the definition. If we had
 | 
|
707  | 
just written  | 
|
708  | 
||
709  | 
\begin{center}
 | 
|
710  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
711  | 
val Pal : Parser[String, String] = ...rhs...  | 
|
712  | 
\end{lstlisting}
 | 
|
713  | 
\end{center}
 | 
|
714  | 
||
| 589 | 715  | 
\noindent  | 
| 591 | 716  | 
then Scala before making this assignemnt to \texttt{Pal} attempts to
 | 
717  | 
find out what the expression on the right-hand side evaluates to. This  | 
|
718  | 
is straightforward in case of simple expressions \texttt{2 + 3}, but
 | 
|
719  | 
the expression above contains \texttt{Pal} in the right-hand
 | 
|
720  | 
side. Without \pcode{lazy} it would try to evaluate what \texttt{Pal}
 | 
|
721  | 
evaluates to and start a new recursion, which means it falls into an  | 
|
722  | 
infinite loop. The definition of \texttt{Pal} is recursive and the
 | 
|
723  | 
\pcode{lazy} key-word prevents it from being fully evaluated. Therefore
 | 
|
724  | 
whenever we want to define a recursive parser we have to write  | 
|
725  | 
||
726  | 
\begin{center}
 | 
|
727  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
728  | 
lazy val SomeParser : Parser[...,...] = ...rhs...  | 
|
729  | 
\end{lstlisting}
 | 
|
730  | 
\end{center}
 | 
|
731  | 
||
732  | 
\noindent That was not necessary for our atomic parsers, like  | 
|
733  | 
\texttt{RegexParser} or \texttt{CharParser}, because they are not recursive.
 | 
|
734  | 
Note that this is also the reason why we had to write  | 
|
735  | 
||
736  | 
\begin{center}
 | 
|
737  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
738  | 
class AltParser[I, T]  | 
|
739  | 
(p: => Parser[I, T],  | 
|
740  | 
        q: => Parser[I, T]) extends Parser[I, T] {...}
 | 
|
741  | 
||
742  | 
class SeqParser[I, T, S]  | 
|
743  | 
(p: => Parser[I, T],  | 
|
744  | 
        q: => Parser[I, S]) extends Parser[I, (T, S)] {...}
 | 
|
745  | 
\end{lstlisting}
 | 
|
746  | 
\end{center}
 | 
|
747  | 
||
748  | 
\noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front of
 | 
|
749  | 
the argument types for \texttt{p} and \texttt{q} prevent Scala from
 | 
|
750  | 
evaluating the arguments. Normally, Scala would first evaluate what  | 
|
751  | 
kind of parsers \texttt{p} and \texttt{q} are, and only then generate
 | 
|
752  | 
the alternative parser combinator, repsectively sequence parser  | 
|
753  | 
combinator. Since the argumants can be recursive parsers, such as  | 
|
754  | 
\texttt{Pal}, this would lead again to an infinite loop.
 | 
|
755  | 
||
756  | 
As a final example in this section, let us consider the grammar for  | 
|
757  | 
well-nested parentheses:  | 
|
758  | 
||
759  | 
\begin{plstx}[margin=3cm]
 | 
|
760  | 
: \meta{P} ::=  (\cdot \meta{P}\cdot ) \cdot \meta{P} | \epsilon\\
 | 
|
761  | 
\end{plstx}
 | 
|
762  | 
||
763  | 
\noindent  | 
|
764  | 
Let us assume we want to not just recognise strings of  | 
|
765  | 
well-nested parentheses but also transfrom round parentheses  | 
|
766  | 
into curly braces. We can do this by using a semantic  | 
|
767  | 
action:  | 
|
768  | 
||
769  | 
\begin{center}
 | 
|
770  | 
  \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,
 | 
|
771  | 
xleftmargin=0mm, numbers=none]  | 
|
772  | 
lazy val P : Parser[String, String] =  | 
|
773  | 
  "(" ~ P ~ ")" ~ P ==> { case (((_,x),_),y) => "{" + x + "}" + y } | ""
 | 
|
774  | 
\end{lstlisting}
 | 
|
775  | 
\end{center}
 | 
|
776  | 
||
777  | 
\noindent  | 
|
778  | 
Here we define a function where which ignores the parentheses in the  | 
|
779  | 
pairs, but replaces them in the right places with curly braces when  | 
|
780  | 
assembling the new string in the right-hand side. If we run  | 
|
781  | 
\pcode{P.parse_all("(((()()))())")} we obtain
 | 
|
782  | 
\texttt{Set(\{\{\{\{\}\{\}\}\}\{\}\})} as expected.
 | 
|
783  | 
||
784  | 
||
| 588 | 785  | 
|
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
786  | 
\subsubsection*{Implementing an Interpreter}
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
787  | 
|
| 592 | 788  | 
The first step before implementing an interpreter for fullblown  | 
789  | 
language is to implement a simple calculator for arithmetic  | 
|
790  | 
expressions. Suppose our arithmetic expressions are given by the  | 
|
791  | 
grammar:  | 
|
792  | 
||
793  | 
\begin{plstx}[margin=3cm,one per line]
 | 
|
794  | 
: \meta{E} ::= 
 | 
|
795  | 
   | \meta{E} \cdot + \cdot \meta{E} 
 | 
|
796  | 
   | \meta{E} \cdot - \cdot \meta{E} 
 | 
|
797  | 
   | \meta{E} \cdot * \cdot \meta{E} 
 | 
|
798  | 
   | ( \cdot \meta{E} \cdot )
 | 
|
799  | 
| Number \\  | 
|
800  | 
\end{plstx}
 | 
|
801  | 
||
802  | 
\noindent  | 
|
803  | 
Naturally we want to implement the grammar in such a way that we can  | 
|
804  | 
calculate what the result of \texttt{4*2+3} is---we are interested in
 | 
|
805  | 
an \texttt{Int} rather than a string. This means every component
 | 
|
806  | 
parser needs to have as output type \texttt{Int} and when we assemble
 | 
|
807  | 
the intermediate results, strings like \texttt{"+"}, \texttt{"*"} and
 | 
|
808  | 
so on, need to be translated into the appropriate Scala operation.  | 
|
809  | 
Being inspired by the parser for well-nested parentheses and ignoring  | 
|
810  | 
the fact that we want $*$ to take precedence, we might write something  | 
|
811  | 
like  | 
|
812  | 
||
813  | 
\begin{center}
 | 
|
814  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
815  | 
lazy val E: Parser[String, Int] =  | 
|
816  | 
  (E ~ "+" ~ E ==> { case ((x, y), z) => x + z} |
 | 
|
817  | 
   E ~ "-" ~ E ==> { case ((x, y), z) => x - z} |
 | 
|
818  | 
   E ~ "*" ~ E ==> { case ((x, y), z) => x * z} |
 | 
|
819  | 
   "(" ~ E ~ ")" ==> { case ((x, y), z) => y} |
 | 
|
820  | 
NumParserInt)  | 
|
821  | 
\end{lstlisting}
 | 
|
822  | 
\end{center}
 | 
|
823  | 
||
824  | 
\noindent  | 
|
825  | 
Consider again carfully how the semantic actions pick out the correct  | 
|
826  | 
arguments. In case of plus, we need \texttt{x} and \texttt{z}, because
 | 
|
827  | 
they correspond to the results of the parser \texttt{E}. We can just
 | 
|
828  | 
add \texttt{x + z} in order to obtain \texttt{Int} because the output
 | 
|
829  | 
type of \texttt{E} is \texttt{Int}.  Similarly with subtraction and
 | 
|
830  | 
multiplication. In contrast in the fourth clause we need to return  | 
|
831  | 
\texttt{y}, because it is the result enclosed inside the parentheses.
 | 
|
832  | 
||
833  | 
So far so good. The problem arises when we try to call \pcode{parse_all} with the
 | 
|
834  | 
expression \texttt{"1+2+3"}. Lets try it
 | 
|
835  | 
||
836  | 
\begin{center}
 | 
|
837  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
838  | 
E.parse_all("1+2+3")
 | 
|
839  | 
\end{lstlisting}
 | 
|
840  | 
\end{center}
 | 
|
841  | 
||
842  | 
\noindent  | 
|
843  | 
\ldots and we wait and wait and \ldots wait. What is the problem? Actually,  | 
|
844  | 
the parser just fell into an infinite loop. The reason is that the above  | 
|
845  | 
grammar is left-recursive and recall that parser combinator cannot deal with  | 
|
846  | 
such grammars. Luckily every left-recursive context-free grammar can be  | 
|
847  | 
transformed into a non-left-recursive grammars that still recognise the  | 
|
848  | 
same strings. This allows us to design the following grammar  | 
|
849  | 
||
850  | 
||
851  | 
||
852  | 
||
853  | 
||
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
854  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
855  | 
|
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
856  | 
|
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
857  | 
\end{document}
 | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
858  | 
|
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
859  | 
%%% Local Variables:  | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
860  | 
%%% mode: latex  | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
861  | 
%%% TeX-master: t  | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
862  | 
%%% End:  |