| author | Christian Urban <urbanc@in.tum.de> | 
| Tue, 01 Oct 2019 00:29:48 +0100 | |
| changeset 639 | d14dac77a866 | 
| parent 595 | d062fb6feefd | 
| child 799 | c18b991eaad2 | 
| permissions | -rw-r--r-- | 
| 584 | 1  | 
|
| 595 | 2  | 
% !TEX program = xelatex  | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
3  | 
\documentclass{article}
 | 
| 
297
 
5c51839c88fd
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
292 
diff
changeset
 | 
4  | 
\usepackage{../style}
 | 
| 
217
 
cd6066f1056a
updated handouts
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
183 
diff
changeset
 | 
5  | 
\usepackage{../langs}
 | 
| 588 | 6  | 
\usepackage{../grammar}
 | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
7  | 
|
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
8  | 
\begin{document}
 | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
9  | 
|
| 
292
 
7ed2a25dd115
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
217 
diff
changeset
 | 
10  | 
\section*{Handout 6 (Parser Combinators)}
 | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
11  | 
|
| 584 | 12  | 
This handout explains how \emph{parser combinators} work and how they
 | 
| 587 | 13  | 
can be implemented in Scala. Their most distinguishing feature is that  | 
14  | 
they are very easy to implement (admittedly it is only easy in a  | 
|
15  | 
functional programming language). Another good point of parser  | 
|
16  | 
combinators is that they can deal with any kind of input as long as  | 
|
17  | 
this input is of ``sequence-kind'', for example a string or a list of  | 
|
18  | 
tokens. The only two properties of the input we need is to be able to  | 
|
19  | 
test when it is empty and ``sequentially'' take it apart. Strings and  | 
|
20  | 
lists fit this bill. However, parser combinators also have their  | 
|
21  | 
drawbacks. For example they require that the grammar to be parsed is  | 
|
22  | 
\emph{not} left-recursive and they are efficient only when the grammar
 | 
|
23  | 
is unambiguous. It is the responsibility of the grammar designer to  | 
|
| 591 | 24  | 
ensure these two properties hold.  | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
25  | 
|
| 587 | 26  | 
The general idea behind parser combinators is to transform the input  | 
27  | 
into sets of pairs, like so  | 
|
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
28  | 
|
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
29  | 
\begin{center}
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
30  | 
$\underbrace{\text{list of tokens}}_{\text{input}}$ 
 | 
| 594 | 31  | 
$\quad\Rightarrow\quad$  | 
| 591 | 32  | 
$\underbrace{\text{set of (parsed part, unprocessed part)}}_{\text{output}}$
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
33  | 
\end{center} 
 | 
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
34  | 
|
| 587 | 35  | 
\noindent  | 
| 590 | 36  | 
Given the extended effort we have spent implementing a lexer in order  | 
| 591 | 37  | 
to generate lists of tokens, it might be surprising that in what  | 
38  | 
follows we shall often use strings as input, rather than lists of  | 
|
39  | 
tokens. This is for making the explanation more lucid and for quick  | 
|
40  | 
examples. It does not make our previous work on lexers obsolete  | 
|
41  | 
(remember they transform a string into a list of tokens). Lexers will  | 
|
42  | 
still be needed for building a somewhat realistic compiler.  | 
|
| 584 | 43  | 
|
| 590 | 44  | 
As mentioned above, parser combinators are relatively agnostic about what  | 
| 587 | 45  | 
kind of input they process. In my Scala code I use the following  | 
46  | 
polymorphic types for parser combinators:  | 
|
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
47  | 
|
| 
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
48  | 
\begin{center}
 | 
| 584 | 49  | 
input:\;\; \texttt{I}  \qquad output:\;\; \texttt{T}
 | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
50  | 
\end{center}
 | 
| 
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
51  | 
|
| 587 | 52  | 
\noindent That is they take as input something of type \texttt{I} and
 | 
| 590 | 53  | 
return a set of pairs of type \texttt{Set[(T, I)]}. Since the input
 | 
54  | 
needs to be of ``sequence-kind'', I actually have to often write  | 
|
| 591 | 55  | 
\texttt{I <\% Seq[\_]} for the input type. This ensures the
 | 
56  | 
input is a subtype of Scala sequences. The first component of the  | 
|
57  | 
generated pairs corresponds to what the parser combinator was able to  | 
|
58  | 
parse from the input and the second is the unprocessed, or  | 
|
59  | 
leftover, part of the input (therefore the type of this unprocessed part is  | 
|
60  | 
the same as the input). A parser combinator might return more than one  | 
|
61  | 
such pair; the idea is that there are potentially several ways of how  | 
|
62  | 
to parse the input. As a concrete example, consider the string  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
63  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
64  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
65  | 
\tt\Grid{iffoo\VS testbar}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
66  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
67  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
68  | 
\noindent We might have a parser combinator which tries to  | 
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
69  | 
interpret this string as a keyword (\texttt{if}) or as an
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
70  | 
identifier (\texttt{iffoo}). Then the output will be the set
 | 
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
71  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
72  | 
\begin{center}
 | 
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
73  | 
$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), 
 | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
74  | 
           \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
75  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
76  | 
|
| 587 | 77  | 
\noindent where the first pair means the parser could recognise  | 
| 590 | 78  | 
\texttt{if} from the input and leaves the \texttt{foo\VS testbar} as
 | 
| 591 | 79  | 
unprocessed part; in the other case it could recognise  | 
| 587 | 80  | 
\texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
 | 
81  | 
parser cannot recognise anything from the input at all, then parser  | 
|
82  | 
combinators just return the empty set $\{\}$. This will indicate
 | 
|
83  | 
something ``went wrong''\ldots or more precisely, nothing could be  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
84  | 
parsed.  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
85  | 
|
| 594 | 86  | 
Also important to note is that the output type \texttt{T} for the
 | 
87  | 
processed part can potentially be different from the input type  | 
|
88  | 
\texttt{I} in the parser. In the example above is just happens to be
 | 
|
89  | 
the same. The reason for the difference is that in general we are  | 
|
90  | 
interested in transforming our input into something  | 
|
91  | 
``different''\ldots for example into a tree; or if we implement the  | 
|
92  | 
grammar for arithmetic expressions, we might be interested in the  | 
|
93  | 
actual integer number the arithmetic expression, say \texttt{1 + 2 *
 | 
|
94  | 
3}, stands for. In this way we can use parser combinators to  | 
|
95  | 
implement relatively easily a calculator, for instance (we shall do  | 
|
96  | 
this later on).  | 
|
| 584 | 97  | 
|
| 594 | 98  | 
The main driving force behind parser combinators is that we can easily  | 
99  | 
build parser combinators out of smaller components following very  | 
|
100  | 
closely the structure of a grammar. In order to implement this in a  | 
|
| 591 | 101  | 
functional/object-oriented programming language, like Scala, we need  | 
102  | 
to specify an abstract class for parser combinators. In the abstract  | 
|
103  | 
class we specify that \texttt{I} is the \emph{input type} of the
 | 
|
| 593 | 104  | 
parser combinator and that \texttt{T} is the \emph{output type}.  This
 | 
| 591 | 105  | 
implies that the function \texttt{parse} takes an argument of type
 | 
106  | 
\texttt{I} and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
 | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
107  | 
|
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
108  | 
\begin{center}
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
109  | 
\begin{lstlisting}[language=Scala]
 | 
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
110  | 
abstract class Parser[I, T] {
 | 
| 590 | 111  | 
def parse(in: I) : Set[(T, I)]  | 
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
112  | 
|
| 590 | 113  | 
def parse_all(in: I) : Set[T] =  | 
114  | 
for ((head, tail) <- parse(in); if (tail.isEmpty))  | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
115  | 
yield head  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
116  | 
}  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
117  | 
\end{lstlisting}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
118  | 
\end{center}
 | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
119  | 
|
| 591 | 120  | 
\noindent It is the obligation in each instance of this class to  | 
| 584 | 121  | 
supply an implementation for \texttt{parse}.  From this function we
 | 
122  | 
can then ``centrally'' derive the function \texttt{parse\_all}, which
 | 
|
123  | 
just filters out all pairs whose second component is not empty (that  | 
|
124  | 
is has still some unprocessed part). The reason is that at the end of  | 
|
125  | 
the parsing we are only interested in the results where all the input  | 
|
126  | 
has been consumed and no unprocessed part is left over.  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
127  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
128  | 
One of the simplest parser combinators recognises just a  | 
| 584 | 129  | 
single character, say $c$, from the beginning of strings. Its  | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
130  | 
behaviour can be described as follows:  | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
131  | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
132  | 
\begin{itemize}
 | 
| 584 | 133  | 
\item If the head of the input string starts with a $c$, then return  | 
134  | 
the set  | 
|
135  | 
  \[\{(c, \textit{tail of}\; s)\}\]
 | 
|
136  | 
  where \textit{tail of} 
 | 
|
137  | 
$s$ is the unprocessed part of the input string.  | 
|
138  | 
\item Otherwise return the empty set $\{\}$.	
 | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
139  | 
\end{itemize}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
140  | 
|
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
141  | 
\noindent  | 
| 590 | 142  | 
The input type of this simple parser combinator is \texttt{String} and
 | 
143  | 
the output type is \texttt{Char}. This means \texttt{parse} returns
 | 
|
144  | 
\mbox{\texttt{Set[(Char, String)]}}.  The code in Scala is as follows:
 | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
145  | 
|
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
146  | 
\begin{center}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
147  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
148  | 
case class CharParser(c: Char) extends Parser[String, Char] {
 | 
| 587 | 149  | 
def parse(in: String) =  | 
150  | 
if (in.head == c) Set((c, in.tail)) else Set()  | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
151  | 
}  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
152  | 
\end{lstlisting}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
153  | 
\end{center}
 | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
154  | 
|
| 589 | 155  | 
\noindent You can see \texttt{parse} tests whether the
 | 
| 587 | 156  | 
first character of the input string \texttt{in} is equal to
 | 
| 584 | 157  | 
\texttt{c}. If yes, then it splits the string into the recognised part
 | 
| 587 | 158  | 
\texttt{c} and the unprocessed part \texttt{in.tail}. In case
 | 
159  | 
\texttt{in} does not start with \texttt{c} then the parser returns the
 | 
|
| 584 | 160  | 
empty set (in Scala \texttt{Set()}). Since this parser recognises
 | 
161  | 
characters and just returns characters as the processed part, the  | 
|
162  | 
output type of the parser is \texttt{Char}.
 | 
|
163  | 
||
164  | 
If we want to parse a list of tokens and interested in recognising a  | 
|
| 590 | 165  | 
number token, for example, we could write something like this  | 
| 584 | 166  | 
|
167  | 
\begin{center}
 | 
|
168  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
 | 
|
169  | 
case object NumParser extends Parser[List[Token], Int] {
 | 
|
170  | 
  def parse(ts: List[Token]) = ts match {
 | 
|
171  | 
case Num_token(s)::ts => Set((s.toInt, ts))  | 
|
172  | 
case _ => Set ()  | 
|
173  | 
}  | 
|
174  | 
}  | 
|
175  | 
\end{lstlisting}
 | 
|
176  | 
\end{center}
 | 
|
177  | 
||
178  | 
\noindent  | 
|
179  | 
In this parser the input is of type \texttt{List[Token]}. The function
 | 
|
180  | 
parse looks at the input \texttt{ts} and checks whether the first
 | 
|
| 589 | 181  | 
token is a \texttt{Num\_token} (let us assume our lexer generated
 | 
182  | 
these tokens for numbers). But this parser does not just return this  | 
|
| 584 | 183  | 
token (and the rest of the list), like the \texttt{CharParser} above,
 | 
| 590 | 184  | 
rather it extracts also the string \texttt{s} from the token and
 | 
185  | 
converts it into an integer. The hope is that the lexer did its work  | 
|
186  | 
well and this conversion always succeeds. The consequence of this is  | 
|
187  | 
that the output type for this parser is \texttt{Int}, not
 | 
|
188  | 
\texttt{Token}. Such a conversion would be needed if we want to
 | 
|
189  | 
implement a simple calculator program, because string-numbers need to  | 
|
190  | 
be transformed into \texttt{Int}-numbers in order to do the
 | 
|
191  | 
calculations.  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
192  | 
|
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
193  | 
|
| 584 | 194  | 
These simple parsers that just look at the input and do a simple  | 
195  | 
transformation are often called \emph{atomic} parser combinators.
 | 
|
196  | 
More interesting are the parser combinators that build larger parsers  | 
|
| 587 | 197  | 
out of smaller component parsers. There are three such parser  | 
198  | 
combinators that can be implemented generically. The \emph{alternative
 | 
|
| 584 | 199  | 
parser combinator} is as follows: given two parsers, say, $p$ and  | 
200  | 
$q$, we apply both parsers to the input (remember parsers are  | 
|
| 587 | 201  | 
functions) and combine the output (remember they are sets of pairs):  | 
202  | 
||
203  | 
\begin{center}
 | 
|
204  | 
$p(\text{input}) \cup q(\text{input})$
 | 
|
205  | 
\end{center}
 | 
|
206  | 
||
207  | 
\noindent In Scala we can implement alternative parser  | 
|
208  | 
combinator as follows  | 
|
209  | 
||
210  | 
\begin{center}
 | 
|
211  | 
\begin{lstlisting}[language=Scala, numbers=none]
 | 
|
212  | 
class AltParser[I, T]  | 
|
213  | 
(p: => Parser[I, T],  | 
|
214  | 
        q: => Parser[I, T]) extends Parser[I, T] {
 | 
|
215  | 
def parse(in: I) = p.parse(in) ++ q.parse(in)  | 
|
216  | 
}  | 
|
217  | 
\end{lstlisting}
 | 
|
218  | 
\end{center}
 | 
|
219  | 
||
220  | 
\noindent The types of this parser combinator are again generic (we  | 
|
221  | 
have \texttt{I} for the input type, and \texttt{T} for the output
 | 
|
222  | 
type). The alternative parser builds a new parser out of two existing  | 
|
| 590 | 223  | 
parsers \texttt{p} and \texttt{q} which are given as arguments.  Both
 | 
224  | 
parsers need to be able to process input of type \texttt{I} and return
 | 
|
225  | 
in \texttt{parse} the same output type \texttt{Set[(T,
 | 
|
| 587 | 226  | 
  I)]}.\footnote{There is an interesting detail of Scala, namely the
 | 
227  | 
  \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They
 | 
|
228  | 
will prevent the evaluation of the arguments before they are  | 
|
229  | 
  used. This is often called \emph{lazy evaluation} of the
 | 
|
| 590 | 230  | 
arguments. We will explain this later.} The alternative parser runs  | 
231  | 
the input with the first parser \texttt{p} (producing a set of pairs)
 | 
|
232  | 
and then runs the same input with \texttt{q} (producing another set of
 | 
|
233  | 
pairs). The result should be then just the union of both sets, which  | 
|
234  | 
is the operation \texttt{++} in Scala.
 | 
|
| 587 | 235  | 
|
236  | 
The alternative parser combinator allows us to construct a parser that  | 
|
237  | 
parses either a character \texttt{a} or \texttt{b} using the
 | 
|
238  | 
\texttt{CharParser} shown above. For this we can write
 | 
|
239  | 
||
240  | 
\begin{center}
 | 
|
241  | 
\begin{lstlisting}[language=Scala, numbers=none]
 | 
|
242  | 
new AltParser(CharParser('a'), CharParser('b'))
 | 
|
243  | 
\end{lstlisting}
 | 
|
244  | 
\end{center}
 | 
|
245  | 
||
246  | 
\noindent Later on we will use Scala mechanism for introducing some  | 
|
| 589 | 247  | 
more readable shorthand notation for this, like \texttt{"a" |
 | 
| 587 | 248  | 
"b"}. Let us look in detail at what this parser combinator produces  | 
| 590 | 249  | 
with some sample strings.  | 
| 587 | 250  | 
|
251  | 
\begin{center}
 | 
|
252  | 
\begin{tabular}{rcl}
 | 
|
253  | 
input strings & & output\medskip\\  | 
|
254  | 
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
 | 
|
255  | 
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
 | 
|
256  | 
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
 | 
|
257  | 
\end{tabular}
 | 
|
258  | 
\end{center}
 | 
|
259  | 
||
260  | 
\noindent We receive in the first two cases a successful  | 
|
261  | 
output (that is a non-empty set). In each case, either  | 
|
| 591 | 262  | 
\pcode{a} or \pcode{b} is in the parsed part, and
 | 
| 587 | 263  | 
\pcode{cde} in the unprocessed part. Clearly this parser cannot
 | 
264  | 
parse anything with \pcode{ccde}, therefore the empty
 | 
|
265  | 
set is returned.  | 
|
266  | 
||
267  | 
A bit more interesting is the \emph{sequence parser combinator}. Given
 | 
|
268  | 
two parsers, say again, $p$ and $q$, we want to apply first the input  | 
|
| 590 | 269  | 
to $p$ producing a set of pairs; then apply $q$ to all the unparsed  | 
| 587 | 270  | 
parts in the pairs; and then combine the results. Mathematically we would  | 
| 591 | 271  | 
write something like this for the set of pairs:  | 
| 587 | 272  | 
|
273  | 
\begin{center}
 | 
|
274  | 
\begin{tabular}{lcl}
 | 
|
275  | 
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & 
 | 
|
276  | 
$(\textit{output}_1, u_1) \in p(\text{input}) 
 | 
|
277  | 
\;\wedge\;$\\  | 
|
278  | 
&& $(\textit{output}_2, u_2) \in q(u_1)\}$
 | 
|
279  | 
\end{tabular}
 | 
|
280  | 
\end{center}
 | 
|
281  | 
||
282  | 
\noindent Notice that the $p$ will first be run on the input,  | 
|
| 590 | 283  | 
producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$
 | 
| 591 | 284  | 
stands for the unprocessed, or leftover, parts of $p$. We want that  | 
| 590 | 285  | 
$q$ runs on all these unprocessed parts $u_1$. Therefore these  | 
286  | 
unprocessed parts are fed into the second parser $q$. The overall  | 
|
287  | 
result of the sequence parser combinator is pairs of the form  | 
|
| 584 | 288  | 
$((\textit{output}_1, \textit{output}_2), u_2)$. This means the
 | 
| 593 | 289  | 
unprocessed part of the sequence parser combinator is the unprocessed  | 
| 591 | 290  | 
part the second parser $q$ leaves as leftover. The parsed parts of the  | 
291  | 
component parsers are combined in a pair, namely  | 
|
292  | 
$(\textit{output}_1, \textit{output}_2)$. The reason is we want to
 | 
|
293  | 
know what $p$ and $q$ were able to parse. This behaviour can be  | 
|
294  | 
implemented in Scala as follows:  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
295  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
296  | 
\begin{center}
 | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
297  | 
\begin{lstlisting}[language=Scala,numbers=none]
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
298  | 
class SeqParser[I, T, S]  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
299  | 
(p: => Parser[I, T],  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
300  | 
        q: => Parser[I, S]) extends Parser[I, (T, S)] {
 | 
| 587 | 301  | 
def parse(in: I) =  | 
302  | 
for ((output1, u1) <- p.parse(in);  | 
|
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
303  | 
(output2, u2) <- q.parse(u1))  | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
304  | 
yield ((output1, output2), u2)  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
305  | 
}  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
306  | 
\end{lstlisting}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
307  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
308  | 
|
| 587 | 309  | 
\noindent This parser takes again as arguments two parsers, \texttt{p}
 | 
| 591 | 310  | 
and \texttt{q}. It implements \texttt{parse} as follows: first run the
 | 
311  | 
parser \texttt{p} on the input producing a set of pairs
 | 
|
| 587 | 312  | 
(\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
 | 
| 591 | 313  | 
unprocessed parts left over by \texttt{p} (recall that there can be
 | 
314  | 
several such pairs). Let then \texttt{q} run on these unprocessed
 | 
|
315  | 
parts producing again a set of pairs. The output of the sequence  | 
|
316  | 
parser combinator is then a set containing pairs where the first  | 
|
317  | 
components are again pairs, namely what the first parser could parse  | 
|
318  | 
together with what the second parser could parse; the second component  | 
|
319  | 
is the unprocessed part left over after running the second parser  | 
|
320  | 
\texttt{q}. Note that the input type of the sequence parser combinator
 | 
|
321  | 
is as usual \texttt{I}, but the output type is
 | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
322  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
323  | 
\begin{center}
 | 
| 590 | 324  | 
\texttt{(T, S)}
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
325  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
326  | 
|
| 584 | 327  | 
\noindent  | 
| 591 | 328  | 
Consequently, the function \texttt{parse} in the sequence parser
 | 
329  | 
combinator returns sets of type \texttt{Set[((T, S), I)]}.  That means
 | 
|
330  | 
we have essentially two output types for the sequence parser  | 
|
331  | 
combinator (packaged in a pair), because in general \textit{p} and
 | 
|
332  | 
\textit{q} might produce different things (for example we recognise a
 | 
|
333  | 
number with \texttt{p} and then with \texttt{q} a string corresponding
 | 
|
334  | 
to an operator).  If any of the runs of \textit{p} and \textit{q}
 | 
|
335  | 
fail, that is produce the empty set, then \texttt{parse} will also
 | 
|
336  | 
produce the empty set.  | 
|
| 584 | 337  | 
|
| 587 | 338  | 
With the shorthand notation we shall introduce later for the sequence  | 
339  | 
parser combinator, we can write for example \pcode{"a" ~ "b"}, which
 | 
|
340  | 
is the parser combinator that first recognises the character  | 
|
341  | 
\texttt{a} from a string and then \texttt{b}. Let us look again at
 | 
|
| 591 | 342  | 
some examples of how this parser combinator processes some strings:  | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
343  | 
|
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
344  | 
\begin{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
345  | 
\begin{tabular}{rcl}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
346  | 
input strings & & output\medskip\\  | 
| 584 | 347  | 
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\
 | 
348  | 
\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
 | 
|
349  | 
\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
 | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
350  | 
\end{tabular}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
351  | 
\end{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
352  | 
|
| 586 | 353  | 
\noindent In the first line we have a successful parse, because the  | 
| 587 | 354  | 
string starts with \texttt{ab}, which is the prefix we are looking
 | 
| 584 | 355  | 
for. But since the parsing combinator is constructed as sequence of  | 
356  | 
the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
 | 
|
357  | 
result is a nested pair of the form \texttt{((a, b), cde)}. It is
 | 
|
| 586 | 358  | 
\emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
 | 
| 587 | 359  | 
expect. The parser returns the empty set in the other examples,  | 
| 584 | 360  | 
because they do not fit with what the parser is supposed to parse.  | 
361  | 
||
362  | 
||
| 589 | 363  | 
A slightly more complicated parser is \pcode{("a" | "b") ~ "c"} which
 | 
| 587 | 364  | 
parses as first character either an \texttt{a} or \texttt{b}, followed
 | 
365  | 
by a \texttt{c}. This parser produces the following outputs.
 | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
366  | 
|
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
367  | 
\begin{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
368  | 
\begin{tabular}{rcl}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
369  | 
input strings & & output\medskip\\  | 
| 585 | 370  | 
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
 | 
371  | 
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
 | 
|
372  | 
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$
 | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
373  | 
\end{tabular}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
374  | 
\end{center}
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
375  | 
|
| 585 | 376  | 
\noindent  | 
377  | 
Now consider the parser \pcode{("a" ~ "b") ~ "c"} which parses
 | 
|
378  | 
\texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces
 | 
|
379  | 
the following outputs.  | 
|
380  | 
||
381  | 
\begin{center}
 | 
|
382  | 
\begin{tabular}{rcl}
 | 
|
383  | 
input strings & & output\medskip\\  | 
|
384  | 
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
 | 
|
385  | 
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\
 | 
|
386  | 
\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$
 | 
|
387  | 
\end{tabular}
 | 
|
388  | 
\end{center}
 | 
|
389  | 
||
390  | 
||
391  | 
\noindent The second and third example fail, because something is  | 
|
| 590 | 392  | 
``missing'' in the sequence we are looking for. The first succeeds but  | 
393  | 
notice how the results nest with sequences: the parsed part is a  | 
|
394  | 
nested pair of the form \pcode{((a, b), c)}. If we nest the sequence
 | 
|
| 591 | 395  | 
parser differently, say \pcode{"a" ~ ("b" ~ "c")}, then also
 | 
| 590 | 396  | 
our output pairs nest differently  | 
| 589 | 397  | 
|
398  | 
\begin{center}
 | 
|
399  | 
\begin{tabular}{rcl}
 | 
|
400  | 
input strings & & output\medskip\\  | 
|
401  | 
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})), \texttt{\Grid{de}})\right\}$\\
 | 
|
402  | 
\end{tabular}
 | 
|
403  | 
\end{center}
 | 
|
404  | 
||
405  | 
\noindent  | 
|
406  | 
Two more examples: first consider the parser  | 
|
| 585 | 407  | 
\pcode{("a" ~ "a") ~ "a"} and the input \pcode{aaaa}:
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
408  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
409  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
410  | 
\begin{tabular}{rcl}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
411  | 
input string & & output\medskip\\  | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
412  | 
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
413  | 
$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
414  | 
\end{tabular}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
415  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
416  | 
|
| 591 | 417  | 
\noindent Notice again how the results nest deeper and deeper as pairs (the  | 
| 585 | 418  | 
last \pcode{a} is in the unprocessed part). To consume everything of
 | 
419  | 
this string we can use the parser \pcode{(("a" ~ "a") ~ "a") ~
 | 
|
420  | 
"a"}. Then the output is as follows:  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
421  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
422  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
423  | 
\begin{tabular}{rcl}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
424  | 
input string & & output\medskip\\  | 
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
425  | 
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
 | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
426  | 
$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
427  | 
\end{tabular}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
428  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
429  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
430  | 
\noindent This is an instance where the parser consumed  | 
| 
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
431  | 
completely the input, meaning the unprocessed part is just the  | 
| 587 | 432  | 
empty string. So if we called \pcode{parse_all}, instead of \pcode{parse},
 | 
| 585 | 433  | 
we would get back the result  | 
434  | 
||
435  | 
\[  | 
|
436  | 
\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}
 | 
|
437  | 
\]  | 
|
438  | 
||
439  | 
\noindent where the unprocessed (empty) parts have been stripped away  | 
|
440  | 
from the pairs; everything where the second part was not empty has  | 
|
| 587 | 441  | 
been thrown away as well, because they represent  | 
| 590 | 442  | 
ultimately-unsuccessful-parses. The main point is that the sequence  | 
443  | 
parser combinator returns pairs that can nest according to the  | 
|
444  | 
nesting of the component parsers.  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
445  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
446  | 
|
| 590 | 447  | 
Consider also carefully that constructing a parser such \pcode{"a" |
 | 
448  | 
  ("a" ~ "b")} will result in a typing error. The intention with this
 | 
|
| 591 | 449  | 
parser is that we want to parse either an \texttt{a}, or an \texttt{a}
 | 
| 590 | 450  | 
followed by a \texttt{b}. However, the first parser has as output type
 | 
451  | 
a single character (recall the type of \texttt{CharParser}), but the
 | 
|
452  | 
second parser produces a pair of characters as output. The alternative  | 
|
453  | 
parser is required to have both component parsers to have the same  | 
|
| 591 | 454  | 
type---the reason is that we need to be able to build the union of two  | 
455  | 
sets, which requires in Scala that the sets have the same type. Since  | 
|
456  | 
they are not in this case, there is a typing error. We will see later  | 
|
457  | 
how we can build this parser without the typing error.  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
297 
diff
changeset
 | 
458  | 
|
| 587 | 459  | 
The next parser combinator, called \emph{semantic action}, does not
 | 
| 591 | 460  | 
actually combine two smaller parsers, but applies a function to the result  | 
| 587 | 461  | 
of a parser. It is implemented in Scala as follows  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
462  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
463  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
464  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
465  | 
class FunParser[I, T, S]  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
466  | 
(p: => Parser[I, T],  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
467  | 
          f: T => S) extends Parser[I, S] {
 | 
| 587 | 468  | 
def parse(in: I) =  | 
469  | 
for ((head, tail) <- p.parse(in)) yield (f(head), tail)  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
470  | 
}  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
471  | 
\end{lstlisting}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
472  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
473  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
474  | 
|
| 590 | 475  | 
\noindent This parser combinator takes a parser \texttt{p} (with input
 | 
476  | 
type \texttt{I} and output type \texttt{T}) as one argument but also a
 | 
|
477  | 
function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p}
 | 
|
478  | 
produces sets of type \texttt{Set[(T, I)]}. The semantic action
 | 
|
479  | 
combinator then applies the function \texttt{f} to all the `processed'
 | 
|
480  | 
parser outputs. Since this function is of type \texttt{T => S}, we
 | 
|
481  | 
obtain a parser with output type \texttt{S}. Again Scala lets us
 | 
|
482  | 
introduce some shorthand notation for this parser  | 
|
| 591 | 483  | 
combinator. Therefore we will write short \texttt{p ==> f} for it.
 | 
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
484  | 
|
| 589 | 485  | 
What are semantic actions good for? Well, they allow you to transform  | 
| 590 | 486  | 
the parsed input into datastructures you can use for further  | 
| 591 | 487  | 
processing. A simple (contrived) example would be to transform parsed  | 
488  | 
characters into ASCII numbers. Suppose we define a function \texttt{f}
 | 
|
489  | 
(from characters to \texttt{Int}s) and use a \texttt{CharParser} for parsing
 | 
|
| 589 | 490  | 
the character \texttt{c}.
 | 
| 587 | 491  | 
|
| 591 | 492  | 
|
| 587 | 493  | 
\begin{center}
 | 
494  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
495  | 
val f = (c: Char) => c.toInt  | 
|
496  | 
val c = new CharParser('c')
 | 
|
497  | 
\end{lstlisting}
 | 
|
498  | 
\end{center}
 | 
|
499  | 
||
500  | 
\noindent  | 
|
| 589 | 501  | 
We then can run the following two parsers on the input \texttt{cbd}:
 | 
| 587 | 502  | 
|
503  | 
\begin{center}
 | 
|
504  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
505  | 
c.parse("cbd")
 | 
|
506  | 
(c ==> f).parse("cbd")
 | 
|
507  | 
\end{lstlisting}
 | 
|
508  | 
\end{center}
 | 
|
509  | 
||
510  | 
\noindent  | 
|
| 589 | 511  | 
In the first line we obtain the expected result \texttt{Set(('c',
 | 
512  | 
  "bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the
 | 
|
513  | 
character has been transformed into an ASCII number.  | 
|
| 588 | 514  | 
|
515  | 
A slightly less contrived example is about parsing numbers (recall  | 
|
| 591 | 516  | 
\texttt{NumParser} above). However, we want to do this here for
 | 
517  | 
strings, not for tokens. For this assume we have the following  | 
|
518  | 
(atomic) \texttt{RegexParser}.
 | 
|
| 588 | 519  | 
|
520  | 
\begin{center}
 | 
|
521  | 
  \begin{lstlisting}[language=Scala,xleftmargin=0mm,
 | 
|
522  | 
basicstyle=\small\ttfamily, numbers=none]  | 
|
523  | 
import scala.util.matching.Regex  | 
|
524  | 
||
525  | 
case class RegexParser(reg: Regex) extends Parser[String, String] {
 | 
|
526  | 
  def parse(in: String) = reg.findPrefixMatchOf(in) match {
 | 
|
527  | 
case None => Set()  | 
|
528  | 
case Some(m) => Set((m.matched, m.after.toString))  | 
|
529  | 
}  | 
|
530  | 
}  | 
|
531  | 
\end{lstlisting}
 | 
|
532  | 
\end{center}
 | 
|
533  | 
||
534  | 
\noindent  | 
|
535  | 
This parser takes a regex as argument and splits up a string into a  | 
|
536  | 
prefix and the rest according to this regex  | 
|
537  | 
(\texttt{reg.findPrefixMatchOf} generates a match---in the successful
 | 
|
538  | 
case---and the corresponding strings can be extracted with  | 
|
| 591 | 539  | 
\texttt{matched} and \texttt{after}). The input and output type for
 | 
540  | 
this parser is \texttt{String}. Using \texttt{RegexParser} we can
 | 
|
541  | 
define a \texttt{NumParser} for \texttt{Strings} to \texttt{Int} as
 | 
|
542  | 
follows:  | 
|
| 588 | 543  | 
|
544  | 
\begin{center}
 | 
|
545  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
546  | 
val NumParser = RegexParser("[0-9]+".r)
 | 
|
547  | 
\end{lstlisting}
 | 
|
548  | 
\end{center}
 | 
|
549  | 
||
550  | 
\noindent  | 
|
| 591 | 551  | 
This parser will recognise a number at the beginning of a string. For  | 
| 588 | 552  | 
example  | 
553  | 
||
554  | 
\begin{center}
 | 
|
555  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
556  | 
NumParser.parse("123abc")
 | 
|
557  | 
\end{lstlisting}
 | 
|
558  | 
\end{center}  
 | 
|
559  | 
||
560  | 
\noindent  | 
|
561  | 
produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
 | 
|
| 590 | 562  | 
still a string (the required double-quotes are not printed by  | 
563  | 
Scala). We want to convert this string into the corresponding  | 
|
564  | 
\texttt{Int}. We can do this as follows using a semantic action
 | 
|
| 588 | 565  | 
|
566  | 
\begin{center}
 | 
|
567  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
568  | 
(NumParser ==> (s => s.toInt)).parse("123abc")
 | 
|
569  | 
\end{lstlisting}
 | 
|
570  | 
\end{center}  
 | 
|
571  | 
||
572  | 
\noindent  | 
|
| 589 | 573  | 
The function in the semantic action converts a string into an  | 
| 591 | 574  | 
\texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))},
 | 
575  | 
but this time \texttt{123} is an \texttt{Int}. Let us come back to
 | 
|
576  | 
semantic actions when we are going to implement actual context-free  | 
|
| 593 | 577  | 
grammars.  | 
| 587 | 578  | 
|
579  | 
\subsubsection*{Shorthand notation for parser combinators}
 | 
|
580  | 
||
581  | 
Before we proceed, let us just explain the shorthand notation for  | 
|
582  | 
parser combinators. Like for regular expressions, the shorthand notation  | 
|
| 590 | 583  | 
will make our life much easier when writing actual parsers. We can define  | 
| 591 | 584  | 
some implicits which allow us to write  | 
585  | 
||
586  | 
\begin{center}
 | 
|
587  | 
\begin{tabular}{ll}  
 | 
|
588  | 
  \pcode{p | q} & alternative parser\\
 | 
|
589  | 
  \pcode{p ~ q} & sequence parser\\ 
 | 
|
590  | 
  \pcode{p ==> f} & semantic action parser
 | 
|
591  | 
\end{tabular}
 | 
|
592  | 
\end{center}
 | 
|
593  | 
||
594  | 
\noindent  | 
|
595  | 
as well as to use plain strings for specifying simple string parsers.  | 
|
| 590 | 596  | 
|
597  | 
The idea is that this shorthand notation allows us to easily translate  | 
|
598  | 
context-free grammars into code. For example recall our context-free  | 
|
599  | 
grammar for palindromes:  | 
|
600  | 
||
601  | 
\begin{plstx}[margin=3cm]
 | 
|
| 591 | 602  | 
: \meta{Pal} ::=  a\cdot \meta{Pal}\cdot a | b\cdot \meta{Pal}\cdot b | a | b | \epsilon\\
 | 
| 590 | 603  | 
\end{plstx}
 | 
604  | 
||
605  | 
\noindent  | 
|
606  | 
Each alternative in this grammar translates into an alternative parser  | 
|
607  | 
combinator. The $\cdot$ can be translated to a sequence parser  | 
|
608  | 
combinator. The parsers for $a$, $b$ and $\epsilon$ can be simply  | 
|
609  | 
written as \texttt{"a"}, \texttt{"b"} and \texttt{""}.
 | 
|
610  | 
||
| 587 | 611  | 
|
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
612  | 
\subsubsection*{How to build parsers using parser combinators?}
 | 
| 
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
613  | 
|
| 588 | 614  | 
The beauty of parser combinators is the ease with which they can be  | 
615  | 
implemented and how easy it is to translate context-free grammars into  | 
|
616  | 
code (though the grammars need to be non-left-recursive). To  | 
|
| 591 | 617  | 
demonstrate this consider again the grammar for palindromes from above.  | 
| 590 | 618  | 
The first idea would be to translate it into the following code  | 
| 588 | 619  | 
|
620  | 
\begin{center}
 | 
|
621  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
622  | 
lazy val Pal : Parser[String, String] =  | 
|
623  | 
  (("a" ~ Pal ~ "a") | ("b" ~ Pal ~ "b") | "a" | "b" | "")
 | 
|
624  | 
\end{lstlisting}
 | 
|
625  | 
\end{center}
 | 
|
626  | 
||
627  | 
\noindent  | 
|
| 590 | 628  | 
Unfortunately, this does not quite work yet as it produces a typing  | 
629  | 
error. The reason is that the parsers \texttt{"a"}, \texttt{"b"} and
 | 
|
630  | 
\texttt{""} all produce strings as output type and therefore can be
 | 
|
631  | 
put into an alternative \texttt{...| "a" | "b" | ""}. But both
 | 
|
| 591 | 632  | 
sequence parsers \pcode{"a" ~ Pal ~ "a"} and \pcode{"b" ~ Pal ~ "b"}
 | 
633  | 
produce pairs of the form  | 
|
634  | 
||
635  | 
\begin{center}
 | 
|
636  | 
(((\texttt{a}-part, \texttt{Pal}-part), \texttt{a}-part), unprocessed part)
 | 
|
637  | 
\end{center}
 | 
|
638  | 
||
639  | 
\noindent That is how the  | 
|
640  | 
sequence parser combinator nests results when \pcode{\~} is used
 | 
|
641  | 
between two components. The solution is to use a semantic action that  | 
|
642  | 
``flattens'' these pairs and appends the corresponding strings, like  | 
|
| 588 | 643  | 
|
644  | 
\begin{center}
 | 
|
645  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
646  | 
lazy val Pal : Parser[String, String] =  | 
|
647  | 
  (("a" ~ Pal ~ "a") ==> { case ((x, y), z) => x + y + z } |
 | 
|
648  | 
   ("b" ~ Pal ~ "b") ==> { case ((x, y), z) => x + y + z } |
 | 
|
649  | 
"a" | "b" | "")  | 
|
650  | 
\end{lstlisting}
 | 
|
651  | 
\end{center}
 | 
|
652  | 
||
| 589 | 653  | 
\noindent  | 
| 591 | 654  | 
How does this work? Well, recall again what the pairs look like for  | 
655  | 
the parser \pcode{"a" ~ Pal ~ "a"}.  The pattern in the semantic
 | 
|
656  | 
action matches the nested pairs (the \texttt{x} with the
 | 
|
657  | 
\texttt{a}-part and so on).  Unfortunately when we have such nested
 | 
|
658  | 
pairs, Scala requires us to define the function using the  | 
|
659  | 
\pcode{case}-syntax
 | 
|
660  | 
||
661  | 
\begin{center}
 | 
|
662  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
663  | 
{ case ((x, y), z) => ... }
 | 
|
664  | 
\end{lstlisting}
 | 
|
665  | 
\end{center}
 | 
|
666  | 
||
667  | 
\noindent  | 
|
668  | 
If we have more sequence parser combinators or have them differently nested,  | 
|
669  | 
then the pattern in the semantic action needs to be adjusted accordingly.  | 
|
670  | 
The action we implement above is to concatenate all three strings, which  | 
|
671  | 
means after the semantic action is applied the output type of the parser  | 
|
672  | 
is \texttt{String}, which means it fits with the alternative parsers
 | 
|
673  | 
\texttt{...| "a" | "b" | ""}.
 | 
|
674  | 
||
675  | 
If we run the parser above with \pcode{Pal.parse_all("abaaaba")} we obtain
 | 
|
| 593 | 676  | 
as result the \pcode{Set(abaaaba)}, which indicates that the string is a palindrome
 | 
| 591 | 677  | 
(an empty set would mean something is wrong). But also notice what the  | 
678  | 
intermediate results are generated by \pcode{Pal.parse("abaaaba")}
 | 
|
679  | 
||
680  | 
\begin{center}
 | 
|
681  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
682  | 
Set((abaaaba,""),(aba,aaba), (a,baaaba), ("",abaaaba))
 | 
|
683  | 
\end{lstlisting}
 | 
|
684  | 
\end{center}
 | 
|
685  | 
||
686  | 
\noindent  | 
|
687  | 
That there are more than one output might be slightly unexpected, but  | 
|
688  | 
can be explained as follows: the pairs represent all possible  | 
|
689  | 
(partial) parses of the string \pcode{"abaaaba"}. The first pair above
 | 
|
| 593 | 690  | 
corresponds to a complete parse (all output is consumed) and this is  | 
| 591 | 691  | 
what \pcode{Pal.parse_all} returns. The second pair is a small
 | 
692  | 
``sub-palindrome'' that can also be parsed, but the parse fails with  | 
|
693  | 
the rest \pcode{aaba}, which is therefore left as unprocessed. The
 | 
|
694  | 
third one is an attempt to parse the whole string with the  | 
|
695  | 
single-character parser \pcode{a}. That of course only partially
 | 
|
696  | 
succeeds, by leaving \pcode{"baaaba"} as the unprocessed
 | 
|
| 593 | 697  | 
part. Finally, since we allow the empty string to be a palindrome we  | 
| 591 | 698  | 
also obtain the last pair, where actually nothing is consumed from the  | 
699  | 
input string. While all this works as intended, we need to be careful  | 
|
700  | 
with this (especially with including the \pcode{""} parser in our
 | 
|
701  | 
grammar): if during parsing the set of parsing attempts gets too big,  | 
|
702  | 
then the parsing process can become very slow as the potential  | 
|
703  | 
candidates for applying rules can snowball.  | 
|
| 589 | 704  | 
|
705  | 
||
| 591 | 706  | 
Important is also to note is that we must define the  | 
707  | 
\texttt{Pal}-parser as a \emph{lazy} value in Scala. Look again at the
 | 
|
708  | 
code: \texttt{Pal} occurs on the right-hand side of the definition. If we had
 | 
|
709  | 
just written  | 
|
710  | 
||
711  | 
\begin{center}
 | 
|
712  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
713  | 
val Pal : Parser[String, String] = ...rhs...  | 
|
714  | 
\end{lstlisting}
 | 
|
715  | 
\end{center}
 | 
|
716  | 
||
| 589 | 717  | 
\noindent  | 
| 593 | 718  | 
then Scala before making this assignment to \texttt{Pal} attempts to
 | 
| 591 | 719  | 
find out what the expression on the right-hand side evaluates to. This  | 
720  | 
is straightforward in case of simple expressions \texttt{2 + 3}, but
 | 
|
721  | 
the expression above contains \texttt{Pal} in the right-hand
 | 
|
722  | 
side. Without \pcode{lazy} it would try to evaluate what \texttt{Pal}
 | 
|
723  | 
evaluates to and start a new recursion, which means it falls into an  | 
|
724  | 
infinite loop. The definition of \texttt{Pal} is recursive and the
 | 
|
725  | 
\pcode{lazy} key-word prevents it from being fully evaluated. Therefore
 | 
|
726  | 
whenever we want to define a recursive parser we have to write  | 
|
727  | 
||
728  | 
\begin{center}
 | 
|
729  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
730  | 
lazy val SomeParser : Parser[...,...] = ...rhs...  | 
|
731  | 
\end{lstlisting}
 | 
|
732  | 
\end{center}
 | 
|
733  | 
||
734  | 
\noindent That was not necessary for our atomic parsers, like  | 
|
735  | 
\texttt{RegexParser} or \texttt{CharParser}, because they are not recursive.
 | 
|
736  | 
Note that this is also the reason why we had to write  | 
|
737  | 
||
738  | 
\begin{center}
 | 
|
739  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
740  | 
class AltParser[I, T]  | 
|
741  | 
(p: => Parser[I, T],  | 
|
742  | 
        q: => Parser[I, T]) extends Parser[I, T] {...}
 | 
|
743  | 
||
744  | 
class SeqParser[I, T, S]  | 
|
745  | 
(p: => Parser[I, T],  | 
|
746  | 
        q: => Parser[I, S]) extends Parser[I, (T, S)] {...}
 | 
|
747  | 
\end{lstlisting}
 | 
|
748  | 
\end{center}
 | 
|
749  | 
||
750  | 
\noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front of
 | 
|
751  | 
the argument types for \texttt{p} and \texttt{q} prevent Scala from
 | 
|
752  | 
evaluating the arguments. Normally, Scala would first evaluate what  | 
|
753  | 
kind of parsers \texttt{p} and \texttt{q} are, and only then generate
 | 
|
| 593 | 754  | 
the alternative parser combinator, respectively sequence parser  | 
755  | 
combinator. Since the arguments can be recursive parsers, such as  | 
|
| 591 | 756  | 
\texttt{Pal}, this would lead again to an infinite loop.
 | 
757  | 
||
758  | 
As a final example in this section, let us consider the grammar for  | 
|
759  | 
well-nested parentheses:  | 
|
760  | 
||
761  | 
\begin{plstx}[margin=3cm]
 | 
|
762  | 
: \meta{P} ::=  (\cdot \meta{P}\cdot ) \cdot \meta{P} | \epsilon\\
 | 
|
763  | 
\end{plstx}
 | 
|
764  | 
||
765  | 
\noindent  | 
|
766  | 
Let us assume we want to not just recognise strings of  | 
|
| 593 | 767  | 
well-nested parentheses but also transform round parentheses  | 
| 591 | 768  | 
into curly braces. We can do this by using a semantic  | 
769  | 
action:  | 
|
770  | 
||
771  | 
\begin{center}
 | 
|
772  | 
  \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,
 | 
|
773  | 
xleftmargin=0mm, numbers=none]  | 
|
774  | 
lazy val P : Parser[String, String] =  | 
|
775  | 
  "(" ~ P ~ ")" ~ P ==> { case (((_,x),_),y) => "{" + x + "}" + y } | ""
 | 
|
776  | 
\end{lstlisting}
 | 
|
777  | 
\end{center}
 | 
|
778  | 
||
779  | 
\noindent  | 
|
780  | 
Here we define a function where which ignores the parentheses in the  | 
|
781  | 
pairs, but replaces them in the right places with curly braces when  | 
|
782  | 
assembling the new string in the right-hand side. If we run  | 
|
783  | 
\pcode{P.parse_all("(((()()))())")} we obtain
 | 
|
784  | 
\texttt{Set(\{\{\{\{\}\{\}\}\}\{\}\})} as expected.
 | 
|
785  | 
||
786  | 
||
| 588 | 787  | 
|
| 
386
 
31295bb945c6
update
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
385 
diff
changeset
 | 
788  | 
\subsubsection*{Implementing an Interpreter}
 | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
789  | 
|
| 593 | 790  | 
The first step before implementing an interpreter for a full-blown  | 
| 592 | 791  | 
language is to implement a simple calculator for arithmetic  | 
792  | 
expressions. Suppose our arithmetic expressions are given by the  | 
|
793  | 
grammar:  | 
|
794  | 
||
795  | 
\begin{plstx}[margin=3cm,one per line]
 | 
|
| 593 | 796  | 
: \meta{E} ::= \meta{E} \cdot + \cdot \meta{E} 
 | 
| 592 | 797  | 
   | \meta{E} \cdot - \cdot \meta{E} 
 | 
798  | 
   | \meta{E} \cdot * \cdot \meta{E} 
 | 
|
799  | 
   | ( \cdot \meta{E} \cdot )
 | 
|
800  | 
| Number \\  | 
|
801  | 
\end{plstx}
 | 
|
802  | 
||
803  | 
\noindent  | 
|
804  | 
Naturally we want to implement the grammar in such a way that we can  | 
|
| 593 | 805  | 
calculate what the result of, for example, \texttt{4*2+3} is---we are
 | 
806  | 
interested in an \texttt{Int} rather than a string. This means every
 | 
|
807  | 
component parser needs to have as output type \texttt{Int} and when we
 | 
|
808  | 
assemble the intermediate results, strings like \texttt{"+"},
 | 
|
809  | 
\texttt{"*"} and so on, need to be translated into the appropriate
 | 
|
810  | 
Scala operation of adding, multiplying and so on. Being inspired by  | 
|
811  | 
the parser for well-nested parentheses above and ignoring the fact  | 
|
812  | 
that we want $*$ to take precedence over $+$ and $-$, we might want to  | 
|
813  | 
write something like  | 
|
| 592 | 814  | 
|
815  | 
\begin{center}
 | 
|
816  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
817  | 
lazy val E: Parser[String, Int] =  | 
|
818  | 
  (E ~ "+" ~ E ==> { case ((x, y), z) => x + z} |
 | 
|
819  | 
   E ~ "-" ~ E ==> { case ((x, y), z) => x - z} |
 | 
|
820  | 
   E ~ "*" ~ E ==> { case ((x, y), z) => x * z} |
 | 
|
821  | 
   "(" ~ E ~ ")" ==> { case ((x, y), z) => y} |
 | 
|
822  | 
NumParserInt)  | 
|
823  | 
\end{lstlisting}
 | 
|
824  | 
\end{center}
 | 
|
825  | 
||
826  | 
\noindent  | 
|
| 593 | 827  | 
Consider again carefully how the semantic actions pick out the correct  | 
828  | 
arguments for the calculation. In case of plus, we need \texttt{x} and
 | 
|
829  | 
\texttt{z}, because they correspond to the results of the component
 | 
|
830  | 
parser \texttt{E}. We can just add \texttt{x + z} in order to obtain
 | 
|
831  | 
an \texttt{Int} because the output type of \texttt{E} is
 | 
|
832  | 
\texttt{Int}.  Similarly with subtraction and multiplication. In
 | 
|
833  | 
contrast in the fourth clause we need to return \texttt{y}, because it
 | 
|
834  | 
is the result enclosed inside the parentheses. The information about  | 
|
835  | 
parentheses, roughly speaking, we just throw away.  | 
|
| 592 | 836  | 
|
837  | 
So far so good. The problem arises when we try to call \pcode{parse_all} with the
 | 
|
838  | 
expression \texttt{"1+2+3"}. Lets try it
 | 
|
839  | 
||
840  | 
\begin{center}
 | 
|
841  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
842  | 
E.parse_all("1+2+3")
 | 
|
843  | 
\end{lstlisting}
 | 
|
844  | 
\end{center}
 | 
|
845  | 
||
846  | 
\noindent  | 
|
| 593 | 847  | 
\ldots and we wait and wait and \ldots still wait. What is the  | 
848  | 
problem? Actually, the parser just fell into an infinite loop! The  | 
|
849  | 
reason is that the above grammar is left-recursive and recall that our  | 
|
850  | 
parser combinators cannot deal with such left-recursive  | 
|
851  | 
grammars. Fortunately, every left-recursive context-free grammar can be  | 
|
852  | 
transformed into a non-left-recursive grammars that still recognises  | 
|
853  | 
the same strings. This allows us to design the following grammar  | 
|
854  | 
||
855  | 
\begin{plstx}[margin=3cm]
 | 
|
856  | 
  : \meta{E} ::=  \meta{T} \cdot + \cdot \meta{E} |  \meta{T} \cdot - \cdot \meta{E} | \meta{T}\\
 | 
|
857  | 
: \meta{T} ::=  \meta{F} \cdot * \cdot \meta{T} | \meta{F}\\
 | 
|
858  | 
: \meta{F} ::= ( \cdot \meta{E} \cdot ) | Number\\
 | 
|
859  | 
\end{plstx}
 | 
|
860  | 
||
861  | 
\noindent  | 
|
862  | 
Recall what left-recursive means from Handout 5 and make sure you see  | 
|
863  | 
why this grammar is \emph{non} left-recursive. This version of the grammar
 | 
|
864  | 
also deals with the fact that $*$ should have a higher precedence. This does not  | 
|
865  | 
affect which strings this grammar can recognise, but in which order we are going  | 
|
866  | 
to evaluate any arithmetic expression. We can translate this grammar into  | 
|
867  | 
parsing combinators as follows:  | 
|
| 592 | 868  | 
|
869  | 
||
| 593 | 870  | 
\begin{center}
 | 
871  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
|
872  | 
lazy val E: Parser[String, Int] =  | 
|
873  | 
  (T ~ "+" ~ E) ==> { case ((x, y), z) => x + z } |
 | 
|
874  | 
  (T ~ "-" ~ E) ==> { case ((x, y), z) => x - z } | T 
 | 
|
875  | 
lazy val T: Parser[String, Int] =  | 
|
876  | 
  (F ~ "*" ~ T) ==> { case ((x, y), z) => x * z } | F
 | 
|
877  | 
lazy val F: Parser[String, Int] =  | 
|
878  | 
  ("(" ~ E ~ ")") ==> { case ((x, y), z) => y } | NumParserInt
 | 
|
879  | 
\end{lstlisting}
 | 
|
880  | 
\end{center}
 | 
|
| 592 | 881  | 
|
| 593 | 882  | 
\noindent  | 
| 594 | 883  | 
Let us try out some examples:  | 
| 592 | 884  | 
|
| 593 | 885  | 
\begin{center}
 | 
886  | 
\begin{tabular}{rcl}
 | 
|
887  | 
  input strings & & output of \pcode{parse_all}\medskip\\
 | 
|
888  | 
  \texttt{\Grid{1+2+3}} & $\rightarrow$ & \texttt{Set(6)}\\
 | 
|
889  | 
  \texttt{\Grid{4*2+3}} & $\rightarrow$ & \texttt{Set(11)}\\
 | 
|
890  | 
  \texttt{\Grid{4*(2+3)}} & $\rightarrow$ & \texttt{Set(20)}\\
 | 
|
| 594 | 891  | 
  \texttt{\Grid{(4)*((2+3))}} & $\rightarrow$ & \texttt{Set(20)}\\
 | 
| 593 | 892  | 
  \texttt{\Grid{4/2+3}} & $\rightarrow$ & \texttt{Set()}\\
 | 
893  | 
  \texttt{\Grid{1\VS +\VS 2\VS +\VS 3}} & $\rightarrow$ & \texttt{Set()}\\                      
 | 
|
894  | 
\end{tabular}
 | 
|
895  | 
\end{center}
 | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
896  | 
|
| 593 | 897  | 
\noindent  | 
| 594 | 898  | 
Note that we call \pcode{parse_all}, not \pcode{parse}.  The examples
 | 
899  | 
should be quite self-explanatory. The last two example do not produce  | 
|
900  | 
any integer result because our parser does not define what to do in  | 
|
901  | 
case of division (could be easily added), but also has no idea what to  | 
|
| 595 | 902  | 
do with whitespaces. To deal with them is the task of the lexer! Yes,  | 
| 594 | 903  | 
we can deal with them inside the grammar, but that would render many  | 
904  | 
grammars becoming unintelligible, including this one.\footnote{If you
 | 
|
905  | 
think an easy solution is to extend the notion of what a number  | 
|
906  | 
should be, then think again---you still would have to deal with  | 
|
| 595 | 907  | 
  cases like \texttt{\Grid{(\VS (\VS 2+3)\VS )}}. Just think of the mess 
 | 
908  | 
you would have in a grammar for a full-blown language where there are  | 
|
909  | 
numerous such cases.}  | 
|
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
910  | 
|
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
911  | 
\end{document}
 | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
912  | 
|
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
913  | 
%%% Local Variables:  | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
914  | 
%%% mode: latex  | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
915  | 
%%% TeX-master: t  | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
916  | 
%%% End:  |