author | Christian Urban <urbanc@in.tum.de> |
Thu, 25 Oct 2018 18:36:04 +0100 | |
changeset 590 | c6a1e19e9801 |
parent 589 | 0451b8b67f62 |
child 591 | 863e502f6a5c |
permissions | -rw-r--r-- |
584 | 1 |
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
2 |
\documentclass{article} |
297
5c51839c88fd
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
292
diff
changeset
|
3 |
\usepackage{../style} |
217
cd6066f1056a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
183
diff
changeset
|
4 |
\usepackage{../langs} |
588 | 5 |
\usepackage{../grammar} |
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
6 |
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
7 |
\begin{document} |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
8 |
|
292
7ed2a25dd115
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
217
diff
changeset
|
9 |
\section*{Handout 6 (Parser Combinators)} |
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
10 |
|
584 | 11 |
This handout explains how \emph{parser combinators} work and how they |
587 | 12 |
can be implemented in Scala. Their most distinguishing feature is that |
13 |
they are very easy to implement (admittedly it is only easy in a |
|
14 |
functional programming language). Another good point of parser |
|
15 |
combinators is that they can deal with any kind of input as long as |
|
16 |
this input is of ``sequence-kind'', for example a string or a list of |
|
17 |
tokens. The only two properties of the input we need is to be able to |
|
18 |
test when it is empty and ``sequentially'' take it apart. Strings and |
|
19 |
lists fit this bill. However, parser combinators also have their |
|
20 |
drawbacks. For example they require that the grammar to be parsed is |
|
21 |
\emph{not} left-recursive and they are efficient only when the grammar |
|
22 |
is unambiguous. It is the responsibility of the grammar designer to |
|
23 |
ensure these two properties. |
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
24 |
|
587 | 25 |
The general idea behind parser combinators is to transform the input |
26 |
into sets of pairs, like so |
|
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
27 |
|
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
28 |
\begin{center} |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
29 |
$\underbrace{\text{list of tokens}}_{\text{input}}$ |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
30 |
$\Rightarrow$ |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
31 |
$\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$ |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
32 |
\end{center} |
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
33 |
|
587 | 34 |
\noindent |
590 | 35 |
Given the extended effort we have spent implementing a lexer in order |
36 |
to generate list of tokens, it might be surprising that in what |
|
37 |
follows we shall often use strings as input, rather than list of |
|
38 |
tokens. This is for making the explanation more lucid. It does not |
|
39 |
make our previous work on lexers obsolete (remember they transform a |
|
40 |
string into a list of tokens). Lexers will still be needed for |
|
41 |
building a somewhat realistic compiler. |
|
584 | 42 |
|
590 | 43 |
As mentioned above, parser combinators are relatively agnostic about what |
587 | 44 |
kind of input they process. In my Scala code I use the following |
45 |
polymorphic types for parser combinators: |
|
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
46 |
|
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
47 |
\begin{center} |
584 | 48 |
input:\;\; \texttt{I} \qquad output:\;\; \texttt{T} |
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
49 |
\end{center} |
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
50 |
|
587 | 51 |
\noindent That is they take as input something of type \texttt{I} and |
590 | 52 |
return a set of pairs of type \texttt{Set[(T, I)]}. Since the input |
53 |
needs to be of ``sequence-kind'', I actually have to often write |
|
54 |
\texttt{I <\% Seq[\_]} for the input type. This ensures the input is a |
|
584 | 55 |
subtype of Scala sequences. The first component of the generated pairs |
56 |
corresponds to what the parser combinator was able to process from the |
|
590 | 57 |
input and the second is the unprocessed part, or leftover, of the |
58 |
input (therefore the type of this unprocessed part is the same as the |
|
59 |
input). A parser combinator might return more than one such pair; the |
|
60 |
idea is that there are potentially several ways of how to parse the |
|
61 |
input. As a concrete example, consider the string |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
62 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
63 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
64 |
\tt\Grid{iffoo\VS testbar} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
65 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
66 |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
67 |
\noindent We might have a parser combinator which tries to |
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
68 |
interpret this string as a keyword (\texttt{if}) or as an |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
69 |
identifier (\texttt{iffoo}). Then the output will be the set |
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
70 |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
71 |
\begin{center} |
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
72 |
$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
73 |
\left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$ |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
74 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
75 |
|
587 | 76 |
\noindent where the first pair means the parser could recognise |
590 | 77 |
\texttt{if} from the input and leaves the \texttt{foo\VS testbar} as |
78 |
`unprocessed' part; in the other case it could recognise |
|
587 | 79 |
\texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the |
80 |
parser cannot recognise anything from the input at all, then parser |
|
81 |
combinators just return the empty set $\{\}$. This will indicate |
|
82 |
something ``went wrong''\ldots or more precisely, nothing could be |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
83 |
parsed. |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
84 |
|
584 | 85 |
Also important to note is that the type \texttt{T} for the processed |
590 | 86 |
part is different from the input type \texttt{I} in the parse. In the |
87 |
example above is just happens to be the same. The reason for the |
|
88 |
potential difference is that in general we are interested in |
|
89 |
transforming our input into something ``different''\ldots for example |
|
90 |
into a tree; or if we implement the grammar for arithmetic |
|
91 |
expressions, we might be interested in the actual integer number the |
|
92 |
arithmetic expression, say \texttt{1 + 2 * 3}, stands for. In this way |
|
93 |
we can use parser combinators to implement relatively easily a |
|
94 |
calculator, for instance. |
|
584 | 95 |
|
96 |
The main idea of parser combinators is that we can easily build parser |
|
97 |
combinators out of smaller components following very closely the |
|
98 |
structure of a grammar. In order to implement this in an |
|
99 |
object-oriented programming language, like Scala, we need to specify |
|
100 |
an abstract class for parser combinators. This abstract class states |
|
101 |
that the function \texttt{parse} takes an argument of type \texttt{I} |
|
102 |
and returns a set of type \mbox{\texttt{Set[(T, I)]}}. |
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
103 |
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
104 |
\begin{center} |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
105 |
\begin{lstlisting}[language=Scala] |
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
106 |
abstract class Parser[I, T] { |
590 | 107 |
def parse(in: I) : Set[(T, I)] |
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
108 |
|
590 | 109 |
def parse_all(in: I) : Set[T] = |
110 |
for ((head, tail) <- parse(in); if (tail.isEmpty)) |
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
111 |
yield head |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
112 |
} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
113 |
\end{lstlisting} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
114 |
\end{center} |
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
115 |
|
584 | 116 |
\noindent It is the obligation in each instance (parser combinator) to |
117 |
supply an implementation for \texttt{parse}. From this function we |
|
118 |
can then ``centrally'' derive the function \texttt{parse\_all}, which |
|
119 |
just filters out all pairs whose second component is not empty (that |
|
120 |
is has still some unprocessed part). The reason is that at the end of |
|
121 |
the parsing we are only interested in the results where all the input |
|
122 |
has been consumed and no unprocessed part is left over. |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
123 |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
124 |
One of the simplest parser combinators recognises just a |
584 | 125 |
single character, say $c$, from the beginning of strings. Its |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
126 |
behaviour can be described as follows: |
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
127 |
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
128 |
\begin{itemize} |
584 | 129 |
\item If the head of the input string starts with a $c$, then return |
130 |
the set |
|
131 |
\[\{(c, \textit{tail of}\; s)\}\] |
|
132 |
where \textit{tail of} |
|
133 |
$s$ is the unprocessed part of the input string. |
|
134 |
\item Otherwise return the empty set $\{\}$. |
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
135 |
\end{itemize} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
136 |
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
137 |
\noindent |
590 | 138 |
The input type of this simple parser combinator is \texttt{String} and |
139 |
the output type is \texttt{Char}. This means \texttt{parse} returns |
|
140 |
\mbox{\texttt{Set[(Char, String)]}}. The code in Scala is as follows: |
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
141 |
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
142 |
\begin{center} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
143 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
144 |
case class CharParser(c: Char) extends Parser[String, Char] { |
587 | 145 |
def parse(in: String) = |
146 |
if (in.head == c) Set((c, in.tail)) else Set() |
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
147 |
} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
148 |
\end{lstlisting} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
149 |
\end{center} |
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
150 |
|
589 | 151 |
\noindent You can see \texttt{parse} tests whether the |
587 | 152 |
first character of the input string \texttt{in} is equal to |
584 | 153 |
\texttt{c}. If yes, then it splits the string into the recognised part |
587 | 154 |
\texttt{c} and the unprocessed part \texttt{in.tail}. In case |
155 |
\texttt{in} does not start with \texttt{c} then the parser returns the |
|
584 | 156 |
empty set (in Scala \texttt{Set()}). Since this parser recognises |
157 |
characters and just returns characters as the processed part, the |
|
158 |
output type of the parser is \texttt{Char}. |
|
159 |
||
160 |
If we want to parse a list of tokens and interested in recognising a |
|
590 | 161 |
number token, for example, we could write something like this |
584 | 162 |
|
163 |
\begin{center} |
|
164 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none] |
|
165 |
case object NumParser extends Parser[List[Token], Int] { |
|
166 |
def parse(ts: List[Token]) = ts match { |
|
167 |
case Num_token(s)::ts => Set((s.toInt, ts)) |
|
168 |
case _ => Set () |
|
169 |
} |
|
170 |
} |
|
171 |
\end{lstlisting} |
|
172 |
\end{center} |
|
173 |
||
174 |
\noindent |
|
175 |
In this parser the input is of type \texttt{List[Token]}. The function |
|
176 |
parse looks at the input \texttt{ts} and checks whether the first |
|
589 | 177 |
token is a \texttt{Num\_token} (let us assume our lexer generated |
178 |
these tokens for numbers). But this parser does not just return this |
|
584 | 179 |
token (and the rest of the list), like the \texttt{CharParser} above, |
590 | 180 |
rather it extracts also the string \texttt{s} from the token and |
181 |
converts it into an integer. The hope is that the lexer did its work |
|
182 |
well and this conversion always succeeds. The consequence of this is |
|
183 |
that the output type for this parser is \texttt{Int}, not |
|
184 |
\texttt{Token}. Such a conversion would be needed if we want to |
|
185 |
implement a simple calculator program, because string-numbers need to |
|
186 |
be transformed into \texttt{Int}-numbers in order to do the |
|
187 |
calculations. |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
188 |
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
189 |
|
584 | 190 |
These simple parsers that just look at the input and do a simple |
191 |
transformation are often called \emph{atomic} parser combinators. |
|
192 |
More interesting are the parser combinators that build larger parsers |
|
587 | 193 |
out of smaller component parsers. There are three such parser |
194 |
combinators that can be implemented generically. The \emph{alternative |
|
584 | 195 |
parser combinator} is as follows: given two parsers, say, $p$ and |
196 |
$q$, we apply both parsers to the input (remember parsers are |
|
587 | 197 |
functions) and combine the output (remember they are sets of pairs): |
198 |
||
199 |
\begin{center} |
|
200 |
$p(\text{input}) \cup q(\text{input})$ |
|
201 |
\end{center} |
|
202 |
||
203 |
\noindent In Scala we can implement alternative parser |
|
204 |
combinator as follows |
|
205 |
||
206 |
\begin{center} |
|
207 |
\begin{lstlisting}[language=Scala, numbers=none] |
|
208 |
class AltParser[I, T] |
|
209 |
(p: => Parser[I, T], |
|
210 |
q: => Parser[I, T]) extends Parser[I, T] { |
|
211 |
def parse(in: I) = p.parse(in) ++ q.parse(in) |
|
212 |
} |
|
213 |
\end{lstlisting} |
|
214 |
\end{center} |
|
215 |
||
216 |
\noindent The types of this parser combinator are again generic (we |
|
217 |
have \texttt{I} for the input type, and \texttt{T} for the output |
|
218 |
type). The alternative parser builds a new parser out of two existing |
|
590 | 219 |
parsers \texttt{p} and \texttt{q} which are given as arguments. Both |
220 |
parsers need to be able to process input of type \texttt{I} and return |
|
221 |
in \texttt{parse} the same output type \texttt{Set[(T, |
|
587 | 222 |
I)]}.\footnote{There is an interesting detail of Scala, namely the |
223 |
\texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They |
|
224 |
will prevent the evaluation of the arguments before they are |
|
225 |
used. This is often called \emph{lazy evaluation} of the |
|
590 | 226 |
arguments. We will explain this later.} The alternative parser runs |
227 |
the input with the first parser \texttt{p} (producing a set of pairs) |
|
228 |
and then runs the same input with \texttt{q} (producing another set of |
|
229 |
pairs). The result should be then just the union of both sets, which |
|
230 |
is the operation \texttt{++} in Scala. |
|
587 | 231 |
|
232 |
The alternative parser combinator allows us to construct a parser that |
|
233 |
parses either a character \texttt{a} or \texttt{b} using the |
|
234 |
\texttt{CharParser} shown above. For this we can write |
|
235 |
||
236 |
\begin{center} |
|
237 |
\begin{lstlisting}[language=Scala, numbers=none] |
|
238 |
new AltParser(CharParser('a'), CharParser('b')) |
|
239 |
\end{lstlisting} |
|
240 |
\end{center} |
|
241 |
||
242 |
\noindent Later on we will use Scala mechanism for introducing some |
|
589 | 243 |
more readable shorthand notation for this, like \texttt{"a" | |
587 | 244 |
"b"}. Let us look in detail at what this parser combinator produces |
590 | 245 |
with some sample strings. |
587 | 246 |
|
247 |
\begin{center} |
|
248 |
\begin{tabular}{rcl} |
|
249 |
input strings & & output\medskip\\ |
|
250 |
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\ |
|
251 |
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\ |
|
252 |
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$ |
|
253 |
\end{tabular} |
|
254 |
\end{center} |
|
255 |
||
256 |
\noindent We receive in the first two cases a successful |
|
257 |
output (that is a non-empty set). In each case, either |
|
258 |
\pcode{a} or \pcode{b} is in the processed part, and |
|
259 |
\pcode{cde} in the unprocessed part. Clearly this parser cannot |
|
260 |
parse anything with \pcode{ccde}, therefore the empty |
|
261 |
set is returned. |
|
262 |
||
263 |
A bit more interesting is the \emph{sequence parser combinator}. Given |
|
264 |
two parsers, say again, $p$ and $q$, we want to apply first the input |
|
590 | 265 |
to $p$ producing a set of pairs; then apply $q$ to all the unparsed |
587 | 266 |
parts in the pairs; and then combine the results. Mathematically we would |
590 | 267 |
write something like this for the result set of pairs: |
587 | 268 |
|
269 |
\begin{center} |
|
270 |
\begin{tabular}{lcl} |
|
271 |
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & |
|
272 |
$(\textit{output}_1, u_1) \in p(\text{input}) |
|
273 |
\;\wedge\;$\\ |
|
274 |
&& $(\textit{output}_2, u_2) \in q(u_1)\}$ |
|
275 |
\end{tabular} |
|
276 |
\end{center} |
|
277 |
||
278 |
\noindent Notice that the $p$ will first be run on the input, |
|
590 | 279 |
producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ |
280 |
stands for the unprocessed, or leftover, parts pf $p$. We want that |
|
281 |
$q$ runs on all these unprocessed parts $u_1$. Therefore these |
|
282 |
unprocessed parts are fed into the second parser $q$. The overall |
|
283 |
result of the sequence parser combinator is pairs of the form |
|
584 | 284 |
$((\textit{output}_1, \textit{output}_2), u_2)$. This means the |
590 | 285 |
unprocessed part of the sequqnce p[arser combinator is the unprocessed |
286 |
part the second parser $q$ leaves as leftover. The processed parts of |
|
287 |
of the component parsers is a pair consisting of the outputs of $p$ |
|
288 |
and $q$, namely $(\textit{output}_1, \textit{output}_2)$. This |
|
289 |
behaviour can be implemented in Scala as follows: |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
290 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
291 |
\begin{center} |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
292 |
\begin{lstlisting}[language=Scala,numbers=none] |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
293 |
class SeqParser[I, T, S] |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
294 |
(p: => Parser[I, T], |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
295 |
q: => Parser[I, S]) extends Parser[I, (T, S)] { |
587 | 296 |
def parse(in: I) = |
297 |
for ((output1, u1) <- p.parse(in); |
|
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
298 |
(output2, u2) <- q.parse(u1)) |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
299 |
yield ((output1, output2), u2) |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
300 |
} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
301 |
\end{lstlisting} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
302 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
303 |
|
587 | 304 |
\noindent This parser takes again as arguments two parsers, \texttt{p} |
305 |
and \texttt{q}. It implements \texttt{parse} as follows: let first run |
|
306 |
the parser \texttt{p} on the input producing a set of pairs |
|
307 |
(\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the |
|
308 |
unprocessed parts left over by \texttt{p}. Let then \texttt{q} run on these |
|
309 |
unprocessed parts producing again a set of pairs. The output of the |
|
310 |
sequence parser combinator is then a set containing pairs where the |
|
311 |
first components are again pairs, namely what the first parser could |
|
312 |
parse together with what the second parser could parse; the second |
|
313 |
component is the unprocessed part left over after running the second |
|
314 |
parser \texttt{q}. Therefore the input type of the sequence parser |
|
315 |
combinator is as usual \texttt{I}, but the output type is |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
316 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
317 |
\begin{center} |
590 | 318 |
\texttt{(T, S)} |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
319 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
320 |
|
584 | 321 |
\noindent |
590 | 322 |
This means \texttt{parse} in the sequence parser combinator returns |
323 |
sets of type \texttt{Set[((T, S), I)]}. Notice that we have |
|
324 |
essentially two output types for the sequence parser combinator |
|
325 |
(packaged in a pair), because in general \textit{p} and \textit{q} |
|
326 |
might produce different things (for example first we recognise a |
|
327 |
number and then a string corresponding to an operator). If any of the |
|
328 |
runs of \textit{p} and \textit{q} fail, that is produce the empty set, |
|
329 |
then \texttt{parse} will also produce the empty set. |
|
584 | 330 |
|
587 | 331 |
With the shorthand notation we shall introduce later for the sequence |
332 |
parser combinator, we can write for example \pcode{"a" ~ "b"}, which |
|
333 |
is the parser combinator that first recognises the character |
|
334 |
\texttt{a} from a string and then \texttt{b}. Let us look again at |
|
335 |
three examples of how this parser combinator processes some strings: |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
336 |
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
337 |
\begin{center} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
338 |
\begin{tabular}{rcl} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
339 |
input strings & & output\medskip\\ |
584 | 340 |
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\ |
341 |
\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\ |
|
342 |
\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$ |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
343 |
\end{tabular} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
344 |
\end{center} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
345 |
|
586 | 346 |
\noindent In the first line we have a successful parse, because the |
587 | 347 |
string starts with \texttt{ab}, which is the prefix we are looking |
584 | 348 |
for. But since the parsing combinator is constructed as sequence of |
349 |
the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the |
|
350 |
result is a nested pair of the form \texttt{((a, b), cde)}. It is |
|
586 | 351 |
\emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously |
587 | 352 |
expect. The parser returns the empty set in the other examples, |
584 | 353 |
because they do not fit with what the parser is supposed to parse. |
354 |
||
355 |
||
589 | 356 |
A slightly more complicated parser is \pcode{("a" | "b") ~ "c"} which |
587 | 357 |
parses as first character either an \texttt{a} or \texttt{b}, followed |
358 |
by a \texttt{c}. This parser produces the following outputs. |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
359 |
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
360 |
\begin{center} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
361 |
\begin{tabular}{rcl} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
362 |
input strings & & output\medskip\\ |
585 | 363 |
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\ |
364 |
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\ |
|
365 |
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$ |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
366 |
\end{tabular} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
367 |
\end{center} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
368 |
|
585 | 369 |
\noindent |
370 |
Now consider the parser \pcode{("a" ~ "b") ~ "c"} which parses |
|
371 |
\texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces |
|
372 |
the following outputs. |
|
373 |
||
374 |
\begin{center} |
|
375 |
\begin{tabular}{rcl} |
|
376 |
input strings & & output\medskip\\ |
|
377 |
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\ |
|
378 |
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\ |
|
379 |
\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$ |
|
380 |
\end{tabular} |
|
381 |
\end{center} |
|
382 |
||
383 |
||
384 |
\noindent The second and third example fail, because something is |
|
590 | 385 |
``missing'' in the sequence we are looking for. The first succeeds but |
386 |
notice how the results nest with sequences: the parsed part is a |
|
387 |
nested pair of the form \pcode{((a, b), c)}. If we nest the sequence |
|
388 |
parser differently, for example \pcode{"a" ~ ("b" ~ "c")}, then also |
|
389 |
our output pairs nest differently |
|
589 | 390 |
|
391 |
\begin{center} |
|
392 |
\begin{tabular}{rcl} |
|
393 |
input strings & & output\medskip\\ |
|
394 |
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})), \texttt{\Grid{de}})\right\}$\\ |
|
395 |
\end{tabular} |
|
396 |
\end{center} |
|
397 |
||
398 |
\noindent |
|
399 |
Two more examples: first consider the parser |
|
585 | 400 |
\pcode{("a" ~ "a") ~ "a"} and the input \pcode{aaaa}: |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
401 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
402 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
403 |
\begin{tabular}{rcl} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
404 |
input string & & output\medskip\\ |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
405 |
\texttt{\Grid{aaaa}} & $\rightarrow$ & |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
406 |
$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\ |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
407 |
\end{tabular} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
408 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
409 |
|
587 | 410 |
\noindent Notice how again the results nest deeper and deeper as pairs (the |
585 | 411 |
last \pcode{a} is in the unprocessed part). To consume everything of |
412 |
this string we can use the parser \pcode{(("a" ~ "a") ~ "a") ~ |
|
413 |
"a"}. Then the output is as follows: |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
414 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
415 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
416 |
\begin{tabular}{rcl} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
417 |
input string & & output\medskip\\ |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
418 |
\texttt{\Grid{aaaa}} & $\rightarrow$ & |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
419 |
$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\ |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
420 |
\end{tabular} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
421 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
422 |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
423 |
\noindent This is an instance where the parser consumed |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
424 |
completely the input, meaning the unprocessed part is just the |
587 | 425 |
empty string. So if we called \pcode{parse_all}, instead of \pcode{parse}, |
585 | 426 |
we would get back the result |
427 |
||
428 |
\[ |
|
429 |
\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\} |
|
430 |
\] |
|
431 |
||
432 |
\noindent where the unprocessed (empty) parts have been stripped away |
|
433 |
from the pairs; everything where the second part was not empty has |
|
587 | 434 |
been thrown away as well, because they represent |
590 | 435 |
ultimately-unsuccessful-parses. The main point is that the sequence |
436 |
parser combinator returns pairs that can nest according to the |
|
437 |
nesting of the component parsers. |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
438 |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
439 |
|
590 | 440 |
Consider also carefully that constructing a parser such \pcode{"a" | |
441 |
("a" ~ "b")} will result in a typing error. The intention with this |
|
442 |
parser is that we want to parse an \texttt{a}, or an \texttt{a} |
|
443 |
followed by a \texttt{b}. However, the first parser has as output type |
|
444 |
a single character (recall the type of \texttt{CharParser}), but the |
|
445 |
second parser produces a pair of characters as output. The alternative |
|
446 |
parser is required to have both component parsers to have the same |
|
447 |
type---we need to be able to build the union of two sets, which means |
|
448 |
in Scala they need to be of the same type. Since ther are not, there |
|
449 |
is a typing error in this example. We will see later how we can build |
|
587 | 450 |
this parser without the typing error. |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
451 |
|
587 | 452 |
The next parser combinator, called \emph{semantic action}, does not |
453 |
actually combine smaller parsers, but applies a function to the result |
|
454 |
of a parser. It is implemented in Scala as follows |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
455 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
456 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
457 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
458 |
class FunParser[I, T, S] |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
459 |
(p: => Parser[I, T], |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
460 |
f: T => S) extends Parser[I, S] { |
587 | 461 |
def parse(in: I) = |
462 |
for ((head, tail) <- p.parse(in)) yield (f(head), tail) |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
463 |
} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
464 |
\end{lstlisting} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
465 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
466 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
467 |
|
590 | 468 |
\noindent This parser combinator takes a parser \texttt{p} (with input |
469 |
type \texttt{I} and output type \texttt{T}) as one argument but also a |
|
470 |
function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p} |
|
471 |
produces sets of type \texttt{Set[(T, I)]}. The semantic action |
|
472 |
combinator then applies the function \texttt{f} to all the `processed' |
|
473 |
parser outputs. Since this function is of type \texttt{T => S}, we |
|
474 |
obtain a parser with output type \texttt{S}. Again Scala lets us |
|
475 |
introduce some shorthand notation for this parser |
|
476 |
combinator. Therefore we will write \texttt{p ==> f} for it. |
|
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
477 |
|
589 | 478 |
What are semantic actions good for? Well, they allow you to transform |
590 | 479 |
the parsed input into datastructures you can use for further |
587 | 480 |
processing. A simple example would be to transform parsed characters |
481 |
into ASCII numbers. Suppose we define a function \texttt{f} (from |
|
589 | 482 |
characters to ints) and use a \texttt{CharParser} for parsing |
483 |
the character \texttt{c}. |
|
587 | 484 |
|
485 |
\begin{center} |
|
486 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
487 |
val f = (c: Char) => c.toInt |
|
488 |
val c = new CharParser('c') |
|
489 |
\end{lstlisting} |
|
490 |
\end{center} |
|
491 |
||
492 |
\noindent |
|
589 | 493 |
We then can run the following two parsers on the input \texttt{cbd}: |
587 | 494 |
|
495 |
\begin{center} |
|
496 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
497 |
c.parse("cbd") |
|
498 |
(c ==> f).parse("cbd") |
|
499 |
\end{lstlisting} |
|
500 |
\end{center} |
|
501 |
||
502 |
\noindent |
|
589 | 503 |
In the first line we obtain the expected result \texttt{Set(('c', |
504 |
"bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the |
|
505 |
character has been transformed into an ASCII number. |
|
588 | 506 |
|
507 |
A slightly less contrived example is about parsing numbers (recall |
|
508 |
\texttt{NumParser} above). However, we want to do this here for strings. |
|
509 |
For this assume we have the following \texttt{RegexParser}. |
|
510 |
||
511 |
\begin{center} |
|
512 |
\begin{lstlisting}[language=Scala,xleftmargin=0mm, |
|
513 |
basicstyle=\small\ttfamily, numbers=none] |
|
514 |
import scala.util.matching.Regex |
|
515 |
||
516 |
case class RegexParser(reg: Regex) extends Parser[String, String] { |
|
517 |
def parse(in: String) = reg.findPrefixMatchOf(in) match { |
|
518 |
case None => Set() |
|
519 |
case Some(m) => Set((m.matched, m.after.toString)) |
|
520 |
} |
|
521 |
} |
|
522 |
\end{lstlisting} |
|
523 |
\end{center} |
|
524 |
||
525 |
\noindent |
|
526 |
This parser takes a regex as argument and splits up a string into a |
|
527 |
prefix and the rest according to this regex |
|
528 |
(\texttt{reg.findPrefixMatchOf} generates a match---in the successful |
|
529 |
case---and the corresponding strings can be extracted with |
|
589 | 530 |
\texttt{matched} and \texttt{after}). Using this parser we can define a |
588 | 531 |
\texttt{NumParser} for strings as follows: |
532 |
||
533 |
\begin{center} |
|
534 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
535 |
val NumParser = RegexParser("[0-9]+".r) |
|
536 |
\end{lstlisting} |
|
537 |
\end{center} |
|
538 |
||
539 |
\noindent |
|
540 |
This parser will recognise a number at the beginning of a string, for |
|
541 |
example |
|
542 |
||
543 |
\begin{center} |
|
544 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
545 |
NumParser.parse("123abc") |
|
546 |
\end{lstlisting} |
|
547 |
\end{center} |
|
548 |
||
549 |
\noindent |
|
550 |
produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is |
|
590 | 551 |
still a string (the required double-quotes are not printed by |
552 |
Scala). We want to convert this string into the corresponding |
|
553 |
\texttt{Int}. We can do this as follows using a semantic action |
|
588 | 554 |
|
555 |
\begin{center} |
|
556 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
557 |
(NumParser ==> (s => s.toInt)).parse("123abc") |
|
558 |
\end{lstlisting} |
|
559 |
\end{center} |
|
560 |
||
561 |
\noindent |
|
589 | 562 |
The function in the semantic action converts a string into an |
588 | 563 |
\texttt{Int}. Let us come back to semantic actions when we are going |
589 | 564 |
to implement actual context-free gammars. |
587 | 565 |
|
566 |
\subsubsection*{Shorthand notation for parser combinators} |
|
567 |
||
568 |
Before we proceed, let us just explain the shorthand notation for |
|
569 |
parser combinators. Like for regular expressions, the shorthand notation |
|
590 | 570 |
will make our life much easier when writing actual parsers. We can define |
571 |
some implicits which allow us to write \pcode{p | q}, \pcode{p ~ q} and |
|
572 |
\pcode{p ==> f} as well as to use plain strings for specifying simple |
|
573 |
string parsers. |
|
574 |
||
575 |
The idea is that this shorthand notation allows us to easily translate |
|
576 |
context-free grammars into code. For example recall our context-free |
|
577 |
grammar for palindromes: |
|
578 |
||
579 |
\begin{plstx}[margin=3cm] |
|
580 |
: \meta{P} ::= a\cdot \meta{P}\cdot a | b\cdot \meta{P}\cdot b | a | b | \epsilon\\ |
|
581 |
\end{plstx} |
|
582 |
||
583 |
\noindent |
|
584 |
Each alternative in this grammar translates into an alternative parser |
|
585 |
combinator. The $\cdot$ can be translated to a sequence parser |
|
586 |
combinator. The parsers for $a$, $b$ and $\epsilon$ can be simply |
|
587 |
written as \texttt{"a"}, \texttt{"b"} and \texttt{""}. |
|
588 |
||
587 | 589 |
|
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
590 |
\subsubsection*{How to build parsers using parser combinators?} |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
591 |
|
588 | 592 |
The beauty of parser combinators is the ease with which they can be |
593 |
implemented and how easy it is to translate context-free grammars into |
|
594 |
code (though the grammars need to be non-left-recursive). To |
|
590 | 595 |
demonstrate this recall the grammar for palindromes from above. |
596 |
The first idea would be to translate it into the following code |
|
588 | 597 |
|
598 |
\begin{center} |
|
599 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
600 |
lazy val Pal : Parser[String, String] = |
|
601 |
(("a" ~ Pal ~ "a") | ("b" ~ Pal ~ "b") | "a" | "b" | "") |
|
602 |
\end{lstlisting} |
|
603 |
\end{center} |
|
604 |
||
605 |
\noindent |
|
590 | 606 |
Unfortunately, this does not quite work yet as it produces a typing |
607 |
error. The reason is that the parsers \texttt{"a"}, \texttt{"b"} and |
|
608 |
\texttt{""} all produce strings as output type and therefore can be |
|
609 |
put into an alternative \texttt{...| "a" | "b" | ""}. But both |
|
610 |
\pcode{"a" ~ Pal ~ "a"} and \pcode{"b" ~ Pal ~ "b"} produce pairs of |
|
611 |
the form $(((\_, \_), \_), \_)$---that is how the sequence parser |
|
612 |
combinator nests results when \pcode{\~} is used between two |
|
613 |
components. The solution is to use a semantic action that ``flattens'' |
|
614 |
these pairs and appends the corresponding strings, like |
|
588 | 615 |
|
616 |
\begin{center} |
|
617 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
618 |
lazy val Pal : Parser[String, String] = |
|
619 |
(("a" ~ Pal ~ "a") ==> { case ((x, y), z) => x + y + z } | |
|
620 |
("b" ~ Pal ~ "b") ==> { case ((x, y), z) => x + y + z } | |
|
621 |
"a" | "b" | "") |
|
622 |
\end{lstlisting} |
|
623 |
\end{center} |
|
624 |
||
589 | 625 |
\noindent |
590 | 626 |
Now in all cases we have strings as output type for the parser variants. |
589 | 627 |
The semantic action |
628 |
||
629 |
||
630 |
\noindent |
|
631 |
Important to note is that we must define \texttt{Pal}-parser as a |
|
632 |
\emph{lazy} value. |
|
588 | 633 |
|
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
634 |
\subsubsection*{Implementing an Interpreter} |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
635 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
636 |
%\bigskip |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
637 |
%takes advantage of the full generality---have a look |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
638 |
%what it produces if we call it with the string \texttt{abc} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
639 |
% |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
640 |
%\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
641 |
%\begin{tabular}{rcl} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
642 |
%input string & & output\medskip\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
643 |
%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
644 |
%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
645 |
%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
646 |
%\end{tabular} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
647 |
%\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
648 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
649 |
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
650 |
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
651 |
\end{document} |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
652 |
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
653 |
%%% Local Variables: |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
654 |
%%% mode: latex |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
655 |
%%% TeX-master: t |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
656 |
%%% End: |