7 \begin{document} |
7 \begin{document} |
8 |
8 |
9 \section*{Handout 6 (Parser Combinators)} |
9 \section*{Handout 6 (Parser Combinators)} |
10 |
10 |
11 This handout explains how \emph{parser combinators} work and how they |
11 This handout explains how \emph{parser combinators} work and how they |
12 can be implemented in Scala. Their distinguishing feature is that they |
12 can be implemented in Scala. Their most distinguishing feature is that |
13 are very easy to implement (admittedly it is only easy in a functional |
13 they are very easy to implement (admittedly it is only easy in a |
14 programming language). However, parser combinators require that the |
14 functional programming language). Another good point of parser |
15 grammar to be parsed is \emph{not} left-recursive and they are |
15 combinators is that they can deal with any kind of input as long as |
16 efficient only when the grammar is unambiguous. It is the |
16 this input is of ``sequence-kind'', for example a string or a list of |
17 responsibility of the grammar designer to ensure these two properties. |
17 tokens. The only two properties of the input we need is to be able to |
18 |
18 test when it is empty and ``sequentially'' take it apart. Strings and |
19 Another good point of parser combinators is that they can deal with any kind |
19 lists fit this bill. However, parser combinators also have their |
20 of input as long as this input of ``sequence-kind'', for example a |
20 drawbacks. For example they require that the grammar to be parsed is |
21 string or a list of tokens. The general idea behind parser combinators |
21 \emph{not} left-recursive and they are efficient only when the grammar |
22 is to transform the input into sets of pairs, like so |
22 is unambiguous. It is the responsibility of the grammar designer to |
|
23 ensure these two properties. |
|
24 |
|
25 The general idea behind parser combinators is to transform the input |
|
26 into sets of pairs, like so |
23 |
27 |
24 \begin{center} |
28 \begin{center} |
25 $\underbrace{\text{list of tokens}}_{\text{input}}$ |
29 $\underbrace{\text{list of tokens}}_{\text{input}}$ |
26 $\Rightarrow$ |
30 $\Rightarrow$ |
27 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$ |
31 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$ |
28 \end{center} |
32 \end{center} |
29 |
33 |
30 \noindent As said, the input can be anything as long as it is a |
34 \noindent |
31 ``sequence''. The only property of the input we need is to be able to |
35 Given the extended effort we have spent in order to implement a lexer |
32 test when it is empty. Obviously we can do this for strings and lists. |
36 in order to generate list of tokens, it might be surprising that in |
33 For more lucidity we shall below often use strings as input in order |
37 what follows we shall often use strings as input. This is for making |
34 to illustrate matters. However, this does not make our previous work |
38 the explanation more lucid. It does not make our previous work on |
35 on lexers obsolete (remember they transform a string into a list of |
39 lexers obsolete (remember they transform a string into a list of |
36 tokens). Lexers will still be needed to build a somewhat realistic |
40 tokens). Lexers will still be needed for building a somewhat realistic |
37 compiler. |
41 compiler. |
38 |
42 |
39 |
43 But as said, parser combinators are relatively agnostic about what |
40 In my Scala code I use the following polymorphic types for parser combinators: |
44 kind of input they process. In my Scala code I use the following |
|
45 polymorphic types for parser combinators: |
41 |
46 |
42 \begin{center} |
47 \begin{center} |
43 input:\;\; \texttt{I} \qquad output:\;\; \texttt{T} |
48 input:\;\; \texttt{I} \qquad output:\;\; \texttt{T} |
44 \end{center} |
49 \end{center} |
45 |
50 |
46 \noindent that is they take as input something of type \texttt{I} and |
51 \noindent That is they take as input something of type \texttt{I} and |
47 return a set of pairs of \texttt{Set[(T, I)]}. Since the input needs |
52 return a set of pairs of type \texttt{Set[(T, I)]}. Since the input needs |
48 to be of ``sequence-kind'' I actually have to often write \texttt{I |
53 to be of ``sequence-kind'', I actually have to often write \texttt{I |
49 <\% Seq[\_]} for the input type in order to indicate it is a |
54 <\% Seq[\_]} for the input type. This ensures the input is a |
50 subtype of Scala sequences. The first component of the generated pairs |
55 subtype of Scala sequences. The first component of the generated pairs |
51 corresponds to what the parser combinator was able to process from the |
56 corresponds to what the parser combinator was able to process from the |
52 input and the second is the unprocessed part of the input (therefore |
57 input and the second is the unprocessed part of the input (therefore |
53 the type of this unprocessed part is the same as the input). As we |
58 the type of this unprocessed part is the same as the input). As we |
54 shall see shortly, a parser combinator might return more than one such |
59 shall see shortly, a parser combinator might return more than one such |
55 pair; the idea being that there are potentially several ways of how to |
60 pair; the idea is that there are potentially several ways of how to |
56 parse the input. As a concrete example, consider the string |
61 parse the input. As a concrete example, consider the string |
57 |
62 |
58 \begin{center} |
63 \begin{center} |
59 \tt\Grid{iffoo\VS testbar} |
64 \tt\Grid{iffoo\VS testbar} |
60 \end{center} |
65 \end{center} |
66 \begin{center} |
71 \begin{center} |
67 $\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), |
72 $\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), |
68 \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$ |
73 \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$ |
69 \end{center} |
74 \end{center} |
70 |
75 |
71 \noindent where the first pair means the parser could |
76 \noindent where the first pair means the parser could recognise |
72 recognise \texttt{if} from the input and leaves the rest as |
77 \texttt{if} from the input and leaves the rest as `unprocessed' as the |
73 `unprocessed' as the second component of the pair; in the |
78 second component of the pair; in the other case it could recognise |
74 other case it could recognise \texttt{iffoo} and leaves |
79 \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the |
75 \texttt{\VS testbar} as unprocessed. If the parser cannot |
80 parser cannot recognise anything from the input at all, then parser |
76 recognise anything from the input, then parser combinators just |
81 combinators just return the empty set $\{\}$. This will indicate |
77 return the empty set $\{\}$. This will indicate something |
82 something ``went wrong''\ldots or more precisely, nothing could be |
78 ``went wrong''\ldots or more precisely, nothing could be |
|
79 parsed. |
83 parsed. |
80 |
84 |
81 Also important to note is that the type \texttt{T} for the processed |
85 Also important to note is that the type \texttt{T} for the processed |
82 part is different from the input type. The reason is that in general |
86 part is different from the input type. In the example above is just |
83 we are interested in transform our input into something |
87 happens to be the same. The reason for the difference is that in |
84 ``different''\ldots for example into a tree, or if we implement the |
88 general we are interested in transforming our input into something |
85 grammar for arithmetic expressions we might be interested in the |
89 ``different''\ldots for example into a tree; or if we implement the |
|
90 grammar for arithmetic expressions, we might be interested in the |
86 actual integer number the arithmetic expression, say \texttt{1 + 2 * |
91 actual integer number the arithmetic expression, say \texttt{1 + 2 * |
87 3}, stands for. In this way we can use parser combinators to |
92 3}, stands for. In this way we can use parser combinators to |
88 implement relatively easily a calculator. |
93 implement relatively easily a calculator, for instance. |
89 |
94 |
90 The main idea of parser combinators is that we can easily build parser |
95 The main idea of parser combinators is that we can easily build parser |
91 combinators out of smaller components following very closely the |
96 combinators out of smaller components following very closely the |
92 structure of a grammar. In order to implement this in an |
97 structure of a grammar. In order to implement this in an |
93 object-oriented programming language, like Scala, we need to specify |
98 object-oriented programming language, like Scala, we need to specify |
128 \item Otherwise return the empty set $\{\}$. |
133 \item Otherwise return the empty set $\{\}$. |
129 \end{itemize} |
134 \end{itemize} |
130 |
135 |
131 \noindent |
136 \noindent |
132 The input type of this simple parser combinator for characters is |
137 The input type of this simple parser combinator for characters is |
133 \texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. |
138 \texttt{String} and the output type is \texttt{Char}. This means |
|
139 \texttt{parse} returns \mbox{\texttt{Set[(Char, String)]}}. |
134 The code in Scala is as follows: |
140 The code in Scala is as follows: |
135 |
141 |
136 \begin{center} |
142 \begin{center} |
137 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
143 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
138 case class CharParser(c: Char) extends Parser[String, Char] { |
144 case class CharParser(c: Char) extends Parser[String, Char] { |
139 def parse(sb: String) = |
145 def parse(in: String) = |
140 if (sb.head == c) Set((c, sb.tail)) else Set() |
146 if (in.head == c) Set((c, in.tail)) else Set() |
141 } |
147 } |
142 \end{lstlisting} |
148 \end{lstlisting} |
143 \end{center} |
149 \end{center} |
144 |
150 |
145 \noindent You can see the \texttt{parse} function tests whether the |
151 \noindent You can see the \texttt{parse} tests whether the |
146 first character of the input string \texttt{sb} is equal to |
152 first character of the input string \texttt{in} is equal to |
147 \texttt{c}. If yes, then it splits the string into the recognised part |
153 \texttt{c}. If yes, then it splits the string into the recognised part |
148 \texttt{c} and the unprocessed part \texttt{sb.tail}. In case |
154 \texttt{c} and the unprocessed part \texttt{in.tail}. In case |
149 \texttt{sb} does not start with \texttt{c} then the parser returns the |
155 \texttt{in} does not start with \texttt{c} then the parser returns the |
150 empty set (in Scala \texttt{Set()}). Since this parser recognises |
156 empty set (in Scala \texttt{Set()}). Since this parser recognises |
151 characters and just returns characters as the processed part, the |
157 characters and just returns characters as the processed part, the |
152 output type of the parser is \texttt{Char}. |
158 output type of the parser is \texttt{Char}. |
153 |
159 |
154 If we want to parse a list of tokens and interested in recognising a |
160 If we want to parse a list of tokens and interested in recognising a |
172 these tokens for numbers. But this parser does not just return this |
178 these tokens for numbers. But this parser does not just return this |
173 token (and the rest of the list), like the \texttt{CharParser} above, |
179 token (and the rest of the list), like the \texttt{CharParser} above, |
174 rather extracts the string \texttt{s} from the token and converts it |
180 rather extracts the string \texttt{s} from the token and converts it |
175 into an integer. The hope is that the lexer did its work well and this |
181 into an integer. The hope is that the lexer did its work well and this |
176 conversion always succeeds. The consequence of this is that the output |
182 conversion always succeeds. The consequence of this is that the output |
177 type for this parser is \texttt{Int}. Such a conversion would be |
183 type for this parser is \texttt{Int}, not \texttt{Token}. Such a |
178 needed if we want to implement a simple calculator program. |
184 conversion would be needed if we want to implement a simple calculator |
|
185 program. |
179 |
186 |
180 |
187 |
181 These simple parsers that just look at the input and do a simple |
188 These simple parsers that just look at the input and do a simple |
182 transformation are often called \emph{atomic} parser combinators. |
189 transformation are often called \emph{atomic} parser combinators. |
183 More interesting are the parser combinators that build larger parsers |
190 More interesting are the parser combinators that build larger parsers |
184 out of smaller component parsers. For example the \emph{alternative |
191 out of smaller component parsers. There are three such parser |
|
192 combinators that can be implemented generically. The \emph{alternative |
185 parser combinator} is as follows: given two parsers, say, $p$ and |
193 parser combinator} is as follows: given two parsers, say, $p$ and |
186 $q$, we apply both parsers to the input (remember parsers are |
194 $q$, we apply both parsers to the input (remember parsers are |
187 functions) and combine the output (remember they are sets of pairs) |
195 functions) and combine the output (remember they are sets of pairs): |
188 |
196 |
189 \begin{center} |
197 \begin{center} |
190 $p(\text{input}) \cup q(\text{input})$ |
198 $p(\text{input}) \cup q(\text{input})$ |
191 \end{center} |
199 \end{center} |
192 |
200 |
193 \noindent In Scala we would implement alternative parser |
201 \noindent In Scala we can implement alternative parser |
194 combinator as follows |
202 combinator as follows |
195 |
203 |
196 \begin{center} |
204 \begin{center} |
197 \begin{lstlisting}[language=Scala, numbers=none] |
205 \begin{lstlisting}[language=Scala, numbers=none] |
198 class AltParser[I, T] |
206 class AltParser[I, T] |
199 (p: => Parser[I, T], |
207 (p: => Parser[I, T], |
200 q: => Parser[I, T]) extends Parser[I, T] { |
208 q: => Parser[I, T]) extends Parser[I, T] { |
201 def parse(sb: I) = p.parse(sb) ++ q.parse(sb) |
209 def parse(in: I) = p.parse(in) ++ q.parse(in) |
|
210 } |
|
211 \end{lstlisting} |
|
212 \end{center} |
|
213 |
|
214 \noindent The types of this parser combinator are again generic (we |
|
215 have \texttt{I} for the input type, and \texttt{T} for the output |
|
216 type). The alternative parser builds a new parser out of two existing |
|
217 parsers \texttt{p} and \texttt{q} given as arguments. Both parsers |
|
218 need to be able to process input of type \texttt{I} and return in |
|
219 \texttt{parse} the same output type \texttt{Set[(T, |
|
220 I)]}.\footnote{There is an interesting detail of Scala, namely the |
|
221 \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They |
|
222 will prevent the evaluation of the arguments before they are |
|
223 used. This is often called \emph{lazy evaluation} of the |
|
224 arguments. We will explain this later.} The alternative parser |
|
225 should run the input with the first parser \texttt{p} (producing a set |
|
226 of pairs) and then run the same input with \texttt{q} (producing |
|
227 another set of pairs). The result should be then just the union of |
|
228 both sets, which is the operation \texttt{++} in Scala. |
|
229 |
|
230 The alternative parser combinator allows us to construct a parser that |
|
231 parses either a character \texttt{a} or \texttt{b} using the |
|
232 \texttt{CharParser} shown above. For this we can write |
|
233 |
|
234 \begin{center} |
|
235 \begin{lstlisting}[language=Scala, numbers=none] |
|
236 new AltParser(CharParser('a'), CharParser('b')) |
|
237 \end{lstlisting} |
|
238 \end{center} |
|
239 |
|
240 \noindent Later on we will use Scala mechanism for introducing some |
|
241 more readable shorthand notation for this, like \texttt{"a" || |
|
242 "b"}. Let us look in detail at what this parser combinator produces |
|
243 with some sample strings |
|
244 |
|
245 \begin{center} |
|
246 \begin{tabular}{rcl} |
|
247 input strings & & output\medskip\\ |
|
248 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\ |
|
249 \texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\ |
|
250 \texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$ |
|
251 \end{tabular} |
|
252 \end{center} |
|
253 |
|
254 \noindent We receive in the first two cases a successful |
|
255 output (that is a non-empty set). In each case, either |
|
256 \pcode{a} or \pcode{b} is in the processed part, and |
|
257 \pcode{cde} in the unprocessed part. Clearly this parser cannot |
|
258 parse anything with \pcode{ccde}, therefore the empty |
|
259 set is returned. |
|
260 |
|
261 A bit more interesting is the \emph{sequence parser combinator}. Given |
|
262 two parsers, say again, $p$ and $q$, we want to apply first the input |
|
263 to $p$ producing a set of pairs; then apply $q$ to all the unparsed |
|
264 parts in the pairs; and then combine the results. Mathematically we would |
|
265 write something like this for the expected set of pairs: |
|
266 |
|
267 \begin{center} |
|
268 \begin{tabular}{lcl} |
|
269 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & |
|
270 $(\textit{output}_1, u_1) \in p(\text{input}) |
|
271 \;\wedge\;$\\ |
|
272 && $(\textit{output}_2, u_2) \in q(u_1)\}$ |
|
273 \end{tabular} |
|
274 \end{center} |
|
275 |
|
276 \noindent Notice that the $p$ wil first be run on the input, producing |
|
277 pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands |
|
278 for the unprocessed, or left-over, parts. We want that $q$ runs on all |
|
279 these unprocessed parts $u_1$. This again will produce some |
|
280 processed part , $p$ and |
|
281 $q$, we apply both parsers to the input (remember parsers are |
|
282 functions) and combine the output (remember they are sets of pairs) |
|
283 |
|
284 \begin{center} |
|
285 $p(\text{input}) \cup q(\text{input})$ |
|
286 \end{center} |
|
287 |
|
288 \noindent In Scala we would implement alternative parser |
|
289 combinator as follows |
|
290 |
|
291 \begin{center} |
|
292 \begin{lstlisting}[language=Scala, numbers=none] |
|
293 class AltParser[I, T] |
|
294 (p: => Parser[I, T], |
|
295 q: => Parser[I, T]) extends Parser[I, T] { |
|
296 def parse(in: I) = p.parse(in) ++ q.parse(in) |
202 } |
297 } |
203 \end{lstlisting} |
298 \end{lstlisting} |
204 \end{center} |
299 \end{center} |
205 |
300 |
206 \noindent The types of this parser combinator are again generic (we |
301 \noindent The types of this parser combinator are again generic (we |
260 \;\wedge\;$\\ |
355 \;\wedge\;$\\ |
261 && $(\textit{output}_2, u_2) \in q(u_1)\}$ |
356 && $(\textit{output}_2, u_2) \in q(u_1)\}$ |
262 \end{tabular} |
357 \end{tabular} |
263 \end{center} |
358 \end{center} |
264 |
359 |
265 \noindent Notice that the $p$ wil first be run on the input, producing |
360 \noindent Notice that the $p$ will first be run on the input, |
266 pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands |
361 producing pairs of the form $\textit{output}_1$ and unprocessed part |
267 for the unprocessed, or left-over, parts. We want that $q$ runs on all |
362 $u_1$. This unprocessed part is fed into the second parser $q$. The |
268 these unprocessed parts $u_1$. This again will produce some |
|
269 processed part , $p$ and |
|
270 $q$, we apply both parsers to the input (remember parsers are |
|
271 functions) and combine the output (remember they are sets of pairs) |
|
272 |
|
273 \begin{center} |
|
274 $p(\text{input}) \cup q(\text{input})$ |
|
275 \end{center} |
|
276 |
|
277 \noindent In Scala we would implement alternative parser |
|
278 combinator as follows |
|
279 |
|
280 \begin{center} |
|
281 \begin{lstlisting}[language=Scala, numbers=none] |
|
282 class AltParser[I, T] |
|
283 (p: => Parser[I, T], |
|
284 q: => Parser[I, T]) extends Parser[I, T] { |
|
285 def parse(sb: I) = p.parse(sb) ++ q.parse(sb) |
|
286 } |
|
287 \end{lstlisting} |
|
288 \end{center} |
|
289 |
|
290 \noindent The types of this parser combinator are again generic (we |
|
291 just have \texttt{I} for the input type, and \texttt{T} for the output |
|
292 type). The alternative parser builds a new parser out of two existing |
|
293 parsers \texttt{p} and \texttt{q}. Both need to be able to process |
|
294 input of type \texttt{I} and return the same output type |
|
295 \texttt{Set[(T, I)]}.\footnote{There is an interesting detail of |
|
296 Scala, namely the \texttt{=>} in front of the types of \texttt{p} |
|
297 and \texttt{q}. They will prevent the evaluation of the arguments |
|
298 before they are used. This is often called \emph{lazy evaluation} of |
|
299 the arguments. We will explain this later.} Therefore the output |
|
300 type of this parser is \texttt{T}. The alternative parser should run |
|
301 the input with the first parser \texttt{p} (producing a set of pairs) |
|
302 and then run the same input with \texttt{q} (producing another set of |
|
303 pairs). The result should be then just the union of both sets, which |
|
304 is the operation \texttt{++} in Scala. |
|
305 |
|
306 The alternative parser combinator already allows us to |
|
307 construct a parser that parses either a character \texttt{a} |
|
308 or \texttt{b}, as |
|
309 |
|
310 \begin{center} |
|
311 \begin{lstlisting}[language=Scala, numbers=none] |
|
312 new AltParser(CharParser('a'), CharParser('b')) |
|
313 \end{lstlisting} |
|
314 \end{center} |
|
315 |
|
316 \noindent Later on we will use again Scala mechanism for introducing some |
|
317 more readable shorthand notation for this, like \texttt{"a" || "b"}. Let us look in detail at what this parser combinator produces with some sample strings |
|
318 |
|
319 \begin{center} |
|
320 \begin{tabular}{rcl} |
|
321 input strings & & output\medskip\\ |
|
322 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\ |
|
323 \texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\ |
|
324 \texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$ |
|
325 \end{tabular} |
|
326 \end{center} |
|
327 |
|
328 \noindent We receive in the first two cases a successful |
|
329 output (that is a non-empty set). In each case, either |
|
330 \pcode{a} or \pcode{b} is in the processed part, and |
|
331 \pcode{cde} in the unprocessed part. Clearly this parser cannot |
|
332 parse anything in the string \pcode{ccde}, therefore the empty |
|
333 set. |
|
334 |
|
335 A bit more interesting is the \emph{sequence parser combinator}. Given |
|
336 two parsers, say again, $p$ and $q$, we want to apply first the input |
|
337 to $p$ producing a set of pairs; then apply $q$ to all the unparsed |
|
338 parts in the pairs; and then combine the results like |
|
339 |
|
340 \begin{center} |
|
341 \begin{tabular}{lcl} |
|
342 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & |
|
343 $(\textit{output}_1, u_1) \in p(\text{input}) |
|
344 \;\wedge\;$\\ |
|
345 && $(\textit{output}_2, u_2) \in q(u_1)\}$ |
|
346 \end{tabular} |
|
347 \end{center} |
|
348 |
|
349 \noindent Notice that the $p$ wil first be run on the input, producing |
|
350 pairs of the form $\textit{output}_1$ and unprocessed part $u_1$. The |
|
351 overall result of the sequence parser combinator is pairs of the form |
363 overall result of the sequence parser combinator is pairs of the form |
352 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the |
364 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the |
353 unprocessed parts of both parsers is the unprocessed parts the second |
365 unprocessed part of both parsers is the unprocessed part the second |
354 parser $q$ produces as left-over. The processed parts of both parsers |
366 parser $q$ produces leaves as left-over. The processed parts of both |
355 is just the pair of the outputs |
367 parsers is a pair consisting of the outputs of $p$ and $q$, namely |
356 $(\textit{output}_1, \textit{output}_2)$. This behaviour can be |
368 $(\textit{output}_1, \textit{output}_2)$. This behaviour can be |
357 implemented in Scala as follows: |
369 implemented in Scala as follows: |
358 |
370 |
359 \begin{center} |
371 \begin{center} |
360 \begin{lstlisting}[language=Scala,numbers=none] |
372 \begin{lstlisting}[language=Scala,numbers=none] |
361 class SeqParser[I, T, S] |
373 class SeqParser[I, T, S] |
362 (p: => Parser[I, T], |
374 (p: => Parser[I, T], |
363 q: => Parser[I, S]) extends Parser[I, (T, S)] { |
375 q: => Parser[I, S]) extends Parser[I, (T, S)] { |
364 def parse(sb: I) = |
376 def parse(in: I) = |
365 for ((output1, u1) <- p.parse(sb); |
377 for ((output1, u1) <- p.parse(in); |
366 (output2, u2) <- q.parse(u1)) |
378 (output2, u2) <- q.parse(u1)) |
367 yield ((output1, output2), u2) |
379 yield ((output1, output2), u2) |
368 } |
380 } |
369 \end{lstlisting} |
381 \end{lstlisting} |
370 \end{center} |
382 \end{center} |
371 |
383 |
372 \noindent This parser takes again as input two parsers, \texttt{p} |
384 \noindent This parser takes again as arguments two parsers, \texttt{p} |
373 and \texttt{q}. It implements \texttt{parse} as follows: let |
385 and \texttt{q}. It implements \texttt{parse} as follows: let first run |
374 first run the parser \texttt{p} on the input producing a set |
386 the parser \texttt{p} on the input producing a set of pairs |
375 of pairs (\texttt{output1}, \texttt{u1}). The \texttt{u1} |
387 (\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the |
376 stands for the unprocessed parts left over by \texttt{p}. Let |
388 unprocessed parts left over by \texttt{p}. Let then \texttt{q} run on these |
377 \texttt{q} run on these unprocessed parts producing again a |
389 unprocessed parts producing again a set of pairs. The output of the |
378 set of pairs. The output of the sequence parser combinator is |
390 sequence parser combinator is then a set containing pairs where the |
379 then a set containing pairs where the first components are |
391 first components are again pairs, namely what the first parser could |
380 again pairs, namely what the first parser could parse together |
392 parse together with what the second parser could parse; the second |
381 with what the second parser could parse; the second component |
393 component is the unprocessed part left over after running the second |
382 is the unprocessed part left over after running the second |
394 parser \texttt{q}. Therefore the input type of the sequence parser |
383 parser \texttt{q}. Therefore the input type of the sequence |
395 combinator is as usual \texttt{I}, but the output type is |
384 parser combinator is as usual \texttt{I}, but the output type |
|
385 is |
|
386 |
396 |
387 \begin{center} |
397 \begin{center} |
388 \texttt{Set[((T, S), I)]} |
398 \texttt{Set[((T, S), I)]} |
389 \end{center} |
399 \end{center} |
390 |
400 |
394 Notice that we have now two output types for the sequence parser |
404 Notice that we have now two output types for the sequence parser |
395 combinator, because in general \textit{p} and \textit{q} might produce |
405 combinator, because in general \textit{p} and \textit{q} might produce |
396 different things (for example first we recognise a number and then a |
406 different things (for example first we recognise a number and then a |
397 string corresponding to an operator). |
407 string corresponding to an operator). |
398 |
408 |
399 |
409 With the shorthand notation we shall introduce later for the sequence |
400 We have not yet looked at this in detail, but Scala allows us to |
410 parser combinator, we can write for example \pcode{"a" ~ "b"}, which |
401 provide some shorthand notation for the sequence parser combinator. We |
411 is the parser combinator that first recognises the character |
402 can write for example \pcode{"a" ~ "b"}, which is the parser |
412 \texttt{a} from a string and then \texttt{b}. Let us look again at |
403 combinator that first recognises the character \texttt{a} from a |
413 three examples of how this parser combinator processes some strings: |
404 string and then \texttt{b}. Let us look again at three examples of |
|
405 how this parser combinator processes strings: |
|
406 |
414 |
407 \begin{center} |
415 \begin{center} |
408 \begin{tabular}{rcl} |
416 \begin{tabular}{rcl} |
409 input strings & & output\medskip\\ |
417 input strings & & output\medskip\\ |
410 \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\ |
418 \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\ |
412 \texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$ |
420 \texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$ |
413 \end{tabular} |
421 \end{tabular} |
414 \end{center} |
422 \end{center} |
415 |
423 |
416 \noindent In the first line we have a successful parse, because the |
424 \noindent In the first line we have a successful parse, because the |
417 string started with \texttt{ab}, which is the prefix we are looking |
425 string starts with \texttt{ab}, which is the prefix we are looking |
418 for. But since the parsing combinator is constructed as sequence of |
426 for. But since the parsing combinator is constructed as sequence of |
419 the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the |
427 the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the |
420 result is a nested pair of the form \texttt{((a, b), cde)}. It is |
428 result is a nested pair of the form \texttt{((a, b), cde)}. It is |
421 \emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously |
429 \emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously |
422 expects. The parser returns the empty set in the other examples, |
430 expect. The parser returns the empty set in the other examples, |
423 because they do not fit with what the parser is supposed to parse. |
431 because they do not fit with what the parser is supposed to parse. |
424 |
432 |
425 |
433 |
426 A slightly more complicated parser is \pcode{("a" || "b") ~ "c"} |
434 A slightly more complicated parser is \pcode{("a" || "b") ~ "c"} which |
427 which parses as first character either an \texttt{a} or \texttt{b} |
435 parses as first character either an \texttt{a} or \texttt{b}, followed |
428 followed by a \texttt{c}. This parser produces the following outputs. |
436 by a \texttt{c}. This parser produces the following outputs. |
429 |
437 |
430 \begin{center} |
438 \begin{center} |
431 \begin{tabular}{rcl} |
439 \begin{tabular}{rcl} |
432 input strings & & output\medskip\\ |
440 input strings & & output\medskip\\ |
433 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\ |
441 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\ |
478 \end{tabular} |
486 \end{tabular} |
479 \end{center} |
487 \end{center} |
480 |
488 |
481 \noindent This is an instance where the parser consumed |
489 \noindent This is an instance where the parser consumed |
482 completely the input, meaning the unprocessed part is just the |
490 completely the input, meaning the unprocessed part is just the |
483 empty string. So if we called \pcode{parse_all} instead of \pcode{parse} |
491 empty string. So if we called \pcode{parse_all}, instead of \pcode{parse}, |
484 we would get back the result |
492 we would get back the result |
485 |
493 |
486 \[ |
494 \[ |
487 \left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\} |
495 \left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\} |
488 \] |
496 \] |
489 |
497 |
490 \noindent where the unprocessed (empty) parts have been stripped away |
498 \noindent where the unprocessed (empty) parts have been stripped away |
491 from the pairs; everything where the second part was not empty has |
499 from the pairs; everything where the second part was not empty has |
492 been thrown away as ultimately-unsuccessful-parses. |
500 been thrown away as well, because they represent |
493 |
501 ultimately-unsuccessful-parses. |
494 |
502 |
495 Note carefully that constructing a parser such \pcode{'a' || |
503 |
496 ('a' ~ 'b')} will result in a typing error. The first |
504 Note carefully that constructing a parser such \pcode{"a" || ("a" ~ |
497 parser has as output type a single character (recall the type |
505 "b")} will result in a typing error. The intention is that we want |
498 of \texttt{CharParser}), but the second parser produces a pair |
506 to parse an \texttt{a}, or an \texttt{a} followed by a |
499 of characters as output. The alternative parser is however |
507 \texttt{b}. However, the first parser has as output type a single |
500 required to have both component parsers to have the same type. |
508 character (recall the type of \texttt{CharParser}), but the second |
501 We will see later how we can build this parser without the |
509 parser produces a pair of characters as output. The alternative parser |
502 typing error. |
510 is required to have both component parsers to have the same type---we |
503 |
511 need to be able to build the union of two sets, which means in Scala |
504 The next parser combinator does not actually combine smaller |
512 they need to be of the same type. We will see later how we can build |
505 parsers, but applies a function to the result of a parser. |
513 this parser without the typing error. |
506 It is implemented in Scala as follows |
514 |
|
515 The next parser combinator, called \emph{semantic action}, does not |
|
516 actually combine smaller parsers, but applies a function to the result |
|
517 of a parser. It is implemented in Scala as follows |
507 |
518 |
508 \begin{center} |
519 \begin{center} |
509 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
520 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
510 class FunParser[I, T, S] |
521 class FunParser[I, T, S] |
511 (p: => Parser[I, T], |
522 (p: => Parser[I, T], |
512 f: T => S) extends Parser[I, S] { |
523 f: T => S) extends Parser[I, S] { |
513 def parse(sb: I) = |
524 def parse(in: I) = |
514 for ((head, tail) <- p.parse(sb)) yield (f(head), tail) |
525 for ((head, tail) <- p.parse(in)) yield (f(head), tail) |
515 } |
526 } |
516 \end{lstlisting} |
527 \end{lstlisting} |
517 \end{center} |
528 \end{center} |
518 |
529 |
519 |
530 |
520 \noindent This parser combinator takes a parser \texttt{p} |
531 \noindent This parser combinator takes a parser \texttt{p} |
521 with output type \texttt{T} as one argument as well as a |
532 with output type \texttt{T} as one argument as well as a |
522 function \texttt{f} with type \texttt{T => S}. The parser |
533 function \texttt{f} with type \texttt{T => S}. The parser |
523 \texttt{p} produces sets of type \texttt{(T, I)}. The |
534 \texttt{p} produces sets of type \texttt{(T, I)}. The |
524 \texttt{FunParser} combinator then applies the function |
535 semantic action combinastor then applies the function |
525 \texttt{f} to all the parser outputs. Since this function is of |
536 \texttt{f} to all the parser outputs. Since this function is of |
526 type \texttt{T => S}, we obtain a parser with output type |
537 type \texttt{T => S}, we obtain a parser with output type |
527 \texttt{S}. Again Scala lets us introduce some shorthand |
538 \texttt{S}. Again Scala lets us introduce some shorthand |
528 notation for this parser combinator. Therefore we will write |
539 notation for this parser combinator. Therefore we will write |
529 \texttt{p ==> f} for it. |
540 \texttt{p ==> f} for it. |
|
541 |
|
542 What are semantic actions good for? Ultimately, they allow is to |
|
543 transform the parsed input into a datastructure we can use for further |
|
544 processing. A simple example would be to transform parsed characters |
|
545 into ASCII numbers. Suppose we define a function \texttt{f} (from |
|
546 characters to ints) and a \texttt{CharParser}. |
|
547 |
|
548 \begin{center} |
|
549 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
550 val f = (c: Char) => c.toInt |
|
551 val c = new CharParser('c') |
|
552 \end{lstlisting} |
|
553 \end{center} |
|
554 |
|
555 \noindent |
|
556 Then |
|
557 |
|
558 \begin{center} |
|
559 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
560 c.parse("cbd") |
|
561 (c ==> f).parse("cbd") |
|
562 \end{lstlisting} |
|
563 \end{center} |
|
564 |
|
565 \noindent |
|
566 the first line produces \texttt{Set(('c', "bd"))}, whereas the second |
|
567 produces \texttt{Set((99, "bd"))}. |
|
568 |
|
569 \subsubsection*{Shorthand notation for parser combinators} |
|
570 |
|
571 Before we proceed, let us just explain the shorthand notation for |
|
572 parser combinators. Like for regular expressions, the shorthand notation |
|
573 will make our life much easier when writing actual parsers. |
530 |
574 |
531 \subsubsection*{How to build parsers using parser combinators?} |
575 \subsubsection*{How to build parsers using parser combinators?} |
532 |
576 |
533 \subsubsection*{Implementing an Interpreter} |
577 \subsubsection*{Implementing an Interpreter} |
534 |
578 |