5 |
5 |
6 \begin{document} |
6 \begin{document} |
7 |
7 |
8 \section*{Handout 6 (Parser Combinators)} |
8 \section*{Handout 6 (Parser Combinators)} |
9 |
9 |
10 In what follows we explain \emph{parser combinators}, because |
10 In what follows we explain \emph{parser combinators}. Their |
11 they are very easy to implement. However, they only work when |
11 distinguishing feature is that they are very easy to |
12 the grammar to be parsed is \emph{not} left-recursive and they |
12 implement. However, they only work when the grammar to be |
13 are efficient only when the grammar is unambiguous. It is the |
13 parsed is \emph{not} left-recursive and they are efficient |
14 responsibility of the grammar designer to ensure these two |
14 only when the grammar is unambiguous. It is the responsibility |
15 properties. |
15 of the grammar designer to ensure these two properties. |
16 |
16 |
17 Parser combinators can deal with any kind of input as long as |
17 Parser combinators can deal with any kind of input as long as |
18 this input is a kind of sequence, for example a string or a |
18 this input is a kind of sequence, for example a string or a |
19 list of tokens. If the input are lists of tokens, then parser |
19 list of tokens. The general idea behind parser combinators is |
20 combinators transform them into sets of pairs, like so |
20 to transform the input into sets of pairs, like so |
21 |
21 |
22 \begin{center} |
22 \begin{center} |
23 $\underbrace{\text{list of tokens}}_{\text{input}}$ |
23 $\underbrace{\text{list of tokens}}_{\text{input}}$ |
24 $\Rightarrow$ |
24 $\Rightarrow$ |
25 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$ |
25 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$ |
26 \end{center} |
26 \end{center} |
27 |
27 |
28 \noindent In Scala parser combinators will therefore be |
28 \noindent In Scala parser combinators are functions of type |
29 functions of type |
|
30 |
29 |
31 \begin{center} |
30 \begin{center} |
32 \texttt{I $\Rightarrow$ Set[(T, I)]} |
31 \texttt{I $\Rightarrow$ Set[(T, I)]} |
33 \end{center} |
32 \end{center} |
34 |
33 |
35 \noindent that is they take as input something of type |
34 \noindent that is they take as input something of type |
36 \texttt{I} and return a set of pairs. The first component of |
35 \texttt{I} and return a set of pairs. The first component of |
37 these pairs corresponds to what the parser combinator was able |
36 these pairs corresponds to what the parser combinator was able |
38 to process from the input and the second is the unprocessed |
37 to process from the input and the second is the unprocessed |
39 part of the input. As we shall see shortly, a parser |
38 part of the input. As we shall see shortly, a parser |
40 combinator might return more than one such pair, with the idea |
39 combinator might return more than one such pair, the idea |
41 that there are potentially several ways how to interpret the |
40 being that there are potentially several ways how to interpret |
42 input. To simplify matters we will first look at the case |
41 the input. To simplify matters we will first look at the case |
43 where the input to the parser combinators is just strings. As |
42 where the input to the parser combinators is just strings. As |
44 a concrete example, consider the case where the input is the |
43 a concrete example, consider the string |
45 string |
|
46 |
44 |
47 \begin{center} |
45 \begin{center} |
48 \tt\Grid{iffoo\VS testbar} |
46 \tt\Grid{iffoo\VS testbar} |
49 \end{center} |
47 \end{center} |
50 |
48 |
51 \noindent We might have a parser combinator which tries to |
49 \noindent We might have a parser combinator which tries to |
52 interpret this string as a keyword (\texttt{if}) or an |
50 interpret this string as a keyword (\texttt{if}) or as an |
53 identifier (\texttt{iffoo}). Then the output will be the set |
51 identifier (\texttt{iffoo}). Then the output will be the set |
54 |
52 |
55 \begin{center} |
53 \begin{center} |
56 $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), |
54 $\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), |
57 \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$ |
55 \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$ |
58 \end{center} |
56 \end{center} |
59 |
57 |
60 \noindent where the first pair means the parser could |
58 \noindent where the first pair means the parser could |
61 recognise \texttt{if} from the input and leaves the rest as |
59 recognise \texttt{if} from the input and leaves the rest as |
62 `unprocessed' as the second component of the pair; in the |
60 `unprocessed' as the second component of the pair; in the |
63 other case it could recognise \texttt{iffoo} and leaves |
61 other case it could recognise \texttt{iffoo} and leaves |
64 \texttt{\VS testbar} as unprocessed. If the parser cannot |
62 \texttt{\VS testbar} as unprocessed. If the parser cannot |
65 recognise anything from the input then parser combinators just |
63 recognise anything from the input, then parser combinators just |
66 return the empty set $\{\}$. This will indicate something |
64 return the empty set $\{\}$. This will indicate something |
67 ``went wrong''\ldots or more precisely, nothing could be |
65 ``went wrong''\ldots or more precisely, nothing could be |
68 parsed. |
66 parsed. |
69 |
67 |
70 The main attraction of parser combinators is that we can |
68 The main attraction of parser combinators is that we can |
91 |
89 |
92 \noindent From the function \texttt{parse} we can then |
90 \noindent From the function \texttt{parse} we can then |
93 ``centrally'' derive the function \texttt{parse\_all}, which |
91 ``centrally'' derive the function \texttt{parse\_all}, which |
94 just filters out all pairs whose second component is not empty |
92 just filters out all pairs whose second component is not empty |
95 (that is has still some unprocessed part). The reason is that |
93 (that is has still some unprocessed part). The reason is that |
96 at the end of parsing we are only interested in the results |
94 at the end of the parsing we are only interested in the results |
97 where all the input has been consumed and no unprocessed part |
95 where all the input has been consumed and no unprocessed part |
98 is left. |
96 is left over. |
99 |
97 |
100 One of the simplest parser combinators recognises just a |
98 One of the simplest parser combinators recognises just a |
101 character, say $c$, from the beginning of strings. Its |
99 character, say $c$, from the beginning of strings. Its |
102 behaviour can be described as follows: |
100 behaviour can be described as follows: |
103 |
101 |
104 \begin{itemize} |
102 \begin{itemize} |
105 \item if the head of the input string starts with a $c$, it returns |
103 \item if the head of the input string starts with a $c$, then returns |
106 the set $\{(c, \textit{tail of}\; s)\}$, where \textit{tail of} |
104 the set $\{(c, \textit{tail of}\; s)\}$, where \textit{tail of} |
107 $s$ is the unprocessed part of the input string |
105 $s$ is the unprocessed part of the input string |
108 \item otherwise it returns the empty set $\{\}$ |
106 \item otherwise return the empty set $\{\}$ |
109 \end{itemize} |
107 \end{itemize} |
110 |
108 |
111 \noindent |
109 \noindent |
112 The input type of this simple parser combinator for characters is |
110 The input type of this simple parser combinator for characters is |
113 \texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. |
111 \texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. |
133 |
131 |
134 More interesting are the parser combinators that build larger |
132 More interesting are the parser combinators that build larger |
135 parsers out of smaller component parsers. For example the |
133 parsers out of smaller component parsers. For example the |
136 alternative parser combinator is as follows: given two |
134 alternative parser combinator is as follows: given two |
137 parsers, say, $p$ and $q$, then we apply both parsers to the |
135 parsers, say, $p$ and $q$, then we apply both parsers to the |
138 input (remember parsers are functions) and combine the input |
136 input (remember parsers are functions) and combine the output |
|
137 (remember they are sets) |
139 |
138 |
140 \begin{center} |
139 \begin{center} |
141 $p(\text{input}) \cup q(\text{input})$ |
140 $p(\text{input}) \cup q(\text{input})$ |
142 \end{center} |
141 \end{center} |
143 |
142 |
144 \noindent |
143 \noindent In Scala we would implement alternative parser |
145 In Scala we would implement alternative parsers as follows |
144 combinator as follows |
146 |
145 |
147 \begin{center} |
146 \begin{center} |
148 \begin{lstlisting}[language=Scala, numbers=none] |
147 \begin{lstlisting}[language=Scala, numbers=none] |
149 class AltParser[I, T] |
148 class AltParser[I, T] |
150 (p: => Parser[I, T], |
149 (p: => Parser[I, T], |
155 \end{center} |
154 \end{center} |
156 |
155 |
157 \noindent The types of this parser combinator are polymorphic |
156 \noindent The types of this parser combinator are polymorphic |
158 (we just have \texttt{I} for the input type, and \texttt{T} |
157 (we just have \texttt{I} for the input type, and \texttt{T} |
159 for the output type). The alternative parser builds a new |
158 for the output type). The alternative parser builds a new |
160 parser out of two existing parser combinator \texttt{p} and |
159 parser out of two existing parsers \texttt{p} and \texttt{q}. |
161 \texttt{q}. Both need to be able to process input of type |
160 Both need to be able to process input of type \texttt{I} and |
162 \texttt{I} and return the same output type \texttt{Set[(T, |
161 return the same output type \texttt{Set[(T, |
163 I)]}. (There is an interesting detail of Scala, namely the |
162 I)]}.\footnote{There is an interesting detail of Scala, namely |
164 \texttt{=>} in front of the types of \texttt{p} and |
163 the \texttt{=>} in front of the types of \texttt{p} and |
165 \texttt{q}. They will prevent the evaluation of the arguments |
164 \texttt{q}. They will prevent the evaluation of the arguments |
166 before they are used. This is often called \emph{lazy |
165 before they are used. This is often called \emph{lazy |
167 evaluation} of the arguments.) The alternative parser should |
166 evaluation} of the arguments. We will explain this later.} The alternative parser should |
168 run the input with the first parser \texttt{p} (producing a |
167 run the input with the first parser \texttt{p} (producing a |
169 set of outputs) and then run the same input with \texttt{q}. |
168 set of outputs) and then run the same input with \texttt{q}. |
170 The result should be then just the union of both sets, which |
169 The result should be then just the union of both sets, which |
171 is the operation \texttt{++} in Scala. |
170 is the operation \texttt{++} in Scala. |
172 |
171 |
173 This parser combinator already allows us to construct a parser |
172 The alternative parser combinator already allows us to |
174 that either a character \texttt{a} or \texttt{b}, as |
173 construct a parser that parses either a character \texttt{a} |
|
174 or \texttt{b}, as |
175 |
175 |
176 \begin{center} |
176 \begin{center} |
177 \begin{lstlisting}[language=Scala, numbers=none] |
177 \begin{lstlisting}[language=Scala, numbers=none] |
178 new AltParser(CharParser('a'), CharParser('b')) |
178 new AltParser(CharParser('a'), CharParser('b')) |
179 \end{lstlisting} |
179 \end{lstlisting} |
192 \end{tabular} |
192 \end{tabular} |
193 \end{center} |
193 \end{center} |
194 |
194 |
195 \noindent We receive in the first two cases a successful |
195 \noindent We receive in the first two cases a successful |
196 output (that is a non-empty set). In each case, either |
196 output (that is a non-empty set). In each case, either |
197 \pcode{a} or \pcode{b} are the processed parts, and \pcode{c} |
197 \pcode{a} or \pcode{b} are in the processed parts, and |
198 is the unprocessed part. Clearly this parser cannot parse |
198 \pcode{c} in the unprocessed part. Clearly this parser cannot |
199 anything with the string \pcode{cc}. |
199 parse anything in the string \pcode{cc}, therefore the empty |
|
200 set. |
200 |
201 |
201 A bit more interesting is the \emph{sequence parser |
202 A bit more interesting is the \emph{sequence parser |
202 combinator}. Given two parsers, say, $p$ and $q$, |
203 combinator}. Given two parsers, say, $p$ and $q$, apply first |
203 apply first the input to $p$ producing a set of pairs; |
204 the input to $p$ producing a set of pairs; then apply $q$ to |
204 then apply $q$ to the unparsed parts; then combine the |
205 all the unparsed parts; then combine the results like |
205 results like |
|
206 |
206 |
207 \begin{center} |
207 \begin{center} |
208 \begin{tabular}{lcl} |
208 \begin{tabular}{lcl} |
209 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & |
209 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & |
210 $(\textit{output}_1, u_1) \in p(\text{input}) |
210 $(\textit{output}_1, u_1) \in p(\text{input}) |
211 \;\wedge\;$\\ |
211 \;\wedge\;$\\ |
212 && $(\textit{output}_2, u_2) \in q(u_1)\}$ |
212 && $\;(\textit{output}_2, u_2) \in q(u_1)\}$ |
213 \end{tabular} |
213 \end{tabular} |
214 \end{center} |
214 \end{center} |
215 |
215 |
216 \noindent |
216 \noindent |
217 This can be implemented in Scala as follows: |
217 This can be implemented in Scala as follows: |
220 \begin{lstlisting}[language=Scala,numbers=none] |
220 \begin{lstlisting}[language=Scala,numbers=none] |
221 class SeqParser[I, T, S] |
221 class SeqParser[I, T, S] |
222 (p: => Parser[I, T], |
222 (p: => Parser[I, T], |
223 q: => Parser[I, S]) extends Parser[I, (T, S)] { |
223 q: => Parser[I, S]) extends Parser[I, (T, S)] { |
224 def parse(sb: I) = |
224 def parse(sb: I) = |
225 for ((head1, tail1) <- p.parse(sb); |
225 for ((output1, u1) <- p.parse(sb); |
226 (head2, tail2) <- q.parse(tail1)) |
226 (output2, u2) <- q.parse(u1)) |
227 yield ((head1, head2), tail2) |
227 yield ((output1, output2), u2) |
228 } |
228 } |
229 \end{lstlisting} |
229 \end{lstlisting} |
230 \end{center} |
230 \end{center} |
231 |
231 |
232 \noindent This parser takes as input two parsers, \texttt{p} |
232 \noindent This parser takes as input two parsers, \texttt{p} |
233 and \texttt{q}. It implements \texttt{parse} as follows: let |
233 and \texttt{q}. It implements \texttt{parse} as follows: let |
234 first run the parser \texttt{p} on the input producing a set |
234 first run the parser \texttt{p} on the input producing a set |
235 of pairs (\texttt{head1}, \texttt{tail1}). The \texttt{tail1} |
235 of pairs (\texttt{output1}, \texttt{u1}). The \texttt{u1} |
236 stands for the unprocessed parts left over by \texttt{p}. Let |
236 stands for the unprocessed parts left over by \texttt{p}. Let |
237 \texttt{q} run on these unprocessed parts producing again a |
237 \texttt{q} run on these unprocessed parts producing again a |
238 set of pairs. The output of the sequence parser combinator is |
238 set of pairs. The output of the sequence parser combinator is |
239 then a set containing pairs where the first components are |
239 then a set containing pairs where the first components are |
240 again pairs, namely what the first parser could parse together |
240 again pairs, namely what the first parser could parse together |
305 \noindent This is an instance where the parser consumed |
306 \noindent This is an instance where the parser consumed |
306 completely the input, meaning the unprocessed part is just the |
307 completely the input, meaning the unprocessed part is just the |
307 empty string. |
308 empty string. |
308 |
309 |
309 |
310 |
310 Note carefully that constructing a parser such \texttt{'a' || |
311 Note carefully that constructing a parser such \pcode{'a' || |
311 ('a' $\sim$ 'b')} will result in a tying error. The first |
312 ('a' ~ 'b')} will result in a typing error. The first |
312 parser has as output type a single character (recall the type |
313 parser has as output type a single character (recall the type |
313 of \texttt{CharParser}), but the second parser produces a pair |
314 of \texttt{CharParser}), but the second parser produces a pair |
314 of characters as output. The alternative parser is however |
315 of characters as output. The alternative parser is however |
315 required to have both component parsers to have the same type. |
316 required to have both component parsers to have the same type. |
316 We will see later how we can build this parser without the |
317 We will see later how we can build this parser without the |
317 typing error. |
318 typing error. |
318 |
319 |
319 The next parser combinator does not actually combine smaller |
320 The next parser combinator does not actually combine smaller |
320 parsers, but applies a function to the result of the parser. |
321 parsers, but applies a function to the result of a parser. |
321 It is implemented in Scala as follows |
322 It is implemented in Scala as follows |
322 |
323 |
323 \begin{center} |
324 \begin{center} |
324 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
325 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
325 class FunParser[I, T, S] |
326 class FunParser[I, T, S] |
330 } |
331 } |
331 \end{lstlisting} |
332 \end{lstlisting} |
332 \end{center} |
333 \end{center} |
333 |
334 |
334 |
335 |
335 \noindent |
336 \noindent This parser combinator takes a parser \texttt{p} |
336 This parser combinator takes a parser \texttt{p} with output type \texttt{T} as |
337 with output type \texttt{T} as one argument as well as a |
337 input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p} |
338 function \texttt{f} with type \texttt{T => S}. The parser |
338 produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then |
339 \texttt{p} produces sets of type \texttt{(T, I)}. The |
339 applies the function \texttt{f} to all the parer outputs. Since this function |
340 \texttt{FunParser} combinator then applies the function |
340 is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}. |
341 \texttt{f} to all the parser outputs. Since this function is of |
341 Again Scala lets us introduce some shorthand notation for this parser combinator. |
342 type \texttt{T => S}, we obtain a parser with output type |
342 Therefore we will write \texttt{p ==> f} for it. |
343 \texttt{S}. Again Scala lets us introduce some shorthand |
|
344 notation for this parser combinator. Therefore we will write |
|
345 \texttt{p ==> f} for it. |
|
346 |
|
347 \subsubsection*{How to build parsers using parser combinators?} |
|
348 |
|
349 \subsubsection*{Implementing an Interpreter} |
343 |
350 |
344 %\bigskip |
351 %\bigskip |
345 %takes advantage of the full generality---have a look |
352 %takes advantage of the full generality---have a look |
346 %what it produces if we call it with the string \texttt{abc} |
353 %what it produces if we call it with the string \texttt{abc} |
347 % |
354 % |