35 |
35 |
36 \noindent |
36 \noindent |
37 Given the extended effort we have spent implementing a lexer in order |
37 Given the extended effort we have spent implementing a lexer in order |
38 to generate lists of tokens, it might be surprising that in what |
38 to generate lists of tokens, it might be surprising that in what |
39 follows we shall often use strings as input, rather than lists of |
39 follows we shall often use strings as input, rather than lists of |
40 tokens. This is for making the explanation more lucid and for quick |
40 tokens. This is for making the explanation more lucid and ensure the |
41 examples. It does not make our previous work on lexers obsolete |
41 examples are simple. It does not make our previous work on lexers obsolete |
42 (remember they transform a string into a list of tokens). Lexers will |
42 (remember they transform a string into a list of tokens). Lexers will |
43 still be needed for building a somewhat realistic compiler. |
43 still be needed for building a somewhat realistic compiler. See also |
|
44 a question in the homework regarding this issue. |
44 |
45 |
45 As mentioned above, parser combinators are relatively agnostic about what |
46 As mentioned above, parser combinators are relatively agnostic about what |
46 kind of input they process. In my Scala code I use the following |
47 kind of input they process. In my Scala code I use the following |
47 polymorphic types for parser combinators: |
48 polymorphic types for parser combinators: |
48 |
49 |
51 \end{center} |
52 \end{center} |
52 |
53 |
53 \noindent That is they take as input something of type \texttt{I} and |
54 \noindent That is they take as input something of type \texttt{I} and |
54 return a set of pairs of type \texttt{Set[(T, I)]}. Since the input |
55 return a set of pairs of type \texttt{Set[(T, I)]}. Since the input |
55 needs to be of ``sequence-kind'', I actually have to often write |
56 needs to be of ``sequence-kind'', I actually have to often write |
56 \texttt{I <\% Seq[\_]} for the input type. This ensures the |
57 \code{(using is: I => Seq[_])} for the input type. This ensures the |
57 input is a subtype of Scala sequences. The first component of the |
58 input is a subtype of Scala sequences.\footnote{This is a new feature |
58 generated pairs corresponds to what the parser combinator was able to |
59 in Scala 3 and is about type-classes, meaning if you use Scala 2 you will have difficulties |
59 parse from the input and the second is the unprocessed, or |
60 with running my code.} The first component of the generated pairs |
60 leftover, part of the input (therefore the type of this unprocessed part is |
61 corresponds to what the parser combinator was able to parse from the |
61 the same as the input). A parser combinator might return more than one |
62 input and the second is the unprocessed, or leftover, part of the |
62 such pair; the idea is that there are potentially several ways of how |
63 input (therefore the type of this unprocessed part is the same as the |
63 to parse the input. As a concrete example, consider the string |
64 input). A parser combinator might return more than one such pair; the |
|
65 idea is that there are potentially several ways of how to parse the |
|
66 input. As a concrete example, consider the string |
64 |
67 |
65 \begin{center} |
68 \begin{center} |
66 \tt\Grid{iffoo\VS testbar} |
69 \tt\Grid{iffoo\VS testbar} |
67 \end{center} |
70 \end{center} |
68 |
71 |
145 \mbox{\texttt{Set[(Char, String)]}}. The code in Scala is as follows: |
148 \mbox{\texttt{Set[(Char, String)]}}. The code in Scala is as follows: |
146 |
149 |
147 \begin{center} |
150 \begin{center} |
148 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
151 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
149 case class CharParser(c: Char) extends Parser[String, Char] { |
152 case class CharParser(c: Char) extends Parser[String, Char] { |
150 def parse(in: String) = |
153 def parse(s: String) = |
151 if (in.head == c) Set((c, in.tail)) else Set() |
154 if (s != "" && s.head == c) Set((c, s.tail)) else Set() |
152 } |
155 } |
153 \end{lstlisting} |
156 \end{lstlisting} |
154 \end{center} |
157 \end{center} |
155 |
158 |
156 \noindent You can see \texttt{parse} tests whether the |
159 \noindent You can see \texttt{parse} tests here whether the |
157 first character of the input string \texttt{in} is equal to |
160 first character of the input string \texttt{s} is equal to |
158 \texttt{c}. If yes, then it splits the string into the recognised part |
161 \texttt{c}. If yes, then it splits the string into the recognised part |
159 \texttt{c} and the unprocessed part \texttt{in.tail}. In case |
162 \texttt{c} and the unprocessed part \texttt{s.tail}. In case |
160 \texttt{in} does not start with \texttt{c} then the parser returns the |
163 \texttt{s} does not start with \texttt{c} then the parser returns the |
161 empty set (in Scala \texttt{Set()}). Since this parser recognises |
164 empty set (in Scala \texttt{Set()}). Since this parser recognises |
162 characters and just returns characters as the processed part, the |
165 characters and just returns characters as the processed part, the |
163 output type of the parser is \texttt{Char}. |
166 output type of the parser is \texttt{Char}. |
164 |
167 |
165 If we want to parse a list of tokens and interested in recognising a |
168 If we want to parse a list of tokens and interested in recognising a |
184 token (and the rest of the list), like the \texttt{CharParser} above, |
187 token (and the rest of the list), like the \texttt{CharParser} above, |
185 rather it extracts also the string \texttt{s} from the token and |
188 rather it extracts also the string \texttt{s} from the token and |
186 converts it into an integer. The hope is that the lexer did its work |
189 converts it into an integer. The hope is that the lexer did its work |
187 well and this conversion always succeeds. The consequence of this is |
190 well and this conversion always succeeds. The consequence of this is |
188 that the output type for this parser is \texttt{Int}, not |
191 that the output type for this parser is \texttt{Int}, not |
189 \texttt{Token}. Such a conversion would be needed if we want to |
192 \texttt{Token}. Such a conversion would be needed in our parser, |
190 implement a simple calculator program, because string-numbers need to |
193 because when we encounter a number in our program, we want to do |
191 be transformed into \texttt{Int}-numbers in order to do the |
194 some calculations based on integers, not strings (or tokens). |
192 calculations. |
|
193 |
195 |
194 |
196 |
195 These simple parsers that just look at the input and do a simple |
197 These simple parsers that just look at the input and do a simple |
196 transformation are often called \emph{atomic} parser combinators. |
198 transformation are often called \emph{atomic} parser combinators. |
197 More interesting are the parser combinators that build larger parsers |
199 More interesting are the parser combinators that build larger parsers |
209 combinator as follows |
211 combinator as follows |
210 |
212 |
211 \begin{center} |
213 \begin{center} |
212 \begin{lstlisting}[language=Scala, numbers=none] |
214 \begin{lstlisting}[language=Scala, numbers=none] |
213 class AltParser[I, T] |
215 class AltParser[I, T] |
214 (p: => Parser[I, T], |
216 (p: => Parser[I, T], |
215 q: => Parser[I, T]) extends Parser[I, T] { |
217 q: => Parser[I, T])(using I => Seq[_]) |
216 def parse(in: I) = p.parse(in) ++ q.parse(in) |
218 extends Parser[I, T] { |
217 } |
219 def parse(in: I) = p.parse(in) ++ q.parse(in) |
|
220 } |
218 \end{lstlisting} |
221 \end{lstlisting} |
219 \end{center} |
222 \end{center} |
220 |
223 |
221 \noindent The types of this parser combinator are again generic (we |
224 \noindent The types of this parser combinator are again generic (we |
222 have \texttt{I} for the input type, and \texttt{T} for the output |
225 have \texttt{I} for the input type, and \texttt{T} for the output |
223 type). The alternative parser builds a new parser out of two existing |
226 type). The alternative parser builds a new parser out of two existing |
224 parsers \texttt{p} and \texttt{q} which are given as arguments. Both |
227 parsers \texttt{p} and \texttt{q} which are given as arguments. Both |
225 parsers need to be able to process input of type \texttt{I} and return |
228 parsers need to be able to process input of type \texttt{I} and return |
226 in \texttt{parse} the same output type \texttt{Set[(T, |
229 in \texttt{parse} the same output type \texttt{Set[(T, |
227 I)]}.\footnote{There is an interesting detail of Scala, namely the |
230 I)]}.\footnote{There is an interesting detail of Scala, namely the |
228 \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They |
231 \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. These arrows |
229 will prevent the evaluation of the arguments before they are |
232 will prevent the evaluation of the arguments before they are |
230 used. This is often called \emph{lazy evaluation} of the |
233 used. This is often called \emph{lazy evaluation} of the |
231 arguments. We will explain this later.} The alternative parser runs |
234 arguments. We will explain this later.} The alternative parser runs |
232 the input with the first parser \texttt{p} (producing a set of pairs) |
235 the input with the first parser \texttt{p} (producing a set of pairs) |
233 and then runs the same input with \texttt{q} (producing another set of |
236 and then runs the same input with \texttt{q} (producing another set of |
234 pairs). The result should be then just the union of both sets, which |
237 pairs). The result should be then just the union of both sets, which |
235 is the operation \texttt{++} in Scala. |
238 is the operation \texttt{++} in Scala. |
236 |
239 |
237 The alternative parser combinator allows us to construct a parser that |
240 The alternative parser combinator allows us to construct a parser that |
238 parses either a character \texttt{a} or \texttt{b} using the |
241 parses either a character \texttt{a} or \texttt{b} using the |
239 \texttt{CharParser} shown above. For this we can write |
242 \texttt{CharParser} shown above. For this we can write\footnote{Note |
|
243 that we cannot use a \texttt{case}-class for \texttt{AltParser}s |
|
244 because of the problem with laziness and Scala quirks. Hating |
|
245 \texttt{new} like the plague, we will work around this later with |
|
246 some syntax tricks. ;o)} |
240 |
247 |
241 \begin{center} |
248 \begin{center} |
242 \begin{lstlisting}[language=Scala, numbers=none] |
249 \begin{lstlisting}[language=Scala, numbers=none] |
243 new AltParser(CharParser('a'), CharParser('b')) |
250 new AltParser(CharParser('a'), CharParser('b')) |
244 \end{lstlisting} |
251 \end{lstlisting} |
245 \end{center} |
252 \end{center} |
246 |
253 |
247 \noindent Later on we will use Scala mechanism for introducing some |
254 \noindent Later on we will use Scala mechanism for introducing some |
248 more readable shorthand notation for this, like \texttt{p"a" || |
255 more readable shorthand notation for this, like \texttt{p"a" || |
249 p"b"}. Let us look in detail at what this parser combinator produces |
256 p"b"}. But first let us look in detail at what this parser combinator produces |
250 with some sample strings. |
257 with some sample strings. |
251 |
258 |
252 \begin{center} |
259 \begin{center} |
253 \begin{tabular}{rcl} |
260 \begin{tabular}{rcl} |
254 input strings & & output\medskip\\ |
261 input strings & & output\medskip\\ |
255 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\ |
262 \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}},\; \texttt{\Grid{cde}})\right\}$\\ |
256 \texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\ |
263 \texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}},\; \texttt{\Grid{cde}})\right\}$\\ |
257 \texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$ |
264 \texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$ |
258 \end{tabular} |
265 \end{tabular} |
259 \end{center} |
266 \end{center} |
260 |
267 |
261 \noindent We receive in the first two cases a successful |
268 \noindent We receive in the first two cases a successful output (that |
262 output (that is a non-empty set). In each case, either |
269 is a non-empty set). In each case, either \pcode{a} or \pcode{b} is in |
263 \pcode{a} or \pcode{b} is in the parsed part, and |
270 the parsed part, and \pcode{cde} in the unprocessed part. Clearly this |
264 \pcode{cde} in the unprocessed part. Clearly this parser cannot |
271 parser cannot parse anything of the form \pcode{ccde}, therefore the |
265 parse anything with \pcode{ccde}, therefore the empty |
272 empty set is returned in the last case. Observe that parser |
266 set is returned. |
273 combinators only look at the beginning of the given input: they do not |
|
274 fish out something in the ``middle'' of the input. |
267 |
275 |
268 A bit more interesting is the \emph{sequence parser combinator}. Given |
276 A bit more interesting is the \emph{sequence parser combinator}. Given |
269 two parsers, say again, $p$ and $q$, we want to apply first the input |
277 two parsers, say again, $p$ and $q$, we want to apply first the input |
270 to $p$ producing a set of pairs; then apply $q$ to all the unparsed |
278 to $p$ producing a set of pairs; then apply $q$ to all the unparsed |
271 parts in the pairs; and then combine the results. Mathematically we would |
279 parts in the pairs; and then combine the results. Mathematically we would |
310 \noindent This parser takes again as arguments two parsers, \texttt{p} |
319 \noindent This parser takes again as arguments two parsers, \texttt{p} |
311 and \texttt{q}. It implements \texttt{parse} as follows: first run the |
320 and \texttt{q}. It implements \texttt{parse} as follows: first run the |
312 parser \texttt{p} on the input producing a set of pairs |
321 parser \texttt{p} on the input producing a set of pairs |
313 (\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the |
322 (\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the |
314 unprocessed parts left over by \texttt{p} (recall that there can be |
323 unprocessed parts left over by \texttt{p} (recall that there can be |
315 several such pairs). Let then \texttt{q} run on these unprocessed |
324 several such pairs). Let then \texttt{q} run on these unprocessed |
316 parts producing again a set of pairs. The output of the sequence |
325 parts producing again a set of pairs. The output of the sequence |
317 parser combinator is then a set containing pairs where the first |
326 parser combinator is then a set containing pairs where the first |
318 components are again pairs, namely what the first parser could parse |
327 components are again pairs, namely what the first parser could parse |
319 together with what the second parser could parse; the second component |
328 together with what the second parser could parse; the second component |
320 is the unprocessed part left over after running the second parser |
329 is the unprocessed part left over after running the second parser |
337 produce the empty set. |
346 produce the empty set. |
338 |
347 |
339 With the shorthand notation we shall introduce later for the sequence |
348 With the shorthand notation we shall introduce later for the sequence |
340 parser combinator, we can write for example \pcode{p"a" ~ p"b"}, which |
349 parser combinator, we can write for example \pcode{p"a" ~ p"b"}, which |
341 is the parser combinator that first recognises the character |
350 is the parser combinator that first recognises the character |
342 \texttt{a} from a string and then \texttt{b}. Let us look again at |
351 \texttt{a} from a string and then \texttt{b}. (Actually, we will be |
343 some examples of how this parser combinator processes some strings: |
352 able to write just \pcode{p"ab"} for such parsers, but it is good to |
|
353 understand first what happens behind the scenes.) Let us look again |
|
354 at some examples of how the sequence parser combinator processes some |
|
355 strings: |
344 |
356 |
345 \begin{center} |
357 \begin{center} |
346 \begin{tabular}{rcl} |
358 \begin{tabular}{rcl} |
347 input strings & & output\medskip\\ |
359 input strings & & output\medskip\\ |
348 \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\ |
360 \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}),\; \texttt{\Grid{cde}})\right\}$\\ |
349 \texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\ |
361 \texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\ |
350 \texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$ |
362 \texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$ |
351 \end{tabular} |
363 \end{tabular} |
352 \end{center} |
364 \end{center} |
353 |
365 |
443 ultimately-unsuccessful-parses. The main point is that the sequence |
455 ultimately-unsuccessful-parses. The main point is that the sequence |
444 parser combinator returns pairs that can nest according to the |
456 parser combinator returns pairs that can nest according to the |
445 nesting of the component parsers. |
457 nesting of the component parsers. |
446 |
458 |
447 |
459 |
448 Consider also carefully that constructing a parser such \pcode{p"a" || |
460 Consider also carefully that constructing a parser such |
449 (p"a" ~ p"b")} will result in a typing error. The intention with this |
461 |
|
462 \begin{center} |
|
463 \pcode{p"a" || (p"a" ~ p"b")} |
|
464 \end{center} |
|
465 |
|
466 \noindent |
|
467 will result in a typing error. The intention with this |
450 parser is that we want to parse either an \texttt{a}, or an \texttt{a} |
468 parser is that we want to parse either an \texttt{a}, or an \texttt{a} |
451 followed by a \texttt{b}. However, the first parser has as output type |
469 followed by a \texttt{b}. However, the first parser has as output type |
452 a single character (recall the type of \texttt{CharParser}), but the |
470 a single character (recall the type of \texttt{CharParser}), but the |
453 second parser produces a pair of characters as output. The alternative |
471 second parser produces a pair of characters as output. The alternative |
454 parser is required to have both component parsers to have the same |
472 parser is required to have both component parsers to have the same |
455 type---the reason is that we need to be able to build the union of two |
473 type---the reason is that we need to be able to build the union of two |
456 sets, which requires in Scala that the sets have the same type. Since |
474 sets, which requires in Scala that the sets have the same type. Since |
457 they are not in this case, there is a typing error. We will see later |
475 they are not in this case, there is a typing error. We will see later |
458 how we can build this parser without the typing error. |
476 how we can build this parser without the typing error. |
459 |
477 |
460 The next parser combinator, called \emph{semantic action}, does not |
478 The next parser combinator, called \emph{semantic action} or \emph{map-parser}, does not |
461 actually combine two smaller parsers, but applies a function to the result |
479 actually combine two smaller parsers, but applies a function to the result |
462 of a parser. It is implemented in Scala as follows |
480 of a parser. It is implemented in Scala as follows |
463 |
481 |
464 \begin{center} |
482 \begin{center} |
465 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
483 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
466 class MapParser[I, T, S] |
484 class MapParser[I, T, S] |
467 (p: => Parser[I, T], |
485 (p: => Parser[I, T], |
468 f: T => S) extends Parser[I, S] { |
486 f: T => S)(using I => Seq[_]) extends Parser[I, S] { |
469 def parse(in: I) = |
487 def parse(in: I) = |
470 for ((head, tail) <- p.parse(in)) yield (f(head), tail) |
488 for ((hd, tl) <- p.parse(in)) yield (f(hd), tl) |
471 } |
489 } |
472 \end{lstlisting} |
490 \end{lstlisting} |
473 \end{center} |
491 \end{center} |
474 |
492 |
475 |
493 |
476 \noindent This parser combinator takes a parser \texttt{p} (with input |
494 \noindent This parser combinator takes a parser \texttt{p} (with input |
477 type \texttt{I} and output type \texttt{T}) as one argument but also a |
495 type \texttt{I} and output type \texttt{T}) as one argument but also a |
478 function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p} |
496 function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p} |
479 produces sets of type \texttt{Set[(T, I)]}. The semantic action |
497 produces sets of type \texttt{Set[(S, I)]}. The semantic action |
480 combinator then applies the function \texttt{f} to all the `processed' |
498 combinator then applies the function \texttt{f} to all the `processed' |
481 parser outputs. Since this function is of type \texttt{T => S}, we |
499 parser outputs. Since this function is of type \texttt{T => S}, we |
482 obtain a parser with output type \texttt{S}. Again Scala lets us |
500 obtain a parser with output type \texttt{S}. Again Scala lets us |
483 introduce some shorthand notation for this parser |
501 introduce some shorthand notation for this parser |
484 combinator. Therefore we will write short \texttt{p.map(f)} for it. |
502 combinator. Therefore we will write short \texttt{p.map(f)} for it. |
571 \end{center} |
589 \end{center} |
572 |
590 |
573 \noindent |
591 \noindent |
574 The function in the semantic action converts a string into an |
592 The function in the semantic action converts a string into an |
575 \texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))}, |
593 \texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))}, |
576 but this time \texttt{123} is an \texttt{Int}. Let us come back to |
594 but this time \texttt{123} is an \texttt{Int}. Think carefully what |
577 semantic actions when we are going to implement actual context-free |
595 the input and output type of the parser is without the semantic action |
578 grammars. |
596 adn what with the semantic action (the type of the function can |
|
597 already tell you). Let us come back to semantic actions when we are |
|
598 going to implement actual context-free grammars. |
579 |
599 |
580 \subsubsection*{Shorthand notation for parser combinators} |
600 \subsubsection*{Shorthand notation for parser combinators} |
581 |
601 |
582 Before we proceed, let us just explain the shorthand notation for |
602 Before we proceed, let us just explain the shorthand notation for |
583 parser combinators. Like for regular expressions, the shorthand notation |
603 parser combinators. Like for regular expressions, the shorthand |
584 will make our life much easier when writing actual parsers. We can define |
604 notation will make our life much easier when writing actual |
585 some implicits which allow us to write |
605 parsers. We can define some extensions\footnote{In Scala 2 this was |
|
606 generically called as ``implicits''.} which allow us to write |
586 |
607 |
587 \begin{center} |
608 \begin{center} |
588 \begin{tabular}{ll} |
609 \begin{tabular}{ll} |
589 \pcode{p || q} & alternative parser\\ |
610 \pcode{p || q} & alternative parser\\ |
590 \pcode{p ~ q} & sequence parser\\ |
611 \pcode{p ~ q} & sequence parser\\ |
591 \pcode{p.map(f)} & semantic action parser |
612 \pcode{p.map(f)} & semantic action parser |
592 \end{tabular} |
613 \end{tabular} |
593 \end{center} |
614 \end{center} |
594 |
615 |
595 \noindent |
616 \noindent |
596 as well as to use string interpolations for specifying simple string parsers. |
617 We will also use the \texttt{p}-string-interpolation for specifying simple string parsers. |
597 |
618 |
598 The idea is that this shorthand notation allows us to easily translate |
619 The idea is that this shorthand notation allows us to easily translate |
599 context-free grammars into code. For example recall our context-free |
620 context-free grammars into code. For example recall our context-free |
600 grammar for palindromes: |
621 grammar for palindromes: |
601 |
622 |