5 |
5 |
6 \begin{document} |
6 \begin{document} |
7 |
7 |
8 \section*{Handout 6 (Parser Combinators)} |
8 \section*{Handout 6 (Parser Combinators)} |
9 |
9 |
10 While regular expressions are very useful for lexing and for recognising |
10 In what follows we explain \emph{parser combinators}, because |
11 many patterns in strings (like email addresses), they have their limitations. For |
11 they are very easy to implement. However, they only work when |
12 example there is no regular expression that can recognise the language |
12 the grammar to be parsed is \emph{not} left-recursive and they |
13 $a^nb^n$. Another example for which there exists no regular expression is the language of well-parenthesised |
13 are efficient only when the grammar is unambiguous. It is the |
14 expressions. In languages like Lisp, which use parentheses rather |
14 responsibility of the grammar designer to ensure these two |
15 extensively, it might be of interest whether the following two expressions |
15 properties. |
16 are well-parenthesised (the left one is, the right one is not): |
16 |
17 |
17 Parser combinators can deal with any kind of input as long as |
18 \begin{center} |
18 this input is a kind of sequence, for example a string or a |
19 $(((()()))())$ \hspace{10mm} $(((()()))()))$ |
19 list of tokens. If the input are lists of tokens, then parser |
20 \end{center} |
20 combinators transform them into sets of pairs, like so |
21 |
21 |
22 \noindent |
22 \begin{center} |
23 Not being able to solve such recognition problems is a serious limitation. |
23 $\underbrace{\text{list of tokens}}_{\text{input}}$ |
24 In order to solve such recognition problems, we need more powerful |
24 $\Rightarrow$ |
25 techniques than regular expressions. We will in particular look at \emph{context-free |
25 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$ |
26 languages}. They include the regular languages as the picture below shows: |
26 \end{center} |
27 |
27 |
28 |
28 \noindent In Scala parser combinators will therefore be |
29 \begin{center} |
29 functions of type |
30 \begin{tikzpicture} |
|
31 [rect/.style={draw=black!50, |
|
32 top color=white,bottom color=black!20, |
|
33 rectangle, very thick, rounded corners}] |
|
34 |
|
35 \draw (0,0) node [rect, text depth=30mm, text width=46mm] {\small all languages}; |
|
36 \draw (0,-0.4) node [rect, text depth=20mm, text width=44mm] {\small decidable languages}; |
|
37 \draw (0,-0.65) node [rect, text depth=13mm] {\small context sensitive languages}; |
|
38 \draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages}; |
|
39 \draw (0,-1.05) node [rect] {\small regular languages}; |
|
40 \end{tikzpicture} |
|
41 \end{center} |
|
42 |
|
43 \noindent |
|
44 Context-free languages play an important role in `day-to-day' text processing and in |
|
45 programming languages. Context-free languages are usually specified by grammars. |
|
46 For example a grammar for well-parenthesised expressions is |
|
47 |
|
48 \begin{center} |
|
49 $P \;\;\rightarrow\;\; ( \cdot P \cdot ) \cdot P \;|\; \epsilon$ |
|
50 \end{center} |
|
51 |
|
52 \noindent |
|
53 In general grammars consist of finitely many rules built up from \emph{terminal symbols} |
|
54 (usually lower-case letters) and \emph{non-terminal symbols} (upper-case letters). Rules |
|
55 have the shape |
|
56 |
|
57 \begin{center} |
|
58 $NT \;\;\rightarrow\;\; \textit{rhs}$ |
|
59 \end{center} |
|
60 |
|
61 \noindent |
|
62 where on the left-hand side is a single non-terminal and on the right a string consisting |
|
63 of both terminals and non-terminals including the $\epsilon$-symbol for indicating the |
|
64 empty string. We use the convention to separate components on |
|
65 the right hand-side by using the $\cdot$ symbol, as in the grammar for well-parenthesised expressions. |
|
66 We also use the convention to use $|$ as a shorthand notation for several rules. For example |
|
67 |
|
68 \begin{center} |
|
69 $NT \;\;\rightarrow\;\; \textit{rhs}_1 \;|\; \textit{rhs}_2$ |
|
70 \end{center} |
|
71 |
|
72 \noindent |
|
73 means that the non-terminal $NT$ can be replaced by either $\textit{rhs}_1$ or $\textit{rhs}_2$. |
|
74 If there are more than one non-terminal on the left-hand side of the rules, then we need to indicate |
|
75 what is the \emph{starting} symbol of the grammar. For example the grammar for arithmetic expressions |
|
76 can be given as follows |
|
77 |
|
78 \begin{center} |
|
79 \begin{tabular}{lcl} |
|
80 $E$ & $\rightarrow$ & $N$ \\ |
|
81 $E$ & $\rightarrow$ & $E \cdot + \cdot E$ \\ |
|
82 $E$ & $\rightarrow$ & $E \cdot - \cdot E$ \\ |
|
83 $E$ & $\rightarrow$ & $E \cdot * \cdot E$ \\ |
|
84 $E$ & $\rightarrow$ & $( \cdot E \cdot )$\\ |
|
85 $N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$ |
|
86 \end{tabular} |
|
87 \end{center} |
|
88 |
|
89 \noindent |
|
90 where $E$ is the starting symbol. A \emph{derivation} for a grammar |
|
91 starts with the staring symbol of the grammar and in each step replaces one |
|
92 non-terminal by a right-hand side of a rule. A derivation ends with a string |
|
93 in which only terminal symbols are left. For example a derivation for the |
|
94 string $(1 + 2) + 3$ is as follows: |
|
95 |
|
96 \begin{center} |
|
97 \begin{tabular}{lll} |
|
98 $E$ & $\rightarrow$ & $E+E$\\ |
|
99 & $\rightarrow$ & $(E)+E$\\ |
|
100 & $\rightarrow$ & $(E+E)+E$\\ |
|
101 & $\rightarrow$ & $(E+E)+N$\\ |
|
102 & $\rightarrow$ & $(E+E)+3$\\ |
|
103 & $\rightarrow$ & $(N+E)+3$\\ |
|
104 & $\rightarrow^+$ & $(1+2)+3$\\ |
|
105 \end{tabular} |
|
106 \end{center} |
|
107 |
|
108 \noindent |
|
109 The \emph{language} of a context-free grammar $G$ with start symbol $S$ |
|
110 is defined as the set of strings derivable by a derivation, that is |
|
111 |
|
112 \begin{center} |
|
113 $\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$ |
|
114 \end{center} |
|
115 |
|
116 \noindent |
|
117 A \emph{parse-tree} encodes how a string is derived with the starting symbol on |
|
118 top and each non-terminal containing a subtree for how it is replaced in a derivation. |
|
119 The parse tree for the string $(1 + 23)+4$ is as follows: |
|
120 |
|
121 \begin{center} |
|
122 \begin{tikzpicture}[level distance=8mm, black] |
|
123 \node {$E$} |
|
124 child {node {$E$} |
|
125 child {node {$($}} |
|
126 child {node {$E$} |
|
127 child {node {$E$} child {node {$N$} child {node {$1$}}}} |
|
128 child {node {$+$}} |
|
129 child {node {$E$} |
|
130 child {node {$N$} child {node {$2$}}} |
|
131 child {node {$N$} child {node {$3$}}} |
|
132 } |
|
133 } |
|
134 child {node {$)$}} |
|
135 } |
|
136 child {node {$+$}} |
|
137 child {node {$E$} |
|
138 child {node {$N$} child {node {$4$}}} |
|
139 }; |
|
140 \end{tikzpicture} |
|
141 \end{center} |
|
142 |
|
143 \noindent |
|
144 We are often interested in these parse-trees since they encode the structure of |
|
145 how a string is derived by a grammar. Before we come to the problem of constructing |
|
146 such parse-trees, we need to consider the following two properties of grammars. |
|
147 A grammar is \emph{left-recursive} if there is a derivation starting from a non-terminal, say |
|
148 $NT$ which leads to a string which again starts with $NT$. This means a derivation of the |
|
149 form. |
|
150 |
|
151 \begin{center} |
|
152 $NT \rightarrow \ldots \rightarrow NT \cdot \ldots$ |
|
153 \end{center} |
|
154 |
|
155 \noindent |
|
156 It can be easily seen that the grammar above for arithmetic expressions is left-recursive: |
|
157 for example the rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow N\cdot N$ |
|
158 show that this grammar is left-recursive. Some algorithms cannot cope with left-recursive |
|
159 grammars. Fortunately every left-recursive grammar can be transformed into one that is |
|
160 not left-recursive, although this transformation might make the grammar less human-readable. |
|
161 For example if we want to give a non-left-recursive grammar for numbers we might |
|
162 specify |
|
163 |
|
164 \begin{center} |
|
165 $N \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot N\;|\;2\cdot N\;|\;\ldots\;|\;9\cdot N$ |
|
166 \end{center} |
|
167 |
|
168 \noindent |
|
169 Using this grammar we can still derive every number string, but we will never be able |
|
170 to derive a string of the form $\ldots \rightarrow N \cdot \ldots$. |
|
171 |
|
172 The other property we have to watch out for is when a grammar is |
|
173 \emph{ambiguous}. A grammar is said to be ambiguous if there are two parse-trees |
|
174 for one string. Again the grammar for arithmetic expressions shown above is ambiguous. |
|
175 While the shown parse tree for the string $(1 + 23) + 4$ is unique, this is not the case in |
|
176 general. For example there are two parse |
|
177 trees for the string $1 + 2 + 3$, namely |
|
178 |
|
179 \begin{center} |
|
180 \begin{tabular}{c@{\hspace{10mm}}c} |
|
181 \begin{tikzpicture}[level distance=8mm, black] |
|
182 \node {$E$} |
|
183 child {node {$E$} child {node {$N$} child {node {$1$}}}} |
|
184 child {node {$+$}} |
|
185 child {node {$E$} |
|
186 child {node {$E$} child {node {$N$} child {node {$2$}}}} |
|
187 child {node {$+$}} |
|
188 child {node {$E$} child {node {$N$} child {node {$3$}}}} |
|
189 } |
|
190 ; |
|
191 \end{tikzpicture} |
|
192 & |
|
193 \begin{tikzpicture}[level distance=8mm, black] |
|
194 \node {$E$} |
|
195 child {node {$E$} |
|
196 child {node {$E$} child {node {$N$} child {node {$1$}}}} |
|
197 child {node {$+$}} |
|
198 child {node {$E$} child {node {$N$} child {node {$2$}}}} |
|
199 } |
|
200 child {node {$+$}} |
|
201 child {node {$E$} child {node {$N$} child {node {$3$}}}} |
|
202 ; |
|
203 \end{tikzpicture} |
|
204 \end{tabular} |
|
205 \end{center} |
|
206 |
|
207 \noindent |
|
208 In particular in programming languages we will try to avoid ambiguous |
|
209 grammars because two different parse-trees for a string mean a program can |
|
210 be interpreted in two different ways. In such cases we have to somehow make sure |
|
211 the two different ways do not matter, or disambiguate the grammar in |
|
212 some other way (for example making the $+$ left-associative). Unfortunately already |
|
213 the problem of deciding whether a grammar |
|
214 is ambiguous or not is in general undecidable. |
|
215 |
|
216 Let us now turn to the problem of generating a parse-tree for a grammar and string. |
|
217 In what follows we explain \emph{parser combinators}, because they are easy |
|
218 to implement and closely resemble grammar rules. Imagine that a grammar |
|
219 describes the strings of natural numbers, such as the grammar $N$ shown above. |
|
220 For all such strings we want to generate the parse-trees or later on we actually |
|
221 want to extract the meaning of these strings, that is the concrete integers ``behind'' |
|
222 these strings. In Scala the parser combinators will be functions of type |
|
223 |
30 |
224 \begin{center} |
31 \begin{center} |
225 \texttt{I $\Rightarrow$ Set[(T, I)]} |
32 \texttt{I $\Rightarrow$ Set[(T, I)]} |
226 \end{center} |
33 \end{center} |
227 |
34 |
228 \noindent |
35 \noindent that is they take as input something of type |
229 that is they take as input something of type \texttt{I}, typically a list of tokens or a string, |
36 \texttt{I} and return a set of pairs. The first component of |
230 and return a set of pairs. The first component of these pairs corresponds to what the |
37 these pairs corresponds to what the parser combinator was able |
231 parser combinator was able to process from the input and the second is the unprocessed |
38 to process from the input and the second is the unprocessed |
232 part of the input. As we shall see shortly, a parser combinator might return more than one such pair, |
39 part of the input. As we shall see shortly, a parser |
233 with the idea that there are potentially several ways how to interpret the input. As a concrete |
40 combinator might return more than one such pair, with the idea |
234 example, consider the case where the input is of type string, say the string |
41 that there are potentially several ways how to interpret the |
|
42 input. To simplify matters we will first look at the case |
|
43 where the input to the parser combinators is just strings. As |
|
44 a concrete example, consider the case where the input is the |
|
45 string |
235 |
46 |
236 \begin{center} |
47 \begin{center} |
237 \tt\Grid{iffoo\VS testbar} |
48 \tt\Grid{iffoo\VS testbar} |
238 \end{center} |
49 \end{center} |
239 |
50 |
240 \noindent |
51 \noindent We might have a parser combinator which tries to |
241 We might have a parser combinator which tries to interpret this string as a keyword (\texttt{if}) or |
52 interpret this string as a keyword (\texttt{if}) or an |
242 an identifier (\texttt{iffoo}). Then the output will be the set |
53 identifier (\texttt{iffoo}). Then the output will be the set |
243 |
54 |
244 \begin{center} |
55 \begin{center} |
245 $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), |
56 $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), |
246 \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$ |
57 \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$ |
247 \end{center} |
58 \end{center} |
248 |
59 |
249 \noindent |
60 \noindent where the first pair means the parser could |
250 where the first pair means the parser could recognise \texttt{if} from the input and leaves |
61 recognise \texttt{if} from the input and leaves the rest as |
251 the rest as `unprocessed' as the second component of the pair; in the other case |
62 `unprocessed' as the second component of the pair; in the |
252 it could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parser |
63 other case it could recognise \texttt{iffoo} and leaves |
253 cannot recognise anything from the input then parser combinators just return the empty |
64 \texttt{\VS testbar} as unprocessed. If the parser cannot |
254 set $\varnothing$. This will indicate something ``went wrong''. |
65 recognise anything from the input then parser combinators just |
255 |
66 return the empty set $\{\}$. This will indicate something |
256 The main attraction is that we can easily build parser combinators out of smaller components |
67 ``went wrong''\ldots or more precisely, nothing could be |
257 following very closely the structure of a grammar. In order to implement this in an object |
68 parsed. |
258 oriented programming language, like Scala, we need to specify an abstract class for parser |
69 |
259 combinators. This abstract class requires the implementation of the function |
70 The main attraction of parser combinators is that we can |
260 \texttt{parse} taking an argument of type \texttt{I} and returns a set of type |
71 easily build parser combinators out of smaller components |
261 \mbox{\texttt{Set[(T, I)]}}. |
72 following very closely the structure of a grammar. In order to |
262 |
73 implement this in an object-oriented programming language, |
263 \begin{center} |
74 like Scala, we need to specify an abstract class for parser |
264 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
75 combinators. This abstract class requires the implementation |
|
76 of the function \texttt{parse} taking an argument of type |
|
77 \texttt{I} and returns a set of type \mbox{\texttt{Set[(T, |
|
78 I)]}}. |
|
79 |
|
80 \begin{center} |
|
81 \begin{lstlisting}[language=Scala] |
265 abstract class Parser[I, T] { |
82 abstract class Parser[I, T] { |
266 def parse(ts: I): Set[(T, I)] |
83 def parse(ts: I): Set[(T, I)] |
267 |
84 |
268 def parse_all(ts: I): Set[T] = |
85 def parse_all(ts: I): Set[T] = |
269 for ((head, tail) <- parse(ts); if (tail.isEmpty)) |
86 for ((head, tail) <- parse(ts); if (tail.isEmpty)) |
270 yield head |
87 yield head |
271 } |
88 } |
272 \end{lstlisting} |
89 \end{lstlisting} |
273 \end{center} |
90 \end{center} |
274 |
91 |
275 \noindent |
92 \noindent From the function \texttt{parse} we can then |
276 From the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all}, |
93 ``centrally'' derive the function \texttt{parse\_all}, which |
277 which just filters out all pairs whose second component is not empty (that is has still some |
94 just filters out all pairs whose second component is not empty |
278 unprocessed part). The reason is that at the end of parsing we are only interested in the |
95 (that is has still some unprocessed part). The reason is that |
279 results where all the input has been consumed and no unprocessed part is left. |
96 at the end of parsing we are only interested in the results |
280 |
97 where all the input has been consumed and no unprocessed part |
281 One of the simplest parser combinators recognises just a character, say $c$, |
98 is left. |
282 from the beginning of strings. Its behaviour is as follows: |
99 |
|
100 One of the simplest parser combinators recognises just a |
|
101 character, say $c$, from the beginning of strings. Its |
|
102 behaviour can be described as follows: |
283 |
103 |
284 \begin{itemize} |
104 \begin{itemize} |
285 \item if the head of the input string starts with a $c$, it returns |
105 \item if the head of the input string starts with a $c$, it returns |
286 the set $\{(c, \textit{tail of}\; s)\}$ |
106 the set $\{(c, \textit{tail of}\; s)\}$, where \textit{tail of} |
287 \item otherwise it returns the empty set $\varnothing$ |
107 $s$ is the unprocessed part of the input string |
|
108 \item otherwise it returns the empty set $\{\}$ |
288 \end{itemize} |
109 \end{itemize} |
289 |
110 |
290 \noindent |
111 \noindent |
291 The input type of this simple parser combinator for characters is |
112 The input type of this simple parser combinator for characters is |
292 \texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. |
113 \texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. |
299 if (sb.head == c) Set((c, sb.tail)) else Set() |
120 if (sb.head == c) Set((c, sb.tail)) else Set() |
300 } |
121 } |
301 \end{lstlisting} |
122 \end{lstlisting} |
302 \end{center} |
123 \end{center} |
303 |
124 |
|
125 \noindent The \texttt{parse} function tests whether the first |
|
126 character of the input string \texttt{sb} is equal to |
|
127 \texttt{c}. If yes, then it splits the string into the |
|
128 recognised part \texttt{c} and the unprocessed part |
|
129 \texttt{sb.tail}. In case \texttt{sb} does not start with |
|
130 \texttt{c} then the parser returns the empty set (in Scala |
|
131 \texttt{Set()}). |
|
132 |
|
133 |
|
134 More interesting are the parser combinators that build larger |
|
135 parsers out of smaller component parsers. For example the |
|
136 alternative parser combinator is as follows: given two |
|
137 parsers, say, $p$ and $q$, then we apply both parsers to the |
|
138 input (remember parsers are functions) and combine the input |
|
139 |
|
140 \begin{center} |
|
141 $p(\text{input}) \cup q(\text{input})$ |
|
142 \end{center} |
|
143 |
304 \noindent |
144 \noindent |
305 The \texttt{parse} function tests whether the first character of the |
145 In Scala we would implement alternative parsers as follows |
306 input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits the |
146 |
307 string into the recognised part \texttt{c} and the unprocessed part |
147 \begin{center} |
308 \texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} then |
148 \begin{lstlisting}[language=Scala, numbers=none] |
309 the parser returns the empty set (in Scala \texttt{Set()}). |
|
310 |
|
311 More interesting are the parser combinators that build larger parsers |
|
312 out of smaller component parsers. For example the alternative |
|
313 parser combinator is as follows. |
|
314 |
|
315 \begin{center} |
|
316 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
317 class AltParser[I, T] |
149 class AltParser[I, T] |
318 (p: => Parser[I, T], |
150 (p: => Parser[I, T], |
319 q: => Parser[I, T]) extends Parser[I, T] { |
151 q: => Parser[I, T]) extends Parser[I, T] { |
320 def parse(sb: I) = p.parse(sb) ++ q.parse(sb) |
152 def parse(sb: I) = p.parse(sb) ++ q.parse(sb) |
321 } |
153 } |
322 \end{lstlisting} |
154 \end{lstlisting} |
323 \end{center} |
155 \end{center} |
324 |
156 |
325 \noindent |
157 \noindent The types of this parser combinator are polymorphic |
326 The types of this parser combinator are polymorphic (we just have \texttt{I} |
158 (we just have \texttt{I} for the input type, and \texttt{T} |
327 for the input type, and \texttt{T} for the output type). The alternative parser |
159 for the output type). The alternative parser builds a new |
328 builds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}. |
160 parser out of two existing parser combinator \texttt{p} and |
329 Both need to be able to process input of type \texttt{I} and return the same |
161 \texttt{q}. Both need to be able to process input of type |
330 output type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the |
162 \texttt{I} and return the same output type \texttt{Set[(T, |
331 \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the |
163 I)]}. (There is an interesting detail of Scala, namely the |
332 evaluation of the arguments before they are used. This is often called |
164 \texttt{=>} in front of the types of \texttt{p} and |
333 \emph{lazy evaluation} of the arguments.) The alternative parser should run |
165 \texttt{q}. They will prevent the evaluation of the arguments |
334 the input with the first parser \texttt{p} (producing a set of outputs) and then |
166 before they are used. This is often called \emph{lazy |
335 run the same input with \texttt{q}. The result should be then just the union |
167 evaluation} of the arguments.) The alternative parser should |
336 of both sets, which is the operation \texttt{++} in Scala. |
168 run the input with the first parser \texttt{p} (producing a |
337 |
169 set of outputs) and then run the same input with \texttt{q}. |
338 This parser combinator already allows us to construct a parser that either |
170 The result should be then just the union of both sets, which |
339 a character \texttt{a} or \texttt{b}, as |
171 is the operation \texttt{++} in Scala. |
340 |
172 |
341 \begin{center} |
173 This parser combinator already allows us to construct a parser |
342 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
174 that either a character \texttt{a} or \texttt{b}, as |
|
175 |
|
176 \begin{center} |
|
177 \begin{lstlisting}[language=Scala, numbers=none] |
343 new AltParser(CharParser('a'), CharParser('b')) |
178 new AltParser(CharParser('a'), CharParser('b')) |
344 \end{lstlisting} |
179 \end{lstlisting} |
345 \end{center} |
180 \end{center} |
346 |
181 |
347 \noindent |
182 \noindent Scala allows us to introduce some more readable |
348 Scala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}. |
183 shorthand notation for this, like \texttt{'a' || 'b'}. We can |
349 We can call this parser combinator with the strings |
184 call this parser combinator with the strings |
350 |
185 |
351 \begin{center} |
186 \begin{center} |
352 \begin{tabular}{rcl} |
187 \begin{tabular}{rcl} |
353 input string & & output\medskip\\ |
188 input strings & & output\medskip\\ |
354 \texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\ |
189 \texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\ |
355 \texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\ |
190 \texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\ |
356 \texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$ |
191 \texttt{\Grid{cc}} & $\rightarrow$ & $\{\}$ |
|
192 \end{tabular} |
|
193 \end{center} |
|
194 |
|
195 \noindent We receive in the first two cases a successful |
|
196 output (that is a non-empty set). In each case, either |
|
197 \pcode{a} or \pcode{b} are the processed parts, and \pcode{c} |
|
198 is the unprocessed part. Clearly this parser cannot parse |
|
199 anything with the string \pcode{cc}. |
|
200 |
|
201 A bit more interesting is the \emph{sequence parser |
|
202 combinator}. Given two parsers, say, $p$ and $q$, |
|
203 apply first the input to $p$ producing a set of pairs; |
|
204 then apply $q$ to the unparsed parts; then combine the |
|
205 results like |
|
206 |
|
207 \begin{center} |
|
208 \begin{tabular}{lcl} |
|
209 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & |
|
210 $(\textit{output}_1, u_1) \in p(\text{input}) |
|
211 \;\wedge\;$\\ |
|
212 && $(\textit{output}_2, u_2) \in q(u_1)\}$ |
357 \end{tabular} |
213 \end{tabular} |
358 \end{center} |
214 \end{center} |
359 |
215 |
360 \noindent |
216 \noindent |
361 We receive in the first two cases a successful output (that is a non-empty set). |
217 This can be implemented in Scala as follows: |
362 |
218 |
363 A bit more interesting is the \emph{sequence parser combinator} implemented in |
219 \begin{center} |
364 Scala as follows: |
220 \begin{lstlisting}[language=Scala,numbers=none] |
365 |
|
366 \begin{center} |
|
367 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
368 class SeqParser[I, T, S] |
221 class SeqParser[I, T, S] |
369 (p: => Parser[I, T], |
222 (p: => Parser[I, T], |
370 q: => Parser[I, S]) extends Parser[I, (T, S)] { |
223 q: => Parser[I, S]) extends Parser[I, (T, S)] { |
371 def parse(sb: I) = |
224 def parse(sb: I) = |
372 for ((head1, tail1) <- p.parse(sb); |
225 for ((head1, tail1) <- p.parse(sb); |
374 yield ((head1, head2), tail2) |
227 yield ((head1, head2), tail2) |
375 } |
228 } |
376 \end{lstlisting} |
229 \end{lstlisting} |
377 \end{center} |
230 \end{center} |
378 |
231 |
379 \noindent |
232 \noindent This parser takes as input two parsers, \texttt{p} |
380 This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} |
233 and \texttt{q}. It implements \texttt{parse} as follows: let |
381 as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}). |
234 first run the parser \texttt{p} on the input producing a set |
382 The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}. |
235 of pairs (\texttt{head1}, \texttt{tail1}). The \texttt{tail1} |
383 Let \texttt{q} run on these unprocessed parts |
236 stands for the unprocessed parts left over by \texttt{p}. Let |
384 producing again a set of pairs. The output of the sequence parser combinator is then a set |
237 \texttt{q} run on these unprocessed parts producing again a |
385 containing pairs where the first components are again pairs, namely what the first parser could parse |
238 set of pairs. The output of the sequence parser combinator is |
386 together with what the second parser could parse; the second component is the unprocessed |
239 then a set containing pairs where the first components are |
387 part left over after running the second parser \texttt{q}. Therefore the input type of |
240 again pairs, namely what the first parser could parse together |
388 the sequence parser combinator is as usual \texttt{I}, but the output type is |
241 with what the second parser could parse; the second component |
|
242 is the unprocessed part left over after running the second |
|
243 parser \texttt{q}. Therefore the input type of the sequence |
|
244 parser combinator is as usual \texttt{I}, but the output type |
|
245 is |
389 |
246 |
390 \begin{center} |
247 \begin{center} |
391 \texttt{Set[((T, S), I)]} |
248 \texttt{Set[((T, S), I)]} |
392 \end{center} |
249 \end{center} |
393 |
250 |
394 Scala allows us to provide some |
251 Scala allows us to provide some shorthand notation for the |
395 shorthand notation for the sequence parser combinator. So we can write for |
252 sequence parser combinator. So we can write for example |
396 example \texttt{'a' $\sim$ 'b'}, which is the |
253 \texttt{'a' $\sim$ 'b'}, which is the parser combinator that |
397 parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}. |
254 first consumes the character \texttt{a} from a string and then |
398 Calling this parser combinator with the strings |
255 \texttt{b}. Three examples of this parser combinator are as |
399 |
256 follows: |
400 \begin{center} |
257 |
401 \begin{tabular}{rcl} |
258 \begin{center} |
402 input string & & output\medskip\\ |
259 \begin{tabular}{rcl} |
|
260 input strings & & output\medskip\\ |
403 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
261 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
404 \texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\ |
262 \texttt{\Grid{bac}} & $\rightarrow$ & $\{\}$\\ |
405 \texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$ |
263 \texttt{\Grid{ccc}} & $\rightarrow$ & $\{\}$ |
406 \end{tabular} |
264 \end{tabular} |
407 \end{center} |
265 \end{center} |
408 |
266 |
409 \noindent |
267 \noindent A slightly more complicated parser is \texttt{('a' |
410 A slightly more complicated parser is \texttt{('a' || 'b') $\sim$ 'b'} which parses as first character either |
268 || 'b') $\sim$ 'b'} which parses as first character either an |
411 an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results. |
269 \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser |
412 |
270 produces the following results. |
413 \begin{center} |
271 |
414 \begin{tabular}{rcl} |
272 \begin{center} |
415 input string & & output\medskip\\ |
273 \begin{tabular}{rcl} |
|
274 input strings & & output\medskip\\ |
416 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
275 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
417 \texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
276 \texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
418 \texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$ |
277 \texttt{\Grid{aac}} & $\rightarrow$ & $\{\}$ |
419 \end{tabular} |
278 \end{tabular} |
420 \end{center} |
279 \end{center} |
421 |
280 |
422 Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error. |
281 \noindent Two more examples: first consider the parser \pcode{('a' ~ |
423 The first parser has as output type a single character (recall the type of \texttt{CharParser}), |
282 'a') ~ 'a'} and the input \pcode{aaaa}: |
424 but the second parser produces a pair of characters as output. The alternative parser is however |
283 |
425 required to have both component parsers to have the same type. We will see later how we can |
284 \begin{center} |
426 build this parser without the typing error. |
285 \begin{tabular}{rcl} |
427 |
286 input string & & output\medskip\\ |
428 The next parser combinator does not actually combine smaller parsers, but applies |
287 \texttt{\Grid{aaaa}} & $\rightarrow$ & |
429 a function to the result of the parser. It is implemented in Scala as follows |
288 $\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\ |
|
289 \end{tabular} |
|
290 \end{center} |
|
291 |
|
292 \noindent Notice how the results nest deeper as pairs (the |
|
293 last \pcode{a} is in the unprocessed part). To consume |
|
294 everything we can use the parser \pcode{(('a' ~'a') ~ 'a') ~ |
|
295 'a'}. Then the output is as follows: |
|
296 |
|
297 \begin{center} |
|
298 \begin{tabular}{rcl} |
|
299 input string & & output\medskip\\ |
|
300 \texttt{\Grid{aaaa}} & $\rightarrow$ & |
|
301 $\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\ |
|
302 \end{tabular} |
|
303 \end{center} |
|
304 |
|
305 \noindent This is an instance where the parser consumed |
|
306 completely the input, meaning the unprocessed part is just the |
|
307 empty string. |
|
308 |
|
309 |
|
310 Note carefully that constructing a parser such \texttt{'a' || |
|
311 ('a' $\sim$ 'b')} will result in a tying error. The first |
|
312 parser has as output type a single character (recall the type |
|
313 of \texttt{CharParser}), but the second parser produces a pair |
|
314 of characters as output. The alternative parser is however |
|
315 required to have both component parsers to have the same type. |
|
316 We will see later how we can build this parser without the |
|
317 typing error. |
|
318 |
|
319 The next parser combinator does not actually combine smaller |
|
320 parsers, but applies a function to the result of the parser. |
|
321 It is implemented in Scala as follows |
430 |
322 |
431 \begin{center} |
323 \begin{center} |
432 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
324 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
433 class FunParser[I, T, S] |
325 class FunParser[I, T, S] |
434 (p: => Parser[I, T], |
326 (p: => Parser[I, T], |