author | Christian Urban <urbanc@in.tum.de> |
Wed, 24 Oct 2018 13:07:13 +0100 | |
changeset 585 | 6ee22f196884 |
parent 584 | 529460073b24 |
child 586 | 451a95e1bc25 |
permissions | -rw-r--r-- |
584 | 1 |
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
2 |
\documentclass{article} |
297
5c51839c88fd
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
292
diff
changeset
|
3 |
\usepackage{../style} |
217
cd6066f1056a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
183
diff
changeset
|
4 |
\usepackage{../langs} |
cd6066f1056a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
183
diff
changeset
|
5 |
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
6 |
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
7 |
\begin{document} |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
8 |
|
292
7ed2a25dd115
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
217
diff
changeset
|
9 |
\section*{Handout 6 (Parser Combinators)} |
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
10 |
|
584 | 11 |
This handout explains how \emph{parser combinators} work and how they |
12 |
can be implemented in Scala. Their distinguishing feature is that they |
|
13 |
are very easy to implement (admittedly it is only easy in a functional |
|
14 |
programming language). However, parser combinators require that the |
|
15 |
grammar to be parsed is \emph{not} left-recursive and they are |
|
16 |
efficient only when the grammar is unambiguous. It is the |
|
17 |
responsibility of the grammar designer to ensure these two properties. |
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
18 |
|
584 | 19 |
Another good point of parser combinators is that they can deal with any kind |
20 |
of input as long as this input of ``sequence-kind'', for example a |
|
21 |
string or a list of tokens. The general idea behind parser combinators |
|
22 |
is to transform the input into sets of pairs, like so |
|
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
23 |
|
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
24 |
\begin{center} |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
25 |
$\underbrace{\text{list of tokens}}_{\text{input}}$ |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
26 |
$\Rightarrow$ |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
27 |
$\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$ |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
28 |
\end{center} |
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
29 |
|
584 | 30 |
\noindent As said, the input can be anything as long as it is a |
31 |
``sequence''. The only property of the input we need is to be able to |
|
32 |
test when it is empty. Obviously we can do this for strings and lists. |
|
33 |
For more lucidity we shall below often use strings as input in order |
|
34 |
to illustrate matters. However, this does not make our previous work |
|
35 |
on lexers obsolete (remember they transform a string into a list of |
|
36 |
tokens). Lexers will still be needed to build a somewhat realistic |
|
37 |
compiler. |
|
38 |
||
39 |
||
40 |
In my Scala code I use the following polymorphic types for parser combinators: |
|
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
41 |
|
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
42 |
\begin{center} |
584 | 43 |
input:\;\; \texttt{I} \qquad output:\;\; \texttt{T} |
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
44 |
\end{center} |
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
45 |
|
584 | 46 |
\noindent that is they take as input something of type \texttt{I} and |
47 |
return a set of pairs of \texttt{Set[(T, I)]}. Since the input needs |
|
48 |
to be of ``sequence-kind'' I actually have to often write \texttt{I |
|
49 |
<\% Seq[\_]} for the input type in order to indicate it is a |
|
50 |
subtype of Scala sequences. The first component of the generated pairs |
|
51 |
corresponds to what the parser combinator was able to process from the |
|
52 |
input and the second is the unprocessed part of the input (therefore |
|
53 |
the type of this unprocessed part is the same as the input). As we |
|
54 |
shall see shortly, a parser combinator might return more than one such |
|
55 |
pair; the idea being that there are potentially several ways of how to |
|
56 |
parse the input. As a concrete example, consider the string |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
57 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
58 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
59 |
\tt\Grid{iffoo\VS testbar} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
60 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
61 |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
62 |
\noindent We might have a parser combinator which tries to |
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
63 |
interpret this string as a keyword (\texttt{if}) or as an |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
64 |
identifier (\texttt{iffoo}). Then the output will be the set |
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
65 |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
66 |
\begin{center} |
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
67 |
$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
68 |
\left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$ |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
69 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
70 |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
71 |
\noindent where the first pair means the parser could |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
72 |
recognise \texttt{if} from the input and leaves the rest as |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
73 |
`unprocessed' as the second component of the pair; in the |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
74 |
other case it could recognise \texttt{iffoo} and leaves |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
75 |
\texttt{\VS testbar} as unprocessed. If the parser cannot |
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
76 |
recognise anything from the input, then parser combinators just |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
77 |
return the empty set $\{\}$. This will indicate something |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
78 |
``went wrong''\ldots or more precisely, nothing could be |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
79 |
parsed. |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
80 |
|
584 | 81 |
Also important to note is that the type \texttt{T} for the processed |
82 |
part is different from the input type. The reason is that in general |
|
83 |
we are interested in transform our input into something |
|
84 |
``different''\ldots for example into a tree, or if we implement the |
|
85 |
grammar for arithmetic expressions we might be interested in the |
|
86 |
actual integer number the arithmetic expression, say \texttt{1 + 2 * |
|
87 |
3}, stands for. In this way we can use parser combinators to |
|
88 |
implement relativaley easily a calculator. |
|
89 |
||
90 |
The main idea of parser combinators is that we can easily build parser |
|
91 |
combinators out of smaller components following very closely the |
|
92 |
structure of a grammar. In order to implement this in an |
|
93 |
object-oriented programming language, like Scala, we need to specify |
|
94 |
an abstract class for parser combinators. This abstract class states |
|
95 |
that the function \texttt{parse} takes an argument of type \texttt{I} |
|
96 |
and returns a set of type \mbox{\texttt{Set[(T, I)]}}. |
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
97 |
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
98 |
\begin{center} |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
99 |
\begin{lstlisting}[language=Scala] |
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
100 |
abstract class Parser[I, T] { |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
101 |
def parse(ts: I): Set[(T, I)] |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
102 |
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
103 |
def parse_all(ts: I): Set[T] = |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
104 |
for ((head, tail) <- parse(ts); if (tail.isEmpty)) |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
105 |
yield head |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
106 |
} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
107 |
\end{lstlisting} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
108 |
\end{center} |
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
109 |
|
584 | 110 |
\noindent It is the obligation in each instance (parser combinator) to |
111 |
supply an implementation for \texttt{parse}. From this function we |
|
112 |
can then ``centrally'' derive the function \texttt{parse\_all}, which |
|
113 |
just filters out all pairs whose second component is not empty (that |
|
114 |
is has still some unprocessed part). The reason is that at the end of |
|
115 |
the parsing we are only interested in the results where all the input |
|
116 |
has been consumed and no unprocessed part is left over. |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
117 |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
118 |
One of the simplest parser combinators recognises just a |
584 | 119 |
single character, say $c$, from the beginning of strings. Its |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
120 |
behaviour can be described as follows: |
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
121 |
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
122 |
\begin{itemize} |
584 | 123 |
\item If the head of the input string starts with a $c$, then return |
124 |
the set |
|
125 |
\[\{(c, \textit{tail of}\; s)\}\] |
|
126 |
where \textit{tail of} |
|
127 |
$s$ is the unprocessed part of the input string. |
|
128 |
\item Otherwise return the empty set $\{\}$. |
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
129 |
\end{itemize} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
130 |
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
131 |
\noindent |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
132 |
The input type of this simple parser combinator for characters is |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
133 |
\texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
134 |
The code in Scala is as follows: |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
135 |
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
136 |
\begin{center} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
137 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
138 |
case class CharParser(c: Char) extends Parser[String, Char] { |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
139 |
def parse(sb: String) = |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
140 |
if (sb.head == c) Set((c, sb.tail)) else Set() |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
141 |
} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
142 |
\end{lstlisting} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
143 |
\end{center} |
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
144 |
|
584 | 145 |
\noindent You can see the \texttt{parse} function tests whether the |
146 |
first character of the input string \texttt{sb} is equal to |
|
147 |
\texttt{c}. If yes, then it splits the string into the recognised part |
|
148 |
\texttt{c} and the unprocessed part \texttt{sb.tail}. In case |
|
149 |
\texttt{sb} does not start with \texttt{c} then the parser returns the |
|
150 |
empty set (in Scala \texttt{Set()}). Since this parser recognises |
|
151 |
characters and just returns characters as the processed part, the |
|
152 |
output type of the parser is \texttt{Char}. |
|
153 |
||
154 |
If we want to parse a list of tokens and interested in recognising a |
|
155 |
number token, we could write something like this |
|
156 |
||
157 |
\begin{center} |
|
158 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none] |
|
159 |
case object NumParser extends Parser[List[Token], Int] { |
|
160 |
def parse(ts: List[Token]) = ts match { |
|
161 |
case Num_token(s)::ts => Set((s.toInt, ts)) |
|
162 |
case _ => Set () |
|
163 |
} |
|
164 |
} |
|
165 |
\end{lstlisting} |
|
166 |
\end{center} |
|
167 |
||
168 |
\noindent |
|
169 |
In this parser the input is of type \texttt{List[Token]}. The function |
|
170 |
parse looks at the input \texttt{ts} and checks whether the first |
|
171 |
token is a \texttt{Num\_token}. Let us assume our lexer generated |
|
172 |
these tokens for numbers. But this parser does not just return this |
|
173 |
token (and the rest of the list), like the \texttt{CharParser} above, |
|
174 |
rather extracts the string \texttt{s} from the token and converts it |
|
175 |
into an integer. The hope is that the lexer did its work well and this |
|
176 |
conversion always succeeds. The consequence of this is that the output |
|
177 |
type for this parser is \texttt{Int}. Such a conversion would be |
|
178 |
needed if we want to implement a simple calculator program. |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
179 |
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
180 |
|
584 | 181 |
These simple parsers that just look at the input and do a simple |
182 |
transformation are often called \emph{atomic} parser combinators. |
|
183 |
More interesting are the parser combinators that build larger parsers |
|
184 |
out of smaller component parsers. For example the \emph{alternative |
|
185 |
parser combinator} is as follows: given two parsers, say, $p$ and |
|
186 |
$q$, we apply both parsers to the input (remember parsers are |
|
187 |
functions) and combine the output (remember they are sets of pairs) |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
188 |
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
189 |
\begin{center} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
190 |
$p(\text{input}) \cup q(\text{input})$ |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
191 |
\end{center} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
192 |
|
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
193 |
\noindent In Scala we would implement alternative parser |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
194 |
combinator as follows |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
195 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
196 |
\begin{center} |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
197 |
\begin{lstlisting}[language=Scala, numbers=none] |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
198 |
class AltParser[I, T] |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
199 |
(p: => Parser[I, T], |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
200 |
q: => Parser[I, T]) extends Parser[I, T] { |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
201 |
def parse(sb: I) = p.parse(sb) ++ q.parse(sb) |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
202 |
} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
203 |
\end{lstlisting} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
204 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
205 |
|
584 | 206 |
\noindent The types of this parser combinator are again generic (we |
207 |
just have \texttt{I} for the input type, and \texttt{T} for the output |
|
208 |
type). The alternative parser builds a new parser out of two existing |
|
209 |
parsers \texttt{p} and \texttt{q}. Both need to be able to process |
|
210 |
input of type \texttt{I} and return the same output type |
|
211 |
\texttt{Set[(T, I)]}.\footnote{There is an interesting detail of |
|
212 |
Scala, namely the \texttt{=>} in front of the types of \texttt{p} |
|
213 |
and \texttt{q}. They will prevent the evaluation of the arguments |
|
214 |
before they are used. This is often called \emph{lazy evaluation} of |
|
215 |
the arguments. We will explain this later.} Therefore the output |
|
216 |
type of this parser is \texttt{T}. The alternative parser should run |
|
217 |
the input with the first parser \texttt{p} (producing a set of pairs) |
|
218 |
and then run the same input with \texttt{q} (producing another set of |
|
219 |
pairs). The result should be then just the union of both sets, which |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
220 |
is the operation \texttt{++} in Scala. |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
221 |
|
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
222 |
The alternative parser combinator already allows us to |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
223 |
construct a parser that parses either a character \texttt{a} |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
224 |
or \texttt{b}, as |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
225 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
226 |
\begin{center} |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
227 |
\begin{lstlisting}[language=Scala, numbers=none] |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
228 |
new AltParser(CharParser('a'), CharParser('b')) |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
229 |
\end{lstlisting} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
230 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
231 |
|
584 | 232 |
\noindent Later on we will use again Scala mechanism for introducing some |
233 |
more readable shorthand notation for this, like \texttt{"a" || "b"}. Let us look in detail at what this parser combinator produces with some somple strings |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
234 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
235 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
236 |
\begin{tabular}{rcl} |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
237 |
input strings & & output\medskip\\ |
584 | 238 |
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\ |
239 |
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\ |
|
240 |
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$ |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
241 |
\end{tabular} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
242 |
\end{center} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
243 |
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
244 |
\noindent We receive in the first two cases a successful |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
245 |
output (that is a non-empty set). In each case, either |
392
2d0a59127694
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
386
diff
changeset
|
246 |
\pcode{a} or \pcode{b} is in the processed part, and |
584 | 247 |
\pcode{cde} in the unprocessed part. Clearly this parser cannot |
248 |
parse anything in the string \pcode{ccde}, therefore the empty |
|
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
249 |
set. |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
250 |
|
584 | 251 |
A bit more interesting is the \emph{sequence parser combinator}. Given |
252 |
two parsers, say again, $p$ and $q$, we want to apply first the input |
|
253 |
to $p$ producing a set of pairs; then apply $q$ to all the unparsed |
|
254 |
parts in the pairs; and then combine the results like |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
255 |
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
256 |
\begin{center} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
257 |
\begin{tabular}{lcl} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
258 |
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
259 |
$(\textit{output}_1, u_1) \in p(\text{input}) |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
260 |
\;\wedge\;$\\ |
584 | 261 |
&& $(\textit{output}_2, u_2) \in q(u_1)\}$ |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
262 |
\end{tabular} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
263 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
264 |
|
584 | 265 |
\noindent Notice that the $p$ wil first be run on the input, producing |
266 |
pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands |
|
267 |
for the unprocessed, or left-over, parts. We want that $q$ runs on all |
|
268 |
these unprocessed parts $u_1$. This again will produce some |
|
269 |
processed part , $p$ and |
|
270 |
$q$, we apply both parsers to the input (remember parsers are |
|
271 |
functions) and combine the output (remember they are sets of pairs) |
|
272 |
||
273 |
\begin{center} |
|
274 |
$p(\text{input}) \cup q(\text{input})$ |
|
275 |
\end{center} |
|
276 |
||
277 |
\noindent In Scala we would implement alternative parser |
|
278 |
combinator as follows |
|
279 |
||
280 |
\begin{center} |
|
281 |
\begin{lstlisting}[language=Scala, numbers=none] |
|
282 |
class AltParser[I, T] |
|
283 |
(p: => Parser[I, T], |
|
284 |
q: => Parser[I, T]) extends Parser[I, T] { |
|
285 |
def parse(sb: I) = p.parse(sb) ++ q.parse(sb) |
|
286 |
} |
|
287 |
\end{lstlisting} |
|
288 |
\end{center} |
|
289 |
||
290 |
\noindent The types of this parser combinator are again generic (we |
|
291 |
just have \texttt{I} for the input type, and \texttt{T} for the output |
|
292 |
type). The alternative parser builds a new parser out of two existing |
|
293 |
parsers \texttt{p} and \texttt{q}. Both need to be able to process |
|
294 |
input of type \texttt{I} and return the same output type |
|
295 |
\texttt{Set[(T, I)]}.\footnote{There is an interesting detail of |
|
296 |
Scala, namely the \texttt{=>} in front of the types of \texttt{p} |
|
297 |
and \texttt{q}. They will prevent the evaluation of the arguments |
|
298 |
before they are used. This is often called \emph{lazy evaluation} of |
|
299 |
the arguments. We will explain this later.} Therefore the output |
|
300 |
type of this parser is \texttt{T}. The alternative parser should run |
|
301 |
the input with the first parser \texttt{p} (producing a set of pairs) |
|
302 |
and then run the same input with \texttt{q} (producing another set of |
|
303 |
pairs). The result should be then just the union of both sets, which |
|
304 |
is the operation \texttt{++} in Scala. |
|
305 |
||
306 |
The alternative parser combinator already allows us to |
|
307 |
construct a parser that parses either a character \texttt{a} |
|
308 |
or \texttt{b}, as |
|
309 |
||
310 |
\begin{center} |
|
311 |
\begin{lstlisting}[language=Scala, numbers=none] |
|
312 |
new AltParser(CharParser('a'), CharParser('b')) |
|
313 |
\end{lstlisting} |
|
314 |
\end{center} |
|
315 |
||
316 |
\noindent Later on we will use again Scala mechanism for introducing some |
|
317 |
more readable shorthand notation for this, like \texttt{"a" || "b"}. Let us look in detail at what this parser combinator produces with some somple strings |
|
318 |
||
319 |
\begin{center} |
|
320 |
\begin{tabular}{rcl} |
|
321 |
input strings & & output\medskip\\ |
|
322 |
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\ |
|
323 |
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\ |
|
324 |
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$ |
|
325 |
\end{tabular} |
|
326 |
\end{center} |
|
327 |
||
328 |
\noindent We receive in the first two cases a successful |
|
329 |
output (that is a non-empty set). In each case, either |
|
330 |
\pcode{a} or \pcode{b} is in the processed part, and |
|
331 |
\pcode{cde} in the unprocessed part. Clearly this parser cannot |
|
332 |
parse anything in the string \pcode{ccde}, therefore the empty |
|
333 |
set. |
|
334 |
||
335 |
A bit more interesting is the \emph{sequence parser combinator}. Given |
|
336 |
two parsers, say again, $p$ and $q$, we want to apply first the input |
|
337 |
to $p$ producing a set of pairs; then apply $q$ to all the unparsed |
|
338 |
parts in the pairs; and then combine the results like |
|
339 |
||
340 |
\begin{center} |
|
341 |
\begin{tabular}{lcl} |
|
342 |
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & |
|
343 |
$(\textit{output}_1, u_1) \in p(\text{input}) |
|
344 |
\;\wedge\;$\\ |
|
345 |
&& $(\textit{output}_2, u_2) \in q(u_1)\}$ |
|
346 |
\end{tabular} |
|
347 |
\end{center} |
|
348 |
||
349 |
\noindent Notice that the $p$ wil first be run on the input, producing |
|
350 |
pairs of the form $\textit{output}_1$ and unprocessed part $u_1$. The |
|
351 |
overall result of the sequence parser combinator is pairs of the form |
|
352 |
$((\textit{output}_1, \textit{output}_2), u_2)$. This means the |
|
353 |
unprocessed parts of both parsers is the unprocessed parts the second |
|
354 |
parser $q$ produces as left-over. The processed parts of both parsers |
|
355 |
is just the pair of the outputs |
|
356 |
$(\textit{output}_1, \textit{output}_2)$. This behavious can be |
|
357 |
implemented in Scala as follows: |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
358 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
359 |
\begin{center} |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
360 |
\begin{lstlisting}[language=Scala,numbers=none] |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
361 |
class SeqParser[I, T, S] |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
362 |
(p: => Parser[I, T], |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
363 |
q: => Parser[I, S]) extends Parser[I, (T, S)] { |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
364 |
def parse(sb: I) = |
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
365 |
for ((output1, u1) <- p.parse(sb); |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
366 |
(output2, u2) <- q.parse(u1)) |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
367 |
yield ((output1, output2), u2) |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
368 |
} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
369 |
\end{lstlisting} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
370 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
371 |
|
584 | 372 |
\noindent This parser takes again as input two parsers, \texttt{p} |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
373 |
and \texttt{q}. It implements \texttt{parse} as follows: let |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
374 |
first run the parser \texttt{p} on the input producing a set |
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
375 |
of pairs (\texttt{output1}, \texttt{u1}). The \texttt{u1} |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
376 |
stands for the unprocessed parts left over by \texttt{p}. Let |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
377 |
\texttt{q} run on these unprocessed parts producing again a |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
378 |
set of pairs. The output of the sequence parser combinator is |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
379 |
then a set containing pairs where the first components are |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
380 |
again pairs, namely what the first parser could parse together |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
381 |
with what the second parser could parse; the second component |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
382 |
is the unprocessed part left over after running the second |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
383 |
parser \texttt{q}. Therefore the input type of the sequence |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
384 |
parser combinator is as usual \texttt{I}, but the output type |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
385 |
is |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
386 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
387 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
388 |
\texttt{Set[((T, S), I)]} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
389 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
390 |
|
584 | 391 |
\noindent |
392 |
If any of the runs of \textit{p} and \textit{q} fail, that is produce |
|
393 |
the empty set, then \texttt{parse} will also produce the empty set. |
|
394 |
Notice that we have now two output types for the sequence parser |
|
395 |
combinator, because in general \textit{p} and \textit{q} might produce |
|
396 |
differetn things (for example first we recognise a number and then a |
|
397 |
string corresponding to an operator). |
|
398 |
||
399 |
||
400 |
We have not yet looked at this in detail, but Scala allows us to |
|
401 |
provide some shorthand notation for the sequence parser combinator. We |
|
402 |
can write for example \pcode{"a" ~ "b"}, which is the parser |
|
403 |
combinator that first recognises the character \texttt{a} from a |
|
404 |
string and then \texttt{b}. Let us look again at three examples of |
|
405 |
how this parser combinator processes strings: |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
406 |
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
407 |
\begin{center} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
408 |
\begin{tabular}{rcl} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
409 |
input strings & & output\medskip\\ |
584 | 410 |
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\ |
411 |
\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\ |
|
412 |
\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$ |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
413 |
\end{tabular} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
414 |
\end{center} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
415 |
|
584 | 416 |
\noindent In the first line we have a sucessful parse, because the |
417 |
string started with \texttt{ab}, which is the prefix we are looking |
|
418 |
for. But since the parsing combinator is constructed as sequence of |
|
419 |
the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the |
|
420 |
result is a nested pair of the form \texttt{((a, b), cde)}. It is |
|
421 |
\emph{not} a simple pair \texttt{(ab, cde)} as one might errorneously |
|
422 |
expects. The parser returns the ampty set in the other examples, |
|
423 |
because they do not fit with what the parser is supposed to parse. |
|
424 |
||
425 |
||
585 | 426 |
A slightly more complicated parser is \pcode{("a" || "b") ~ "c"} |
584 | 427 |
which parses as first character either an \texttt{a} or \texttt{b} |
428 |
followed by a \texttt{c}. This parser produces the following outputs. |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
429 |
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
430 |
\begin{center} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
431 |
\begin{tabular}{rcl} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
432 |
input strings & & output\medskip\\ |
585 | 433 |
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\ |
434 |
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\ |
|
435 |
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$ |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
436 |
\end{tabular} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
437 |
\end{center} |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
438 |
|
585 | 439 |
\noindent |
440 |
Now consider the parser \pcode{("a" ~ "b") ~ "c"} which parses |
|
441 |
\texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces |
|
442 |
the following outputs. |
|
443 |
||
444 |
\begin{center} |
|
445 |
\begin{tabular}{rcl} |
|
446 |
input strings & & output\medskip\\ |
|
447 |
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\ |
|
448 |
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\ |
|
449 |
\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$ |
|
450 |
\end{tabular} |
|
451 |
\end{center} |
|
452 |
||
453 |
||
454 |
\noindent The second and third example fail, because something is |
|
455 |
``missing'' in the sequence we are looking for. Also notice how the |
|
456 |
results nest with sequences: the parsed part is a nested pair of the |
|
457 |
form \pcode{((a, b), c)}. Two more examples: first consider the parser |
|
458 |
\pcode{("a" ~ "a") ~ "a"} and the input \pcode{aaaa}: |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
459 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
460 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
461 |
\begin{tabular}{rcl} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
462 |
input string & & output\medskip\\ |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
463 |
\texttt{\Grid{aaaa}} & $\rightarrow$ & |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
464 |
$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\ |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
465 |
\end{tabular} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
466 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
467 |
|
585 | 468 |
\noindent Notice how the results nests deeper and deeper as pairs (the |
469 |
last \pcode{a} is in the unprocessed part). To consume everything of |
|
470 |
this string we can use the parser \pcode{(("a" ~ "a") ~ "a") ~ |
|
471 |
"a"}. Then the output is as follows: |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
472 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
473 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
474 |
\begin{tabular}{rcl} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
475 |
input string & & output\medskip\\ |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
476 |
\texttt{\Grid{aaaa}} & $\rightarrow$ & |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
477 |
$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\ |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
478 |
\end{tabular} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
479 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
480 |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
481 |
\noindent This is an instance where the parser consumed |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
482 |
completely the input, meaning the unprocessed part is just the |
585 | 483 |
empty string. So if we called \pcode{parse_all} instead of \pcode{parse} |
484 |
we would get back the result |
|
485 |
||
486 |
\[ |
|
487 |
\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\} |
|
488 |
\] |
|
489 |
||
490 |
\noindent where the unprocessed (empty) parts have been stripped away |
|
491 |
from the pairs; everything where the second part was not empty has |
|
492 |
been thrown away as ultimately-unsuccessful-parses. |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
493 |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
494 |
|
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
495 |
Note carefully that constructing a parser such \pcode{'a' || |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
496 |
('a' ~ 'b')} will result in a typing error. The first |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
497 |
parser has as output type a single character (recall the type |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
498 |
of \texttt{CharParser}), but the second parser produces a pair |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
499 |
of characters as output. The alternative parser is however |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
500 |
required to have both component parsers to have the same type. |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
501 |
We will see later how we can build this parser without the |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
502 |
typing error. |
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
503 |
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
504 |
The next parser combinator does not actually combine smaller |
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
505 |
parsers, but applies a function to the result of a parser. |
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
506 |
It is implemented in Scala as follows |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
507 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
508 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
509 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
510 |
class FunParser[I, T, S] |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
511 |
(p: => Parser[I, T], |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
512 |
f: T => S) extends Parser[I, S] { |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
513 |
def parse(sb: I) = |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
514 |
for ((head, tail) <- p.parse(sb)) yield (f(head), tail) |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
515 |
} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
516 |
\end{lstlisting} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
517 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
518 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
519 |
|
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
520 |
\noindent This parser combinator takes a parser \texttt{p} |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
521 |
with output type \texttt{T} as one argument as well as a |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
522 |
function \texttt{f} with type \texttt{T => S}. The parser |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
523 |
\texttt{p} produces sets of type \texttt{(T, I)}. The |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
524 |
\texttt{FunParser} combinator then applies the function |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
525 |
\texttt{f} to all the parser outputs. Since this function is of |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
526 |
type \texttt{T => S}, we obtain a parser with output type |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
527 |
\texttt{S}. Again Scala lets us introduce some shorthand |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
528 |
notation for this parser combinator. Therefore we will write |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
529 |
\texttt{p ==> f} for it. |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
530 |
|
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
531 |
\subsubsection*{How to build parsers using parser combinators?} |
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
532 |
|
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
533 |
\subsubsection*{Implementing an Interpreter} |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
534 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
535 |
%\bigskip |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
536 |
%takes advantage of the full generality---have a look |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
537 |
%what it produces if we call it with the string \texttt{abc} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
538 |
% |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
539 |
%\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
540 |
%\begin{tabular}{rcl} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
541 |
%input string & & output\medskip\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
542 |
%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
543 |
%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
544 |
%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
545 |
%\end{tabular} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
546 |
%\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
547 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
548 |
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
549 |
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
550 |
\end{document} |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
551 |
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
552 |
%%% Local Variables: |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
553 |
%%% mode: latex |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
554 |
%%% TeX-master: t |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
555 |
%%% End: |