| author | Christian Urban <christian.urban@kcl.ac.uk> |
| Wed, 08 Oct 2025 10:42:10 +0100 | |
| changeset 1003 | bae8c3eb51c7 |
| parent 941 | 66adcae6c762 |
| permissions | -rw-r--r-- |
| 584 | 1 |
|
| 595 | 2 |
% !TEX program = xelatex |
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
3 |
\documentclass{article}
|
|
297
5c51839c88fd
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
292
diff
changeset
|
4 |
\usepackage{../style}
|
|
217
cd6066f1056a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
183
diff
changeset
|
5 |
\usepackage{../langs}
|
| 588 | 6 |
\usepackage{../grammar}
|
| 799 | 7 |
\usepackage{../graphics}
|
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
8 |
|
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
9 |
\begin{document}
|
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
10 |
|
|
292
7ed2a25dd115
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
217
diff
changeset
|
11 |
\section*{Handout 6 (Parser Combinators)}
|
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
12 |
|
| 584 | 13 |
This handout explains how \emph{parser combinators} work and how they
|
| 587 | 14 |
can be implemented in Scala. Their most distinguishing feature is that |
15 |
they are very easy to implement (admittedly it is only easy in a |
|
16 |
functional programming language). Another good point of parser |
|
17 |
combinators is that they can deal with any kind of input as long as |
|
18 |
this input is of ``sequence-kind'', for example a string or a list of |
|
19 |
tokens. The only two properties of the input we need is to be able to |
|
20 |
test when it is empty and ``sequentially'' take it apart. Strings and |
|
21 |
lists fit this bill. However, parser combinators also have their |
|
22 |
drawbacks. For example they require that the grammar to be parsed is |
|
23 |
\emph{not} left-recursive and they are efficient only when the grammar
|
|
24 |
is unambiguous. It is the responsibility of the grammar designer to |
|
| 591 | 25 |
ensure these two properties hold. |
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
26 |
|
| 587 | 27 |
The general idea behind parser combinators is to transform the input |
28 |
into sets of pairs, like so |
|
|
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
29 |
|
|
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
30 |
\begin{center}
|
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
31 |
$\underbrace{\text{list of tokens}}_{\text{input}}$
|
| 594 | 32 |
$\quad\Rightarrow\quad$ |
| 591 | 33 |
$\underbrace{\text{set of (parsed part, unprocessed part)}}_{\text{output}}$
|
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
34 |
\end{center}
|
|
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
35 |
|
| 587 | 36 |
\noindent |
| 590 | 37 |
Given the extended effort we have spent implementing a lexer in order |
| 591 | 38 |
to generate lists of tokens, it might be surprising that in what |
39 |
follows we shall often use strings as input, rather than lists of |
|
| 937 | 40 |
tokens. This is for making the explanation more lucid and ensure the |
41 |
examples are simple. It does not make our previous work on lexers obsolete |
|
| 591 | 42 |
(remember they transform a string into a list of tokens). Lexers will |
| 937 | 43 |
still be needed for building a somewhat realistic compiler. See also |
44 |
a question in the homework regarding this issue. |
|
| 584 | 45 |
|
| 590 | 46 |
As mentioned above, parser combinators are relatively agnostic about what |
| 587 | 47 |
kind of input they process. In my Scala code I use the following |
48 |
polymorphic types for parser combinators: |
|
|
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
49 |
|
|
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
50 |
\begin{center}
|
| 584 | 51 |
input:\;\; \texttt{I} \qquad output:\;\; \texttt{T}
|
|
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
52 |
\end{center}
|
|
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
53 |
|
| 587 | 54 |
\noindent That is they take as input something of type \texttt{I} and
|
| 590 | 55 |
return a set of pairs of type \texttt{Set[(T, I)]}. Since the input
|
56 |
needs to be of ``sequence-kind'', I actually have to often write |
|
| 937 | 57 |
\code{(using is: I => Seq[_])} for the input type. This ensures the
|
58 |
input is a subtype of Scala sequences.\footnote{This is a new feature
|
|
59 |
in Scala 3 and is about type-classes, meaning if you use Scala 2 you will have difficulties |
|
60 |
with running my code.} The first component of the generated pairs |
|
61 |
corresponds to what the parser combinator was able to parse from the |
|
62 |
input and the second is the unprocessed, or leftover, part of the |
|
63 |
input (therefore the type of this unprocessed part is the same as the |
|
64 |
input). A parser combinator might return more than one such pair; the |
|
65 |
idea is that there are potentially several ways of how to parse the |
|
66 |
input. As a concrete example, consider the string |
|
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
67 |
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
68 |
\begin{center}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
69 |
\tt\Grid{iffoo\VS testbar}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
70 |
\end{center}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
71 |
|
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
72 |
\noindent We might have a parser combinator which tries to |
|
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
73 |
interpret this string as a keyword (\texttt{if}) or as an
|
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
74 |
identifier (\texttt{iffoo}). Then the output will be the set
|
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
75 |
|
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
76 |
\begin{center}
|
|
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
77 |
$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right),
|
|
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
78 |
\left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
|
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
79 |
\end{center}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
80 |
|
| 587 | 81 |
\noindent where the first pair means the parser could recognise |
| 590 | 82 |
\texttt{if} from the input and leaves the \texttt{foo\VS testbar} as
|
| 591 | 83 |
unprocessed part; in the other case it could recognise |
| 587 | 84 |
\texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
|
85 |
parser cannot recognise anything from the input at all, then parser |
|
86 |
combinators just return the empty set $\{\}$. This will indicate
|
|
87 |
something ``went wrong''\ldots or more precisely, nothing could be |
|
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
88 |
parsed. |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
89 |
|
| 594 | 90 |
Also important to note is that the output type \texttt{T} for the
|
91 |
processed part can potentially be different from the input type |
|
92 |
\texttt{I} in the parser. In the example above is just happens to be
|
|
93 |
the same. The reason for the difference is that in general we are |
|
94 |
interested in transforming our input into something |
|
95 |
``different''\ldots for example into a tree; or if we implement the |
|
96 |
grammar for arithmetic expressions, we might be interested in the |
|
97 |
actual integer number the arithmetic expression, say \texttt{1 + 2 *
|
|
98 |
3}, stands for. In this way we can use parser combinators to |
|
99 |
implement relatively easily a calculator, for instance (we shall do |
|
100 |
this later on). |
|
| 584 | 101 |
|
| 594 | 102 |
The main driving force behind parser combinators is that we can easily |
103 |
build parser combinators out of smaller components following very |
|
104 |
closely the structure of a grammar. In order to implement this in a |
|
| 591 | 105 |
functional/object-oriented programming language, like Scala, we need |
106 |
to specify an abstract class for parser combinators. In the abstract |
|
107 |
class we specify that \texttt{I} is the \emph{input type} of the
|
|
| 593 | 108 |
parser combinator and that \texttt{T} is the \emph{output type}. This
|
| 591 | 109 |
implies that the function \texttt{parse} takes an argument of type
|
| 941 | 110 |
\texttt{I} and returns a set of type \mbox{\texttt{Set[(T,
|
111 |
I)]}}.\footnote{In the actual code on KEATS there is some
|
|
112 |
kerfuffle about correct type-bounds and using clauses. Ignore this |
|
113 |
part of the implementation for the moment---it is not needed for |
|
114 |
understanding how the code works.} |
|
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
115 |
|
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
116 |
\begin{center}
|
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
117 |
\begin{lstlisting}[language=Scala]
|
| 941 | 118 |
abstract class Parser[I, T] {
|
| 937 | 119 |
def parse(in: I): Set[(T, I)] |
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
120 |
|
| 590 | 121 |
def parse_all(in: I) : Set[T] = |
| 937 | 122 |
for ((hd, tl) <- parse(in); |
| 941 | 123 |
if tl.isEmpty) yield hd |
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
124 |
} |
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
125 |
\end{lstlisting}
|
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
126 |
\end{center}
|
|
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
127 |
|
| 591 | 128 |
\noindent It is the obligation in each instance of this class to |
| 584 | 129 |
supply an implementation for \texttt{parse}. From this function we
|
130 |
can then ``centrally'' derive the function \texttt{parse\_all}, which
|
|
131 |
just filters out all pairs whose second component is not empty (that |
|
132 |
is has still some unprocessed part). The reason is that at the end of |
|
133 |
the parsing we are only interested in the results where all the input |
|
134 |
has been consumed and no unprocessed part is left over. |
|
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
135 |
|
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
136 |
One of the simplest parser combinators recognises just a |
| 584 | 137 |
single character, say $c$, from the beginning of strings. Its |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
138 |
behaviour can be described as follows: |
|
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
139 |
|
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
140 |
\begin{itemize}
|
| 937 | 141 |
\item If the head of the input string $s$ starts with a $c$, then return |
| 584 | 142 |
the set |
| 937 | 143 |
\[\{(c, \textit{tail-of-}s)\}\]
|
| 584 | 144 |
where \textit{tail of}
|
145 |
$s$ is the unprocessed part of the input string. |
|
146 |
\item Otherwise return the empty set $\{\}$.
|
|
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
147 |
\end{itemize}
|
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
148 |
|
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
149 |
\noindent |
| 590 | 150 |
The input type of this simple parser combinator is \texttt{String} and
|
151 |
the output type is \texttt{Char}. This means \texttt{parse} returns
|
|
152 |
\mbox{\texttt{Set[(Char, String)]}}. The code in Scala is as follows:
|
|
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
153 |
|
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
154 |
\begin{center}
|
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
155 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
|
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
156 |
case class CharParser(c: Char) extends Parser[String, Char] {
|
| 937 | 157 |
def parse(s: String) = |
158 |
if (s != "" && s.head == c) Set((c, s.tail)) else Set() |
|
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
159 |
} |
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
160 |
\end{lstlisting}
|
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
161 |
\end{center}
|
|
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
162 |
|
| 937 | 163 |
\noindent You can see \texttt{parse} tests here whether the
|
164 |
first character of the input string \texttt{s} is equal to
|
|
| 584 | 165 |
\texttt{c}. If yes, then it splits the string into the recognised part
|
| 937 | 166 |
\texttt{c} and the unprocessed part \texttt{s.tail}. In case
|
167 |
\texttt{s} does not start with \texttt{c} then the parser returns the
|
|
| 584 | 168 |
empty set (in Scala \texttt{Set()}). Since this parser recognises
|
169 |
characters and just returns characters as the processed part, the |
|
170 |
output type of the parser is \texttt{Char}.
|
|
171 |
||
172 |
If we want to parse a list of tokens and interested in recognising a |
|
| 590 | 173 |
number token, for example, we could write something like this |
| 584 | 174 |
|
175 |
\begin{center}
|
|
176 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
|
|
177 |
case object NumParser extends Parser[List[Token], Int] {
|
|
178 |
def parse(ts: List[Token]) = ts match {
|
|
| 937 | 179 |
case Num_token(s)::rest => Set((s.toInt, rest)) |
| 584 | 180 |
case _ => Set () |
181 |
} |
|
182 |
} |
|
183 |
\end{lstlisting}
|
|
184 |
\end{center}
|
|
185 |
||
186 |
\noindent |
|
187 |
In this parser the input is of type \texttt{List[Token]}. The function
|
|
188 |
parse looks at the input \texttt{ts} and checks whether the first
|
|
| 589 | 189 |
token is a \texttt{Num\_token} (let us assume our lexer generated
|
190 |
these tokens for numbers). But this parser does not just return this |
|
| 584 | 191 |
token (and the rest of the list), like the \texttt{CharParser} above,
|
| 590 | 192 |
rather it extracts also the string \texttt{s} from the token and
|
193 |
converts it into an integer. The hope is that the lexer did its work |
|
194 |
well and this conversion always succeeds. The consequence of this is |
|
195 |
that the output type for this parser is \texttt{Int}, not
|
|
| 937 | 196 |
\texttt{Token}. Such a conversion would be needed in our parser,
|
197 |
because when we encounter a number in our program, we want to do |
|
198 |
some calculations based on integers, not strings (or tokens). |
|
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
199 |
|
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
200 |
|
| 584 | 201 |
These simple parsers that just look at the input and do a simple |
202 |
transformation are often called \emph{atomic} parser combinators.
|
|
203 |
More interesting are the parser combinators that build larger parsers |
|
| 587 | 204 |
out of smaller component parsers. There are three such parser |
205 |
combinators that can be implemented generically. The \emph{alternative
|
|
| 584 | 206 |
parser combinator} is as follows: given two parsers, say, $p$ and |
207 |
$q$, we apply both parsers to the input (remember parsers are |
|
| 587 | 208 |
functions) and combine the output (remember they are sets of pairs): |
209 |
||
210 |
\begin{center}
|
|
211 |
$p(\text{input}) \cup q(\text{input})$
|
|
212 |
\end{center}
|
|
213 |
||
214 |
\noindent In Scala we can implement alternative parser |
|
215 |
combinator as follows |
|
216 |
||
217 |
\begin{center}
|
|
218 |
\begin{lstlisting}[language=Scala, numbers=none]
|
|
219 |
class AltParser[I, T] |
|
| 937 | 220 |
(p: => Parser[I, T], |
| 941 | 221 |
q: => Parser[I, T]) extends Parser[I, T] {
|
| 937 | 222 |
def parse(in: I) = p.parse(in) ++ q.parse(in) |
223 |
} |
|
| 587 | 224 |
\end{lstlisting}
|
225 |
\end{center}
|
|
226 |
||
227 |
\noindent The types of this parser combinator are again generic (we |
|
228 |
have \texttt{I} for the input type, and \texttt{T} for the output
|
|
229 |
type). The alternative parser builds a new parser out of two existing |
|
| 590 | 230 |
parsers \texttt{p} and \texttt{q} which are given as arguments. Both
|
231 |
parsers need to be able to process input of type \texttt{I} and return
|
|
232 |
in \texttt{parse} the same output type \texttt{Set[(T,
|
|
| 587 | 233 |
I)]}.\footnote{There is an interesting detail of Scala, namely the
|
| 937 | 234 |
\texttt{=>} in front of the types of \texttt{p} and \texttt{q}. These arrows
|
| 587 | 235 |
will prevent the evaluation of the arguments before they are |
236 |
used. This is often called \emph{lazy evaluation} of the
|
|
| 590 | 237 |
arguments. We will explain this later.} The alternative parser runs |
238 |
the input with the first parser \texttt{p} (producing a set of pairs)
|
|
239 |
and then runs the same input with \texttt{q} (producing another set of
|
|
240 |
pairs). The result should be then just the union of both sets, which |
|
241 |
is the operation \texttt{++} in Scala.
|
|
| 587 | 242 |
|
243 |
The alternative parser combinator allows us to construct a parser that |
|
244 |
parses either a character \texttt{a} or \texttt{b} using the
|
|
| 937 | 245 |
\texttt{CharParser} shown above. For this we can write\footnote{Note
|
246 |
that we cannot use a \texttt{case}-class for \texttt{AltParser}s
|
|
247 |
because of the problem with laziness and Scala quirks. Hating |
|
248 |
\texttt{new} like the plague, we will work around this later with
|
|
249 |
some syntax tricks. ;o)} |
|
| 587 | 250 |
|
251 |
\begin{center}
|
|
252 |
\begin{lstlisting}[language=Scala, numbers=none]
|
|
253 |
new AltParser(CharParser('a'), CharParser('b'))
|
|
254 |
\end{lstlisting}
|
|
255 |
\end{center}
|
|
256 |
||
257 |
\noindent Later on we will use Scala mechanism for introducing some |
|
| 799 | 258 |
more readable shorthand notation for this, like \texttt{p"a" ||
|
| 937 | 259 |
p"b"}. But first let us look in detail at what this parser combinator produces |
| 590 | 260 |
with some sample strings. |
| 587 | 261 |
|
262 |
\begin{center}
|
|
263 |
\begin{tabular}{rcl}
|
|
264 |
input strings & & output\medskip\\ |
|
| 937 | 265 |
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}},\; \texttt{\Grid{cde}})\right\}$\\
|
266 |
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}},\; \texttt{\Grid{cde}})\right\}$\\
|
|
| 587 | 267 |
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
|
268 |
\end{tabular}
|
|
269 |
\end{center}
|
|
270 |
||
| 937 | 271 |
\noindent We receive in the first two cases a successful output (that |
272 |
is a non-empty set). In each case, either \pcode{a} or \pcode{b} is in
|
|
273 |
the parsed part, and \pcode{cde} in the unprocessed part. Clearly this
|
|
274 |
parser cannot parse anything of the form \pcode{ccde}, therefore the
|
|
275 |
empty set is returned in the last case. Observe that parser |
|
276 |
combinators only look at the beginning of the given input: they do not |
|
277 |
fish out something in the ``middle'' of the input. |
|
| 587 | 278 |
|
279 |
A bit more interesting is the \emph{sequence parser combinator}. Given
|
|
280 |
two parsers, say again, $p$ and $q$, we want to apply first the input |
|
| 590 | 281 |
to $p$ producing a set of pairs; then apply $q$ to all the unparsed |
| 587 | 282 |
parts in the pairs; and then combine the results. Mathematically we would |
| 591 | 283 |
write something like this for the set of pairs: |
| 587 | 284 |
|
285 |
\begin{center}
|
|
286 |
\begin{tabular}{lcl}
|
|
287 |
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ &
|
|
288 |
$(\textit{output}_1, u_1) \in p(\text{input})
|
|
289 |
\;\wedge\;$\\ |
|
290 |
&& $(\textit{output}_2, u_2) \in q(u_1)\}$
|
|
291 |
\end{tabular}
|
|
292 |
\end{center}
|
|
293 |
||
294 |
\noindent Notice that the $p$ will first be run on the input, |
|
| 590 | 295 |
producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$
|
| 591 | 296 |
stands for the unprocessed, or leftover, parts of $p$. We want that |
| 590 | 297 |
$q$ runs on all these unprocessed parts $u_1$. Therefore these |
298 |
unprocessed parts are fed into the second parser $q$. The overall |
|
299 |
result of the sequence parser combinator is pairs of the form |
|
| 584 | 300 |
$((\textit{output}_1, \textit{output}_2), u_2)$. This means the
|
| 593 | 301 |
unprocessed part of the sequence parser combinator is the unprocessed |
| 591 | 302 |
part the second parser $q$ leaves as leftover. The parsed parts of the |
303 |
component parsers are combined in a pair, namely |
|
304 |
$(\textit{output}_1, \textit{output}_2)$. The reason is we want to
|
|
305 |
know what $p$ and $q$ were able to parse. This behaviour can be |
|
306 |
implemented in Scala as follows: |
|
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
307 |
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
308 |
\begin{center}
|
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
309 |
\begin{lstlisting}[language=Scala,numbers=none]
|
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
310 |
class SeqParser[I, T, S] |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
311 |
(p: => Parser[I, T], |
| 941 | 312 |
q: => Parser[I, S]) extends Parser[I, (T, S)] {
|
| 587 | 313 |
def parse(in: I) = |
314 |
for ((output1, u1) <- p.parse(in); |
|
|
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
315 |
(output2, u2) <- q.parse(u1)) |
|
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
316 |
yield ((output1, output2), u2) |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
317 |
} |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
318 |
\end{lstlisting}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
319 |
\end{center}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
320 |
|
| 587 | 321 |
\noindent This parser takes again as arguments two parsers, \texttt{p}
|
| 591 | 322 |
and \texttt{q}. It implements \texttt{parse} as follows: first run the
|
323 |
parser \texttt{p} on the input producing a set of pairs
|
|
| 587 | 324 |
(\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
|
| 591 | 325 |
unprocessed parts left over by \texttt{p} (recall that there can be
|
| 937 | 326 |
several such pairs). Let then \texttt{q} run on these unprocessed
|
| 591 | 327 |
parts producing again a set of pairs. The output of the sequence |
328 |
parser combinator is then a set containing pairs where the first |
|
329 |
components are again pairs, namely what the first parser could parse |
|
330 |
together with what the second parser could parse; the second component |
|
331 |
is the unprocessed part left over after running the second parser |
|
332 |
\texttt{q}. Note that the input type of the sequence parser combinator
|
|
333 |
is as usual \texttt{I}, but the output type is
|
|
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
334 |
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
335 |
\begin{center}
|
| 590 | 336 |
\texttt{(T, S)}
|
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
337 |
\end{center}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
338 |
|
| 584 | 339 |
\noindent |
| 591 | 340 |
Consequently, the function \texttt{parse} in the sequence parser
|
341 |
combinator returns sets of type \texttt{Set[((T, S), I)]}. That means
|
|
342 |
we have essentially two output types for the sequence parser |
|
343 |
combinator (packaged in a pair), because in general \textit{p} and
|
|
344 |
\textit{q} might produce different things (for example we recognise a
|
|
345 |
number with \texttt{p} and then with \texttt{q} a string corresponding
|
|
346 |
to an operator). If any of the runs of \textit{p} and \textit{q}
|
|
347 |
fail, that is produce the empty set, then \texttt{parse} will also
|
|
348 |
produce the empty set. |
|
| 584 | 349 |
|
| 587 | 350 |
With the shorthand notation we shall introduce later for the sequence |
| 799 | 351 |
parser combinator, we can write for example \pcode{p"a" ~ p"b"}, which
|
| 587 | 352 |
is the parser combinator that first recognises the character |
| 937 | 353 |
\texttt{a} from a string and then \texttt{b}. (Actually, we will be
|
354 |
able to write just \pcode{p"ab"} for such parsers, but it is good to
|
|
355 |
understand first what happens behind the scenes.) Let us look again |
|
356 |
at some examples of how the sequence parser combinator processes some |
|
357 |
strings: |
|
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
358 |
|
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
359 |
\begin{center}
|
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
360 |
\begin{tabular}{rcl}
|
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
361 |
input strings & & output\medskip\\ |
| 937 | 362 |
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}),\; \texttt{\Grid{cde}})\right\}$\\
|
| 584 | 363 |
\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
|
364 |
\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
|
|
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
365 |
\end{tabular}
|
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
366 |
\end{center}
|
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
367 |
|
| 586 | 368 |
\noindent In the first line we have a successful parse, because the |
| 587 | 369 |
string starts with \texttt{ab}, which is the prefix we are looking
|
| 584 | 370 |
for. But since the parsing combinator is constructed as sequence of |
371 |
the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
|
|
372 |
result is a nested pair of the form \texttt{((a, b), cde)}. It is
|
|
| 586 | 373 |
\emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
|
| 587 | 374 |
expect. The parser returns the empty set in the other examples, |
| 584 | 375 |
because they do not fit with what the parser is supposed to parse. |
376 |
||
377 |
||
| 799 | 378 |
A slightly more complicated parser is \pcode{(p"a" || p"b") ~ p"c"} which
|
| 587 | 379 |
parses as first character either an \texttt{a} or \texttt{b}, followed
|
380 |
by a \texttt{c}. This parser produces the following outputs.
|
|
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
381 |
|
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
382 |
\begin{center}
|
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
383 |
\begin{tabular}{rcl}
|
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
384 |
input strings & & output\medskip\\ |
| 937 | 385 |
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
|
386 |
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
|
|
| 585 | 387 |
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$
|
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
388 |
\end{tabular}
|
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
389 |
\end{center}
|
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
390 |
|
| 585 | 391 |
\noindent |
| 799 | 392 |
Now consider the parser \pcode{(p"a" ~ p"b") ~ p"c"} which parses
|
| 585 | 393 |
\texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces
|
394 |
the following outputs. |
|
395 |
||
396 |
\begin{center}
|
|
397 |
\begin{tabular}{rcl}
|
|
398 |
input strings & & output\medskip\\ |
|
| 937 | 399 |
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
|
| 585 | 400 |
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\
|
401 |
\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$
|
|
402 |
\end{tabular}
|
|
403 |
\end{center}
|
|
404 |
||
405 |
||
406 |
\noindent The second and third example fail, because something is |
|
| 590 | 407 |
``missing'' in the sequence we are looking for. The first succeeds but |
408 |
notice how the results nest with sequences: the parsed part is a |
|
409 |
nested pair of the form \pcode{((a, b), c)}. If we nest the sequence
|
|
| 799 | 410 |
parser differently, say \pcode{p"a" ~ (p"b" ~ p"c")}, then also
|
| 590 | 411 |
our output pairs nest differently |
| 589 | 412 |
|
413 |
\begin{center}
|
|
414 |
\begin{tabular}{rcl}
|
|
415 |
input strings & & output\medskip\\ |
|
| 937 | 416 |
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})),\; \texttt{\Grid{de}})\right\}$\\
|
| 589 | 417 |
\end{tabular}
|
418 |
\end{center}
|
|
419 |
||
420 |
\noindent |
|
421 |
Two more examples: first consider the parser |
|
| 799 | 422 |
\pcode{(p"a" ~ p"a") ~ p"a"} and the input \pcode{aaaa}:
|
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
423 |
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
424 |
\begin{center}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
425 |
\begin{tabular}{rcl}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
426 |
input string & & output\medskip\\ |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
427 |
\texttt{\Grid{aaaa}} & $\rightarrow$ &
|
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
428 |
$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
|
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
429 |
\end{tabular}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
430 |
\end{center}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
431 |
|
| 591 | 432 |
\noindent Notice again how the results nest deeper and deeper as pairs (the |
| 585 | 433 |
last \pcode{a} is in the unprocessed part). To consume everything of
|
| 799 | 434 |
this string we can use the parser \pcode{((p"a" ~ p"a") ~ p"a") ~
|
435 |
p"a"}. Then the output is as follows: |
|
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
436 |
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
437 |
\begin{center}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
438 |
\begin{tabular}{rcl}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
439 |
input string & & output\medskip\\ |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
440 |
\texttt{\Grid{aaaa}} & $\rightarrow$ &
|
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
441 |
$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
|
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
442 |
\end{tabular}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
443 |
\end{center}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
444 |
|
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
445 |
\noindent This is an instance where the parser consumed |
|
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
446 |
completely the input, meaning the unprocessed part is just the |
| 587 | 447 |
empty string. So if we called \pcode{parse_all}, instead of \pcode{parse},
|
| 585 | 448 |
we would get back the result |
449 |
||
450 |
\[ |
|
451 |
\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}
|
|
452 |
\] |
|
453 |
||
454 |
\noindent where the unprocessed (empty) parts have been stripped away |
|
455 |
from the pairs; everything where the second part was not empty has |
|
| 587 | 456 |
been thrown away as well, because they represent |
| 590 | 457 |
ultimately-unsuccessful-parses. The main point is that the sequence |
458 |
parser combinator returns pairs that can nest according to the |
|
459 |
nesting of the component parsers. |
|
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
460 |
|
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
461 |
|
| 937 | 462 |
Consider also carefully that constructing a parser such |
463 |
||
464 |
\begin{center}
|
|
465 |
\pcode{p"a" || (p"a" ~ p"b")}
|
|
466 |
\end{center}
|
|
467 |
||
468 |
\noindent |
|
469 |
will result in a typing error. The intention with this |
|
| 591 | 470 |
parser is that we want to parse either an \texttt{a}, or an \texttt{a}
|
| 590 | 471 |
followed by a \texttt{b}. However, the first parser has as output type
|
472 |
a single character (recall the type of \texttt{CharParser}), but the
|
|
473 |
second parser produces a pair of characters as output. The alternative |
|
474 |
parser is required to have both component parsers to have the same |
|
| 591 | 475 |
type---the reason is that we need to be able to build the union of two |
476 |
sets, which requires in Scala that the sets have the same type. Since |
|
477 |
they are not in this case, there is a typing error. We will see later |
|
478 |
how we can build this parser without the typing error. |
|
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
297
diff
changeset
|
479 |
|
| 937 | 480 |
The next parser combinator, called \emph{semantic action} or \emph{map-parser}, does not
|
| 591 | 481 |
actually combine two smaller parsers, but applies a function to the result |
| 587 | 482 |
of a parser. It is implemented in Scala as follows |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
483 |
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
484 |
\begin{center}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
485 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
|
| 799 | 486 |
class MapParser[I, T, S] |
| 937 | 487 |
(p: => Parser[I, T], |
| 941 | 488 |
f: T => S) extends Parser[I, S] {
|
| 937 | 489 |
def parse(in: I) = |
490 |
for ((hd, tl) <- p.parse(in)) yield (f(hd), tl) |
|
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
491 |
} |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
492 |
\end{lstlisting}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
493 |
\end{center}
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
494 |
|
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
495 |
|
| 590 | 496 |
\noindent This parser combinator takes a parser \texttt{p} (with input
|
497 |
type \texttt{I} and output type \texttt{T}) as one argument but also a
|
|
498 |
function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p}
|
|
| 937 | 499 |
produces sets of type \texttt{Set[(S, I)]}. The semantic action
|
| 590 | 500 |
combinator then applies the function \texttt{f} to all the `processed'
|
501 |
parser outputs. Since this function is of type \texttt{T => S}, we
|
|
502 |
obtain a parser with output type \texttt{S}. Again Scala lets us
|
|
503 |
introduce some shorthand notation for this parser |
|
| 799 | 504 |
combinator. Therefore we will write short \texttt{p.map(f)} for it.
|
|
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
505 |
|
| 589 | 506 |
What are semantic actions good for? Well, they allow you to transform |
| 590 | 507 |
the parsed input into datastructures you can use for further |
| 591 | 508 |
processing. A simple (contrived) example would be to transform parsed |
509 |
characters into ASCII numbers. Suppose we define a function \texttt{f}
|
|
510 |
(from characters to \texttt{Int}s) and use a \texttt{CharParser} for parsing
|
|
| 589 | 511 |
the character \texttt{c}.
|
| 587 | 512 |
|
| 591 | 513 |
|
| 587 | 514 |
\begin{center}
|
515 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
|
|
516 |
val f = (c: Char) => c.toInt |
|
517 |
val c = new CharParser('c')
|
|
518 |
\end{lstlisting}
|
|
519 |
\end{center}
|
|
520 |
||
521 |
\noindent |
|
| 589 | 522 |
We then can run the following two parsers on the input \texttt{cbd}:
|
| 587 | 523 |
|
524 |
\begin{center}
|
|
525 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
|
|
526 |
c.parse("cbd")
|
|
| 799 | 527 |
c.map(f).parse("cbd")
|
| 587 | 528 |
\end{lstlisting}
|
529 |
\end{center}
|
|
530 |
||
531 |
\noindent |
|
| 589 | 532 |
In the first line we obtain the expected result \texttt{Set(('c',
|
533 |
"bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the
|
|
534 |
character has been transformed into an ASCII number. |
|
| 588 | 535 |
|
536 |
A slightly less contrived example is about parsing numbers (recall |
|
| 591 | 537 |
\texttt{NumParser} above). However, we want to do this here for
|
538 |
strings, not for tokens. For this assume we have the following |
|
539 |
(atomic) \texttt{RegexParser}.
|
|
| 588 | 540 |
|
541 |
\begin{center}
|
|
542 |
\begin{lstlisting}[language=Scala,xleftmargin=0mm,
|
|
543 |
basicstyle=\small\ttfamily, numbers=none] |
|
544 |
import scala.util.matching.Regex |
|
545 |
||
546 |
case class RegexParser(reg: Regex) extends Parser[String, String] {
|
|
547 |
def parse(in: String) = reg.findPrefixMatchOf(in) match {
|
|
548 |
case None => Set() |
|
549 |
case Some(m) => Set((m.matched, m.after.toString)) |
|
550 |
} |
|
551 |
} |
|
552 |
\end{lstlisting}
|
|
553 |
\end{center}
|
|
554 |
||
555 |
\noindent |
|
556 |
This parser takes a regex as argument and splits up a string into a |
|
557 |
prefix and the rest according to this regex |
|
558 |
(\texttt{reg.findPrefixMatchOf} generates a match---in the successful
|
|
559 |
case---and the corresponding strings can be extracted with |
|
| 591 | 560 |
\texttt{matched} and \texttt{after}). The input and output type for
|
561 |
this parser is \texttt{String}. Using \texttt{RegexParser} we can
|
|
562 |
define a \texttt{NumParser} for \texttt{Strings} to \texttt{Int} as
|
|
563 |
follows: |
|
| 588 | 564 |
|
565 |
\begin{center}
|
|
566 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
|
|
567 |
val NumParser = RegexParser("[0-9]+".r)
|
|
568 |
\end{lstlisting}
|
|
569 |
\end{center}
|
|
570 |
||
571 |
\noindent |
|
| 591 | 572 |
This parser will recognise a number at the beginning of a string. For |
| 588 | 573 |
example |
574 |
||
575 |
\begin{center}
|
|
576 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
|
|
577 |
NumParser.parse("123abc")
|
|
578 |
\end{lstlisting}
|
|
579 |
\end{center}
|
|
580 |
||
581 |
\noindent |
|
582 |
produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
|
|
| 937 | 583 |
still a string (the expected double-quotes are not printed by |
| 590 | 584 |
Scala). We want to convert this string into the corresponding |
585 |
\texttt{Int}. We can do this as follows using a semantic action
|
|
| 588 | 586 |
|
587 |
\begin{center}
|
|
588 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
|
|
| 799 | 589 |
NumParser.map{s => s.toInt}.parse("123abc")
|
| 588 | 590 |
\end{lstlisting}
|
591 |
\end{center}
|
|
592 |
||
593 |
\noindent |
|
| 589 | 594 |
The function in the semantic action converts a string into an |
| 591 | 595 |
\texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))},
|
| 937 | 596 |
but this time \texttt{123} is an \texttt{Int}. Think carefully what
|
597 |
the input and output type of the parser is without the semantic action |
|
| 941 | 598 |
and what with the semantic action (the type of the function can |
| 937 | 599 |
already tell you). Let us come back to semantic actions when we are |
600 |
going to implement actual context-free grammars. |
|
| 587 | 601 |
|
602 |
\subsubsection*{Shorthand notation for parser combinators}
|
|
603 |
||
604 |
Before we proceed, let us just explain the shorthand notation for |
|
| 937 | 605 |
parser combinators. Like for regular expressions, the shorthand |
606 |
notation will make our life much easier when writing actual |
|
607 |
parsers. We can define some extensions\footnote{In Scala 2 this was
|
|
608 |
generically called as ``implicits''.} which allow us to write |
|
| 591 | 609 |
|
610 |
\begin{center}
|
|
611 |
\begin{tabular}{ll}
|
|
| 799 | 612 |
\pcode{p || q} & alternative parser\\
|
| 591 | 613 |
\pcode{p ~ q} & sequence parser\\
|
| 799 | 614 |
\pcode{p.map(f)} & semantic action parser
|
| 591 | 615 |
\end{tabular}
|
616 |
\end{center}
|
|
617 |
||
618 |
\noindent |
|
| 937 | 619 |
We will also use the \texttt{p}-string-interpolation for specifying simple string parsers.
|
| 590 | 620 |
|
621 |
The idea is that this shorthand notation allows us to easily translate |
|
622 |
context-free grammars into code. For example recall our context-free |
|
623 |
grammar for palindromes: |
|
624 |
||
625 |
\begin{plstx}[margin=3cm]
|
|
| 591 | 626 |
: \meta{Pal} ::= a\cdot \meta{Pal}\cdot a | b\cdot \meta{Pal}\cdot b | a | b | \epsilon\\
|
| 590 | 627 |
\end{plstx}
|
628 |
||
629 |
\noindent |
|
630 |
Each alternative in this grammar translates into an alternative parser |
|
631 |
combinator. The $\cdot$ can be translated to a sequence parser |
|
632 |
combinator. The parsers for $a$, $b$ and $\epsilon$ can be simply |
|
| 799 | 633 |
written as \texttt{p"a"}, \texttt{p"b"} and \texttt{p""}.
|
| 590 | 634 |
|
| 587 | 635 |
|
| 937 | 636 |
\subsubsection*{How to build more interesting parsers using parser combinators?}
|
|
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
637 |
|
| 588 | 638 |
The beauty of parser combinators is the ease with which they can be |
639 |
implemented and how easy it is to translate context-free grammars into |
|
640 |
code (though the grammars need to be non-left-recursive). To |
|
| 591 | 641 |
demonstrate this consider again the grammar for palindromes from above. |
| 590 | 642 |
The first idea would be to translate it into the following code |
| 588 | 643 |
|
644 |
\begin{center}
|
|
645 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
|
|
646 |
lazy val Pal : Parser[String, String] = |
|
| 799 | 647 |
((p"a" ~ Pal ~ p"a") || (p"b" ~ Pal ~ p"b") || p"a" || p"b" || p"") |
| 588 | 648 |
\end{lstlisting}
|
649 |
\end{center}
|
|
650 |
||
651 |
\noindent |
|
| 590 | 652 |
Unfortunately, this does not quite work yet as it produces a typing |
| 799 | 653 |
error. The reason is that the parsers \texttt{p"a"}, \texttt{p"b"} and
|
654 |
\texttt{p""} all produce strings as output type and therefore can be
|
|
655 |
put into an alternative \texttt{...|| p"a" || p"b" || p""}. But both
|
|
656 |
sequence parsers \pcode{p"a" ~ Pal ~ p"a"} and \pcode{p"b" ~ Pal ~ p"b"}
|
|
| 591 | 657 |
produce pairs of the form |
658 |
||
659 |
\begin{center}
|
|
660 |
(((\texttt{a}-part, \texttt{Pal}-part), \texttt{a}-part), unprocessed part)
|
|
661 |
\end{center}
|
|
662 |
||
663 |
\noindent That is how the |
|
664 |
sequence parser combinator nests results when \pcode{\~} is used
|
|
665 |
between two components. The solution is to use a semantic action that |
|
666 |
``flattens'' these pairs and appends the corresponding strings, like |
|
| 588 | 667 |
|
668 |
\begin{center}
|
|
669 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
|
|
670 |
lazy val Pal : Parser[String, String] = |
|
| 799 | 671 |
((p"a" ~ Pal ~ p"a").map{ case ((x, y), z) => x + y + z } ||
|
672 |
(p"b" ~ Pal ~ p"b").map{ case ((x, y), z) => x + y + z } ||
|
|
673 |
p"a" || p"b" || p"") |
|
| 588 | 674 |
\end{lstlisting}
|
675 |
\end{center}
|
|
676 |
||
| 589 | 677 |
\noindent |
| 591 | 678 |
How does this work? Well, recall again what the pairs look like for |
| 799 | 679 |
the parser \pcode{p"a" ~ Pal ~ p"a"}. The pattern in the semantic
|
| 591 | 680 |
action matches the nested pairs (the \texttt{x} with the
|
681 |
\texttt{a}-part and so on). Unfortunately when we have such nested
|
|
682 |
pairs, Scala requires us to define the function using the |
|
683 |
\pcode{case}-syntax
|
|
684 |
||
685 |
\begin{center}
|
|
686 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
|
|
687 |
{ case ((x, y), z) => ... }
|
|
688 |
\end{lstlisting}
|
|
689 |
\end{center}
|
|
690 |
||
691 |
\noindent |
|
692 |
If we have more sequence parser combinators or have them differently nested, |
|
693 |
then the pattern in the semantic action needs to be adjusted accordingly. |
|
694 |
The action we implement above is to concatenate all three strings, which |
|
695 |
means after the semantic action is applied the output type of the parser |
|
696 |
is \texttt{String}, which means it fits with the alternative parsers
|
|
| 799 | 697 |
\texttt{...|| p"a" || p"b" || p""}.
|
| 591 | 698 |
|
699 |
If we run the parser above with \pcode{Pal.parse_all("abaaaba")} we obtain
|
|
| 593 | 700 |
as result the \pcode{Set(abaaaba)}, which indicates that the string is a palindrome
|
| 591 | 701 |
(an empty set would mean something is wrong). But also notice what the |
702 |
intermediate results are generated by \pcode{Pal.parse("abaaaba")}
|
|
703 |
||
704 |
\begin{center}
|
|
705 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
|
|
706 |
Set((abaaaba,""),(aba,aaba), (a,baaaba), ("",abaaaba))
|
|
707 |
\end{lstlisting}
|
|
708 |
\end{center}
|
|
709 |
||
710 |
\noindent |
|
711 |
That there are more than one output might be slightly unexpected, but |
|
712 |
can be explained as follows: the pairs represent all possible |
|
713 |
(partial) parses of the string \pcode{"abaaaba"}. The first pair above
|
|
| 593 | 714 |
corresponds to a complete parse (all output is consumed) and this is |
| 591 | 715 |
what \pcode{Pal.parse_all} returns. The second pair is a small
|
716 |
``sub-palindrome'' that can also be parsed, but the parse fails with |
|
717 |
the rest \pcode{aaba}, which is therefore left as unprocessed. The
|
|
718 |
third one is an attempt to parse the whole string with the |
|
719 |
single-character parser \pcode{a}. That of course only partially
|
|
720 |
succeeds, by leaving \pcode{"baaaba"} as the unprocessed
|
|
| 593 | 721 |
part. Finally, since we allow the empty string to be a palindrome we |
| 591 | 722 |
also obtain the last pair, where actually nothing is consumed from the |
723 |
input string. While all this works as intended, we need to be careful |
|
724 |
with this (especially with including the \pcode{""} parser in our
|
|
725 |
grammar): if during parsing the set of parsing attempts gets too big, |
|
726 |
then the parsing process can become very slow as the potential |
|
727 |
candidates for applying rules can snowball. |
|
| 589 | 728 |
|
729 |
||
| 591 | 730 |
Important is also to note is that we must define the |
731 |
\texttt{Pal}-parser as a \emph{lazy} value in Scala. Look again at the
|
|
732 |
code: \texttt{Pal} occurs on the right-hand side of the definition. If we had
|
|
733 |
just written |
|
734 |
||
735 |
\begin{center}
|
|
736 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
|
|
737 |
val Pal : Parser[String, String] = ...rhs... |
|
738 |
\end{lstlisting}
|
|
739 |
\end{center}
|
|
740 |
||
| 589 | 741 |
\noindent |
| 593 | 742 |
then Scala before making this assignment to \texttt{Pal} attempts to
|
| 591 | 743 |
find out what the expression on the right-hand side evaluates to. This |
744 |
is straightforward in case of simple expressions \texttt{2 + 3}, but
|
|
745 |
the expression above contains \texttt{Pal} in the right-hand
|
|
746 |
side. Without \pcode{lazy} it would try to evaluate what \texttt{Pal}
|
|
747 |
evaluates to and start a new recursion, which means it falls into an |
|
748 |
infinite loop. The definition of \texttt{Pal} is recursive and the
|
|
749 |
\pcode{lazy} key-word prevents it from being fully evaluated. Therefore
|
|
750 |
whenever we want to define a recursive parser we have to write |
|
751 |
||
752 |
\begin{center}
|
|
753 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
|
|
754 |
lazy val SomeParser : Parser[...,...] = ...rhs... |
|
755 |
\end{lstlisting}
|
|
756 |
\end{center}
|
|
757 |
||
758 |
\noindent That was not necessary for our atomic parsers, like |
|
759 |
\texttt{RegexParser} or \texttt{CharParser}, because they are not recursive.
|
|
760 |
Note that this is also the reason why we had to write |
|
761 |
||
762 |
\begin{center}
|
|
763 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
|
|
764 |
class AltParser[I, T] |
|
765 |
(p: => Parser[I, T], |
|
| 937 | 766 |
q: => Parser[I, T]) ... extends Parser[I, T] {...}
|
| 591 | 767 |
|
768 |
class SeqParser[I, T, S] |
|
769 |
(p: => Parser[I, T], |
|
| 937 | 770 |
q: => Parser[I, S]) ... extends Parser[I, (T, S)] {...}
|
| 591 | 771 |
\end{lstlisting}
|
772 |
\end{center}
|
|
773 |
||
774 |
\noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front of
|
|
775 |
the argument types for \texttt{p} and \texttt{q} prevent Scala from
|
|
776 |
evaluating the arguments. Normally, Scala would first evaluate what |
|
777 |
kind of parsers \texttt{p} and \texttt{q} are, and only then generate
|
|
| 593 | 778 |
the alternative parser combinator, respectively sequence parser |
779 |
combinator. Since the arguments can be recursive parsers, such as |
|
| 591 | 780 |
\texttt{Pal}, this would lead again to an infinite loop.
|
781 |
||
782 |
As a final example in this section, let us consider the grammar for |
|
783 |
well-nested parentheses: |
|
784 |
||
785 |
\begin{plstx}[margin=3cm]
|
|
786 |
: \meta{P} ::= (\cdot \meta{P}\cdot ) \cdot \meta{P} | \epsilon\\
|
|
787 |
\end{plstx}
|
|
788 |
||
789 |
\noindent |
|
790 |
Let us assume we want to not just recognise strings of |
|
| 593 | 791 |
well-nested parentheses but also transform round parentheses |
| 591 | 792 |
into curly braces. We can do this by using a semantic |
793 |
action: |
|
794 |
||
795 |
\begin{center}
|
|
796 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,
|
|
797 |
xleftmargin=0mm, numbers=none] |
|
798 |
lazy val P : Parser[String, String] = |
|
| 799 | 799 |
(p"(" ~ P ~ p")" ~ P).map{ case (((_,x),_),y) => "{" + x + "}" + y } || p""
|
| 591 | 800 |
\end{lstlisting}
|
801 |
\end{center}
|
|
802 |
||
803 |
\noindent |
|
804 |
Here we define a function where which ignores the parentheses in the |
|
805 |
pairs, but replaces them in the right places with curly braces when |
|
806 |
assembling the new string in the right-hand side. If we run |
|
807 |
\pcode{P.parse_all("(((()()))())")} we obtain
|
|
808 |
\texttt{Set(\{\{\{\{\}\{\}\}\}\{\}\})} as expected.
|
|
809 |
||
810 |
||
| 588 | 811 |
|
|
386
31295bb945c6
update
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
385
diff
changeset
|
812 |
\subsubsection*{Implementing an Interpreter}
|
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
813 |
|
| 593 | 814 |
The first step before implementing an interpreter for a full-blown |
| 592 | 815 |
language is to implement a simple calculator for arithmetic |
816 |
expressions. Suppose our arithmetic expressions are given by the |
|
817 |
grammar: |
|
818 |
||
819 |
\begin{plstx}[margin=3cm,one per line]
|
|
| 593 | 820 |
: \meta{E} ::= \meta{E} \cdot + \cdot \meta{E}
|
| 592 | 821 |
| \meta{E} \cdot - \cdot \meta{E}
|
822 |
| \meta{E} \cdot * \cdot \meta{E}
|
|
823 |
| ( \cdot \meta{E} \cdot )
|
|
824 |
| Number \\ |
|
825 |
\end{plstx}
|
|
826 |
||
827 |
\noindent |
|
828 |
Naturally we want to implement the grammar in such a way that we can |
|
| 593 | 829 |
calculate what the result of, for example, \texttt{4*2+3} is---we are
|
830 |
interested in an \texttt{Int} rather than a string. This means every
|
|
831 |
component parser needs to have as output type \texttt{Int} and when we
|
|
832 |
assemble the intermediate results, strings like \texttt{"+"},
|
|
833 |
\texttt{"*"} and so on, need to be translated into the appropriate
|
|
834 |
Scala operation of adding, multiplying and so on. Being inspired by |
|
835 |
the parser for well-nested parentheses above and ignoring the fact |
|
836 |
that we want $*$ to take precedence over $+$ and $-$, we might want to |
|
837 |
write something like |
|
| 592 | 838 |
|
839 |
\begin{center}
|
|
840 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
|
|
841 |
lazy val E: Parser[String, Int] = |
|
| 799 | 842 |
((E ~ p"+" ~ E).map{ case ((x, y), z) => x + z} ||
|
843 |
(E ~ p"-" ~ E).map{ case ((x, y), z) => x - z} ||
|
|
844 |
(E ~ p"*" ~ E).map{ case ((x, y), z) => x * z} ||
|
|
845 |
(p"(" ~ E ~ p")").map{ case ((x, y), z) => y} ||
|
|
| 592 | 846 |
NumParserInt) |
847 |
\end{lstlisting}
|
|
848 |
\end{center}
|
|
849 |
||
850 |
\noindent |
|
| 593 | 851 |
Consider again carefully how the semantic actions pick out the correct |
852 |
arguments for the calculation. In case of plus, we need \texttt{x} and
|
|
853 |
\texttt{z}, because they correspond to the results of the component
|
|
854 |
parser \texttt{E}. We can just add \texttt{x + z} in order to obtain
|
|
855 |
an \texttt{Int} because the output type of \texttt{E} is
|
|
856 |
\texttt{Int}. Similarly with subtraction and multiplication. In
|
|
857 |
contrast in the fourth clause we need to return \texttt{y}, because it
|
|
858 |
is the result enclosed inside the parentheses. The information about |
|
859 |
parentheses, roughly speaking, we just throw away. |
|
| 592 | 860 |
|
861 |
So far so good. The problem arises when we try to call \pcode{parse_all} with the
|
|
862 |
expression \texttt{"1+2+3"}. Lets try it
|
|
863 |
||
864 |
\begin{center}
|
|
865 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
|
|
866 |
E.parse_all("1+2+3")
|
|
867 |
\end{lstlisting}
|
|
868 |
\end{center}
|
|
869 |
||
870 |
\noindent |
|
| 593 | 871 |
\ldots and we wait and wait and \ldots still wait. What is the |
872 |
problem? Actually, the parser just fell into an infinite loop! The |
|
873 |
reason is that the above grammar is left-recursive and recall that our |
|
874 |
parser combinators cannot deal with such left-recursive |
|
875 |
grammars. Fortunately, every left-recursive context-free grammar can be |
|
876 |
transformed into a non-left-recursive grammars that still recognises |
|
877 |
the same strings. This allows us to design the following grammar |
|
878 |
||
879 |
\begin{plstx}[margin=3cm]
|
|
880 |
: \meta{E} ::= \meta{T} \cdot + \cdot \meta{E} | \meta{T} \cdot - \cdot \meta{E} | \meta{T}\\
|
|
881 |
: \meta{T} ::= \meta{F} \cdot * \cdot \meta{T} | \meta{F}\\
|
|
882 |
: \meta{F} ::= ( \cdot \meta{E} \cdot ) | Number\\
|
|
883 |
\end{plstx}
|
|
884 |
||
885 |
\noindent |
|
886 |
Recall what left-recursive means from Handout 5 and make sure you see |
|
887 |
why this grammar is \emph{non} left-recursive. This version of the grammar
|
|
| 937 | 888 |
also deals with the fact that $*$ should have a higher precedence than $+$ and $-$. This does not |
| 593 | 889 |
affect which strings this grammar can recognise, but in which order we are going |
890 |
to evaluate any arithmetic expression. We can translate this grammar into |
|
891 |
parsing combinators as follows: |
|
| 592 | 892 |
|
893 |
||
| 593 | 894 |
\begin{center}
|
895 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
|
|
896 |
lazy val E: Parser[String, Int] = |
|
| 799 | 897 |
(T ~ p"+" ~ E).map{ case ((x, y), z) => x + z } ||
|
898 |
(T ~ p"-" ~ E).map{ case ((x, y), z) => x - z } || T
|
|
| 593 | 899 |
lazy val T: Parser[String, Int] = |
| 799 | 900 |
(F ~ p"*" ~ T).map{ case ((x, y), z) => x * z } || F
|
| 593 | 901 |
lazy val F: Parser[String, Int] = |
| 799 | 902 |
(p"(" ~ E ~ p")").map{ case ((x, y), z) => y } || NumParserInt
|
| 593 | 903 |
\end{lstlisting}
|
904 |
\end{center}
|
|
| 592 | 905 |
|
| 593 | 906 |
\noindent |
| 594 | 907 |
Let us try out some examples: |
| 592 | 908 |
|
| 593 | 909 |
\begin{center}
|
910 |
\begin{tabular}{rcl}
|
|
911 |
input strings & & output of \pcode{parse_all}\medskip\\
|
|
912 |
\texttt{\Grid{1+2+3}} & $\rightarrow$ & \texttt{Set(6)}\\
|
|
913 |
\texttt{\Grid{4*2+3}} & $\rightarrow$ & \texttt{Set(11)}\\
|
|
914 |
\texttt{\Grid{4*(2+3)}} & $\rightarrow$ & \texttt{Set(20)}\\
|
|
| 594 | 915 |
\texttt{\Grid{(4)*((2+3))}} & $\rightarrow$ & \texttt{Set(20)}\\
|
| 593 | 916 |
\texttt{\Grid{4/2+3}} & $\rightarrow$ & \texttt{Set()}\\
|
917 |
\texttt{\Grid{1\VS +\VS 2\VS +\VS 3}} & $\rightarrow$ & \texttt{Set()}\\
|
|
918 |
\end{tabular}
|
|
919 |
\end{center}
|
|
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
920 |
|
| 593 | 921 |
\noindent |
| 594 | 922 |
Note that we call \pcode{parse_all}, not \pcode{parse}. The examples
|
923 |
should be quite self-explanatory. The last two example do not produce |
|
924 |
any integer result because our parser does not define what to do in |
|
925 |
case of division (could be easily added), but also has no idea what to |
|
| 595 | 926 |
do with whitespaces. To deal with them is the task of the lexer! Yes, |
| 594 | 927 |
we can deal with them inside the grammar, but that would render many |
928 |
grammars becoming unintelligible, including this one.\footnote{If you
|
|
929 |
think an easy solution is to extend the notion of what a number |
|
930 |
should be, then think again---you still would have to deal with |
|
| 595 | 931 |
cases like \texttt{\Grid{(\VS (\VS 2+3)\VS )}}. Just think of the mess
|
932 |
you would have in a grammar for a full-blown language where there are |
|
933 |
numerous such cases.} |
|
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
934 |
|
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
935 |
\end{document}
|
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
936 |
|
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
937 |
%%% Local Variables: |
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
938 |
%%% mode: latex |
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
939 |
%%% TeX-master: t |
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
940 |
%%% End: |