| author | Christian Urban <christian.urban@kcl.ac.uk> | 
| Fri, 24 Oct 2025 10:45:17 +0100 | |
| changeset 1017 | b0d44eb1ecc7 | 
| parent 941 | 66adcae6c762 | 
| permissions | -rw-r--r-- | 
| 584 | 1 | |
| 595 | 2 | % !TEX program = xelatex | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 3 | \documentclass{article}
 | 
| 297 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 4 | \usepackage{../style}
 | 
| 217 
cd6066f1056a
updated handouts
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
183diff
changeset | 5 | \usepackage{../langs}
 | 
| 588 | 6 | \usepackage{../grammar}
 | 
| 799 | 7 | \usepackage{../graphics}
 | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 8 | |
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 9 | \begin{document}
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 10 | |
| 292 
7ed2a25dd115
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
217diff
changeset | 11 | \section*{Handout 6 (Parser Combinators)}
 | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 12 | |
| 584 | 13 | This handout explains how \emph{parser combinators} work and how they
 | 
| 587 | 14 | can be implemented in Scala. Their most distinguishing feature is that | 
| 15 | they are very easy to implement (admittedly it is only easy in a | |
| 16 | functional programming language). Another good point of parser | |
| 17 | combinators is that they can deal with any kind of input as long as | |
| 18 | this input is of ``sequence-kind'', for example a string or a list of | |
| 19 | tokens. The only two properties of the input we need is to be able to | |
| 20 | test when it is empty and ``sequentially'' take it apart. Strings and | |
| 21 | lists fit this bill. However, parser combinators also have their | |
| 22 | drawbacks. For example they require that the grammar to be parsed is | |
| 23 | \emph{not} left-recursive and they are efficient only when the grammar
 | |
| 24 | is unambiguous. It is the responsibility of the grammar designer to | |
| 591 | 25 | ensure these two properties hold. | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 26 | |
| 587 | 27 | The general idea behind parser combinators is to transform the input | 
| 28 | into sets of pairs, like so | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 29 | |
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 30 | \begin{center}
 | 
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 31 | $\underbrace{\text{list of tokens}}_{\text{input}}$ 
 | 
| 594 | 32 | $\quad\Rightarrow\quad$ | 
| 591 | 33 | $\underbrace{\text{set of (parsed part, unprocessed part)}}_{\text{output}}$
 | 
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 34 | \end{center} 
 | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 35 | |
| 587 | 36 | \noindent | 
| 590 | 37 | Given the extended effort we have spent implementing a lexer in order | 
| 591 | 38 | to generate lists of tokens, it might be surprising that in what | 
| 39 | follows we shall often use strings as input, rather than lists of | |
| 937 | 40 | tokens. This is for making the explanation more lucid and ensure the | 
| 41 | examples are simple. It does not make our previous work on lexers obsolete | |
| 591 | 42 | (remember they transform a string into a list of tokens). Lexers will | 
| 937 | 43 | still be needed for building a somewhat realistic compiler. See also | 
| 44 | a question in the homework regarding this issue. | |
| 584 | 45 | |
| 590 | 46 | As mentioned above, parser combinators are relatively agnostic about what | 
| 587 | 47 | kind of input they process. In my Scala code I use the following | 
| 48 | polymorphic types for parser combinators: | |
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 49 | |
| 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 50 | \begin{center}
 | 
| 584 | 51 | input:\;\; \texttt{I}  \qquad output:\;\; \texttt{T}
 | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 52 | \end{center}
 | 
| 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 53 | |
| 587 | 54 | \noindent That is they take as input something of type \texttt{I} and
 | 
| 590 | 55 | return a set of pairs of type \texttt{Set[(T, I)]}. Since the input
 | 
| 56 | needs to be of ``sequence-kind'', I actually have to often write | |
| 937 | 57 | \code{(using is: I => Seq[_])} for the input type. This ensures the
 | 
| 58 | input is a subtype of Scala sequences.\footnote{This is a new feature
 | |
| 59 | in Scala 3 and is about type-classes, meaning if you use Scala 2 you will have difficulties | |
| 60 | with running my code.} The first component of the generated pairs | |
| 61 | corresponds to what the parser combinator was able to parse from the | |
| 62 | input and the second is the unprocessed, or leftover, part of the | |
| 63 | input (therefore the type of this unprocessed part is the same as the | |
| 64 | input). A parser combinator might return more than one such pair; the | |
| 65 | idea is that there are potentially several ways of how to parse the | |
| 66 | input. As a concrete example, consider the string | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 67 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 68 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 69 | \tt\Grid{iffoo\VS testbar}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 70 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 71 | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 72 | \noindent We might have a parser combinator which tries to | 
| 386 
31295bb945c6
update
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
385diff
changeset | 73 | interpret this string as a keyword (\texttt{if}) or as an
 | 
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 74 | identifier (\texttt{iffoo}). Then the output will be the set
 | 
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 75 | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 76 | \begin{center}
 | 
| 386 
31295bb945c6
update
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
385diff
changeset | 77 | $\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), 
 | 
| 
31295bb945c6
update
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
385diff
changeset | 78 |            \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
 | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 79 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 80 | |
| 587 | 81 | \noindent where the first pair means the parser could recognise | 
| 590 | 82 | \texttt{if} from the input and leaves the \texttt{foo\VS testbar} as
 | 
| 591 | 83 | unprocessed part; in the other case it could recognise | 
| 587 | 84 | \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
 | 
| 85 | parser cannot recognise anything from the input at all, then parser | |
| 86 | combinators just return the empty set $\{\}$. This will indicate
 | |
| 87 | something ``went wrong''\ldots or more precisely, nothing could be | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 88 | parsed. | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 89 | |
| 594 | 90 | Also important to note is that the output type \texttt{T} for the
 | 
| 91 | processed part can potentially be different from the input type | |
| 92 | \texttt{I} in the parser. In the example above is just happens to be
 | |
| 93 | the same. The reason for the difference is that in general we are | |
| 94 | interested in transforming our input into something | |
| 95 | ``different''\ldots for example into a tree; or if we implement the | |
| 96 | grammar for arithmetic expressions, we might be interested in the | |
| 97 | actual integer number the arithmetic expression, say \texttt{1 + 2 *
 | |
| 98 | 3}, stands for. In this way we can use parser combinators to | |
| 99 | implement relatively easily a calculator, for instance (we shall do | |
| 100 | this later on). | |
| 584 | 101 | |
| 594 | 102 | The main driving force behind parser combinators is that we can easily | 
| 103 | build parser combinators out of smaller components following very | |
| 104 | closely the structure of a grammar. In order to implement this in a | |
| 591 | 105 | functional/object-oriented programming language, like Scala, we need | 
| 106 | to specify an abstract class for parser combinators. In the abstract | |
| 107 | class we specify that \texttt{I} is the \emph{input type} of the
 | |
| 593 | 108 | parser combinator and that \texttt{T} is the \emph{output type}.  This
 | 
| 591 | 109 | implies that the function \texttt{parse} takes an argument of type
 | 
| 941 | 110 | \texttt{I} and returns a set of type \mbox{\texttt{Set[(T,
 | 
| 111 |     I)]}}.\footnote{In the actual code on KEATS there is some
 | |
| 112 | kerfuffle about correct type-bounds and using clauses. Ignore this | |
| 113 | part of the implementation for the moment---it is not needed for | |
| 114 | understanding how the code works.} | |
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 115 | |
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 116 | \begin{center}
 | 
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 117 | \begin{lstlisting}[language=Scala]
 | 
| 941 | 118 | abstract class Parser[I, T]  {
 | 
| 937 | 119 | def parse(in: I): Set[(T, I)] | 
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 120 | |
| 590 | 121 | def parse_all(in: I) : Set[T] = | 
| 937 | 122 | for ((hd, tl) <- parse(in); | 
| 941 | 123 | if tl.isEmpty) yield hd | 
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 124 | } | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 125 | \end{lstlisting}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 126 | \end{center}
 | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 127 | |
| 591 | 128 | \noindent It is the obligation in each instance of this class to | 
| 584 | 129 | supply an implementation for \texttt{parse}.  From this function we
 | 
| 130 | can then ``centrally'' derive the function \texttt{parse\_all}, which
 | |
| 131 | just filters out all pairs whose second component is not empty (that | |
| 132 | is has still some unprocessed part). The reason is that at the end of | |
| 133 | the parsing we are only interested in the results where all the input | |
| 134 | has been consumed and no unprocessed part is left over. | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 135 | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 136 | One of the simplest parser combinators recognises just a | 
| 584 | 137 | single character, say $c$, from the beginning of strings. Its | 
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 138 | behaviour can be described as follows: | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 139 | |
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 140 | \begin{itemize}
 | 
| 937 | 141 | \item If the head of the input string $s$ starts with a $c$, then return | 
| 584 | 142 | the set | 
| 937 | 143 |   \[\{(c, \textit{tail-of-}s)\}\]
 | 
| 584 | 144 |   where \textit{tail of} 
 | 
| 145 | $s$ is the unprocessed part of the input string. | |
| 146 | \item Otherwise return the empty set $\{\}$.	
 | |
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 147 | \end{itemize}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 148 | |
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 149 | \noindent | 
| 590 | 150 | The input type of this simple parser combinator is \texttt{String} and
 | 
| 151 | the output type is \texttt{Char}. This means \texttt{parse} returns
 | |
| 152 | \mbox{\texttt{Set[(Char, String)]}}.  The code in Scala is as follows:
 | |
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 153 | |
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 154 | \begin{center}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 155 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 156 | case class CharParser(c: Char) extends Parser[String, Char] {
 | 
| 937 | 157 | def parse(s: String) = | 
| 158 | if (s != "" && s.head == c) Set((c, s.tail)) else Set() | |
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 159 | } | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 160 | \end{lstlisting}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 161 | \end{center}
 | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 162 | |
| 937 | 163 | \noindent You can see \texttt{parse} tests here whether the
 | 
| 164 | first character of the input string \texttt{s} is equal to
 | |
| 584 | 165 | \texttt{c}. If yes, then it splits the string into the recognised part
 | 
| 937 | 166 | \texttt{c} and the unprocessed part \texttt{s.tail}. In case
 | 
| 167 | \texttt{s} does not start with \texttt{c} then the parser returns the
 | |
| 584 | 168 | empty set (in Scala \texttt{Set()}). Since this parser recognises
 | 
| 169 | characters and just returns characters as the processed part, the | |
| 170 | output type of the parser is \texttt{Char}.
 | |
| 171 | ||
| 172 | If we want to parse a list of tokens and interested in recognising a | |
| 590 | 173 | number token, for example, we could write something like this | 
| 584 | 174 | |
| 175 | \begin{center}
 | |
| 176 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
 | |
| 177 | case object NumParser extends Parser[List[Token], Int] {
 | |
| 178 |   def parse(ts: List[Token]) = ts match {
 | |
| 937 | 179 | case Num_token(s)::rest => Set((s.toInt, rest)) | 
| 584 | 180 | case _ => Set () | 
| 181 | } | |
| 182 | } | |
| 183 | \end{lstlisting}
 | |
| 184 | \end{center}
 | |
| 185 | ||
| 186 | \noindent | |
| 187 | In this parser the input is of type \texttt{List[Token]}. The function
 | |
| 188 | parse looks at the input \texttt{ts} and checks whether the first
 | |
| 589 | 189 | token is a \texttt{Num\_token} (let us assume our lexer generated
 | 
| 190 | these tokens for numbers). But this parser does not just return this | |
| 584 | 191 | token (and the rest of the list), like the \texttt{CharParser} above,
 | 
| 590 | 192 | rather it extracts also the string \texttt{s} from the token and
 | 
| 193 | converts it into an integer. The hope is that the lexer did its work | |
| 194 | well and this conversion always succeeds. The consequence of this is | |
| 195 | that the output type for this parser is \texttt{Int}, not
 | |
| 937 | 196 | \texttt{Token}. Such a conversion would be needed in our parser,
 | 
| 197 | because when we encounter a number in our program, we want to do | |
| 198 | some calculations based on integers, not strings (or tokens). | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 199 | |
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 200 | |
| 584 | 201 | These simple parsers that just look at the input and do a simple | 
| 202 | transformation are often called \emph{atomic} parser combinators.
 | |
| 203 | More interesting are the parser combinators that build larger parsers | |
| 587 | 204 | out of smaller component parsers. There are three such parser | 
| 205 | combinators that can be implemented generically. The \emph{alternative
 | |
| 584 | 206 | parser combinator} is as follows: given two parsers, say, $p$ and | 
| 207 | $q$, we apply both parsers to the input (remember parsers are | |
| 587 | 208 | functions) and combine the output (remember they are sets of pairs): | 
| 209 | ||
| 210 | \begin{center}
 | |
| 211 | $p(\text{input}) \cup q(\text{input})$
 | |
| 212 | \end{center}
 | |
| 213 | ||
| 214 | \noindent In Scala we can implement alternative parser | |
| 215 | combinator as follows | |
| 216 | ||
| 217 | \begin{center}
 | |
| 218 | \begin{lstlisting}[language=Scala, numbers=none]
 | |
| 219 | class AltParser[I, T] | |
| 937 | 220 | (p: => Parser[I, T], | 
| 941 | 221 |           q: => Parser[I, T]) extends Parser[I, T] {
 | 
| 937 | 222 | def parse(in: I) = p.parse(in) ++ q.parse(in) | 
| 223 | } | |
| 587 | 224 | \end{lstlisting}
 | 
| 225 | \end{center}
 | |
| 226 | ||
| 227 | \noindent The types of this parser combinator are again generic (we | |
| 228 | have \texttt{I} for the input type, and \texttt{T} for the output
 | |
| 229 | type). The alternative parser builds a new parser out of two existing | |
| 590 | 230 | parsers \texttt{p} and \texttt{q} which are given as arguments.  Both
 | 
| 231 | parsers need to be able to process input of type \texttt{I} and return
 | |
| 232 | in \texttt{parse} the same output type \texttt{Set[(T,
 | |
| 587 | 233 |   I)]}.\footnote{There is an interesting detail of Scala, namely the
 | 
| 937 | 234 |   \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. These arrows
 | 
| 587 | 235 | will prevent the evaluation of the arguments before they are | 
| 236 |   used. This is often called \emph{lazy evaluation} of the
 | |
| 590 | 237 | arguments. We will explain this later.} The alternative parser runs | 
| 238 | the input with the first parser \texttt{p} (producing a set of pairs)
 | |
| 239 | and then runs the same input with \texttt{q} (producing another set of
 | |
| 240 | pairs). The result should be then just the union of both sets, which | |
| 241 | is the operation \texttt{++} in Scala.
 | |
| 587 | 242 | |
| 243 | The alternative parser combinator allows us to construct a parser that | |
| 244 | parses either a character \texttt{a} or \texttt{b} using the
 | |
| 937 | 245 | \texttt{CharParser} shown above. For this we can write\footnote{Note
 | 
| 246 |   that we cannot use a \texttt{case}-class for \texttt{AltParser}s
 | |
| 247 | because of the problem with laziness and Scala quirks. Hating | |
| 248 |   \texttt{new} like the plague, we will work around this later with
 | |
| 249 | some syntax tricks. ;o)} | |
| 587 | 250 | |
| 251 | \begin{center}
 | |
| 252 | \begin{lstlisting}[language=Scala, numbers=none]
 | |
| 253 | new AltParser(CharParser('a'), CharParser('b'))
 | |
| 254 | \end{lstlisting}
 | |
| 255 | \end{center}
 | |
| 256 | ||
| 257 | \noindent Later on we will use Scala mechanism for introducing some | |
| 799 | 258 | more readable shorthand notation for this, like \texttt{p"a" ||
 | 
| 937 | 259 | p"b"}. But first let us look in detail at what this parser combinator produces | 
| 590 | 260 | with some sample strings. | 
| 587 | 261 | |
| 262 | \begin{center}
 | |
| 263 | \begin{tabular}{rcl}
 | |
| 264 | input strings & & output\medskip\\ | |
| 937 | 265 | \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}},\; \texttt{\Grid{cde}})\right\}$\\
 | 
| 266 | \texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}},\; \texttt{\Grid{cde}})\right\}$\\
 | |
| 587 | 267 | \texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
 | 
| 268 | \end{tabular}
 | |
| 269 | \end{center}
 | |
| 270 | ||
| 937 | 271 | \noindent We receive in the first two cases a successful output (that | 
| 272 | is a non-empty set). In each case, either \pcode{a} or \pcode{b} is in
 | |
| 273 | the parsed part, and \pcode{cde} in the unprocessed part. Clearly this
 | |
| 274 | parser cannot parse anything of the form \pcode{ccde}, therefore the
 | |
| 275 | empty set is returned in the last case. Observe that parser | |
| 276 | combinators only look at the beginning of the given input: they do not | |
| 277 | fish out something in the ``middle'' of the input. | |
| 587 | 278 | |
| 279 | A bit more interesting is the \emph{sequence parser combinator}. Given
 | |
| 280 | two parsers, say again, $p$ and $q$, we want to apply first the input | |
| 590 | 281 | to $p$ producing a set of pairs; then apply $q$ to all the unparsed | 
| 587 | 282 | parts in the pairs; and then combine the results. Mathematically we would | 
| 591 | 283 | write something like this for the set of pairs: | 
| 587 | 284 | |
| 285 | \begin{center}
 | |
| 286 | \begin{tabular}{lcl}
 | |
| 287 | $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & 
 | |
| 288 | $(\textit{output}_1, u_1) \in p(\text{input}) 
 | |
| 289 | \;\wedge\;$\\ | |
| 290 | && $(\textit{output}_2, u_2) \in q(u_1)\}$
 | |
| 291 | \end{tabular}
 | |
| 292 | \end{center}
 | |
| 293 | ||
| 294 | \noindent Notice that the $p$ will first be run on the input, | |
| 590 | 295 | producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$
 | 
| 591 | 296 | stands for the unprocessed, or leftover, parts of $p$. We want that | 
| 590 | 297 | $q$ runs on all these unprocessed parts $u_1$. Therefore these | 
| 298 | unprocessed parts are fed into the second parser $q$. The overall | |
| 299 | result of the sequence parser combinator is pairs of the form | |
| 584 | 300 | $((\textit{output}_1, \textit{output}_2), u_2)$. This means the
 | 
| 593 | 301 | unprocessed part of the sequence parser combinator is the unprocessed | 
| 591 | 302 | part the second parser $q$ leaves as leftover. The parsed parts of the | 
| 303 | component parsers are combined in a pair, namely | |
| 304 | $(\textit{output}_1, \textit{output}_2)$. The reason is we want to
 | |
| 305 | know what $p$ and $q$ were able to parse. This behaviour can be | |
| 306 | implemented in Scala as follows: | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 307 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 308 | \begin{center}
 | 
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 309 | \begin{lstlisting}[language=Scala,numbers=none]
 | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 310 | class SeqParser[I, T, S] | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 311 | (p: => Parser[I, T], | 
| 941 | 312 |         q: => Parser[I, S]) extends Parser[I, (T, S)] {
 | 
| 587 | 313 | def parse(in: I) = | 
| 314 | for ((output1, u1) <- p.parse(in); | |
| 386 
31295bb945c6
update
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
385diff
changeset | 315 | (output2, u2) <- q.parse(u1)) | 
| 
31295bb945c6
update
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
385diff
changeset | 316 | yield ((output1, output2), u2) | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 317 | } | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 318 | \end{lstlisting}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 319 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 320 | |
| 587 | 321 | \noindent This parser takes again as arguments two parsers, \texttt{p}
 | 
| 591 | 322 | and \texttt{q}. It implements \texttt{parse} as follows: first run the
 | 
| 323 | parser \texttt{p} on the input producing a set of pairs
 | |
| 587 | 324 | (\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
 | 
| 591 | 325 | unprocessed parts left over by \texttt{p} (recall that there can be
 | 
| 937 | 326 | several such pairs).  Let then \texttt{q} run on these unprocessed
 | 
| 591 | 327 | parts producing again a set of pairs. The output of the sequence | 
| 328 | parser combinator is then a set containing pairs where the first | |
| 329 | components are again pairs, namely what the first parser could parse | |
| 330 | together with what the second parser could parse; the second component | |
| 331 | is the unprocessed part left over after running the second parser | |
| 332 | \texttt{q}. Note that the input type of the sequence parser combinator
 | |
| 333 | is as usual \texttt{I}, but the output type is
 | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 334 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 335 | \begin{center}
 | 
| 590 | 336 | \texttt{(T, S)}
 | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 337 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 338 | |
| 584 | 339 | \noindent | 
| 591 | 340 | Consequently, the function \texttt{parse} in the sequence parser
 | 
| 341 | combinator returns sets of type \texttt{Set[((T, S), I)]}.  That means
 | |
| 342 | we have essentially two output types for the sequence parser | |
| 343 | combinator (packaged in a pair), because in general \textit{p} and
 | |
| 344 | \textit{q} might produce different things (for example we recognise a
 | |
| 345 | number with \texttt{p} and then with \texttt{q} a string corresponding
 | |
| 346 | to an operator).  If any of the runs of \textit{p} and \textit{q}
 | |
| 347 | fail, that is produce the empty set, then \texttt{parse} will also
 | |
| 348 | produce the empty set. | |
| 584 | 349 | |
| 587 | 350 | With the shorthand notation we shall introduce later for the sequence | 
| 799 | 351 | parser combinator, we can write for example \pcode{p"a" ~ p"b"}, which
 | 
| 587 | 352 | is the parser combinator that first recognises the character | 
| 937 | 353 | \texttt{a} from a string and then \texttt{b}. (Actually, we will be
 | 
| 354 | able to write just \pcode{p"ab"} for such parsers, but it is good to
 | |
| 355 | understand first what happens behind the scenes.) Let us look again | |
| 356 | at some examples of how the sequence parser combinator processes some | |
| 357 | strings: | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 358 | |
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 359 | \begin{center}
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 360 | \begin{tabular}{rcl}
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 361 | input strings & & output\medskip\\ | 
| 937 | 362 | \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}),\; \texttt{\Grid{cde}})\right\}$\\
 | 
| 584 | 363 | \texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
 | 
| 364 | \texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
 | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 365 | \end{tabular}
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 366 | \end{center}
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 367 | |
| 586 | 368 | \noindent In the first line we have a successful parse, because the | 
| 587 | 369 | string starts with \texttt{ab}, which is the prefix we are looking
 | 
| 584 | 370 | for. But since the parsing combinator is constructed as sequence of | 
| 371 | the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
 | |
| 372 | result is a nested pair of the form \texttt{((a, b), cde)}. It is
 | |
| 586 | 373 | \emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
 | 
| 587 | 374 | expect. The parser returns the empty set in the other examples, | 
| 584 | 375 | because they do not fit with what the parser is supposed to parse. | 
| 376 | ||
| 377 | ||
| 799 | 378 | A slightly more complicated parser is \pcode{(p"a" || p"b") ~ p"c"} which
 | 
| 587 | 379 | parses as first character either an \texttt{a} or \texttt{b}, followed
 | 
| 380 | by a \texttt{c}. This parser produces the following outputs.
 | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 381 | |
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 382 | \begin{center}
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 383 | \begin{tabular}{rcl}
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 384 | input strings & & output\medskip\\ | 
| 937 | 385 | \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
 | 
| 386 | \texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
 | |
| 585 | 387 | \texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$
 | 
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 388 | \end{tabular}
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 389 | \end{center}
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 390 | |
| 585 | 391 | \noindent | 
| 799 | 392 | Now consider the parser \pcode{(p"a" ~ p"b") ~ p"c"} which parses
 | 
| 585 | 393 | \texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces
 | 
| 394 | the following outputs. | |
| 395 | ||
| 396 | \begin{center}
 | |
| 397 | \begin{tabular}{rcl}
 | |
| 398 | input strings & & output\medskip\\ | |
| 937 | 399 | \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
 | 
| 585 | 400 | \texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\
 | 
| 401 | \texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$
 | |
| 402 | \end{tabular}
 | |
| 403 | \end{center}
 | |
| 404 | ||
| 405 | ||
| 406 | \noindent The second and third example fail, because something is | |
| 590 | 407 | ``missing'' in the sequence we are looking for. The first succeeds but | 
| 408 | notice how the results nest with sequences: the parsed part is a | |
| 409 | nested pair of the form \pcode{((a, b), c)}. If we nest the sequence
 | |
| 799 | 410 | parser differently, say \pcode{p"a" ~ (p"b" ~ p"c")}, then also
 | 
| 590 | 411 | our output pairs nest differently | 
| 589 | 412 | |
| 413 | \begin{center}
 | |
| 414 | \begin{tabular}{rcl}
 | |
| 415 | input strings & & output\medskip\\ | |
| 937 | 416 | \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})),\; \texttt{\Grid{de}})\right\}$\\
 | 
| 589 | 417 | \end{tabular}
 | 
| 418 | \end{center}
 | |
| 419 | ||
| 420 | \noindent | |
| 421 | Two more examples: first consider the parser | |
| 799 | 422 | \pcode{(p"a" ~ p"a") ~ p"a"} and the input \pcode{aaaa}:
 | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 423 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 424 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 425 | \begin{tabular}{rcl}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 426 | input string & & output\medskip\\ | 
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 427 | \texttt{\Grid{aaaa}} & $\rightarrow$ & 
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 428 | $\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
 | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 429 | \end{tabular}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 430 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 431 | |
| 591 | 432 | \noindent Notice again how the results nest deeper and deeper as pairs (the | 
| 585 | 433 | last \pcode{a} is in the unprocessed part). To consume everything of
 | 
| 799 | 434 | this string we can use the parser \pcode{((p"a" ~ p"a") ~ p"a") ~
 | 
| 435 | p"a"}. Then the output is as follows: | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 436 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 437 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 438 | \begin{tabular}{rcl}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 439 | input string & & output\medskip\\ | 
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 440 | \texttt{\Grid{aaaa}} & $\rightarrow$ & 
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 441 | $\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
 | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 442 | \end{tabular}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 443 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 444 | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 445 | \noindent This is an instance where the parser consumed | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 446 | completely the input, meaning the unprocessed part is just the | 
| 587 | 447 | empty string. So if we called \pcode{parse_all}, instead of \pcode{parse},
 | 
| 585 | 448 | we would get back the result | 
| 449 | ||
| 450 | \[ | |
| 451 | \left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}
 | |
| 452 | \] | |
| 453 | ||
| 454 | \noindent where the unprocessed (empty) parts have been stripped away | |
| 455 | from the pairs; everything where the second part was not empty has | |
| 587 | 456 | been thrown away as well, because they represent | 
| 590 | 457 | ultimately-unsuccessful-parses. The main point is that the sequence | 
| 458 | parser combinator returns pairs that can nest according to the | |
| 459 | nesting of the component parsers. | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 460 | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 461 | |
| 937 | 462 | Consider also carefully that constructing a parser such | 
| 463 | ||
| 464 | \begin{center}
 | |
| 465 | \pcode{p"a" || (p"a" ~ p"b")}
 | |
| 466 | \end{center}
 | |
| 467 | ||
| 468 | \noindent | |
| 469 | will result in a typing error. The intention with this | |
| 591 | 470 | parser is that we want to parse either an \texttt{a}, or an \texttt{a}
 | 
| 590 | 471 | followed by a \texttt{b}. However, the first parser has as output type
 | 
| 472 | a single character (recall the type of \texttt{CharParser}), but the
 | |
| 473 | second parser produces a pair of characters as output. The alternative | |
| 474 | parser is required to have both component parsers to have the same | |
| 591 | 475 | type---the reason is that we need to be able to build the union of two | 
| 476 | sets, which requires in Scala that the sets have the same type. Since | |
| 477 | they are not in this case, there is a typing error. We will see later | |
| 478 | how we can build this parser without the typing error. | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 479 | |
| 937 | 480 | The next parser combinator, called \emph{semantic action} or \emph{map-parser}, does not
 | 
| 591 | 481 | actually combine two smaller parsers, but applies a function to the result | 
| 587 | 482 | of a parser. It is implemented in Scala as follows | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 483 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 484 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 485 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 799 | 486 | class MapParser[I, T, S] | 
| 937 | 487 | (p: => Parser[I, T], | 
| 941 | 488 |          f: T => S) extends Parser[I, S] {
 | 
| 937 | 489 | def parse(in: I) = | 
| 490 | for ((hd, tl) <- p.parse(in)) yield (f(hd), tl) | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 491 | } | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 492 | \end{lstlisting}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 493 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 494 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 495 | |
| 590 | 496 | \noindent This parser combinator takes a parser \texttt{p} (with input
 | 
| 497 | type \texttt{I} and output type \texttt{T}) as one argument but also a
 | |
| 498 | function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p}
 | |
| 937 | 499 | produces sets of type \texttt{Set[(S, I)]}. The semantic action
 | 
| 590 | 500 | combinator then applies the function \texttt{f} to all the `processed'
 | 
| 501 | parser outputs. Since this function is of type \texttt{T => S}, we
 | |
| 502 | obtain a parser with output type \texttt{S}. Again Scala lets us
 | |
| 503 | introduce some shorthand notation for this parser | |
| 799 | 504 | combinator. Therefore we will write short \texttt{p.map(f)} for it.
 | 
| 386 
31295bb945c6
update
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
385diff
changeset | 505 | |
| 589 | 506 | What are semantic actions good for? Well, they allow you to transform | 
| 590 | 507 | the parsed input into datastructures you can use for further | 
| 591 | 508 | processing. A simple (contrived) example would be to transform parsed | 
| 509 | characters into ASCII numbers. Suppose we define a function \texttt{f}
 | |
| 510 | (from characters to \texttt{Int}s) and use a \texttt{CharParser} for parsing
 | |
| 589 | 511 | the character \texttt{c}.
 | 
| 587 | 512 | |
| 591 | 513 | |
| 587 | 514 | \begin{center}
 | 
| 515 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 516 | val f = (c: Char) => c.toInt | |
| 517 | val c = new CharParser('c')
 | |
| 518 | \end{lstlisting}
 | |
| 519 | \end{center}
 | |
| 520 | ||
| 521 | \noindent | |
| 589 | 522 | We then can run the following two parsers on the input \texttt{cbd}:
 | 
| 587 | 523 | |
| 524 | \begin{center}
 | |
| 525 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 526 | c.parse("cbd")
 | |
| 799 | 527 | c.map(f).parse("cbd")
 | 
| 587 | 528 | \end{lstlisting}
 | 
| 529 | \end{center}
 | |
| 530 | ||
| 531 | \noindent | |
| 589 | 532 | In the first line we obtain the expected result \texttt{Set(('c',
 | 
| 533 |   "bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the
 | |
| 534 | character has been transformed into an ASCII number. | |
| 588 | 535 | |
| 536 | A slightly less contrived example is about parsing numbers (recall | |
| 591 | 537 | \texttt{NumParser} above). However, we want to do this here for
 | 
| 538 | strings, not for tokens. For this assume we have the following | |
| 539 | (atomic) \texttt{RegexParser}.
 | |
| 588 | 540 | |
| 541 | \begin{center}
 | |
| 542 |   \begin{lstlisting}[language=Scala,xleftmargin=0mm,
 | |
| 543 | basicstyle=\small\ttfamily, numbers=none] | |
| 544 | import scala.util.matching.Regex | |
| 545 | ||
| 546 | case class RegexParser(reg: Regex) extends Parser[String, String] {
 | |
| 547 |   def parse(in: String) = reg.findPrefixMatchOf(in) match {
 | |
| 548 | case None => Set() | |
| 549 | case Some(m) => Set((m.matched, m.after.toString)) | |
| 550 | } | |
| 551 | } | |
| 552 | \end{lstlisting}
 | |
| 553 | \end{center}
 | |
| 554 | ||
| 555 | \noindent | |
| 556 | This parser takes a regex as argument and splits up a string into a | |
| 557 | prefix and the rest according to this regex | |
| 558 | (\texttt{reg.findPrefixMatchOf} generates a match---in the successful
 | |
| 559 | case---and the corresponding strings can be extracted with | |
| 591 | 560 | \texttt{matched} and \texttt{after}). The input and output type for
 | 
| 561 | this parser is \texttt{String}. Using \texttt{RegexParser} we can
 | |
| 562 | define a \texttt{NumParser} for \texttt{Strings} to \texttt{Int} as
 | |
| 563 | follows: | |
| 588 | 564 | |
| 565 | \begin{center}
 | |
| 566 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 567 | val NumParser = RegexParser("[0-9]+".r)
 | |
| 568 | \end{lstlisting}
 | |
| 569 | \end{center}
 | |
| 570 | ||
| 571 | \noindent | |
| 591 | 572 | This parser will recognise a number at the beginning of a string. For | 
| 588 | 573 | example | 
| 574 | ||
| 575 | \begin{center}
 | |
| 576 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 577 | NumParser.parse("123abc")
 | |
| 578 | \end{lstlisting}
 | |
| 579 | \end{center}  
 | |
| 580 | ||
| 581 | \noindent | |
| 582 | produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
 | |
| 937 | 583 | still a string (the expected double-quotes are not printed by | 
| 590 | 584 | Scala). We want to convert this string into the corresponding | 
| 585 | \texttt{Int}. We can do this as follows using a semantic action
 | |
| 588 | 586 | |
| 587 | \begin{center}
 | |
| 588 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 799 | 589 | NumParser.map{s => s.toInt}.parse("123abc")
 | 
| 588 | 590 | \end{lstlisting}
 | 
| 591 | \end{center}  
 | |
| 592 | ||
| 593 | \noindent | |
| 589 | 594 | The function in the semantic action converts a string into an | 
| 591 | 595 | \texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))},
 | 
| 937 | 596 | but this time \texttt{123} is an \texttt{Int}. Think carefully what
 | 
| 597 | the input and output type of the parser is without the semantic action | |
| 941 | 598 | and what with the semantic action (the type of the function can | 
| 937 | 599 | already tell you). Let us come back to semantic actions when we are | 
| 600 | going to implement actual context-free grammars. | |
| 587 | 601 | |
| 602 | \subsubsection*{Shorthand notation for parser combinators}
 | |
| 603 | ||
| 604 | Before we proceed, let us just explain the shorthand notation for | |
| 937 | 605 | parser combinators. Like for regular expressions, the shorthand | 
| 606 | notation will make our life much easier when writing actual | |
| 607 | parsers. We can define some extensions\footnote{In Scala 2 this was
 | |
| 608 | generically called as ``implicits''.} which allow us to write | |
| 591 | 609 | |
| 610 | \begin{center}
 | |
| 611 | \begin{tabular}{ll}  
 | |
| 799 | 612 |   \pcode{p || q} & alternative parser\\
 | 
| 591 | 613 |   \pcode{p ~ q} & sequence parser\\ 
 | 
| 799 | 614 |   \pcode{p.map(f)} & semantic action parser
 | 
| 591 | 615 | \end{tabular}
 | 
| 616 | \end{center}
 | |
| 617 | ||
| 618 | \noindent | |
| 937 | 619 | We will also use the \texttt{p}-string-interpolation for specifying simple string parsers.
 | 
| 590 | 620 | |
| 621 | The idea is that this shorthand notation allows us to easily translate | |
| 622 | context-free grammars into code. For example recall our context-free | |
| 623 | grammar for palindromes: | |
| 624 | ||
| 625 | \begin{plstx}[margin=3cm]
 | |
| 591 | 626 | : \meta{Pal} ::=  a\cdot \meta{Pal}\cdot a | b\cdot \meta{Pal}\cdot b | a | b | \epsilon\\
 | 
| 590 | 627 | \end{plstx}
 | 
| 628 | ||
| 629 | \noindent | |
| 630 | Each alternative in this grammar translates into an alternative parser | |
| 631 | combinator. The $\cdot$ can be translated to a sequence parser | |
| 632 | combinator. The parsers for $a$, $b$ and $\epsilon$ can be simply | |
| 799 | 633 | written as \texttt{p"a"}, \texttt{p"b"} and \texttt{p""}.
 | 
| 590 | 634 | |
| 587 | 635 | |
| 937 | 636 | \subsubsection*{How to build more interesting parsers using parser combinators?}
 | 
| 386 
31295bb945c6
update
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
385diff
changeset | 637 | |
| 588 | 638 | The beauty of parser combinators is the ease with which they can be | 
| 639 | implemented and how easy it is to translate context-free grammars into | |
| 640 | code (though the grammars need to be non-left-recursive). To | |
| 591 | 641 | demonstrate this consider again the grammar for palindromes from above. | 
| 590 | 642 | The first idea would be to translate it into the following code | 
| 588 | 643 | |
| 644 | \begin{center}
 | |
| 645 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 646 | lazy val Pal : Parser[String, String] = | |
| 799 | 647 | ((p"a" ~ Pal ~ p"a") || (p"b" ~ Pal ~ p"b") || p"a" || p"b" || p"") | 
| 588 | 648 | \end{lstlisting}
 | 
| 649 | \end{center}
 | |
| 650 | ||
| 651 | \noindent | |
| 590 | 652 | Unfortunately, this does not quite work yet as it produces a typing | 
| 799 | 653 | error. The reason is that the parsers \texttt{p"a"}, \texttt{p"b"} and
 | 
| 654 | \texttt{p""} all produce strings as output type and therefore can be
 | |
| 655 | put into an alternative \texttt{...|| p"a" || p"b" || p""}. But both
 | |
| 656 | sequence parsers \pcode{p"a" ~ Pal ~ p"a"} and \pcode{p"b" ~ Pal ~ p"b"}
 | |
| 591 | 657 | produce pairs of the form | 
| 658 | ||
| 659 | \begin{center}
 | |
| 660 | (((\texttt{a}-part, \texttt{Pal}-part), \texttt{a}-part), unprocessed part)
 | |
| 661 | \end{center}
 | |
| 662 | ||
| 663 | \noindent That is how the | |
| 664 | sequence parser combinator nests results when \pcode{\~} is used
 | |
| 665 | between two components. The solution is to use a semantic action that | |
| 666 | ``flattens'' these pairs and appends the corresponding strings, like | |
| 588 | 667 | |
| 668 | \begin{center}
 | |
| 669 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 670 | lazy val Pal : Parser[String, String] = | |
| 799 | 671 |   ((p"a" ~ Pal ~ p"a").map{ case ((x, y), z) => x + y + z } ||
 | 
| 672 |    (p"b" ~ Pal ~ p"b").map{ case ((x, y), z) => x + y + z } ||
 | |
| 673 | p"a" || p"b" || p"") | |
| 588 | 674 | \end{lstlisting}
 | 
| 675 | \end{center}
 | |
| 676 | ||
| 589 | 677 | \noindent | 
| 591 | 678 | How does this work? Well, recall again what the pairs look like for | 
| 799 | 679 | the parser \pcode{p"a" ~ Pal ~ p"a"}.  The pattern in the semantic
 | 
| 591 | 680 | action matches the nested pairs (the \texttt{x} with the
 | 
| 681 | \texttt{a}-part and so on).  Unfortunately when we have such nested
 | |
| 682 | pairs, Scala requires us to define the function using the | |
| 683 | \pcode{case}-syntax
 | |
| 684 | ||
| 685 | \begin{center}
 | |
| 686 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 687 | { case ((x, y), z) => ... }
 | |
| 688 | \end{lstlisting}
 | |
| 689 | \end{center}
 | |
| 690 | ||
| 691 | \noindent | |
| 692 | If we have more sequence parser combinators or have them differently nested, | |
| 693 | then the pattern in the semantic action needs to be adjusted accordingly. | |
| 694 | The action we implement above is to concatenate all three strings, which | |
| 695 | means after the semantic action is applied the output type of the parser | |
| 696 | is \texttt{String}, which means it fits with the alternative parsers
 | |
| 799 | 697 | \texttt{...|| p"a" || p"b" || p""}.
 | 
| 591 | 698 | |
| 699 | If we run the parser above with \pcode{Pal.parse_all("abaaaba")} we obtain
 | |
| 593 | 700 | as result the \pcode{Set(abaaaba)}, which indicates that the string is a palindrome
 | 
| 591 | 701 | (an empty set would mean something is wrong). But also notice what the | 
| 702 | intermediate results are generated by \pcode{Pal.parse("abaaaba")}
 | |
| 703 | ||
| 704 | \begin{center}
 | |
| 705 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 706 | Set((abaaaba,""),(aba,aaba), (a,baaaba), ("",abaaaba))
 | |
| 707 | \end{lstlisting}
 | |
| 708 | \end{center}
 | |
| 709 | ||
| 710 | \noindent | |
| 711 | That there are more than one output might be slightly unexpected, but | |
| 712 | can be explained as follows: the pairs represent all possible | |
| 713 | (partial) parses of the string \pcode{"abaaaba"}. The first pair above
 | |
| 593 | 714 | corresponds to a complete parse (all output is consumed) and this is | 
| 591 | 715 | what \pcode{Pal.parse_all} returns. The second pair is a small
 | 
| 716 | ``sub-palindrome'' that can also be parsed, but the parse fails with | |
| 717 | the rest \pcode{aaba}, which is therefore left as unprocessed. The
 | |
| 718 | third one is an attempt to parse the whole string with the | |
| 719 | single-character parser \pcode{a}. That of course only partially
 | |
| 720 | succeeds, by leaving \pcode{"baaaba"} as the unprocessed
 | |
| 593 | 721 | part. Finally, since we allow the empty string to be a palindrome we | 
| 591 | 722 | also obtain the last pair, where actually nothing is consumed from the | 
| 723 | input string. While all this works as intended, we need to be careful | |
| 724 | with this (especially with including the \pcode{""} parser in our
 | |
| 725 | grammar): if during parsing the set of parsing attempts gets too big, | |
| 726 | then the parsing process can become very slow as the potential | |
| 727 | candidates for applying rules can snowball. | |
| 589 | 728 | |
| 729 | ||
| 591 | 730 | Important is also to note is that we must define the | 
| 731 | \texttt{Pal}-parser as a \emph{lazy} value in Scala. Look again at the
 | |
| 732 | code: \texttt{Pal} occurs on the right-hand side of the definition. If we had
 | |
| 733 | just written | |
| 734 | ||
| 735 | \begin{center}
 | |
| 736 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 737 | val Pal : Parser[String, String] = ...rhs... | |
| 738 | \end{lstlisting}
 | |
| 739 | \end{center}
 | |
| 740 | ||
| 589 | 741 | \noindent | 
| 593 | 742 | then Scala before making this assignment to \texttt{Pal} attempts to
 | 
| 591 | 743 | find out what the expression on the right-hand side evaluates to. This | 
| 744 | is straightforward in case of simple expressions \texttt{2 + 3}, but
 | |
| 745 | the expression above contains \texttt{Pal} in the right-hand
 | |
| 746 | side. Without \pcode{lazy} it would try to evaluate what \texttt{Pal}
 | |
| 747 | evaluates to and start a new recursion, which means it falls into an | |
| 748 | infinite loop. The definition of \texttt{Pal} is recursive and the
 | |
| 749 | \pcode{lazy} key-word prevents it from being fully evaluated. Therefore
 | |
| 750 | whenever we want to define a recursive parser we have to write | |
| 751 | ||
| 752 | \begin{center}
 | |
| 753 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 754 | lazy val SomeParser : Parser[...,...] = ...rhs... | |
| 755 | \end{lstlisting}
 | |
| 756 | \end{center}
 | |
| 757 | ||
| 758 | \noindent That was not necessary for our atomic parsers, like | |
| 759 | \texttt{RegexParser} or \texttt{CharParser}, because they are not recursive.
 | |
| 760 | Note that this is also the reason why we had to write | |
| 761 | ||
| 762 | \begin{center}
 | |
| 763 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 764 | class AltParser[I, T] | |
| 765 | (p: => Parser[I, T], | |
| 937 | 766 |         q: => Parser[I, T]) ... extends Parser[I, T] {...}
 | 
| 591 | 767 | |
| 768 | class SeqParser[I, T, S] | |
| 769 | (p: => Parser[I, T], | |
| 937 | 770 |         q: => Parser[I, S]) ... extends Parser[I, (T, S)] {...}
 | 
| 591 | 771 | \end{lstlisting}
 | 
| 772 | \end{center}
 | |
| 773 | ||
| 774 | \noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front of
 | |
| 775 | the argument types for \texttt{p} and \texttt{q} prevent Scala from
 | |
| 776 | evaluating the arguments. Normally, Scala would first evaluate what | |
| 777 | kind of parsers \texttt{p} and \texttt{q} are, and only then generate
 | |
| 593 | 778 | the alternative parser combinator, respectively sequence parser | 
| 779 | combinator. Since the arguments can be recursive parsers, such as | |
| 591 | 780 | \texttt{Pal}, this would lead again to an infinite loop.
 | 
| 781 | ||
| 782 | As a final example in this section, let us consider the grammar for | |
| 783 | well-nested parentheses: | |
| 784 | ||
| 785 | \begin{plstx}[margin=3cm]
 | |
| 786 | : \meta{P} ::=  (\cdot \meta{P}\cdot ) \cdot \meta{P} | \epsilon\\
 | |
| 787 | \end{plstx}
 | |
| 788 | ||
| 789 | \noindent | |
| 790 | Let us assume we want to not just recognise strings of | |
| 593 | 791 | well-nested parentheses but also transform round parentheses | 
| 591 | 792 | into curly braces. We can do this by using a semantic | 
| 793 | action: | |
| 794 | ||
| 795 | \begin{center}
 | |
| 796 |   \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,
 | |
| 797 | xleftmargin=0mm, numbers=none] | |
| 798 | lazy val P : Parser[String, String] = | |
| 799 | 799 |   (p"(" ~ P ~ p")" ~ P).map{ case (((_,x),_),y) => "{" + x + "}" + y } || p""
 | 
| 591 | 800 | \end{lstlisting}
 | 
| 801 | \end{center}
 | |
| 802 | ||
| 803 | \noindent | |
| 804 | Here we define a function where which ignores the parentheses in the | |
| 805 | pairs, but replaces them in the right places with curly braces when | |
| 806 | assembling the new string in the right-hand side. If we run | |
| 807 | \pcode{P.parse_all("(((()()))())")} we obtain
 | |
| 808 | \texttt{Set(\{\{\{\{\}\{\}\}\}\{\}\})} as expected.
 | |
| 809 | ||
| 810 | ||
| 588 | 811 | |
| 386 
31295bb945c6
update
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
385diff
changeset | 812 | \subsubsection*{Implementing an Interpreter}
 | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 813 | |
| 593 | 814 | The first step before implementing an interpreter for a full-blown | 
| 592 | 815 | language is to implement a simple calculator for arithmetic | 
| 816 | expressions. Suppose our arithmetic expressions are given by the | |
| 817 | grammar: | |
| 818 | ||
| 819 | \begin{plstx}[margin=3cm,one per line]
 | |
| 593 | 820 | : \meta{E} ::= \meta{E} \cdot + \cdot \meta{E} 
 | 
| 592 | 821 |    | \meta{E} \cdot - \cdot \meta{E} 
 | 
| 822 |    | \meta{E} \cdot * \cdot \meta{E} 
 | |
| 823 |    | ( \cdot \meta{E} \cdot )
 | |
| 824 | | Number \\ | |
| 825 | \end{plstx}
 | |
| 826 | ||
| 827 | \noindent | |
| 828 | Naturally we want to implement the grammar in such a way that we can | |
| 593 | 829 | calculate what the result of, for example, \texttt{4*2+3} is---we are
 | 
| 830 | interested in an \texttt{Int} rather than a string. This means every
 | |
| 831 | component parser needs to have as output type \texttt{Int} and when we
 | |
| 832 | assemble the intermediate results, strings like \texttt{"+"},
 | |
| 833 | \texttt{"*"} and so on, need to be translated into the appropriate
 | |
| 834 | Scala operation of adding, multiplying and so on. Being inspired by | |
| 835 | the parser for well-nested parentheses above and ignoring the fact | |
| 836 | that we want $*$ to take precedence over $+$ and $-$, we might want to | |
| 837 | write something like | |
| 592 | 838 | |
| 839 | \begin{center}
 | |
| 840 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 841 | lazy val E: Parser[String, Int] = | |
| 799 | 842 |   ((E ~ p"+" ~ E).map{ case ((x, y), z) => x + z} ||
 | 
| 843 |    (E ~ p"-" ~ E).map{ case ((x, y), z) => x - z} ||
 | |
| 844 |    (E ~ p"*" ~ E).map{ case ((x, y), z) => x * z} ||
 | |
| 845 |    (p"(" ~ E ~ p")").map{ case ((x, y), z) => y} ||
 | |
| 592 | 846 | NumParserInt) | 
| 847 | \end{lstlisting}
 | |
| 848 | \end{center}
 | |
| 849 | ||
| 850 | \noindent | |
| 593 | 851 | Consider again carefully how the semantic actions pick out the correct | 
| 852 | arguments for the calculation. In case of plus, we need \texttt{x} and
 | |
| 853 | \texttt{z}, because they correspond to the results of the component
 | |
| 854 | parser \texttt{E}. We can just add \texttt{x + z} in order to obtain
 | |
| 855 | an \texttt{Int} because the output type of \texttt{E} is
 | |
| 856 | \texttt{Int}.  Similarly with subtraction and multiplication. In
 | |
| 857 | contrast in the fourth clause we need to return \texttt{y}, because it
 | |
| 858 | is the result enclosed inside the parentheses. The information about | |
| 859 | parentheses, roughly speaking, we just throw away. | |
| 592 | 860 | |
| 861 | So far so good. The problem arises when we try to call \pcode{parse_all} with the
 | |
| 862 | expression \texttt{"1+2+3"}. Lets try it
 | |
| 863 | ||
| 864 | \begin{center}
 | |
| 865 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 866 | E.parse_all("1+2+3")
 | |
| 867 | \end{lstlisting}
 | |
| 868 | \end{center}
 | |
| 869 | ||
| 870 | \noindent | |
| 593 | 871 | \ldots and we wait and wait and \ldots still wait. What is the | 
| 872 | problem? Actually, the parser just fell into an infinite loop! The | |
| 873 | reason is that the above grammar is left-recursive and recall that our | |
| 874 | parser combinators cannot deal with such left-recursive | |
| 875 | grammars. Fortunately, every left-recursive context-free grammar can be | |
| 876 | transformed into a non-left-recursive grammars that still recognises | |
| 877 | the same strings. This allows us to design the following grammar | |
| 878 | ||
| 879 | \begin{plstx}[margin=3cm]
 | |
| 880 |   : \meta{E} ::=  \meta{T} \cdot + \cdot \meta{E} |  \meta{T} \cdot - \cdot \meta{E} | \meta{T}\\
 | |
| 881 | : \meta{T} ::=  \meta{F} \cdot * \cdot \meta{T} | \meta{F}\\
 | |
| 882 | : \meta{F} ::= ( \cdot \meta{E} \cdot ) | Number\\
 | |
| 883 | \end{plstx}
 | |
| 884 | ||
| 885 | \noindent | |
| 886 | Recall what left-recursive means from Handout 5 and make sure you see | |
| 887 | why this grammar is \emph{non} left-recursive. This version of the grammar
 | |
| 937 | 888 | also deals with the fact that $*$ should have a higher precedence than $+$ and $-$. This does not | 
| 593 | 889 | affect which strings this grammar can recognise, but in which order we are going | 
| 890 | to evaluate any arithmetic expression. We can translate this grammar into | |
| 891 | parsing combinators as follows: | |
| 592 | 892 | |
| 893 | ||
| 593 | 894 | \begin{center}
 | 
| 895 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 896 | lazy val E: Parser[String, Int] = | |
| 799 | 897 |   (T ~ p"+" ~ E).map{ case ((x, y), z) => x + z } ||
 | 
| 898 |   (T ~ p"-" ~ E).map{ case ((x, y), z) => x - z } || T 
 | |
| 593 | 899 | lazy val T: Parser[String, Int] = | 
| 799 | 900 |   (F ~ p"*" ~ T).map{ case ((x, y), z) => x * z } || F
 | 
| 593 | 901 | lazy val F: Parser[String, Int] = | 
| 799 | 902 |   (p"(" ~ E ~ p")").map{ case ((x, y), z) => y } || NumParserInt
 | 
| 593 | 903 | \end{lstlisting}
 | 
| 904 | \end{center}
 | |
| 592 | 905 | |
| 593 | 906 | \noindent | 
| 594 | 907 | Let us try out some examples: | 
| 592 | 908 | |
| 593 | 909 | \begin{center}
 | 
| 910 | \begin{tabular}{rcl}
 | |
| 911 |   input strings & & output of \pcode{parse_all}\medskip\\
 | |
| 912 |   \texttt{\Grid{1+2+3}} & $\rightarrow$ & \texttt{Set(6)}\\
 | |
| 913 |   \texttt{\Grid{4*2+3}} & $\rightarrow$ & \texttt{Set(11)}\\
 | |
| 914 |   \texttt{\Grid{4*(2+3)}} & $\rightarrow$ & \texttt{Set(20)}\\
 | |
| 594 | 915 |   \texttt{\Grid{(4)*((2+3))}} & $\rightarrow$ & \texttt{Set(20)}\\
 | 
| 593 | 916 |   \texttt{\Grid{4/2+3}} & $\rightarrow$ & \texttt{Set()}\\
 | 
| 917 |   \texttt{\Grid{1\VS +\VS 2\VS +\VS 3}} & $\rightarrow$ & \texttt{Set()}\\                      
 | |
| 918 | \end{tabular}
 | |
| 919 | \end{center}
 | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 920 | |
| 593 | 921 | \noindent | 
| 594 | 922 | Note that we call \pcode{parse_all}, not \pcode{parse}.  The examples
 | 
| 923 | should be quite self-explanatory. The last two example do not produce | |
| 924 | any integer result because our parser does not define what to do in | |
| 925 | case of division (could be easily added), but also has no idea what to | |
| 595 | 926 | do with whitespaces. To deal with them is the task of the lexer! Yes, | 
| 594 | 927 | we can deal with them inside the grammar, but that would render many | 
| 928 | grammars becoming unintelligible, including this one.\footnote{If you
 | |
| 929 | think an easy solution is to extend the notion of what a number | |
| 930 | should be, then think again---you still would have to deal with | |
| 595 | 931 |   cases like \texttt{\Grid{(\VS (\VS 2+3)\VS )}}. Just think of the mess 
 | 
| 932 | you would have in a grammar for a full-blown language where there are | |
| 933 | numerous such cases.} | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 934 | |
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 935 | \end{document}
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 936 | |
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 937 | %%% Local Variables: | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 938 | %%% mode: latex | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 939 | %%% TeX-master: t | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 940 | %%% End: |