| author | Christian Urban <urbanc@in.tum.de> | 
| Thu, 25 Oct 2018 18:36:04 +0100 | |
| changeset 590 | 5cdefb0e309e | 
| parent 589 | bdb9940be094 | 
| child 591 | e3d10383ae37 | 
| permissions | -rw-r--r-- | 
| 584 | 1 | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 2 | \documentclass{article}
 | 
| 297 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 3 | \usepackage{../style}
 | 
| 217 
cd6066f1056a
updated handouts
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
183diff
changeset | 4 | \usepackage{../langs}
 | 
| 588 | 5 | \usepackage{../grammar}
 | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 6 | |
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 7 | \begin{document}
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 8 | |
| 292 
7ed2a25dd115
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
217diff
changeset | 9 | \section*{Handout 6 (Parser Combinators)}
 | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 10 | |
| 584 | 11 | This handout explains how \emph{parser combinators} work and how they
 | 
| 587 | 12 | can be implemented in Scala. Their most distinguishing feature is that | 
| 13 | they are very easy to implement (admittedly it is only easy in a | |
| 14 | functional programming language). Another good point of parser | |
| 15 | combinators is that they can deal with any kind of input as long as | |
| 16 | this input is of ``sequence-kind'', for example a string or a list of | |
| 17 | tokens. The only two properties of the input we need is to be able to | |
| 18 | test when it is empty and ``sequentially'' take it apart. Strings and | |
| 19 | lists fit this bill. However, parser combinators also have their | |
| 20 | drawbacks. For example they require that the grammar to be parsed is | |
| 21 | \emph{not} left-recursive and they are efficient only when the grammar
 | |
| 22 | is unambiguous. It is the responsibility of the grammar designer to | |
| 23 | ensure these two properties. | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 24 | |
| 587 | 25 | The general idea behind parser combinators is to transform the input | 
| 26 | into sets of pairs, like so | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 27 | |
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 28 | \begin{center}
 | 
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 29 | $\underbrace{\text{list of tokens}}_{\text{input}}$ 
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 30 | $\Rightarrow$ | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 31 | $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 32 | \end{center} 
 | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 33 | |
| 587 | 34 | \noindent | 
| 590 | 35 | Given the extended effort we have spent implementing a lexer in order | 
| 36 | to generate list of tokens, it might be surprising that in what | |
| 37 | follows we shall often use strings as input, rather than list of | |
| 38 | tokens. This is for making the explanation more lucid. It does not | |
| 39 | make our previous work on lexers obsolete (remember they transform a | |
| 40 | string into a list of tokens). Lexers will still be needed for | |
| 41 | building a somewhat realistic compiler. | |
| 584 | 42 | |
| 590 | 43 | As mentioned above, parser combinators are relatively agnostic about what | 
| 587 | 44 | kind of input they process. In my Scala code I use the following | 
| 45 | polymorphic types for parser combinators: | |
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 46 | |
| 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 47 | \begin{center}
 | 
| 584 | 48 | input:\;\; \texttt{I}  \qquad output:\;\; \texttt{T}
 | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 49 | \end{center}
 | 
| 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 50 | |
| 587 | 51 | \noindent That is they take as input something of type \texttt{I} and
 | 
| 590 | 52 | return a set of pairs of type \texttt{Set[(T, I)]}. Since the input
 | 
| 53 | needs to be of ``sequence-kind'', I actually have to often write | |
| 54 | \texttt{I <\% Seq[\_]} for the input type. This ensures the input is a
 | |
| 584 | 55 | subtype of Scala sequences. The first component of the generated pairs | 
| 56 | corresponds to what the parser combinator was able to process from the | |
| 590 | 57 | input and the second is the unprocessed part, or leftover, of the | 
| 58 | input (therefore the type of this unprocessed part is the same as the | |
| 59 | input). A parser combinator might return more than one such pair; the | |
| 60 | idea is that there are potentially several ways of how to parse the | |
| 61 | input. As a concrete example, consider the string | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 62 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 63 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 64 | \tt\Grid{iffoo\VS testbar}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 65 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 66 | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 67 | \noindent We might have a parser combinator which tries to | 
| 386 
31295bb945c6
update
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
385diff
changeset | 68 | interpret this string as a keyword (\texttt{if}) or as an
 | 
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 69 | identifier (\texttt{iffoo}). Then the output will be the set
 | 
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 70 | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 71 | \begin{center}
 | 
| 386 
31295bb945c6
update
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
385diff
changeset | 72 | $\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), 
 | 
| 
31295bb945c6
update
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
385diff
changeset | 73 |            \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
 | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 74 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 75 | |
| 587 | 76 | \noindent where the first pair means the parser could recognise | 
| 590 | 77 | \texttt{if} from the input and leaves the \texttt{foo\VS testbar} as
 | 
| 78 | `unprocessed' part; in the other case it could recognise | |
| 587 | 79 | \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
 | 
| 80 | parser cannot recognise anything from the input at all, then parser | |
| 81 | combinators just return the empty set $\{\}$. This will indicate
 | |
| 82 | something ``went wrong''\ldots or more precisely, nothing could be | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 83 | parsed. | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 84 | |
| 584 | 85 | Also important to note is that the type \texttt{T} for the processed
 | 
| 590 | 86 | part is different from the input type \texttt{I} in the parse. In the
 | 
| 87 | example above is just happens to be the same. The reason for the | |
| 88 | potential difference is that in general we are interested in | |
| 89 | transforming our input into something ``different''\ldots for example | |
| 90 | into a tree; or if we implement the grammar for arithmetic | |
| 91 | expressions, we might be interested in the actual integer number the | |
| 92 | arithmetic expression, say \texttt{1 + 2 * 3}, stands for. In this way
 | |
| 93 | we can use parser combinators to implement relatively easily a | |
| 94 | calculator, for instance. | |
| 584 | 95 | |
| 96 | The main idea of parser combinators is that we can easily build parser | |
| 97 | combinators out of smaller components following very closely the | |
| 98 | structure of a grammar. In order to implement this in an | |
| 99 | object-oriented programming language, like Scala, we need to specify | |
| 100 | an abstract class for parser combinators. This abstract class states | |
| 101 | that the function \texttt{parse} takes an argument of type \texttt{I}
 | |
| 102 | and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
 | |
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 103 | |
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 104 | \begin{center}
 | 
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 105 | \begin{lstlisting}[language=Scala]
 | 
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 106 | abstract class Parser[I, T] {
 | 
| 590 | 107 | def parse(in: I) : Set[(T, I)] | 
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 108 | |
| 590 | 109 | def parse_all(in: I) : Set[T] = | 
| 110 | for ((head, tail) <- parse(in); if (tail.isEmpty)) | |
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 111 | yield head | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 112 | } | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 113 | \end{lstlisting}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 114 | \end{center}
 | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 115 | |
| 584 | 116 | \noindent It is the obligation in each instance (parser combinator) to | 
| 117 | supply an implementation for \texttt{parse}.  From this function we
 | |
| 118 | can then ``centrally'' derive the function \texttt{parse\_all}, which
 | |
| 119 | just filters out all pairs whose second component is not empty (that | |
| 120 | is has still some unprocessed part). The reason is that at the end of | |
| 121 | the parsing we are only interested in the results where all the input | |
| 122 | has been consumed and no unprocessed part is left over. | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 123 | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 124 | One of the simplest parser combinators recognises just a | 
| 584 | 125 | single character, say $c$, from the beginning of strings. Its | 
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 126 | behaviour can be described as follows: | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 127 | |
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 128 | \begin{itemize}
 | 
| 584 | 129 | \item If the head of the input string starts with a $c$, then return | 
| 130 | the set | |
| 131 |   \[\{(c, \textit{tail of}\; s)\}\]
 | |
| 132 |   where \textit{tail of} 
 | |
| 133 | $s$ is the unprocessed part of the input string. | |
| 134 | \item Otherwise return the empty set $\{\}$.	
 | |
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 135 | \end{itemize}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 136 | |
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 137 | \noindent | 
| 590 | 138 | The input type of this simple parser combinator is \texttt{String} and
 | 
| 139 | the output type is \texttt{Char}. This means \texttt{parse} returns
 | |
| 140 | \mbox{\texttt{Set[(Char, String)]}}.  The code in Scala is as follows:
 | |
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 141 | |
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 142 | \begin{center}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 143 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 144 | case class CharParser(c: Char) extends Parser[String, Char] {
 | 
| 587 | 145 | def parse(in: String) = | 
| 146 | if (in.head == c) Set((c, in.tail)) else Set() | |
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 147 | } | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 148 | \end{lstlisting}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 149 | \end{center}
 | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 150 | |
| 589 | 151 | \noindent You can see \texttt{parse} tests whether the
 | 
| 587 | 152 | first character of the input string \texttt{in} is equal to
 | 
| 584 | 153 | \texttt{c}. If yes, then it splits the string into the recognised part
 | 
| 587 | 154 | \texttt{c} and the unprocessed part \texttt{in.tail}. In case
 | 
| 155 | \texttt{in} does not start with \texttt{c} then the parser returns the
 | |
| 584 | 156 | empty set (in Scala \texttt{Set()}). Since this parser recognises
 | 
| 157 | characters and just returns characters as the processed part, the | |
| 158 | output type of the parser is \texttt{Char}.
 | |
| 159 | ||
| 160 | If we want to parse a list of tokens and interested in recognising a | |
| 590 | 161 | number token, for example, we could write something like this | 
| 584 | 162 | |
| 163 | \begin{center}
 | |
| 164 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
 | |
| 165 | case object NumParser extends Parser[List[Token], Int] {
 | |
| 166 |   def parse(ts: List[Token]) = ts match {
 | |
| 167 | case Num_token(s)::ts => Set((s.toInt, ts)) | |
| 168 | case _ => Set () | |
| 169 | } | |
| 170 | } | |
| 171 | \end{lstlisting}
 | |
| 172 | \end{center}
 | |
| 173 | ||
| 174 | \noindent | |
| 175 | In this parser the input is of type \texttt{List[Token]}. The function
 | |
| 176 | parse looks at the input \texttt{ts} and checks whether the first
 | |
| 589 | 177 | token is a \texttt{Num\_token} (let us assume our lexer generated
 | 
| 178 | these tokens for numbers). But this parser does not just return this | |
| 584 | 179 | token (and the rest of the list), like the \texttt{CharParser} above,
 | 
| 590 | 180 | rather it extracts also the string \texttt{s} from the token and
 | 
| 181 | converts it into an integer. The hope is that the lexer did its work | |
| 182 | well and this conversion always succeeds. The consequence of this is | |
| 183 | that the output type for this parser is \texttt{Int}, not
 | |
| 184 | \texttt{Token}. Such a conversion would be needed if we want to
 | |
| 185 | implement a simple calculator program, because string-numbers need to | |
| 186 | be transformed into \texttt{Int}-numbers in order to do the
 | |
| 187 | calculations. | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 188 | |
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 189 | |
| 584 | 190 | These simple parsers that just look at the input and do a simple | 
| 191 | transformation are often called \emph{atomic} parser combinators.
 | |
| 192 | More interesting are the parser combinators that build larger parsers | |
| 587 | 193 | out of smaller component parsers. There are three such parser | 
| 194 | combinators that can be implemented generically. The \emph{alternative
 | |
| 584 | 195 | parser combinator} is as follows: given two parsers, say, $p$ and | 
| 196 | $q$, we apply both parsers to the input (remember parsers are | |
| 587 | 197 | functions) and combine the output (remember they are sets of pairs): | 
| 198 | ||
| 199 | \begin{center}
 | |
| 200 | $p(\text{input}) \cup q(\text{input})$
 | |
| 201 | \end{center}
 | |
| 202 | ||
| 203 | \noindent In Scala we can implement alternative parser | |
| 204 | combinator as follows | |
| 205 | ||
| 206 | \begin{center}
 | |
| 207 | \begin{lstlisting}[language=Scala, numbers=none]
 | |
| 208 | class AltParser[I, T] | |
| 209 | (p: => Parser[I, T], | |
| 210 |         q: => Parser[I, T]) extends Parser[I, T] {
 | |
| 211 | def parse(in: I) = p.parse(in) ++ q.parse(in) | |
| 212 | } | |
| 213 | \end{lstlisting}
 | |
| 214 | \end{center}
 | |
| 215 | ||
| 216 | \noindent The types of this parser combinator are again generic (we | |
| 217 | have \texttt{I} for the input type, and \texttt{T} for the output
 | |
| 218 | type). The alternative parser builds a new parser out of two existing | |
| 590 | 219 | parsers \texttt{p} and \texttt{q} which are given as arguments.  Both
 | 
| 220 | parsers need to be able to process input of type \texttt{I} and return
 | |
| 221 | in \texttt{parse} the same output type \texttt{Set[(T,
 | |
| 587 | 222 |   I)]}.\footnote{There is an interesting detail of Scala, namely the
 | 
| 223 |   \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They
 | |
| 224 | will prevent the evaluation of the arguments before they are | |
| 225 |   used. This is often called \emph{lazy evaluation} of the
 | |
| 590 | 226 | arguments. We will explain this later.} The alternative parser runs | 
| 227 | the input with the first parser \texttt{p} (producing a set of pairs)
 | |
| 228 | and then runs the same input with \texttt{q} (producing another set of
 | |
| 229 | pairs). The result should be then just the union of both sets, which | |
| 230 | is the operation \texttt{++} in Scala.
 | |
| 587 | 231 | |
| 232 | The alternative parser combinator allows us to construct a parser that | |
| 233 | parses either a character \texttt{a} or \texttt{b} using the
 | |
| 234 | \texttt{CharParser} shown above. For this we can write
 | |
| 235 | ||
| 236 | \begin{center}
 | |
| 237 | \begin{lstlisting}[language=Scala, numbers=none]
 | |
| 238 | new AltParser(CharParser('a'), CharParser('b'))
 | |
| 239 | \end{lstlisting}
 | |
| 240 | \end{center}
 | |
| 241 | ||
| 242 | \noindent Later on we will use Scala mechanism for introducing some | |
| 589 | 243 | more readable shorthand notation for this, like \texttt{"a" |
 | 
| 587 | 244 | "b"}. Let us look in detail at what this parser combinator produces | 
| 590 | 245 | with some sample strings. | 
| 587 | 246 | |
| 247 | \begin{center}
 | |
| 248 | \begin{tabular}{rcl}
 | |
| 249 | input strings & & output\medskip\\ | |
| 250 | \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
 | |
| 251 | \texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
 | |
| 252 | \texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
 | |
| 253 | \end{tabular}
 | |
| 254 | \end{center}
 | |
| 255 | ||
| 256 | \noindent We receive in the first two cases a successful | |
| 257 | output (that is a non-empty set). In each case, either | |
| 258 | \pcode{a} or \pcode{b} is in the processed part, and
 | |
| 259 | \pcode{cde} in the unprocessed part. Clearly this parser cannot
 | |
| 260 | parse anything with \pcode{ccde}, therefore the empty
 | |
| 261 | set is returned. | |
| 262 | ||
| 263 | A bit more interesting is the \emph{sequence parser combinator}. Given
 | |
| 264 | two parsers, say again, $p$ and $q$, we want to apply first the input | |
| 590 | 265 | to $p$ producing a set of pairs; then apply $q$ to all the unparsed | 
| 587 | 266 | parts in the pairs; and then combine the results. Mathematically we would | 
| 590 | 267 | write something like this for the result set of pairs: | 
| 587 | 268 | |
| 269 | \begin{center}
 | |
| 270 | \begin{tabular}{lcl}
 | |
| 271 | $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & 
 | |
| 272 | $(\textit{output}_1, u_1) \in p(\text{input}) 
 | |
| 273 | \;\wedge\;$\\ | |
| 274 | && $(\textit{output}_2, u_2) \in q(u_1)\}$
 | |
| 275 | \end{tabular}
 | |
| 276 | \end{center}
 | |
| 277 | ||
| 278 | \noindent Notice that the $p$ will first be run on the input, | |
| 590 | 279 | producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$
 | 
| 280 | stands for the unprocessed, or leftover, parts pf $p$. We want that | |
| 281 | $q$ runs on all these unprocessed parts $u_1$. Therefore these | |
| 282 | unprocessed parts are fed into the second parser $q$. The overall | |
| 283 | result of the sequence parser combinator is pairs of the form | |
| 584 | 284 | $((\textit{output}_1, \textit{output}_2), u_2)$. This means the
 | 
| 590 | 285 | unprocessed part of the sequqnce p[arser combinator is the unprocessed | 
| 286 | part the second parser $q$ leaves as leftover. The processed parts of | |
| 287 | of the component parsers is a pair consisting of the outputs of $p$ | |
| 288 | and $q$, namely $(\textit{output}_1, \textit{output}_2)$. This
 | |
| 289 | behaviour can be implemented in Scala as follows: | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 290 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 291 | \begin{center}
 | 
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 292 | \begin{lstlisting}[language=Scala,numbers=none]
 | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 293 | class SeqParser[I, T, S] | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 294 | (p: => Parser[I, T], | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 295 |         q: => Parser[I, S]) extends Parser[I, (T, S)] {
 | 
| 587 | 296 | def parse(in: I) = | 
| 297 | for ((output1, u1) <- p.parse(in); | |
| 386 
31295bb945c6
update
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
385diff
changeset | 298 | (output2, u2) <- q.parse(u1)) | 
| 
31295bb945c6
update
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
385diff
changeset | 299 | yield ((output1, output2), u2) | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 300 | } | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 301 | \end{lstlisting}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 302 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 303 | |
| 587 | 304 | \noindent This parser takes again as arguments two parsers, \texttt{p}
 | 
| 305 | and \texttt{q}. It implements \texttt{parse} as follows: let first run
 | |
| 306 | the parser \texttt{p} on the input producing a set of pairs
 | |
| 307 | (\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
 | |
| 308 | unprocessed parts left over by \texttt{p}. Let then \texttt{q} run on these
 | |
| 309 | unprocessed parts producing again a set of pairs. The output of the | |
| 310 | sequence parser combinator is then a set containing pairs where the | |
| 311 | first components are again pairs, namely what the first parser could | |
| 312 | parse together with what the second parser could parse; the second | |
| 313 | component is the unprocessed part left over after running the second | |
| 314 | parser \texttt{q}. Therefore the input type of the sequence parser
 | |
| 315 | combinator is as usual \texttt{I}, but the output type is
 | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 316 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 317 | \begin{center}
 | 
| 590 | 318 | \texttt{(T, S)}
 | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 319 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 320 | |
| 584 | 321 | \noindent | 
| 590 | 322 | This means \texttt{parse} in the sequence parser combinator returns
 | 
| 323 | sets of type \texttt{Set[((T, S), I)]}.  Notice that we have
 | |
| 324 | essentially two output types for the sequence parser combinator | |
| 325 | (packaged in a pair), because in general \textit{p} and \textit{q}
 | |
| 326 | might produce different things (for example first we recognise a | |
| 327 | number and then a string corresponding to an operator). If any of the | |
| 328 | runs of \textit{p} and \textit{q} fail, that is produce the empty set,
 | |
| 329 | then \texttt{parse} will also produce the empty set.
 | |
| 584 | 330 | |
| 587 | 331 | With the shorthand notation we shall introduce later for the sequence | 
| 332 | parser combinator, we can write for example \pcode{"a" ~ "b"}, which
 | |
| 333 | is the parser combinator that first recognises the character | |
| 334 | \texttt{a} from a string and then \texttt{b}. Let us look again at
 | |
| 335 | three examples of how this parser combinator processes some strings: | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 336 | |
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 337 | \begin{center}
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 338 | \begin{tabular}{rcl}
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 339 | input strings & & output\medskip\\ | 
| 584 | 340 | \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\
 | 
| 341 | \texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
 | |
| 342 | \texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
 | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 343 | \end{tabular}
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 344 | \end{center}
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 345 | |
| 586 | 346 | \noindent In the first line we have a successful parse, because the | 
| 587 | 347 | string starts with \texttt{ab}, which is the prefix we are looking
 | 
| 584 | 348 | for. But since the parsing combinator is constructed as sequence of | 
| 349 | the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
 | |
| 350 | result is a nested pair of the form \texttt{((a, b), cde)}. It is
 | |
| 586 | 351 | \emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
 | 
| 587 | 352 | expect. The parser returns the empty set in the other examples, | 
| 584 | 353 | because they do not fit with what the parser is supposed to parse. | 
| 354 | ||
| 355 | ||
| 589 | 356 | A slightly more complicated parser is \pcode{("a" | "b") ~ "c"} which
 | 
| 587 | 357 | parses as first character either an \texttt{a} or \texttt{b}, followed
 | 
| 358 | by a \texttt{c}. This parser produces the following outputs.
 | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 359 | |
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 360 | \begin{center}
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 361 | \begin{tabular}{rcl}
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 362 | input strings & & output\medskip\\ | 
| 585 | 363 | \texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
 | 
| 364 | \texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
 | |
| 365 | \texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$
 | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 366 | \end{tabular}
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 367 | \end{center}
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 368 | |
| 585 | 369 | \noindent | 
| 370 | Now consider the parser \pcode{("a" ~ "b") ~ "c"} which parses
 | |
| 371 | \texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces
 | |
| 372 | the following outputs. | |
| 373 | ||
| 374 | \begin{center}
 | |
| 375 | \begin{tabular}{rcl}
 | |
| 376 | input strings & & output\medskip\\ | |
| 377 | \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
 | |
| 378 | \texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\
 | |
| 379 | \texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$
 | |
| 380 | \end{tabular}
 | |
| 381 | \end{center}
 | |
| 382 | ||
| 383 | ||
| 384 | \noindent The second and third example fail, because something is | |
| 590 | 385 | ``missing'' in the sequence we are looking for. The first succeeds but | 
| 386 | notice how the results nest with sequences: the parsed part is a | |
| 387 | nested pair of the form \pcode{((a, b), c)}. If we nest the sequence
 | |
| 388 | parser differently, for example \pcode{"a" ~ ("b" ~ "c")}, then also
 | |
| 389 | our output pairs nest differently | |
| 589 | 390 | |
| 391 | \begin{center}
 | |
| 392 | \begin{tabular}{rcl}
 | |
| 393 | input strings & & output\medskip\\ | |
| 394 | \texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})), \texttt{\Grid{de}})\right\}$\\
 | |
| 395 | \end{tabular}
 | |
| 396 | \end{center}
 | |
| 397 | ||
| 398 | \noindent | |
| 399 | Two more examples: first consider the parser | |
| 585 | 400 | \pcode{("a" ~ "a") ~ "a"} and the input \pcode{aaaa}:
 | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 401 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 402 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 403 | \begin{tabular}{rcl}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 404 | input string & & output\medskip\\ | 
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 405 | \texttt{\Grid{aaaa}} & $\rightarrow$ & 
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 406 | $\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
 | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 407 | \end{tabular}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 408 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 409 | |
| 587 | 410 | \noindent Notice how again the results nest deeper and deeper as pairs (the | 
| 585 | 411 | last \pcode{a} is in the unprocessed part). To consume everything of
 | 
| 412 | this string we can use the parser \pcode{(("a" ~ "a") ~ "a") ~
 | |
| 413 | "a"}. Then the output is as follows: | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 414 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 415 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 416 | \begin{tabular}{rcl}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 417 | input string & & output\medskip\\ | 
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 418 | \texttt{\Grid{aaaa}} & $\rightarrow$ & 
 | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 419 | $\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
 | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 420 | \end{tabular}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 421 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 422 | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 423 | \noindent This is an instance where the parser consumed | 
| 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 424 | completely the input, meaning the unprocessed part is just the | 
| 587 | 425 | empty string. So if we called \pcode{parse_all}, instead of \pcode{parse},
 | 
| 585 | 426 | we would get back the result | 
| 427 | ||
| 428 | \[ | |
| 429 | \left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}
 | |
| 430 | \] | |
| 431 | ||
| 432 | \noindent where the unprocessed (empty) parts have been stripped away | |
| 433 | from the pairs; everything where the second part was not empty has | |
| 587 | 434 | been thrown away as well, because they represent | 
| 590 | 435 | ultimately-unsuccessful-parses. The main point is that the sequence | 
| 436 | parser combinator returns pairs that can nest according to the | |
| 437 | nesting of the component parsers. | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 438 | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 439 | |
| 590 | 440 | Consider also carefully that constructing a parser such \pcode{"a" |
 | 
| 441 |   ("a" ~ "b")} will result in a typing error. The intention with this
 | |
| 442 | parser is that we want to parse an \texttt{a}, or an \texttt{a}
 | |
| 443 | followed by a \texttt{b}. However, the first parser has as output type
 | |
| 444 | a single character (recall the type of \texttt{CharParser}), but the
 | |
| 445 | second parser produces a pair of characters as output. The alternative | |
| 446 | parser is required to have both component parsers to have the same | |
| 447 | type---we need to be able to build the union of two sets, which means | |
| 448 | in Scala they need to be of the same type. Since ther are not, there | |
| 449 | is a typing error in this example. We will see later how we can build | |
| 587 | 450 | this parser without the typing error. | 
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
297diff
changeset | 451 | |
| 587 | 452 | The next parser combinator, called \emph{semantic action}, does not
 | 
| 453 | actually combine smaller parsers, but applies a function to the result | |
| 454 | of a parser. It is implemented in Scala as follows | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 455 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 456 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 457 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 458 | class FunParser[I, T, S] | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 459 | (p: => Parser[I, T], | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 460 |           f: T => S) extends Parser[I, S] {
 | 
| 587 | 461 | def parse(in: I) = | 
| 462 | for ((head, tail) <- p.parse(in)) yield (f(head), tail) | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 463 | } | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 464 | \end{lstlisting}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 465 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 466 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 467 | |
| 590 | 468 | \noindent This parser combinator takes a parser \texttt{p} (with input
 | 
| 469 | type \texttt{I} and output type \texttt{T}) as one argument but also a
 | |
| 470 | function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p}
 | |
| 471 | produces sets of type \texttt{Set[(T, I)]}. The semantic action
 | |
| 472 | combinator then applies the function \texttt{f} to all the `processed'
 | |
| 473 | parser outputs. Since this function is of type \texttt{T => S}, we
 | |
| 474 | obtain a parser with output type \texttt{S}. Again Scala lets us
 | |
| 475 | introduce some shorthand notation for this parser | |
| 476 | combinator. Therefore we will write \texttt{p ==> f} for it.
 | |
| 386 
31295bb945c6
update
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
385diff
changeset | 477 | |
| 589 | 478 | What are semantic actions good for? Well, they allow you to transform | 
| 590 | 479 | the parsed input into datastructures you can use for further | 
| 587 | 480 | processing. A simple example would be to transform parsed characters | 
| 481 | into ASCII numbers. Suppose we define a function \texttt{f} (from
 | |
| 589 | 482 | characters to ints) and use a \texttt{CharParser} for parsing
 | 
| 483 | the character \texttt{c}.
 | |
| 587 | 484 | |
| 485 | \begin{center}
 | |
| 486 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 487 | val f = (c: Char) => c.toInt | |
| 488 | val c = new CharParser('c')
 | |
| 489 | \end{lstlisting}
 | |
| 490 | \end{center}
 | |
| 491 | ||
| 492 | \noindent | |
| 589 | 493 | We then can run the following two parsers on the input \texttt{cbd}:
 | 
| 587 | 494 | |
| 495 | \begin{center}
 | |
| 496 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 497 | c.parse("cbd")
 | |
| 498 | (c ==> f).parse("cbd")
 | |
| 499 | \end{lstlisting}
 | |
| 500 | \end{center}
 | |
| 501 | ||
| 502 | \noindent | |
| 589 | 503 | In the first line we obtain the expected result \texttt{Set(('c',
 | 
| 504 |   "bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the
 | |
| 505 | character has been transformed into an ASCII number. | |
| 588 | 506 | |
| 507 | A slightly less contrived example is about parsing numbers (recall | |
| 508 | \texttt{NumParser} above). However, we want to do this here for strings.
 | |
| 509 | For this assume we have the following \texttt{RegexParser}.
 | |
| 510 | ||
| 511 | \begin{center}
 | |
| 512 |   \begin{lstlisting}[language=Scala,xleftmargin=0mm,
 | |
| 513 | basicstyle=\small\ttfamily, numbers=none] | |
| 514 | import scala.util.matching.Regex | |
| 515 | ||
| 516 | case class RegexParser(reg: Regex) extends Parser[String, String] {
 | |
| 517 |   def parse(in: String) = reg.findPrefixMatchOf(in) match {
 | |
| 518 | case None => Set() | |
| 519 | case Some(m) => Set((m.matched, m.after.toString)) | |
| 520 | } | |
| 521 | } | |
| 522 | \end{lstlisting}
 | |
| 523 | \end{center}
 | |
| 524 | ||
| 525 | \noindent | |
| 526 | This parser takes a regex as argument and splits up a string into a | |
| 527 | prefix and the rest according to this regex | |
| 528 | (\texttt{reg.findPrefixMatchOf} generates a match---in the successful
 | |
| 529 | case---and the corresponding strings can be extracted with | |
| 589 | 530 | \texttt{matched} and \texttt{after}). Using this parser we can define a
 | 
| 588 | 531 | \texttt{NumParser} for strings as follows:
 | 
| 532 | ||
| 533 | \begin{center}
 | |
| 534 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 535 | val NumParser = RegexParser("[0-9]+".r)
 | |
| 536 | \end{lstlisting}
 | |
| 537 | \end{center}
 | |
| 538 | ||
| 539 | \noindent | |
| 540 | This parser will recognise a number at the beginning of a string, for | |
| 541 | example | |
| 542 | ||
| 543 | \begin{center}
 | |
| 544 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 545 | NumParser.parse("123abc")
 | |
| 546 | \end{lstlisting}
 | |
| 547 | \end{center}  
 | |
| 548 | ||
| 549 | \noindent | |
| 550 | produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
 | |
| 590 | 551 | still a string (the required double-quotes are not printed by | 
| 552 | Scala). We want to convert this string into the corresponding | |
| 553 | \texttt{Int}. We can do this as follows using a semantic action
 | |
| 588 | 554 | |
| 555 | \begin{center}
 | |
| 556 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 557 | (NumParser ==> (s => s.toInt)).parse("123abc")
 | |
| 558 | \end{lstlisting}
 | |
| 559 | \end{center}  
 | |
| 560 | ||
| 561 | \noindent | |
| 589 | 562 | The function in the semantic action converts a string into an | 
| 588 | 563 | \texttt{Int}. Let us come back to semantic actions when we are going
 | 
| 589 | 564 | to implement actual context-free gammars. | 
| 587 | 565 | |
| 566 | \subsubsection*{Shorthand notation for parser combinators}
 | |
| 567 | ||
| 568 | Before we proceed, let us just explain the shorthand notation for | |
| 569 | parser combinators. Like for regular expressions, the shorthand notation | |
| 590 | 570 | will make our life much easier when writing actual parsers. We can define | 
| 571 | some implicits which allow us to write \pcode{p | q}, \pcode{p ~ q} and
 | |
| 572 | \pcode{p ==> f} as well as to use plain strings for specifying simple
 | |
| 573 | string parsers. | |
| 574 | ||
| 575 | The idea is that this shorthand notation allows us to easily translate | |
| 576 | context-free grammars into code. For example recall our context-free | |
| 577 | grammar for palindromes: | |
| 578 | ||
| 579 | \begin{plstx}[margin=3cm]
 | |
| 580 | : \meta{P} ::=  a\cdot \meta{P}\cdot a | b\cdot \meta{P}\cdot b | a | b | \epsilon\\
 | |
| 581 | \end{plstx}
 | |
| 582 | ||
| 583 | \noindent | |
| 584 | Each alternative in this grammar translates into an alternative parser | |
| 585 | combinator. The $\cdot$ can be translated to a sequence parser | |
| 586 | combinator. The parsers for $a$, $b$ and $\epsilon$ can be simply | |
| 587 | written as \texttt{"a"}, \texttt{"b"} and \texttt{""}.
 | |
| 588 | ||
| 587 | 589 | |
| 386 
31295bb945c6
update
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
385diff
changeset | 590 | \subsubsection*{How to build parsers using parser combinators?}
 | 
| 
31295bb945c6
update
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
385diff
changeset | 591 | |
| 588 | 592 | The beauty of parser combinators is the ease with which they can be | 
| 593 | implemented and how easy it is to translate context-free grammars into | |
| 594 | code (though the grammars need to be non-left-recursive). To | |
| 590 | 595 | demonstrate this recall the grammar for palindromes from above. | 
| 596 | The first idea would be to translate it into the following code | |
| 588 | 597 | |
| 598 | \begin{center}
 | |
| 599 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 600 | lazy val Pal : Parser[String, String] = | |
| 601 |   (("a" ~ Pal ~ "a") | ("b" ~ Pal ~ "b") | "a" | "b" | "")
 | |
| 602 | \end{lstlisting}
 | |
| 603 | \end{center}
 | |
| 604 | ||
| 605 | \noindent | |
| 590 | 606 | Unfortunately, this does not quite work yet as it produces a typing | 
| 607 | error. The reason is that the parsers \texttt{"a"}, \texttt{"b"} and
 | |
| 608 | \texttt{""} all produce strings as output type and therefore can be
 | |
| 609 | put into an alternative \texttt{...| "a" | "b" | ""}. But both
 | |
| 610 | \pcode{"a" ~ Pal ~ "a"} and \pcode{"b" ~ Pal ~ "b"} produce pairs of
 | |
| 611 | the form $(((\_, \_), \_), \_)$---that is how the sequence parser | |
| 612 | combinator nests results when \pcode{\~} is used between two
 | |
| 613 | components. The solution is to use a semantic action that ``flattens'' | |
| 614 | these pairs and appends the corresponding strings, like | |
| 588 | 615 | |
| 616 | \begin{center}
 | |
| 617 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | |
| 618 | lazy val Pal : Parser[String, String] = | |
| 619 |   (("a" ~ Pal ~ "a") ==> { case ((x, y), z) => x + y + z } |
 | |
| 620 |    ("b" ~ Pal ~ "b") ==> { case ((x, y), z) => x + y + z } |
 | |
| 621 | "a" | "b" | "") | |
| 622 | \end{lstlisting}
 | |
| 623 | \end{center}
 | |
| 624 | ||
| 589 | 625 | \noindent | 
| 590 | 626 | Now in all cases we have strings as output type for the parser variants. | 
| 589 | 627 | The semantic action | 
| 628 | ||
| 629 | ||
| 630 | \noindent | |
| 631 | Important to note is that we must define \texttt{Pal}-parser as a
 | |
| 632 | \emph{lazy} value.
 | |
| 588 | 633 | |
| 386 
31295bb945c6
update
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
385diff
changeset | 634 | \subsubsection*{Implementing an Interpreter}
 | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 635 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 636 | %\bigskip | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 637 | %takes advantage of the full generality---have a look | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 638 | %what it produces if we call it with the string \texttt{abc}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 639 | % | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 640 | %\begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 641 | %\begin{tabular}{rcl}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 642 | %input string & & output\medskip\\ | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 643 | %\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 644 | %\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 645 | %\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 646 | %\end{tabular}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 647 | %\end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 648 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 649 | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 650 | |
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 651 | \end{document}
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 652 | |
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 653 | %%% Local Variables: | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 654 | %%% mode: latex | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 655 | %%% TeX-master: t | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 656 | %%% End: |