handouts/ho06.tex
author Christian Urban <urbanc@in.tum.de>
Thu, 25 Oct 2018 00:50:58 +0100
changeset 588 a4646557016d
parent 587 5ddedcd92d84
child 589 0451b8b67f62
permissions -rw-r--r--
updated
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
     1
173
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     2
\documentclass{article}
297
5c51839c88fd updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 292
diff changeset
     3
\usepackage{../style}
217
cd6066f1056a updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 183
diff changeset
     4
\usepackage{../langs}
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
     5
\usepackage{../grammar}
173
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     6
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     7
\begin{document}
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     8
292
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
     9
\section*{Handout 6 (Parser Combinators)}
173
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
    10
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    11
This handout explains how \emph{parser combinators} work and how they
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    12
can be implemented in Scala. Their most distinguishing feature is that
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    13
they are very easy to implement (admittedly it is only easy in a
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    14
functional programming language).  Another good point of parser
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    15
combinators is that they can deal with any kind of input as long as
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    16
this input is of ``sequence-kind'', for example a string or a list of
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    17
tokens. The only two properties of the input we need is to be able to
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    18
test when it is empty and ``sequentially'' take it apart. Strings and
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    19
lists fit this bill. However, parser combinators also have their
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    20
drawbacks. For example they require that the grammar to be parsed is
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    21
\emph{not} left-recursive and they are efficient only when the grammar
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    22
is unambiguous. It is the responsibility of the grammar designer to
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    23
ensure these two properties.
173
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
    24
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    25
The general idea behind parser combinators is to transform the input
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    26
into sets of pairs, like so
175
5801e8c0e528 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 173
diff changeset
    27
5801e8c0e528 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 173
diff changeset
    28
\begin{center}
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    29
$\underbrace{\text{list of tokens}}_{\text{input}}$ 
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    30
$\Rightarrow$
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    31
$\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    32
\end{center} 
175
5801e8c0e528 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 173
diff changeset
    33
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    34
\noindent
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    35
Given the extended effort we have spent in order to implement a lexer
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    36
in order to generate list of tokens, it might be surprising that in
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    37
what follows we shall often use strings as input. This is for making
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    38
the explanation more lucid. It does not make our previous work on
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    39
lexers obsolete (remember they transform a string into a list of
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    40
tokens). Lexers will still be needed for building a somewhat realistic
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    41
compiler.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    42
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    43
But as said, parser combinators are relatively agnostic about what
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    44
kind of input they process. In my Scala code I use the following
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    45
polymorphic types for parser combinators:
176
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
    46
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
    47
\begin{center}
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    48
input:\;\; \texttt{I}  \qquad output:\;\; \texttt{T}
176
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
    49
\end{center}
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
    50
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    51
\noindent That is they take as input something of type \texttt{I} and
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    52
return a set of pairs of type \texttt{Set[(T, I)]}. Since the input needs
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    53
to be of ``sequence-kind'', I actually have to often write \texttt{I
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    54
  <\% Seq[\_]} for the input type. This ensures the input is a
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    55
subtype of Scala sequences. The first component of the generated pairs
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    56
corresponds to what the parser combinator was able to process from the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    57
input and the second is the unprocessed part of the input (therefore
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    58
the type of this unprocessed part is the same as the input). As we
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    59
shall see shortly, a parser combinator might return more than one such
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    60
pair; the idea is that there are potentially several ways of how to
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    61
parse the input.  As a concrete example, consider the string
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    62
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    63
\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    64
\tt\Grid{iffoo\VS testbar}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    65
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    66
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    67
\noindent We might have a parser combinator which tries to
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
    68
interpret this string as a keyword (\texttt{if}) or as an
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    69
identifier (\texttt{iffoo}). Then the output will be the set
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
    70
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    71
\begin{center}
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
    72
$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
    73
           \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    74
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    75
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    76
\noindent where the first pair means the parser could recognise
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    77
\texttt{if} from the input and leaves the rest as `unprocessed' as the
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    78
second component of the pair; in the other case it could recognise
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    79
\texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    80
parser cannot recognise anything from the input at all, then parser
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    81
combinators just return the empty set $\{\}$. This will indicate
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    82
something ``went wrong''\ldots or more precisely, nothing could be
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    83
parsed.
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    84
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    85
Also important to note is that the type \texttt{T} for the processed
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    86
part is different from the input type. In the example above is just
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    87
happens to be the same. The reason for the difference is that in
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    88
general we are interested in transforming our input into something
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    89
``different''\ldots for example into a tree; or if we implement the
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    90
grammar for arithmetic expressions, we might be interested in the
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    91
actual integer number the arithmetic expression, say \texttt{1 + 2 *
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    92
  3}, stands for. In this way we can use parser combinators to
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    93
implement relatively easily a calculator, for instance.
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    94
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    95
The main idea of parser combinators is that we can easily build parser
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    96
combinators out of smaller components following very closely the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    97
structure of a grammar. In order to implement this in an
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    98
object-oriented programming language, like Scala, we need to specify
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    99
an abstract class for parser combinators. This abstract class states
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   100
that the function \texttt{parse} takes an argument of type \texttt{I}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   101
and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   102
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   103
\begin{center}
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   104
\begin{lstlisting}[language=Scala]
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   105
abstract class Parser[I, T] {
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   106
  def parse(ts: I): Set[(T, I)]
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   107
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   108
  def parse_all(ts: I): Set[T] =
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   109
    for ((head, tail) <- parse(ts); if (tail.isEmpty)) 
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   110
      yield head
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   111
}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   112
\end{lstlisting}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   113
\end{center}
176
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
   114
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   115
\noindent It is the obligation in each instance (parser combinator) to
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   116
supply an implementation for \texttt{parse}.  From this function we
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   117
can then ``centrally'' derive the function \texttt{parse\_all}, which
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   118
just filters out all pairs whose second component is not empty (that
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   119
is has still some unprocessed part). The reason is that at the end of
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   120
the parsing we are only interested in the results where all the input
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   121
has been consumed and no unprocessed part is left over.
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   122
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   123
One of the simplest parser combinators recognises just a
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   124
single character, say $c$, from the beginning of strings. Its
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   125
behaviour can be described as follows:
176
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
   126
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   127
\begin{itemize}
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   128
\item If the head of the input string starts with a $c$, then return 
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   129
  the set
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   130
  \[\{(c, \textit{tail of}\; s)\}\]
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   131
  where \textit{tail of} 
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   132
  $s$ is the unprocessed part of the input string.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   133
\item Otherwise return the empty set $\{\}$.	
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   134
\end{itemize}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   135
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   136
\noindent
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   137
The input type of this simple parser combinator for characters is
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   138
\texttt{String} and the output type is \texttt{Char}. This means
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   139
\texttt{parse} returns  \mbox{\texttt{Set[(Char, String)]}}. 
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   140
The code in Scala is as follows:
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   141
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   142
\begin{center}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   143
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   144
case class CharParser(c: Char) extends Parser[String, Char] {
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   145
  def parse(in: String) = 
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   146
    if (in.head == c) Set((c, in.tail)) else Set()
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   147
}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   148
\end{lstlisting}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   149
\end{center}
176
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
   150
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   151
\noindent You can see the \texttt{parse} tests whether the
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   152
first character of the input string \texttt{in} is equal to
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   153
\texttt{c}. If yes, then it splits the string into the recognised part
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   154
\texttt{c} and the unprocessed part \texttt{in.tail}. In case
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   155
\texttt{in} does not start with \texttt{c} then the parser returns the
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   156
empty set (in Scala \texttt{Set()}). Since this parser recognises
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   157
characters and just returns characters as the processed part, the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   158
output type of the parser is \texttt{Char}.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   159
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   160
If we want to parse a list of tokens and interested in recognising a
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   161
number token, we could write something like this
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   162
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   163
\begin{center}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   164
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   165
case object NumParser extends Parser[List[Token], Int] {
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   166
  def parse(ts: List[Token]) = ts match {
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   167
    case Num_token(s)::ts => Set((s.toInt, ts)) 
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   168
    case _ => Set ()
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   169
  }
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   170
}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   171
\end{lstlisting}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   172
\end{center}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   173
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   174
\noindent
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   175
In this parser the input is of type \texttt{List[Token]}. The function
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   176
parse looks at the input \texttt{ts} and checks whether the first
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   177
token is a \texttt{Num\_token}. Let us assume our lexer generated
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   178
these tokens for numbers. But this parser does not just return this
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   179
token (and the rest of the list), like the \texttt{CharParser} above,
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   180
rather extracts the string \texttt{s} from the token and converts it
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   181
into an integer. The hope is that the lexer did its work well and this
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   182
conversion always succeeds. The consequence of this is that the output
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   183
type for this parser is \texttt{Int}, not \texttt{Token}. Such a
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   184
conversion would be needed if we want to implement a simple calculator
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   185
program.
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   186
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   187
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   188
These simple parsers that just look at the input and do a simple
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   189
transformation are often called \emph{atomic} parser combinators.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   190
More interesting are the parser combinators that build larger parsers
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   191
out of smaller component parsers. There are three such parser
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   192
combinators that can be implemented generically. The \emph{alternative
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   193
  parser combinator} is as follows: given two parsers, say, $p$ and
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   194
$q$, we apply both parsers to the input (remember parsers are
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   195
functions) and combine the output (remember they are sets of pairs):
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   196
 
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   197
\begin{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   198
$p(\text{input}) \cup q(\text{input})$
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   199
\end{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   200
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   201
\noindent In Scala we can implement alternative parser
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   202
combinator as follows
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   203
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   204
\begin{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   205
\begin{lstlisting}[language=Scala, numbers=none]
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   206
class AltParser[I, T]
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   207
       (p: => Parser[I, T], 
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   208
        q: => Parser[I, T]) extends Parser[I, T] {
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   209
  def parse(in: I) = p.parse(in) ++ q.parse(in)
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   210
}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   211
\end{lstlisting}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   212
\end{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   213
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   214
\noindent The types of this parser combinator are again generic (we
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   215
have \texttt{I} for the input type, and \texttt{T} for the output
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   216
type). The alternative parser builds a new parser out of two existing
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   217
parsers \texttt{p} and \texttt{q} given as arguments.  Both parsers
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   218
need to be able to process input of type \texttt{I} and return in
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   219
\texttt{parse} the same output type \texttt{Set[(T,
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   220
  I)]}.\footnote{There is an interesting detail of Scala, namely the
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   221
  \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   222
  will prevent the evaluation of the arguments before they are
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   223
  used. This is often called \emph{lazy evaluation} of the
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   224
  arguments. We will explain this later.}  The alternative parser
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   225
should run the input with the first parser \texttt{p} (producing a set
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   226
of pairs) and then run the same input with \texttt{q} (producing
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   227
another set of pairs).  The result should be then just the union of
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   228
both sets, which is the operation \texttt{++} in Scala.
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   229
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   230
The alternative parser combinator allows us to construct a parser that
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   231
parses either a character \texttt{a} or \texttt{b} using the
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   232
\texttt{CharParser} shown above. For this we can write
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   233
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   234
\begin{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   235
\begin{lstlisting}[language=Scala, numbers=none]
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   236
new AltParser(CharParser('a'), CharParser('b'))
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   237
\end{lstlisting}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   238
\end{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   239
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   240
\noindent Later on we will use Scala mechanism for introducing some
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   241
more readable shorthand notation for this, like \texttt{"a" ||
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   242
  "b"}. Let us look in detail at what this parser combinator produces
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   243
with some sample strings
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   244
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   245
\begin{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   246
\begin{tabular}{rcl}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   247
input strings & & output\medskip\\
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   248
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   249
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   250
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   251
\end{tabular}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   252
\end{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   253
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   254
\noindent We receive in the first two cases a successful
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   255
output (that is a non-empty set). In each case, either
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   256
\pcode{a} or \pcode{b} is in the processed part, and
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   257
\pcode{cde} in the unprocessed part. Clearly this parser cannot
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   258
parse anything with \pcode{ccde}, therefore the empty
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   259
set is returned.
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   260
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   261
A bit more interesting is the \emph{sequence parser combinator}. Given
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   262
two parsers, say again, $p$ and $q$, we want to apply first the input
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   263
to $p$ producing a set of pairs; then apply $q$ to all the unparsed
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   264
parts in the pairs; and then combine the results. Mathematically we would
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   265
write something like this for the expected set of pairs:
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   266
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   267
\begin{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   268
\begin{tabular}{lcl}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   269
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & 
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   270
$(\textit{output}_1, u_1) \in p(\text{input}) 
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   271
\;\wedge\;$\\
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   272
&& $(\textit{output}_2, u_2) \in q(u_1)\}$
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   273
\end{tabular}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   274
\end{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   275
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   276
\noindent Notice that the $p$ wil first be run on the input, producing
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   277
pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   278
for the unprocessed, or left-over, parts. We want that $q$ runs on all
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   279
these unprocessed parts $u_1$. This again will produce some
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   280
processed part , $p$ and
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   281
$q$, we apply both parsers to the input (remember parsers are
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   282
functions) and combine the output (remember they are sets of pairs)
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   283
 
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   284
\begin{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   285
$p(\text{input}) \cup q(\text{input})$
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   286
\end{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   287
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   288
\noindent In Scala we would implement alternative parser
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   289
combinator as follows
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   290
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   291
\begin{center}
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   292
\begin{lstlisting}[language=Scala, numbers=none]
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   293
class AltParser[I, T]
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   294
       (p: => Parser[I, T], 
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   295
        q: => Parser[I, T]) extends Parser[I, T] {
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   296
  def parse(in: I) = p.parse(in) ++ q.parse(in)
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   297
}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   298
\end{lstlisting}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   299
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   300
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   301
\noindent The types of this parser combinator are again generic (we
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   302
just have \texttt{I} for the input type, and \texttt{T} for the output
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   303
type). The alternative parser builds a new parser out of two existing
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   304
parsers \texttt{p} and \texttt{q}.  Both need to be able to process
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   305
input of type \texttt{I} and return the same output type
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   306
\texttt{Set[(T, I)]}.\footnote{There is an interesting detail of
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   307
  Scala, namely the \texttt{=>} in front of the types of \texttt{p}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   308
  and \texttt{q}. They will prevent the evaluation of the arguments
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   309
  before they are used. This is often called \emph{lazy evaluation} of
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   310
  the arguments. We will explain this later.}  Therefore the output
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   311
type of this parser is \texttt{T}. The alternative parser should run
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   312
the input with the first parser \texttt{p} (producing a set of pairs)
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   313
and then run the same input with \texttt{q} (producing another set of
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   314
pairs).  The result should be then just the union of both sets, which
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   315
is the operation \texttt{++} in Scala.
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   316
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   317
The alternative parser combinator already allows us to
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   318
construct a parser that parses either a character \texttt{a}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   319
or \texttt{b}, as
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   320
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   321
\begin{center}
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   322
\begin{lstlisting}[language=Scala, numbers=none]
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   323
new AltParser(CharParser('a'), CharParser('b'))
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   324
\end{lstlisting}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   325
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   326
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   327
\noindent Later on we will use again Scala mechanism for introducing some
586
Christian Urban <urbanc@in.tum.de>
parents: 585
diff changeset
   328
more readable shorthand notation for this, like \texttt{"a" || "b"}. Let us look in detail at what this parser combinator produces with some sample strings 
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   329
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   330
\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   331
\begin{tabular}{rcl}
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   332
input strings & & output\medskip\\
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   333
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   334
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   335
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   336
\end{tabular}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   337
\end{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   338
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   339
\noindent We receive in the first two cases a successful
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   340
output (that is a non-empty set). In each case, either
392
2d0a59127694 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 386
diff changeset
   341
\pcode{a} or \pcode{b} is in the processed part, and
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   342
\pcode{cde} in the unprocessed part. Clearly this parser cannot
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   343
parse anything in the string \pcode{ccde}, therefore the empty
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   344
set.
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   345
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   346
A bit more interesting is the \emph{sequence parser combinator}. Given
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   347
two parsers, say again, $p$ and $q$, we want to apply first the input
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   348
to $p$ producing a set of pairs; then apply $q$ to all the unparsed
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   349
parts in the pairs; and then combine the results like
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   350
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   351
\begin{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   352
\begin{tabular}{lcl}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   353
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & 
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   354
$(\textit{output}_1, u_1) \in p(\text{input}) 
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   355
\;\wedge\;$\\
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   356
&& $(\textit{output}_2, u_2) \in q(u_1)\}$
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   357
\end{tabular}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   358
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   359
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   360
\noindent Notice that the $p$ will first be run on the input,
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   361
producing pairs of the form $\textit{output}_1$ and unprocessed part
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   362
$u_1$. This unprocessed part is fed into the second parser $q$. The
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   363
overall result of the sequence parser combinator is pairs of the form
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   364
$((\textit{output}_1, \textit{output}_2), u_2)$. This means the
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   365
unprocessed part of both parsers is the unprocessed part the second
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   366
parser $q$ produces leaves as left-over. The processed parts of both
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   367
parsers is a pair consisting of the outputs of $p$ and $q$, namely
586
Christian Urban <urbanc@in.tum.de>
parents: 585
diff changeset
   368
$(\textit{output}_1, \textit{output}_2)$. This behaviour can be
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   369
implemented in Scala as follows:
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   370
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   371
\begin{center}
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   372
\begin{lstlisting}[language=Scala,numbers=none]
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   373
class SeqParser[I, T, S]
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   374
       (p: => Parser[I, T], 
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   375
        q: => Parser[I, S]) extends Parser[I, (T, S)] {
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   376
  def parse(in: I) = 
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   377
    for ((output1, u1) <- p.parse(in); 
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   378
         (output2, u2) <- q.parse(u1)) 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   379
            yield ((output1, output2), u2)
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   380
}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   381
\end{lstlisting}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   382
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   383
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   384
\noindent This parser takes again as arguments two parsers, \texttt{p}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   385
and \texttt{q}. It implements \texttt{parse} as follows: let first run
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   386
the parser \texttt{p} on the input producing a set of pairs
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   387
(\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   388
unprocessed parts left over by \texttt{p}. Let then \texttt{q} run on these
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   389
unprocessed parts producing again a set of pairs. The output of the
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   390
sequence parser combinator is then a set containing pairs where the
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   391
first components are again pairs, namely what the first parser could
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   392
parse together with what the second parser could parse; the second
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   393
component is the unprocessed part left over after running the second
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   394
parser \texttt{q}. Therefore the input type of the sequence parser
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   395
combinator is as usual \texttt{I}, but the output type is
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   396
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   397
\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   398
\texttt{Set[((T, S), I)]}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   399
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   400
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   401
\noindent
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   402
If any of the runs of \textit{p} and \textit{q} fail, that is produce
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   403
the empty set, then \texttt{parse} will also produce the empty set.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   404
Notice that we have now two output types for the sequence parser
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   405
combinator, because in general \textit{p} and \textit{q} might produce
586
Christian Urban <urbanc@in.tum.de>
parents: 585
diff changeset
   406
different things (for example first we recognise a number and then a
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   407
string corresponding to an operator).
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   408
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   409
With the shorthand notation we shall introduce later for the sequence
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   410
parser combinator, we can write for example \pcode{"a" ~ "b"}, which
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   411
is the parser combinator that first recognises the character
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   412
\texttt{a} from a string and then \texttt{b}. Let us look again at
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   413
three examples of how this parser combinator processes some strings:
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   414
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   415
\begin{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   416
\begin{tabular}{rcl}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   417
input strings & & output\medskip\\
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   418
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   419
\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   420
\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   421
\end{tabular}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   422
\end{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   423
586
Christian Urban <urbanc@in.tum.de>
parents: 585
diff changeset
   424
\noindent In the first line we have a successful parse, because the
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   425
string starts with \texttt{ab}, which is the prefix we are looking
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   426
for. But since the parsing combinator is constructed as sequence of
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   427
the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   428
result is a nested pair of the form \texttt{((a, b), cde)}. It is
586
Christian Urban <urbanc@in.tum.de>
parents: 585
diff changeset
   429
\emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   430
expect.  The parser returns the empty set in the other examples,
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   431
because they do not fit with what the parser is supposed to parse.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   432
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   433
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   434
A slightly more complicated parser is \pcode{("a" || "b") ~ "c"} which
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   435
parses as first character either an \texttt{a} or \texttt{b}, followed
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   436
by a \texttt{c}. This parser produces the following outputs.
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   437
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   438
\begin{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   439
\begin{tabular}{rcl}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   440
input strings & & output\medskip\\
585
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   441
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   442
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   443
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   444
\end{tabular}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   445
\end{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   446
585
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   447
\noindent
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   448
Now consider the parser \pcode{("a" ~ "b") ~ "c"} which parses
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   449
\texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   450
the following outputs.
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   451
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   452
\begin{center}
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   453
\begin{tabular}{rcl}
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   454
input strings & & output\medskip\\
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   455
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   456
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   457
\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   458
\end{tabular}
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   459
\end{center}
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   460
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   461
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   462
\noindent The second and third example fail, because something is
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   463
``missing'' in the sequence we are looking for. Also notice how the
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   464
results nest with sequences: the parsed part is a nested pair of the
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   465
form \pcode{((a, b), c)}. Two more examples: first consider the parser
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   466
\pcode{("a" ~ "a") ~ "a"} and the input \pcode{aaaa}:
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   467
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   468
\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   469
\begin{tabular}{rcl}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   470
input string & & output\medskip\\
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   471
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   472
$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   473
\end{tabular}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   474
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   475
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   476
\noindent Notice how again the results nest deeper and deeper as pairs (the
585
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   477
last \pcode{a} is in the unprocessed part). To consume everything of
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   478
this string we can use the parser \pcode{(("a" ~ "a") ~ "a") ~
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   479
  "a"}. Then the output is as follows:
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   480
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   481
\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   482
\begin{tabular}{rcl}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   483
input string & & output\medskip\\
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   484
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   485
$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   486
\end{tabular}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   487
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   488
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   489
\noindent This is an instance where the parser consumed
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   490
completely the input, meaning the unprocessed part is just the
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   491
empty string. So if we called \pcode{parse_all}, instead of \pcode{parse},
585
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   492
we would get back the result
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   493
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   494
\[
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   495
\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   496
\]
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   497
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   498
\noindent where the unprocessed (empty) parts have been stripped away
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   499
from the pairs; everything where the second part was not empty has
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   500
been thrown away as well, because they represent
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   501
ultimately-unsuccessful-parses.
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   502
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   503
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   504
Note carefully that constructing a parser such \pcode{"a" || ("a" ~
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   505
  "b")} will result in a typing error. The intention is that we want
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   506
to parse an \texttt{a}, or an \texttt{a} followed by a
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   507
\texttt{b}. However, the first parser has as output type a single
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   508
character (recall the type of \texttt{CharParser}), but the second
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   509
parser produces a pair of characters as output. The alternative parser
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   510
is required to have both component parsers to have the same type---we
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   511
need to be able to build the union of two sets, which means in Scala
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   512
they need to be of the same type.  We will see later how we can build
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   513
this parser without the typing error.
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   514
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   515
The next parser combinator, called \emph{semantic action}, does not
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   516
actually combine smaller parsers, but applies a function to the result
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   517
of a parser.  It is implemented in Scala as follows
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   518
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   519
\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   520
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   521
class FunParser[I, T, S]
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   522
         (p: => Parser[I, T], 
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   523
          f: T => S) extends Parser[I, S] {
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   524
  def parse(in: I) = 
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   525
    for ((head, tail) <- p.parse(in)) yield (f(head), tail)
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   526
}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   527
\end{lstlisting}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   528
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   529
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   530
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   531
\noindent This parser combinator takes a parser \texttt{p}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   532
with output type \texttt{T} as one argument as well as a
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   533
function \texttt{f} with type \texttt{T => S}. The parser
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   534
\texttt{p} produces sets of type \texttt{(T, I)}. The
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   535
semantic action combinastor then applies the function
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   536
\texttt{f} to all the parser outputs. Since this function is of
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   537
type \texttt{T => S}, we obtain a parser with output type
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   538
\texttt{S}. Again Scala lets us introduce some shorthand
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   539
notation for this parser combinator. Therefore we will write
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   540
\texttt{p ==> f} for it.
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   541
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   542
What are semantic actions good for? Well, they allow is to transform
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   543
the parsed input into a datastructure we can use for further
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   544
processing. A simple example would be to transform parsed characters
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   545
into ASCII numbers. Suppose we define a function \texttt{f} (from
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   546
characters to ints) and use a \texttt{CharParser} for the character \texttt{c}.
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   547
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   548
\begin{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   549
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   550
val f = (c: Char) => c.toInt
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   551
val c = new CharParser('c')
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   552
\end{lstlisting}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   553
\end{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   554
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   555
\noindent
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   556
Then we can run the following parsers on the input \texttt{cbd}:
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   557
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   558
\begin{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   559
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   560
c.parse("cbd")
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   561
(c ==> f).parse("cbd")
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   562
\end{lstlisting}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   563
\end{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   564
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   565
\noindent
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   566
The first line we obtain the result \texttt{Set(('c', "bd"))}, whereas the second
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   567
produces \texttt{Set((99, "bd"))}---the character has been transformed into
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   568
an ASCII number.
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   569
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   570
A slightly less contrived example is about parsing numbers (recall
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   571
\texttt{NumParser} above). However, we want to do this here for strings.
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   572
For this assume we have the following \texttt{RegexParser}.
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   573
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   574
\begin{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   575
  \begin{lstlisting}[language=Scala,xleftmargin=0mm,
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   576
    basicstyle=\small\ttfamily, numbers=none]
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   577
import scala.util.matching.Regex
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   578
    
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   579
case class RegexParser(reg: Regex) extends Parser[String, String] {
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   580
  def parse(in: String) = reg.findPrefixMatchOf(in) match {
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   581
    case None => Set()
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   582
    case Some(m) => Set((m.matched, m.after.toString))  
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   583
  }
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   584
}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   585
\end{lstlisting}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   586
\end{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   587
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   588
\noindent
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   589
This parser takes a regex as argument and splits up a string into a
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   590
prefix and the rest according to this regex
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   591
(\texttt{reg.findPrefixMatchOf} generates a match---in the successful
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   592
case---and the corresponding strings can be extracted with
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   593
\texttt{matched} and \texttt{after}). We can now define a
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   594
\texttt{NumParser} for strings as follows:
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   595
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   596
\begin{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   597
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   598
val NumParser = RegexParser("[0-9]+".r)
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   599
\end{lstlisting}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   600
\end{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   601
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   602
\noindent
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   603
This parser will recognise a number at the beginning of a string, for
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   604
example
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   605
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   606
\begin{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   607
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   608
NumParser.parse("123abc")
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   609
\end{lstlisting}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   610
\end{center}  
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   611
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   612
\noindent
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   613
produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   614
still a string. We need to convert it into the corresponding
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   615
\texttt{Int}. We can do this as follows
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   616
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   617
\begin{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   618
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   619
(NumParser ==> (s => s.toInt)).parse("123abc")
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   620
\end{lstlisting}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   621
\end{center}  
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   622
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   623
\noindent
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   624
The semantic action in form of a function converts a string into an
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   625
\texttt{Int}. Let us come back to semantic actions when we are going
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   626
to implement a simple calculator.
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   627
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   628
\subsubsection*{Shorthand notation for parser combinators}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   629
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   630
Before we proceed, let us just explain the shorthand notation for
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   631
parser combinators. Like for regular expressions, the shorthand notation
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   632
will make our life much easier when writing actual parsers.
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   633
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   634
\subsubsection*{How to build parsers using parser combinators?}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   635
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   636
The beauty of parser combinators is the ease with which they can be
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   637
implemented and how easy it is to translate context-free grammars into
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   638
code (though the grammars need to be non-left-recursive). To
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   639
demonstrate this recall our context-free grammar for palindromes:
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   640
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   641
\begin{plstx}[margin=3cm]
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   642
: \meta{P} ::=  a\cdot \meta{P}\cdot a | b\cdot \meta{P}\cdot b | a | b | \epsilon\\
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   643
\end{plstx}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   644
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   645
\noindent
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   646
Given the parser combinators for alternatives and sequeneces, this grammar should be
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   647
straightforward to implement. The first idea would be
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   648
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   649
\begin{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   650
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   651
lazy val Pal : Parser[String, String] = 
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   652
  (("a" ~ Pal ~ "a") | ("b" ~ Pal ~ "b") | "a" | "b" | "")
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   653
\end{lstlisting}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   654
\end{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   655
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   656
\noindent
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   657
Unfortunately, this does not work as it produces a typing error. The
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   658
reason is that the parsers \texttt{"a"}, \texttt{"b"} and \texttt{""}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   659
all produce strings and therefore can be put into an alternative
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   660
\texttt{...| "a" | "b" | ""}. But both \pcode{"a" ~ Pal ~ "a"} and
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   661
\pcode{"b" ~ Pal ~ "b"} produce pairs of the form
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   662
$(((\_, \_), \_), \_)$---that is how the sequence parser combinator
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   663
nests results when \pcode{\~} is used between two components. The
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   664
solution is to use a semantic action that ``flattens'' these pairs and
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   665
appends the corresponding strings, like
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   666
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   667
\begin{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   668
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   669
lazy val Pal : Parser[String, String] =  
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   670
  (("a" ~ Pal ~ "a") ==> { case ((x, y), z) => x + y + z } |
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   671
   ("b" ~ Pal ~ "b") ==> { case ((x, y), z) => x + y + z } |
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   672
    "a" | "b" | "")
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   673
\end{lstlisting}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   674
\end{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   675
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   676
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   677
\subsubsection*{Implementing an Interpreter}
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   678
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   679
%\bigskip
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   680
%takes advantage of the full generality---have a look
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   681
%what it produces if we call it with the string \texttt{abc}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   682
%
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   683
%\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   684
%\begin{tabular}{rcl}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   685
%input string & & output\medskip\\
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   686
%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   687
%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   688
%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   689
%\end{tabular}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   690
%\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   691
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   692
173
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   693
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   694
\end{document}
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   695
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   696
%%% Local Variables: 
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   697
%%% mode: latex  
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   698
%%% TeX-master: t
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   699
%%% End: