handouts/ho06.tex
author Christian Urban <urbanc@in.tum.de>
Wed, 24 Oct 2018 13:07:13 +0100
changeset 585 6ee22f196884
parent 584 529460073b24
child 586 451a95e1bc25
permissions -rw-r--r--
updated
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
     1
173
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     2
\documentclass{article}
297
5c51839c88fd updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 292
diff changeset
     3
\usepackage{../style}
217
cd6066f1056a updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 183
diff changeset
     4
\usepackage{../langs}
cd6066f1056a updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 183
diff changeset
     5
173
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     6
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     7
\begin{document}
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     8
292
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
     9
\section*{Handout 6 (Parser Combinators)}
173
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
    10
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    11
This handout explains how \emph{parser combinators} work and how they
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    12
can be implemented in Scala. Their distinguishing feature is that they
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    13
are very easy to implement (admittedly it is only easy in a functional
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    14
programming language). However, parser combinators require that the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    15
grammar to be parsed is \emph{not} left-recursive and they are
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    16
efficient only when the grammar is unambiguous. It is the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    17
responsibility of the grammar designer to ensure these two properties.
173
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
    18
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    19
Another good point of parser combinators is that they can deal with any kind
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    20
of input as long as this input of ``sequence-kind'', for example a
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    21
string or a list of tokens. The general idea behind parser combinators
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    22
is to transform the input into sets of pairs, like so
175
5801e8c0e528 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 173
diff changeset
    23
5801e8c0e528 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 173
diff changeset
    24
\begin{center}
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    25
$\underbrace{\text{list of tokens}}_{\text{input}}$ 
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    26
$\Rightarrow$
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    27
$\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    28
\end{center} 
175
5801e8c0e528 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 173
diff changeset
    29
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    30
\noindent As said, the input can be anything as long as it is a
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    31
``sequence''. The only property of the input we need is to be able to
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    32
test when it is empty. Obviously we can do this for strings and lists.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    33
For more lucidity we shall below often use strings as input in order
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    34
to illustrate matters. However, this does not make our previous work
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    35
on lexers obsolete (remember they transform a string into a list of
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    36
tokens). Lexers will still be needed to build a somewhat realistic
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    37
compiler.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    38
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    39
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    40
In my Scala code I use the following polymorphic types for parser combinators:
176
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
    41
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
    42
\begin{center}
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    43
input:\;\; \texttt{I}  \qquad output:\;\; \texttt{T}
176
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
    44
\end{center}
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
    45
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    46
\noindent that is they take as input something of type \texttt{I} and
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    47
return a set of pairs of \texttt{Set[(T, I)]}. Since the input needs
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    48
to be of ``sequence-kind'' I actually have to often write \texttt{I
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    49
  <\% Seq[\_]} for the input type in order to indicate it is a
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    50
subtype of Scala sequences. The first component of the generated pairs
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    51
corresponds to what the parser combinator was able to process from the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    52
input and the second is the unprocessed part of the input (therefore
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    53
the type of this unprocessed part is the same as the input). As we
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    54
shall see shortly, a parser combinator might return more than one such
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    55
pair; the idea being that there are potentially several ways of how to
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    56
parse the input.  As a concrete example, consider the string
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    57
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    58
\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    59
\tt\Grid{iffoo\VS testbar}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    60
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    61
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    62
\noindent We might have a parser combinator which tries to
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
    63
interpret this string as a keyword (\texttt{if}) or as an
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    64
identifier (\texttt{iffoo}). Then the output will be the set
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
    65
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    66
\begin{center}
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
    67
$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
    68
           \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    69
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    70
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    71
\noindent where the first pair means the parser could
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    72
recognise \texttt{if} from the input and leaves the rest as
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    73
`unprocessed' as the second component of the pair; in the
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    74
other case it could recognise \texttt{iffoo} and leaves
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    75
\texttt{\VS testbar} as unprocessed. If the parser cannot
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
    76
recognise anything from the input, then parser combinators just
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    77
return the empty set $\{\}$. This will indicate something
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    78
``went wrong''\ldots or more precisely, nothing could be
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    79
parsed.
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    80
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    81
Also important to note is that the type \texttt{T} for the processed
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    82
part is different from the input type. The reason is that in general
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    83
we are interested in transform our input into something
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    84
``different''\ldots for example into a tree, or if we implement the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    85
grammar for arithmetic expressions we might be interested in the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    86
actual integer number the arithmetic expression, say \texttt{1 + 2 *
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    87
  3}, stands for. In this way we can use parser combinators to
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    88
implement relativaley easily a calculator.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    89
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    90
The main idea of parser combinators is that we can easily build parser
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    91
combinators out of smaller components following very closely the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    92
structure of a grammar. In order to implement this in an
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    93
object-oriented programming language, like Scala, we need to specify
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    94
an abstract class for parser combinators. This abstract class states
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    95
that the function \texttt{parse} takes an argument of type \texttt{I}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    96
and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
    97
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
    98
\begin{center}
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    99
\begin{lstlisting}[language=Scala]
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   100
abstract class Parser[I, T] {
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   101
  def parse(ts: I): Set[(T, I)]
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   102
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   103
  def parse_all(ts: I): Set[T] =
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   104
    for ((head, tail) <- parse(ts); if (tail.isEmpty)) 
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   105
      yield head
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   106
}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   107
\end{lstlisting}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   108
\end{center}
176
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
   109
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   110
\noindent It is the obligation in each instance (parser combinator) to
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   111
supply an implementation for \texttt{parse}.  From this function we
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   112
can then ``centrally'' derive the function \texttt{parse\_all}, which
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   113
just filters out all pairs whose second component is not empty (that
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   114
is has still some unprocessed part). The reason is that at the end of
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   115
the parsing we are only interested in the results where all the input
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   116
has been consumed and no unprocessed part is left over.
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   117
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   118
One of the simplest parser combinators recognises just a
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   119
single character, say $c$, from the beginning of strings. Its
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   120
behaviour can be described as follows:
176
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
   121
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   122
\begin{itemize}
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   123
\item If the head of the input string starts with a $c$, then return 
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   124
  the set
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   125
  \[\{(c, \textit{tail of}\; s)\}\]
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   126
  where \textit{tail of} 
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   127
  $s$ is the unprocessed part of the input string.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   128
\item Otherwise return the empty set $\{\}$.	
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   129
\end{itemize}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   130
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   131
\noindent
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   132
The input type of this simple parser combinator for characters is
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   133
\texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. 
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   134
The code in Scala is as follows:
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   135
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   136
\begin{center}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   137
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   138
case class CharParser(c: Char) extends Parser[String, Char] {
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   139
  def parse(sb: String) = 
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   140
    if (sb.head == c) Set((c, sb.tail)) else Set()
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   141
}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   142
\end{lstlisting}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   143
\end{center}
176
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
   144
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   145
\noindent You can see the \texttt{parse} function tests whether the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   146
first character of the input string \texttt{sb} is equal to
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   147
\texttt{c}. If yes, then it splits the string into the recognised part
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   148
\texttt{c} and the unprocessed part \texttt{sb.tail}. In case
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   149
\texttt{sb} does not start with \texttt{c} then the parser returns the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   150
empty set (in Scala \texttt{Set()}). Since this parser recognises
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   151
characters and just returns characters as the processed part, the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   152
output type of the parser is \texttt{Char}.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   153
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   154
If we want to parse a list of tokens and interested in recognising a
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   155
number token, we could write something like this
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   156
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   157
\begin{center}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   158
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   159
case object NumParser extends Parser[List[Token], Int] {
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   160
  def parse(ts: List[Token]) = ts match {
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   161
    case Num_token(s)::ts => Set((s.toInt, ts)) 
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   162
    case _ => Set ()
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   163
  }
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   164
}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   165
\end{lstlisting}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   166
\end{center}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   167
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   168
\noindent
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   169
In this parser the input is of type \texttt{List[Token]}. The function
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   170
parse looks at the input \texttt{ts} and checks whether the first
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   171
token is a \texttt{Num\_token}. Let us assume our lexer generated
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   172
these tokens for numbers. But this parser does not just return this
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   173
token (and the rest of the list), like the \texttt{CharParser} above,
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   174
rather extracts the string \texttt{s} from the token and converts it
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   175
into an integer. The hope is that the lexer did its work well and this
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   176
conversion always succeeds. The consequence of this is that the output
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   177
type for this parser is \texttt{Int}. Such a conversion would be
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   178
needed if we want to implement a simple calculator program.
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   179
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   180
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   181
These simple parsers that just look at the input and do a simple
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   182
transformation are often called \emph{atomic} parser combinators.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   183
More interesting are the parser combinators that build larger parsers
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   184
out of smaller component parsers. For example the \emph{alternative
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   185
  parser combinator} is as follows: given two parsers, say, $p$ and
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   186
$q$, we apply both parsers to the input (remember parsers are
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   187
functions) and combine the output (remember they are sets of pairs)
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   188
 
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   189
\begin{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   190
$p(\text{input}) \cup q(\text{input})$
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   191
\end{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   192
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   193
\noindent In Scala we would implement alternative parser
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   194
combinator as follows
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   195
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   196
\begin{center}
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   197
\begin{lstlisting}[language=Scala, numbers=none]
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   198
class AltParser[I, T]
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   199
       (p: => Parser[I, T], 
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   200
        q: => Parser[I, T]) extends Parser[I, T] {
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   201
  def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   202
}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   203
\end{lstlisting}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   204
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   205
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   206
\noindent The types of this parser combinator are again generic (we
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   207
just have \texttt{I} for the input type, and \texttt{T} for the output
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   208
type). The alternative parser builds a new parser out of two existing
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   209
parsers \texttt{p} and \texttt{q}.  Both need to be able to process
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   210
input of type \texttt{I} and return the same output type
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   211
\texttt{Set[(T, I)]}.\footnote{There is an interesting detail of
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   212
  Scala, namely the \texttt{=>} in front of the types of \texttt{p}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   213
  and \texttt{q}. They will prevent the evaluation of the arguments
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   214
  before they are used. This is often called \emph{lazy evaluation} of
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   215
  the arguments. We will explain this later.}  Therefore the output
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   216
type of this parser is \texttt{T}. The alternative parser should run
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   217
the input with the first parser \texttt{p} (producing a set of pairs)
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   218
and then run the same input with \texttt{q} (producing another set of
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   219
pairs).  The result should be then just the union of both sets, which
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   220
is the operation \texttt{++} in Scala.
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   221
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   222
The alternative parser combinator already allows us to
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   223
construct a parser that parses either a character \texttt{a}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   224
or \texttt{b}, as
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   225
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   226
\begin{center}
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   227
\begin{lstlisting}[language=Scala, numbers=none]
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   228
new AltParser(CharParser('a'), CharParser('b'))
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   229
\end{lstlisting}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   230
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   231
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   232
\noindent Later on we will use again Scala mechanism for introducing some
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   233
more readable shorthand notation for this, like \texttt{"a" || "b"}. Let us look in detail at what this parser combinator produces with some somple strings 
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   234
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   235
\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   236
\begin{tabular}{rcl}
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   237
input strings & & output\medskip\\
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   238
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   239
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   240
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   241
\end{tabular}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   242
\end{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   243
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   244
\noindent We receive in the first two cases a successful
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   245
output (that is a non-empty set). In each case, either
392
2d0a59127694 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 386
diff changeset
   246
\pcode{a} or \pcode{b} is in the processed part, and
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   247
\pcode{cde} in the unprocessed part. Clearly this parser cannot
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   248
parse anything in the string \pcode{ccde}, therefore the empty
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   249
set.
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   250
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   251
A bit more interesting is the \emph{sequence parser combinator}. Given
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   252
two parsers, say again, $p$ and $q$, we want to apply first the input
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   253
to $p$ producing a set of pairs; then apply $q$ to all the unparsed
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   254
parts in the pairs; and then combine the results like
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   255
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   256
\begin{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   257
\begin{tabular}{lcl}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   258
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & 
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   259
$(\textit{output}_1, u_1) \in p(\text{input}) 
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   260
\;\wedge\;$\\
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   261
&& $(\textit{output}_2, u_2) \in q(u_1)\}$
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   262
\end{tabular}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   263
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   264
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   265
\noindent Notice that the $p$ wil first be run on the input, producing
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   266
pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$ stands
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   267
for the unprocessed, or left-over, parts. We want that $q$ runs on all
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   268
these unprocessed parts $u_1$. This again will produce some
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   269
processed part , $p$ and
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   270
$q$, we apply both parsers to the input (remember parsers are
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   271
functions) and combine the output (remember they are sets of pairs)
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   272
 
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   273
\begin{center}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   274
$p(\text{input}) \cup q(\text{input})$
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   275
\end{center}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   276
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   277
\noindent In Scala we would implement alternative parser
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   278
combinator as follows
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   279
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   280
\begin{center}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   281
\begin{lstlisting}[language=Scala, numbers=none]
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   282
class AltParser[I, T]
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   283
       (p: => Parser[I, T], 
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   284
        q: => Parser[I, T]) extends Parser[I, T] {
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   285
  def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   286
}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   287
\end{lstlisting}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   288
\end{center}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   289
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   290
\noindent The types of this parser combinator are again generic (we
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   291
just have \texttt{I} for the input type, and \texttt{T} for the output
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   292
type). The alternative parser builds a new parser out of two existing
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   293
parsers \texttt{p} and \texttt{q}.  Both need to be able to process
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   294
input of type \texttt{I} and return the same output type
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   295
\texttt{Set[(T, I)]}.\footnote{There is an interesting detail of
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   296
  Scala, namely the \texttt{=>} in front of the types of \texttt{p}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   297
  and \texttt{q}. They will prevent the evaluation of the arguments
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   298
  before they are used. This is often called \emph{lazy evaluation} of
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   299
  the arguments. We will explain this later.}  Therefore the output
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   300
type of this parser is \texttt{T}. The alternative parser should run
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   301
the input with the first parser \texttt{p} (producing a set of pairs)
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   302
and then run the same input with \texttt{q} (producing another set of
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   303
pairs).  The result should be then just the union of both sets, which
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   304
is the operation \texttt{++} in Scala.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   305
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   306
The alternative parser combinator already allows us to
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   307
construct a parser that parses either a character \texttt{a}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   308
or \texttt{b}, as
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   309
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   310
\begin{center}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   311
\begin{lstlisting}[language=Scala, numbers=none]
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   312
new AltParser(CharParser('a'), CharParser('b'))
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   313
\end{lstlisting}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   314
\end{center}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   315
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   316
\noindent Later on we will use again Scala mechanism for introducing some
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   317
more readable shorthand notation for this, like \texttt{"a" || "b"}. Let us look in detail at what this parser combinator produces with some somple strings 
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   318
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   319
\begin{center}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   320
\begin{tabular}{rcl}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   321
input strings & & output\medskip\\
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   322
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{cde}})\right\}$\\
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   323
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{cde}})\right\}$\\
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   324
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   325
\end{tabular}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   326
\end{center}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   327
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   328
\noindent We receive in the first two cases a successful
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   329
output (that is a non-empty set). In each case, either
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   330
\pcode{a} or \pcode{b} is in the processed part, and
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   331
\pcode{cde} in the unprocessed part. Clearly this parser cannot
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   332
parse anything in the string \pcode{ccde}, therefore the empty
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   333
set.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   334
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   335
A bit more interesting is the \emph{sequence parser combinator}. Given
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   336
two parsers, say again, $p$ and $q$, we want to apply first the input
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   337
to $p$ producing a set of pairs; then apply $q$ to all the unparsed
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   338
parts in the pairs; and then combine the results like
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   339
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   340
\begin{center}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   341
\begin{tabular}{lcl}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   342
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & 
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   343
$(\textit{output}_1, u_1) \in p(\text{input}) 
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   344
\;\wedge\;$\\
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   345
&& $(\textit{output}_2, u_2) \in q(u_1)\}$
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   346
\end{tabular}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   347
\end{center}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   348
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   349
\noindent Notice that the $p$ wil first be run on the input, producing
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   350
pairs of the form $\textit{output}_1$ and unprocessed part $u_1$. The
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   351
overall result of the sequence parser combinator is pairs of the form
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   352
$((\textit{output}_1, \textit{output}_2), u_2)$. This means the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   353
unprocessed parts of both parsers is the unprocessed parts the second
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   354
parser $q$ produces as left-over. The processed parts of both parsers
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   355
is just the pair of the outputs
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   356
$(\textit{output}_1, \textit{output}_2)$. This behavious can be
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   357
implemented in Scala as follows:
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   358
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   359
\begin{center}
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   360
\begin{lstlisting}[language=Scala,numbers=none]
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   361
class SeqParser[I, T, S]
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   362
       (p: => Parser[I, T], 
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   363
        q: => Parser[I, S]) extends Parser[I, (T, S)] {
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   364
  def parse(sb: I) = 
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   365
    for ((output1, u1) <- p.parse(sb); 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   366
         (output2, u2) <- q.parse(u1)) 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   367
            yield ((output1, output2), u2)
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   368
}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   369
\end{lstlisting}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   370
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   371
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   372
\noindent This parser takes again as input two parsers, \texttt{p}
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   373
and \texttt{q}. It implements \texttt{parse} as follows: let
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   374
first run the parser \texttt{p} on the input producing a set
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   375
of pairs (\texttt{output1}, \texttt{u1}). The \texttt{u1}
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   376
stands for the unprocessed parts left over by \texttt{p}. Let
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   377
\texttt{q} run on these unprocessed parts producing again a
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   378
set of pairs. The output of the sequence parser combinator is
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   379
then a set containing pairs where the first components are
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   380
again pairs, namely what the first parser could parse together
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   381
with what the second parser could parse; the second component
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   382
is the unprocessed part left over after running the second
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   383
parser \texttt{q}. Therefore the input type of the sequence
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   384
parser combinator is as usual \texttt{I}, but the output type
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   385
is
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   386
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   387
\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   388
\texttt{Set[((T, S), I)]}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   389
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   390
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   391
\noindent
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   392
If any of the runs of \textit{p} and \textit{q} fail, that is produce
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   393
the empty set, then \texttt{parse} will also produce the empty set.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   394
Notice that we have now two output types for the sequence parser
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   395
combinator, because in general \textit{p} and \textit{q} might produce
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   396
differetn things (for example first we recognise a number and then a
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   397
string corresponding to an operator).
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   398
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   399
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   400
We have not yet looked at this in detail, but Scala allows us to
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   401
provide some shorthand notation for the sequence parser combinator. We
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   402
can write for example \pcode{"a" ~ "b"}, which is the parser
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   403
combinator that first recognises the character \texttt{a} from a
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   404
string and then \texttt{b}. Let us look again at three examples of
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   405
how this parser combinator processes strings:
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   406
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   407
\begin{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   408
\begin{tabular}{rcl}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   409
input strings & & output\medskip\\
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   410
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{cde}})\right\}$\\
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   411
\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   412
\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   413
\end{tabular}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   414
\end{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   415
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   416
\noindent In the first line we have a sucessful parse, because the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   417
string started with \texttt{ab}, which is the prefix we are looking
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   418
for. But since the parsing combinator is constructed as sequence of
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   419
the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   420
result is a nested pair of the form \texttt{((a, b), cde)}. It is
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   421
\emph{not} a simple pair \texttt{(ab, cde)} as one might errorneously
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   422
expects.  The parser returns the ampty set in the other examples,
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   423
because they do not fit with what the parser is supposed to parse.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   424
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   425
585
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   426
A slightly more complicated parser is \pcode{("a" || "b") ~ "c"}
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   427
which parses as first character either an \texttt{a} or \texttt{b}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   428
followed by a \texttt{c}. This parser produces the following outputs.
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   429
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   430
\begin{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   431
\begin{tabular}{rcl}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   432
input strings & & output\medskip\\
585
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   433
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   434
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   435
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   436
\end{tabular}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   437
\end{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   438
585
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   439
\noindent
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   440
Now consider the parser \pcode{("a" ~ "b") ~ "c"} which parses
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   441
\texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   442
the following outputs.
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   443
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   444
\begin{center}
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   445
\begin{tabular}{rcl}
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   446
input strings & & output\medskip\\
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   447
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}), \texttt{\Grid{de}})\right\}$\\
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   448
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   449
\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   450
\end{tabular}
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   451
\end{center}
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   452
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   453
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   454
\noindent The second and third example fail, because something is
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   455
``missing'' in the sequence we are looking for. Also notice how the
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   456
results nest with sequences: the parsed part is a nested pair of the
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   457
form \pcode{((a, b), c)}. Two more examples: first consider the parser
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   458
\pcode{("a" ~ "a") ~ "a"} and the input \pcode{aaaa}:
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   459
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   460
\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   461
\begin{tabular}{rcl}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   462
input string & & output\medskip\\
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   463
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   464
$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   465
\end{tabular}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   466
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   467
585
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   468
\noindent Notice how the results nests deeper and deeper as pairs (the
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   469
last \pcode{a} is in the unprocessed part). To consume everything of
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   470
this string we can use the parser \pcode{(("a" ~ "a") ~ "a") ~
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   471
  "a"}. Then the output is as follows:
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   472
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   473
\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   474
\begin{tabular}{rcl}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   475
input string & & output\medskip\\
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   476
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   477
$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   478
\end{tabular}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   479
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   480
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   481
\noindent This is an instance where the parser consumed
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   482
completely the input, meaning the unprocessed part is just the
585
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   483
empty string. So if we called \pcode{parse_all} instead of \pcode{parse}
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   484
we would get back the result
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   485
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   486
\[
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   487
\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   488
\]
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   489
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   490
\noindent where the unprocessed (empty) parts have been stripped away
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   491
from the pairs; everything where the second part was not empty has
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   492
been thrown away as ultimately-unsuccessful-parses.
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   493
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   494
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   495
Note carefully that constructing a parser such \pcode{'a' ||
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   496
('a' ~ 'b')} will result in a typing error. The first
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   497
parser has as output type a single character (recall the type
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   498
of \texttt{CharParser}), but the second parser produces a pair
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   499
of characters as output. The alternative parser is however
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   500
required to have both component parsers to have the same type.
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   501
We will see later how we can build this parser without the
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   502
typing error.
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   503
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   504
The next parser combinator does not actually combine smaller
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   505
parsers, but applies a function to the result of a parser.
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   506
It is implemented in Scala as follows
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   507
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   508
\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   509
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   510
class FunParser[I, T, S]
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   511
         (p: => Parser[I, T], 
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   512
          f: T => S) extends Parser[I, S] {
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   513
  def parse(sb: I) = 
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   514
    for ((head, tail) <- p.parse(sb)) yield (f(head), tail)
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   515
}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   516
\end{lstlisting}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   517
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   518
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   519
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   520
\noindent This parser combinator takes a parser \texttt{p}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   521
with output type \texttt{T} as one argument as well as a
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   522
function \texttt{f} with type \texttt{T => S}. The parser
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   523
\texttt{p} produces sets of type \texttt{(T, I)}. The
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   524
\texttt{FunParser} combinator then applies the function
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   525
\texttt{f} to all the parser outputs. Since this function is of
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   526
type \texttt{T => S}, we obtain a parser with output type
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   527
\texttt{S}. Again Scala lets us introduce some shorthand
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   528
notation for this parser combinator. Therefore we will write
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   529
\texttt{p ==> f} for it.
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   530
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   531
\subsubsection*{How to build parsers using parser combinators?}
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   532
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   533
\subsubsection*{Implementing an Interpreter}
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   534
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   535
%\bigskip
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   536
%takes advantage of the full generality---have a look
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   537
%what it produces if we call it with the string \texttt{abc}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   538
%
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   539
%\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   540
%\begin{tabular}{rcl}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   541
%input string & & output\medskip\\
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   542
%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   543
%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   544
%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   545
%\end{tabular}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   546
%\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   547
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   548
173
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   549
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   550
\end{document}
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   551
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   552
%%% Local Variables: 
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   553
%%% mode: latex  
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   554
%%% TeX-master: t
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   555
%%% End: