handouts/ho06.tex
author Christian Urban <christian.urban@kcl.ac.uk>
Thu, 19 Sep 2024 19:25:13 +0100
changeset 964 da1f8c033b8e
parent 941 66adcae6c762
permissions -rw-r--r--
updated
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
     1
595
4bf0096bc06b updated
Christian Urban <urbanc@in.tum.de>
parents: 594
diff changeset
     2
% !TEX program = xelatex
173
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     3
\documentclass{article}
297
5c51839c88fd updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 292
diff changeset
     4
\usepackage{../style}
217
cd6066f1056a updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 183
diff changeset
     5
\usepackage{../langs}
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
     6
\usepackage{../grammar}
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
     7
\usepackage{../graphics}
173
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     8
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
     9
\begin{document}
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
    10
292
7ed2a25dd115 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 217
diff changeset
    11
\section*{Handout 6 (Parser Combinators)}
173
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
    12
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    13
This handout explains how \emph{parser combinators} work and how they
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    14
can be implemented in Scala. Their most distinguishing feature is that
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    15
they are very easy to implement (admittedly it is only easy in a
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    16
functional programming language).  Another good point of parser
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    17
combinators is that they can deal with any kind of input as long as
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    18
this input is of ``sequence-kind'', for example a string or a list of
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    19
tokens. The only two properties of the input we need is to be able to
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    20
test when it is empty and ``sequentially'' take it apart. Strings and
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    21
lists fit this bill. However, parser combinators also have their
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    22
drawbacks. For example they require that the grammar to be parsed is
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    23
\emph{not} left-recursive and they are efficient only when the grammar
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    24
is unambiguous. It is the responsibility of the grammar designer to
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
    25
ensure these two properties hold.
173
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
    26
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    27
The general idea behind parser combinators is to transform the input
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    28
into sets of pairs, like so
175
5801e8c0e528 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 173
diff changeset
    29
5801e8c0e528 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 173
diff changeset
    30
\begin{center}
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    31
$\underbrace{\text{list of tokens}}_{\text{input}}$ 
594
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
    32
$\quad\Rightarrow\quad$
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
    33
$\underbrace{\text{set of (parsed part, unprocessed part)}}_{\text{output}}$
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    34
\end{center} 
175
5801e8c0e528 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 173
diff changeset
    35
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    36
\noindent
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
    37
Given the extended effort we have spent implementing a lexer in order
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
    38
to generate lists of tokens, it might be surprising that in what
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
    39
follows we shall often use strings as input, rather than lists of
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
    40
tokens. This is for making the explanation more lucid and ensure the
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
    41
examples are simple. It does not make our previous work on lexers obsolete
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
    42
(remember they transform a string into a list of tokens). Lexers will
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
    43
still be needed for building a somewhat realistic compiler. See also
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
    44
a question in the homework regarding this issue.
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    45
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
    46
As mentioned above, parser combinators are relatively agnostic about what
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    47
kind of input they process. In my Scala code I use the following
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    48
polymorphic types for parser combinators:
176
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
    49
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
    50
\begin{center}
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
    51
input:\;\; \texttt{I}  \qquad output:\;\; \texttt{T}
176
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
    52
\end{center}
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
    53
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    54
\noindent That is they take as input something of type \texttt{I} and
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
    55
return a set of pairs of type \texttt{Set[(T, I)]}. Since the input
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
    56
needs to be of ``sequence-kind'', I actually have to often write
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
    57
\code{(using is: I => Seq[_])} for the input type. This ensures the
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
    58
input is a subtype of Scala sequences.\footnote{This is a new feature
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
    59
  in Scala 3 and is about type-classes, meaning if you use Scala 2 you will have difficulties
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
    60
  with running my code.} The first component of the generated pairs
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
    61
corresponds to what the parser combinator was able to parse from the
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
    62
input and the second is the unprocessed, or leftover, part of the
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
    63
input (therefore the type of this unprocessed part is the same as the
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
    64
input). A parser combinator might return more than one such pair; the
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
    65
idea is that there are potentially several ways of how to parse the
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
    66
input.  As a concrete example, consider the string
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    67
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    68
\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    69
\tt\Grid{iffoo\VS testbar}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    70
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    71
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    72
\noindent We might have a parser combinator which tries to
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
    73
interpret this string as a keyword (\texttt{if}) or as an
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    74
identifier (\texttt{iffoo}). Then the output will be the set
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
    75
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    76
\begin{center}
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
    77
$\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
    78
           \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    79
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    80
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    81
\noindent where the first pair means the parser could recognise
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
    82
\texttt{if} from the input and leaves the \texttt{foo\VS testbar} as
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
    83
unprocessed part; in the other case it could recognise
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    84
\texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    85
parser cannot recognise anything from the input at all, then parser
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    86
combinators just return the empty set $\{\}$. This will indicate
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
    87
something ``went wrong''\ldots or more precisely, nothing could be
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
    88
parsed.
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
    89
594
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
    90
Also important to note is that the output type \texttt{T} for the
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
    91
processed part can potentially be different from the input type
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
    92
\texttt{I} in the parser. In the example above is just happens to be
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
    93
the same. The reason for the difference is that in general we are
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
    94
interested in transforming our input into something
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
    95
``different''\ldots for example into a tree; or if we implement the
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
    96
grammar for arithmetic expressions, we might be interested in the
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
    97
actual integer number the arithmetic expression, say \texttt{1 + 2 *
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
    98
  3}, stands for. In this way we can use parser combinators to
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
    99
implement relatively easily a calculator, for instance (we shall do
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
   100
this later on).
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   101
594
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
   102
The main driving force behind parser combinators is that we can easily
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
   103
build parser combinators out of smaller components following very
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
   104
closely the structure of a grammar. In order to implement this in a
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   105
functional/object-oriented programming language, like Scala, we need
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   106
to specify an abstract class for parser combinators. In the abstract
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   107
class we specify that \texttt{I} is the \emph{input type} of the
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   108
parser combinator and that \texttt{T} is the \emph{output type}.  This
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   109
implies that the function \texttt{parse} takes an argument of type
941
66adcae6c762 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 937
diff changeset
   110
\texttt{I} and returns a set of type \mbox{\texttt{Set[(T,
66adcae6c762 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 937
diff changeset
   111
    I)]}}.\footnote{In the actual code on KEATS there is some
66adcae6c762 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 937
diff changeset
   112
  kerfuffle about correct type-bounds and using clauses. Ignore this
66adcae6c762 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 937
diff changeset
   113
  part of the implementation for the moment---it is not needed for
66adcae6c762 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 937
diff changeset
   114
  understanding how the code works.}
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   115
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   116
\begin{center}
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   117
\begin{lstlisting}[language=Scala]
941
66adcae6c762 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 937
diff changeset
   118
abstract class Parser[I, T]  {
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   119
  def parse(in: I): Set[(T, I)]  
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   120
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   121
  def parse_all(in: I) : Set[T] =
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   122
    for ((hd, tl) <- parse(in); 
941
66adcae6c762 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 937
diff changeset
   123
        if tl.isEmpty) yield hd
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   124
}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   125
\end{lstlisting}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   126
\end{center}
176
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
   127
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   128
\noindent It is the obligation in each instance of this class to
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   129
supply an implementation for \texttt{parse}.  From this function we
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   130
can then ``centrally'' derive the function \texttt{parse\_all}, which
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   131
just filters out all pairs whose second component is not empty (that
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   132
is has still some unprocessed part). The reason is that at the end of
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   133
the parsing we are only interested in the results where all the input
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   134
has been consumed and no unprocessed part is left over.
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   135
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   136
One of the simplest parser combinators recognises just a
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   137
single character, say $c$, from the beginning of strings. Its
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   138
behaviour can be described as follows:
176
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
   139
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   140
\begin{itemize}
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   141
\item If the head of the input string $s$ starts with a $c$, then return 
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   142
  the set
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   143
  \[\{(c, \textit{tail-of-}s)\}\]
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   144
  where \textit{tail of} 
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   145
  $s$ is the unprocessed part of the input string.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   146
\item Otherwise return the empty set $\{\}$.	
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   147
\end{itemize}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   148
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   149
\noindent
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   150
The input type of this simple parser combinator is \texttt{String} and
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   151
the output type is \texttt{Char}. This means \texttt{parse} returns
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   152
\mbox{\texttt{Set[(Char, String)]}}.  The code in Scala is as follows:
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   153
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   154
\begin{center}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   155
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   156
case class CharParser(c: Char) extends Parser[String, Char] {
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   157
  def parse(s: String) = 
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   158
    if (s != "" && s.head == c) Set((c, s.tail)) else Set()
177
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   159
}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   160
\end{lstlisting}
53def1fbf472 updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 176
diff changeset
   161
\end{center}
176
3c2653fc8b5a updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 175
diff changeset
   162
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   163
\noindent You can see \texttt{parse} tests here whether the
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   164
first character of the input string \texttt{s} is equal to
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   165
\texttt{c}. If yes, then it splits the string into the recognised part
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   166
\texttt{c} and the unprocessed part \texttt{s.tail}. In case
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   167
\texttt{s} does not start with \texttt{c} then the parser returns the
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   168
empty set (in Scala \texttt{Set()}). Since this parser recognises
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   169
characters and just returns characters as the processed part, the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   170
output type of the parser is \texttt{Char}.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   171
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   172
If we want to parse a list of tokens and interested in recognising a
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   173
number token, for example, we could write something like this
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   174
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   175
\begin{center}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   176
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,numbers=none]
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   177
case object NumParser extends Parser[List[Token], Int] {
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   178
  def parse(ts: List[Token]) = ts match {
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   179
    case Num_token(s)::rest => Set((s.toInt, rest)) 
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   180
    case _ => Set ()
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   181
  }
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   182
}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   183
\end{lstlisting}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   184
\end{center}
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   185
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   186
\noindent
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   187
In this parser the input is of type \texttt{List[Token]}. The function
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   188
parse looks at the input \texttt{ts} and checks whether the first
589
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   189
token is a \texttt{Num\_token} (let us assume our lexer generated
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   190
these tokens for numbers). But this parser does not just return this
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   191
token (and the rest of the list), like the \texttt{CharParser} above,
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   192
rather it extracts also the string \texttt{s} from the token and
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   193
converts it into an integer. The hope is that the lexer did its work
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   194
well and this conversion always succeeds. The consequence of this is
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   195
that the output type for this parser is \texttt{Int}, not
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   196
\texttt{Token}. Such a conversion would be needed in our parser,
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   197
because when we encounter a number in our program, we want to do
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   198
some calculations based on integers, not strings (or tokens). 
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   199
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   200
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   201
These simple parsers that just look at the input and do a simple
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   202
transformation are often called \emph{atomic} parser combinators.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   203
More interesting are the parser combinators that build larger parsers
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   204
out of smaller component parsers. There are three such parser
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   205
combinators that can be implemented generically. The \emph{alternative
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   206
  parser combinator} is as follows: given two parsers, say, $p$ and
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   207
$q$, we apply both parsers to the input (remember parsers are
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   208
functions) and combine the output (remember they are sets of pairs):
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   209
 
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   210
\begin{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   211
$p(\text{input}) \cup q(\text{input})$
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   212
\end{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   213
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   214
\noindent In Scala we can implement alternative parser
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   215
combinator as follows
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   216
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   217
\begin{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   218
\begin{lstlisting}[language=Scala, numbers=none]
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   219
class AltParser[I, T]
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   220
         (p: => Parser[I, T], 
941
66adcae6c762 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 937
diff changeset
   221
          q: => Parser[I, T]) extends Parser[I, T] {
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   222
  def parse(in: I) = p.parse(in) ++ q.parse(in)   
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   223
}    
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   224
\end{lstlisting}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   225
\end{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   226
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   227
\noindent The types of this parser combinator are again generic (we
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   228
have \texttt{I} for the input type, and \texttt{T} for the output
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   229
type). The alternative parser builds a new parser out of two existing
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   230
parsers \texttt{p} and \texttt{q} which are given as arguments.  Both
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   231
parsers need to be able to process input of type \texttt{I} and return
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   232
in \texttt{parse} the same output type \texttt{Set[(T,
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   233
  I)]}.\footnote{There is an interesting detail of Scala, namely the
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   234
  \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. These arrows
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   235
  will prevent the evaluation of the arguments before they are
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   236
  used. This is often called \emph{lazy evaluation} of the
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   237
  arguments. We will explain this later.}  The alternative parser runs
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   238
the input with the first parser \texttt{p} (producing a set of pairs)
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   239
and then runs the same input with \texttt{q} (producing another set of
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   240
pairs).  The result should be then just the union of both sets, which
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   241
is the operation \texttt{++} in Scala.
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   242
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   243
The alternative parser combinator allows us to construct a parser that
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   244
parses either a character \texttt{a} or \texttt{b} using the
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   245
\texttt{CharParser} shown above. For this we can write\footnote{Note
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   246
  that we cannot use a \texttt{case}-class for \texttt{AltParser}s
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   247
  because of the problem with laziness and Scala quirks. Hating
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   248
  \texttt{new} like the plague, we will work around this later with
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   249
  some syntax tricks. ;o)}
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   250
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   251
\begin{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   252
\begin{lstlisting}[language=Scala, numbers=none]
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   253
new AltParser(CharParser('a'), CharParser('b'))
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   254
\end{lstlisting}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   255
\end{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   256
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   257
\noindent Later on we will use Scala mechanism for introducing some
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   258
more readable shorthand notation for this, like \texttt{p"a" ||
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   259
  p"b"}. But first let us look in detail at what this parser combinator produces
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   260
with some sample strings.
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   261
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   262
\begin{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   263
\begin{tabular}{rcl}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   264
input strings & & output\medskip\\
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   265
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}},\; \texttt{\Grid{cde}})\right\}$\\
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   266
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}},\; \texttt{\Grid{cde}})\right\}$\\
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   267
\texttt{\Grid{ccde}} & $\rightarrow$ & $\{\}$
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   268
\end{tabular}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   269
\end{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   270
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   271
\noindent We receive in the first two cases a successful output (that
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   272
is a non-empty set). In each case, either \pcode{a} or \pcode{b} is in
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   273
the parsed part, and \pcode{cde} in the unprocessed part. Clearly this
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   274
parser cannot parse anything of the form \pcode{ccde}, therefore the
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   275
empty set is returned in the last case. Observe that parser
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   276
combinators only look at the beginning of the given input: they do not
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   277
fish out something in the ``middle'' of the input.
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   278
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   279
A bit more interesting is the \emph{sequence parser combinator}. Given
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   280
two parsers, say again, $p$ and $q$, we want to apply first the input
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   281
to $p$ producing a set of pairs; then apply $q$ to all the unparsed  
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   282
parts in the pairs; and then combine the results. Mathematically we would
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   283
write something like this for the set of pairs:
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   284
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   285
\begin{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   286
\begin{tabular}{lcl}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   287
$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\,|\,$ & 
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   288
$(\textit{output}_1, u_1) \in p(\text{input}) 
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   289
\;\wedge\;$\\
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   290
&& $(\textit{output}_2, u_2) \in q(u_1)\}$
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   291
\end{tabular}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   292
\end{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   293
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   294
\noindent Notice that the $p$ will first be run on the input,
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   295
producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   296
stands for the unprocessed, or leftover, parts of $p$. We want that
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   297
$q$ runs on all these unprocessed parts $u_1$. Therefore these
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   298
unprocessed parts are fed into the second parser $q$. The overall
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   299
result of the sequence parser combinator is pairs of the form
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   300
$((\textit{output}_1, \textit{output}_2), u_2)$. This means the
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   301
unprocessed part of the sequence parser combinator is the unprocessed
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   302
part the second parser $q$ leaves as leftover. The parsed parts of the
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   303
component parsers are combined in a pair, namely
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   304
$(\textit{output}_1, \textit{output}_2)$. The reason is we want to
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   305
know what $p$ and $q$ were able to parse. This behaviour can be
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   306
implemented in Scala as follows:
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   307
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   308
\begin{center}
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   309
\begin{lstlisting}[language=Scala,numbers=none]
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   310
class SeqParser[I, T, S]
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   311
       (p: => Parser[I, T], 
941
66adcae6c762 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 937
diff changeset
   312
        q: => Parser[I, S]) extends Parser[I, (T, S)] {
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   313
  def parse(in: I) = 
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   314
    for ((output1, u1) <- p.parse(in); 
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   315
         (output2, u2) <- q.parse(u1)) 
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   316
            yield ((output1, output2), u2)
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   317
}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   318
\end{lstlisting}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   319
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   320
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   321
\noindent This parser takes again as arguments two parsers, \texttt{p}
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   322
and \texttt{q}. It implements \texttt{parse} as follows: first run the
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   323
parser \texttt{p} on the input producing a set of pairs
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   324
(\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   325
unprocessed parts left over by \texttt{p} (recall that there can be
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   326
several such pairs).  Let then \texttt{q} run on these unprocessed
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   327
parts producing again a set of pairs. The output of the sequence
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   328
parser combinator is then a set containing pairs where the first
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   329
components are again pairs, namely what the first parser could parse
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   330
together with what the second parser could parse; the second component
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   331
is the unprocessed part left over after running the second parser
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   332
\texttt{q}. Note that the input type of the sequence parser combinator
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   333
is as usual \texttt{I}, but the output type is
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   334
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   335
\begin{center}
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   336
\texttt{(T, S)}
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   337
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   338
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   339
\noindent
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   340
Consequently, the function \texttt{parse} in the sequence parser
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   341
combinator returns sets of type \texttt{Set[((T, S), I)]}.  That means
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   342
we have essentially two output types for the sequence parser
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   343
combinator (packaged in a pair), because in general \textit{p} and
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   344
\textit{q} might produce different things (for example we recognise a
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   345
number with \texttt{p} and then with \texttt{q} a string corresponding
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   346
to an operator).  If any of the runs of \textit{p} and \textit{q}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   347
fail, that is produce the empty set, then \texttt{parse} will also
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   348
produce the empty set.
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   349
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   350
With the shorthand notation we shall introduce later for the sequence
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   351
parser combinator, we can write for example \pcode{p"a" ~ p"b"}, which
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   352
is the parser combinator that first recognises the character
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   353
\texttt{a} from a string and then \texttt{b}. (Actually, we will be
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   354
able to write just \pcode{p"ab"} for such parsers, but it is good to
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   355
  understand first what happens behind the scenes.) Let us look again
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   356
  at some examples of how the sequence parser combinator processes some
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   357
  strings:
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   358
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   359
\begin{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   360
\begin{tabular}{rcl}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   361
input strings & & output\medskip\\
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   362
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}),\; \texttt{\Grid{cde}})\right\}$\\
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   363
\texttt{\Grid{bacde}} & $\rightarrow$ & $\{\}$\\
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   364
\texttt{\Grid{cccde}} & $\rightarrow$ & $\{\}$
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   365
\end{tabular}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   366
\end{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   367
586
Christian Urban <urbanc@in.tum.de>
parents: 585
diff changeset
   368
\noindent In the first line we have a successful parse, because the
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   369
string starts with \texttt{ab}, which is the prefix we are looking
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   370
for. But since the parsing combinator is constructed as sequence of
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   371
the two simple (atomic) parsers for \texttt{a} and \texttt{b}, the
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   372
result is a nested pair of the form \texttt{((a, b), cde)}. It is
586
Christian Urban <urbanc@in.tum.de>
parents: 585
diff changeset
   373
\emph{not} a simple pair \texttt{(ab, cde)} as one might erroneously
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   374
expect.  The parser returns the empty set in the other examples,
584
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   375
because they do not fit with what the parser is supposed to parse.
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   376
529460073b24 updated
Christian Urban <urbanc@in.tum.de>
parents: 392
diff changeset
   377
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   378
A slightly more complicated parser is \pcode{(p"a" || p"b") ~ p"c"} which
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   379
parses as first character either an \texttt{a} or \texttt{b}, followed
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   380
by a \texttt{c}. This parser produces the following outputs.
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   381
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   382
\begin{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   383
\begin{tabular}{rcl}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   384
input strings & & output\medskip\\
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   385
\texttt{\Grid{acde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   386
\texttt{\Grid{bcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
585
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   387
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   388
\end{tabular}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   389
\end{center}
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   390
585
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   391
\noindent
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   392
Now consider the parser \pcode{(p"a" ~ p"b") ~ p"c"} which parses
585
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   393
\texttt{a}, \texttt{b}, \texttt{c} in sequence. This parser produces
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   394
the following outputs.
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   395
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   396
\begin{center}
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   397
\begin{tabular}{rcl}
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   398
input strings & & output\medskip\\
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   399
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{(((\texttt{\Grid{a}},\texttt{\Grid{b}}), \texttt{\Grid{c}}),\; \texttt{\Grid{de}})\right\}$\\
585
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   400
\texttt{\Grid{abde}} & $\rightarrow$ & $\{\}$\\
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   401
\texttt{\Grid{bcde}} & $\rightarrow$ & $\{\}$
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   402
\end{tabular}
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   403
\end{center}
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   404
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   405
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   406
\noindent The second and third example fail, because something is
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   407
``missing'' in the sequence we are looking for. The first succeeds but
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   408
notice how the results nest with sequences: the parsed part is a
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   409
nested pair of the form \pcode{((a, b), c)}. If we nest the sequence
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   410
parser differently, say \pcode{p"a" ~ (p"b" ~ p"c")}, then also
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   411
our output pairs nest differently
589
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   412
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   413
\begin{center}
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   414
\begin{tabular}{rcl}
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   415
input strings & & output\medskip\\
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   416
\texttt{\Grid{abcde}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}},(\texttt{\Grid{b}}, \texttt{\Grid{c}})),\; \texttt{\Grid{de}})\right\}$\\
589
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   417
\end{tabular}
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   418
\end{center}
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   419
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   420
\noindent
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   421
Two more examples: first consider the parser
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   422
\pcode{(p"a" ~ p"a") ~ p"a"} and the input \pcode{aaaa}:
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   423
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   424
\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   425
\begin{tabular}{rcl}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   426
input string & & output\medskip\\
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   427
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   428
$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   429
\end{tabular}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   430
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   431
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   432
\noindent Notice again how the results nest deeper and deeper as pairs (the
585
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   433
last \pcode{a} is in the unprocessed part). To consume everything of
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   434
this string we can use the parser \pcode{((p"a" ~ p"a") ~ p"a") ~
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   435
  p"a"}. Then the output is as follows:
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   436
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   437
\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   438
\begin{tabular}{rcl}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   439
input string & & output\medskip\\
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   440
\texttt{\Grid{aaaa}} & $\rightarrow$ & 
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   441
$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   442
\end{tabular}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   443
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   444
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   445
\noindent This is an instance where the parser consumed
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   446
completely the input, meaning the unprocessed part is just the
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   447
empty string. So if we called \pcode{parse_all}, instead of \pcode{parse},
585
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   448
we would get back the result
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   449
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   450
\[
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   451
\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   452
\]
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   453
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   454
\noindent where the unprocessed (empty) parts have been stripped away
6ee22f196884 updated
Christian Urban <urbanc@in.tum.de>
parents: 584
diff changeset
   455
from the pairs; everything where the second part was not empty has
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   456
been thrown away as well, because they represent
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   457
ultimately-unsuccessful-parses. The main point is that the sequence
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   458
parser combinator returns pairs that can nest according to the
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   459
nesting of the component parsers.
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   460
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   461
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   462
Consider also carefully that constructing a parser such
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   463
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   464
\begin{center}
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   465
\pcode{p"a" || (p"a" ~ p"b")}
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   466
\end{center}
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   467
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   468
\noindent
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   469
will result in a typing error. The intention with this
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   470
parser is that we want to parse either an \texttt{a}, or an \texttt{a}
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   471
followed by a \texttt{b}. However, the first parser has as output type
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   472
a single character (recall the type of \texttt{CharParser}), but the
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   473
second parser produces a pair of characters as output. The alternative
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   474
parser is required to have both component parsers to have the same
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   475
type---the reason is that we need to be able to build the union of two
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   476
sets, which requires in Scala that the sets have the same type.  Since
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   477
they are not in this case, there is a typing error.  We will see later
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   478
how we can build this parser without the typing error.
385
7f8516ff408d updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 297
diff changeset
   479
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   480
The next parser combinator, called \emph{semantic action} or \emph{map-parser}, does not
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   481
actually combine two smaller parsers, but applies a function to the result
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   482
of a parser.  It is implemented in Scala as follows
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   483
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   484
\begin{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   485
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   486
class MapParser[I, T, S]
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   487
        (p: => Parser[I, T], 
941
66adcae6c762 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 937
diff changeset
   488
         f: T => S) extends Parser[I, S] {
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   489
  def parse(in: I) =
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   490
       for ((hd, tl) <- p.parse(in)) yield (f(hd), tl)
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   491
}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   492
\end{lstlisting}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   493
\end{center}
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   494
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   495
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   496
\noindent This parser combinator takes a parser \texttt{p} (with input
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   497
type \texttt{I} and output type \texttt{T}) as one argument but also a
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   498
function \texttt{f} (with type \texttt{T => S}). The parser \texttt{p}
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   499
produces sets of type \texttt{Set[(S, I)]}. The semantic action
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   500
combinator then applies the function \texttt{f} to all the `processed'
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   501
parser outputs. Since this function is of type \texttt{T => S}, we
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   502
obtain a parser with output type \texttt{S}. Again Scala lets us
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   503
introduce some shorthand notation for this parser
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   504
combinator. Therefore we will write short \texttt{p.map(f)} for it.
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   505
589
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   506
What are semantic actions good for? Well, they allow you to transform
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   507
the parsed input into datastructures you can use for further
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   508
processing. A simple (contrived) example would be to transform parsed
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   509
characters into ASCII numbers. Suppose we define a function \texttt{f}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   510
(from characters to \texttt{Int}s) and use a \texttt{CharParser} for parsing
589
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   511
the character \texttt{c}.
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   512
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   513
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   514
\begin{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   515
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   516
val f = (c: Char) => c.toInt
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   517
val c = new CharParser('c')
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   518
\end{lstlisting}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   519
\end{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   520
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   521
\noindent
589
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   522
We then can run the following two parsers on the input \texttt{cbd}:
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   523
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   524
\begin{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   525
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   526
c.parse("cbd")
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   527
c.map(f).parse("cbd")
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   528
\end{lstlisting}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   529
\end{center}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   530
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   531
\noindent
589
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   532
In the first line we obtain the expected result \texttt{Set(('c',
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   533
  "bd"))}, whereas the second produces \texttt{Set((99, "bd"))}---the
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   534
character has been transformed into an ASCII number.
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   535
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   536
A slightly less contrived example is about parsing numbers (recall
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   537
\texttt{NumParser} above). However, we want to do this here for
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   538
strings, not for tokens.  For this assume we have the following
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   539
(atomic) \texttt{RegexParser}.
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   540
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   541
\begin{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   542
  \begin{lstlisting}[language=Scala,xleftmargin=0mm,
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   543
    basicstyle=\small\ttfamily, numbers=none]
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   544
import scala.util.matching.Regex
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   545
    
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   546
case class RegexParser(reg: Regex) extends Parser[String, String] {
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   547
  def parse(in: String) = reg.findPrefixMatchOf(in) match {
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   548
    case None => Set()
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   549
    case Some(m) => Set((m.matched, m.after.toString))  
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   550
  }
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   551
}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   552
\end{lstlisting}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   553
\end{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   554
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   555
\noindent
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   556
This parser takes a regex as argument and splits up a string into a
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   557
prefix and the rest according to this regex
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   558
(\texttt{reg.findPrefixMatchOf} generates a match---in the successful
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   559
case---and the corresponding strings can be extracted with
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   560
\texttt{matched} and \texttt{after}). The input and output type for
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   561
this parser is \texttt{String}. Using \texttt{RegexParser} we can
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   562
define a \texttt{NumParser} for \texttt{Strings} to \texttt{Int} as
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   563
follows:
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   564
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   565
\begin{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   566
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   567
val NumParser = RegexParser("[0-9]+".r)
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   568
\end{lstlisting}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   569
\end{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   570
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   571
\noindent
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   572
This parser will recognise a number at the beginning of a string. For
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   573
example
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   574
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   575
\begin{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   576
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   577
NumParser.parse("123abc")
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   578
\end{lstlisting}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   579
\end{center}  
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   580
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   581
\noindent
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   582
produces \texttt{Set((123,abc))}. The problem is that \texttt{123} is
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   583
still a string (the expected double-quotes are not printed by
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   584
Scala). We want to convert this string into the corresponding
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   585
\texttt{Int}. We can do this as follows using a semantic action
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   586
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   587
\begin{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   588
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   589
NumParser.map{s => s.toInt}.parse("123abc")
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   590
\end{lstlisting}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   591
\end{center}  
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   592
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   593
\noindent
589
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   594
The function in the semantic action converts a string into an
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   595
\texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))},
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   596
but this time \texttt{123} is an \texttt{Int}. Think carefully what
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   597
the input and output type of the parser is without the semantic action
941
66adcae6c762 updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 937
diff changeset
   598
and what with the semantic action (the type of the function can
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   599
already tell you). Let us come back to semantic actions when we are
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   600
going to implement actual context-free grammars.
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   601
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   602
\subsubsection*{Shorthand notation for parser combinators}
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   603
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   604
Before we proceed, let us just explain the shorthand notation for
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   605
parser combinators. Like for regular expressions, the shorthand
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   606
notation will make our life much easier when writing actual
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   607
parsers. We can define some extensions\footnote{In Scala 2 this was
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   608
  generically called as ``implicits''.} which allow us to write
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   609
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   610
\begin{center}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   611
\begin{tabular}{ll}  
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   612
  \pcode{p || q} & alternative parser\\
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   613
  \pcode{p ~ q} & sequence parser\\ 
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   614
  \pcode{p.map(f)} & semantic action parser
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   615
\end{tabular}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   616
\end{center}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   617
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   618
\noindent
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   619
We will also use the \texttt{p}-string-interpolation for specifying simple string parsers.
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   620
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   621
The idea is that this shorthand notation allows us to easily translate
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   622
context-free grammars into code. For example recall our context-free
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   623
grammar for palindromes:
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   624
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   625
\begin{plstx}[margin=3cm]
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   626
: \meta{Pal} ::=  a\cdot \meta{Pal}\cdot a | b\cdot \meta{Pal}\cdot b | a | b | \epsilon\\
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   627
\end{plstx}
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   628
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   629
\noindent
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   630
Each alternative in this grammar translates into an alternative parser
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   631
combinator.  The $\cdot$ can be translated to a sequence parser
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   632
combinator. The parsers for $a$, $b$ and $\epsilon$ can be simply
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   633
written as \texttt{p"a"}, \texttt{p"b"} and \texttt{p""}.
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   634
587
5ddedcd92d84 updated
Christian Urban <urbanc@in.tum.de>
parents: 586
diff changeset
   635
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   636
\subsubsection*{How to build more interesting parsers using parser combinators?}
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   637
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   638
The beauty of parser combinators is the ease with which they can be
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   639
implemented and how easy it is to translate context-free grammars into
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   640
code (though the grammars need to be non-left-recursive). To
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   641
demonstrate this consider again the grammar for palindromes from above.
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   642
The first idea would be to translate it into the following code
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   643
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   644
\begin{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   645
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   646
lazy val Pal : Parser[String, String] = 
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   647
  ((p"a" ~ Pal ~ p"a") || (p"b" ~ Pal ~ p"b") || p"a" || p"b" || p"")
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   648
\end{lstlisting}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   649
\end{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   650
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   651
\noindent
590
c6a1e19e9801 updated
Christian Urban <urbanc@in.tum.de>
parents: 589
diff changeset
   652
Unfortunately, this does not quite work yet as it produces a typing
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   653
error. The reason is that the parsers \texttt{p"a"}, \texttt{p"b"} and
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   654
\texttt{p""} all produce strings as output type and therefore can be
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   655
put into an alternative \texttt{...|| p"a" || p"b" || p""}. But both
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   656
sequence parsers \pcode{p"a" ~ Pal ~ p"a"} and \pcode{p"b" ~ Pal ~ p"b"}
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   657
produce pairs of the form
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   658
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   659
\begin{center}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   660
(((\texttt{a}-part, \texttt{Pal}-part), \texttt{a}-part), unprocessed part)
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   661
\end{center}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   662
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   663
\noindent That is how the
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   664
sequence parser combinator nests results when \pcode{\~} is used
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   665
between two components. The solution is to use a semantic action that
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   666
``flattens'' these pairs and appends the corresponding strings, like
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   667
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   668
\begin{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   669
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   670
lazy val Pal : Parser[String, String] =  
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   671
  ((p"a" ~ Pal ~ p"a").map{ case ((x, y), z) => x + y + z } ||
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   672
   (p"b" ~ Pal ~ p"b").map{ case ((x, y), z) => x + y + z } ||
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   673
    p"a" || p"b" || p"")
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   674
\end{lstlisting}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   675
\end{center}
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   676
589
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   677
\noindent
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   678
How does this work? Well, recall again what the pairs look like for
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   679
the parser \pcode{p"a" ~ Pal ~ p"a"}.  The pattern in the semantic
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   680
action matches the nested pairs (the \texttt{x} with the
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   681
\texttt{a}-part and so on).  Unfortunately when we have such nested
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   682
pairs, Scala requires us to define the function using the
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   683
\pcode{case}-syntax
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   684
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   685
\begin{center}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   686
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   687
{ case ((x, y), z) => ... }
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   688
\end{lstlisting}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   689
\end{center}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   690
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   691
\noindent
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   692
If we have more sequence parser combinators or have them differently nested,
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   693
then the pattern in the semantic action needs to be adjusted accordingly.
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   694
The action we implement above is to concatenate all three strings, which
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   695
means after the semantic action is applied the output type of the parser 
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   696
is \texttt{String}, which means it fits with the alternative parsers
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   697
\texttt{...|| p"a" || p"b" || p""}.
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   698
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   699
If we run the parser above with \pcode{Pal.parse_all("abaaaba")} we obtain
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   700
as result the \pcode{Set(abaaaba)}, which indicates that the string is a palindrome
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   701
(an empty set would mean something is wrong). But also notice what the
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   702
intermediate results are generated by \pcode{Pal.parse("abaaaba")}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   703
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   704
\begin{center}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   705
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   706
Set((abaaaba,""),(aba,aaba), (a,baaaba), ("",abaaaba))
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   707
\end{lstlisting}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   708
\end{center}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   709
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   710
\noindent
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   711
That there are more than one output might be slightly unexpected, but
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   712
can be explained as follows: the pairs represent all possible
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   713
(partial) parses of the string \pcode{"abaaaba"}. The first pair above
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   714
corresponds to a complete parse (all output is consumed) and this is
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   715
what \pcode{Pal.parse_all} returns. The second pair is a small
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   716
``sub-palindrome'' that can also be parsed, but the parse fails with
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   717
the rest \pcode{aaba}, which is therefore left as unprocessed. The
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   718
third one is an attempt to parse the whole string with the
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   719
single-character parser \pcode{a}. That of course only partially
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   720
succeeds, by leaving \pcode{"baaaba"} as the unprocessed
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   721
part. Finally, since we allow the empty string to be a palindrome we
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   722
also obtain the last pair, where actually nothing is consumed from the
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   723
input string. While all this works as intended, we need to be careful
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   724
with this (especially with including the \pcode{""} parser in our
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   725
grammar): if during parsing the set of parsing attempts gets too big,
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   726
then the parsing process can become very slow as the potential
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   727
candidates for applying rules can snowball.
589
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   728
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   729
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   730
Important is also to note is that we must define the
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   731
\texttt{Pal}-parser as a \emph{lazy} value in Scala. Look again at the
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   732
code: \texttt{Pal} occurs on the right-hand side of the definition. If we had
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   733
just written
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   734
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   735
\begin{center}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   736
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   737
val Pal : Parser[String, String] =  ...rhs...
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   738
\end{lstlisting}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   739
\end{center}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   740
589
0451b8b67f62 updated
Christian Urban <urbanc@in.tum.de>
parents: 588
diff changeset
   741
\noindent
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   742
then Scala before making this assignment to \texttt{Pal} attempts to
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   743
find out what the expression on the right-hand side evaluates to. This
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   744
is straightforward in case of simple expressions \texttt{2 + 3}, but
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   745
the expression above contains \texttt{Pal} in the right-hand
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   746
side. Without \pcode{lazy} it would try to evaluate what \texttt{Pal}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   747
evaluates to and start a new recursion, which means it falls into an
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   748
infinite loop. The definition of \texttt{Pal} is recursive and the
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   749
\pcode{lazy} key-word prevents it from being fully evaluated. Therefore
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   750
whenever we want to define a recursive parser we have to write
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   751
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   752
\begin{center}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   753
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   754
lazy val SomeParser : Parser[...,...] =  ...rhs...
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   755
\end{lstlisting}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   756
\end{center}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   757
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   758
\noindent That was not necessary for our atomic parsers, like
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   759
\texttt{RegexParser} or \texttt{CharParser}, because they are not recursive.
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   760
Note that this is also the reason why we had to write
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   761
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   762
\begin{center}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   763
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   764
class AltParser[I, T]
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   765
       (p: => Parser[I, T], 
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   766
        q: => Parser[I, T]) ... extends Parser[I, T] {...}
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   767
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   768
class SeqParser[I, T, S]
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   769
       (p: => Parser[I, T], 
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   770
        q: => Parser[I, S]) ... extends Parser[I, (T, S)] {...}
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   771
\end{lstlisting}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   772
\end{center}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   773
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   774
\noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front of
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   775
the argument types for \texttt{p} and \texttt{q} prevent Scala from
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   776
evaluating the arguments. Normally, Scala would first evaluate what
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   777
kind of parsers \texttt{p} and \texttt{q} are, and only then generate
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   778
the alternative parser combinator, respectively sequence parser
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   779
combinator. Since the arguments can be recursive parsers, such as
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   780
\texttt{Pal}, this would lead again to an infinite loop.
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   781
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   782
As a final example in this section, let us consider the grammar for
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   783
well-nested parentheses:
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   784
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   785
\begin{plstx}[margin=3cm]
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   786
: \meta{P} ::=  (\cdot \meta{P}\cdot ) \cdot \meta{P} | \epsilon\\
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   787
\end{plstx}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   788
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   789
\noindent
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   790
Let us assume we want to not just recognise strings of
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   791
well-nested parentheses but also transform round parentheses
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   792
into curly braces. We can do this by using a semantic
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   793
action:
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   794
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   795
\begin{center}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   796
  \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   797
    xleftmargin=0mm, numbers=none]
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   798
lazy val P : Parser[String, String] = 
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   799
  (p"(" ~ P ~ p")" ~ P).map{ case (((_,x),_),y) => "{" + x + "}" + y } || p""
591
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   800
\end{lstlisting}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   801
\end{center}
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   802
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   803
\noindent
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   804
Here we define a function where which ignores the parentheses in the
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   805
pairs, but replaces them in the right places with curly braces when
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   806
assembling the new string in the right-hand side. If we run
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   807
\pcode{P.parse_all("(((()()))())")} we obtain
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   808
\texttt{Set(\{\{\{\{\}\{\}\}\}\{\}\})} as expected.
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   809
863e502f6a5c updated
Christian Urban <urbanc@in.tum.de>
parents: 590
diff changeset
   810
588
a4646557016d updated
Christian Urban <urbanc@in.tum.de>
parents: 587
diff changeset
   811
386
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 385
diff changeset
   812
\subsubsection*{Implementing an Interpreter}
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   813
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   814
The first step before implementing an interpreter for a full-blown
592
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   815
language is to implement a simple calculator for arithmetic
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   816
expressions. Suppose our arithmetic expressions are given by the
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   817
grammar:
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   818
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   819
\begin{plstx}[margin=3cm,one per line]
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   820
: \meta{E} ::= \meta{E} \cdot + \cdot \meta{E} 
592
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   821
   | \meta{E} \cdot - \cdot \meta{E} 
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   822
   | \meta{E} \cdot * \cdot \meta{E} 
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   823
   | ( \cdot \meta{E} \cdot )
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   824
   | Number \\
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   825
\end{plstx}
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   826
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   827
\noindent
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   828
Naturally we want to implement the grammar in such a way that we can
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   829
calculate what the result of, for example, \texttt{4*2+3} is---we are
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   830
interested in an \texttt{Int} rather than a string. This means every
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   831
component parser needs to have as output type \texttt{Int} and when we
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   832
assemble the intermediate results, strings like \texttt{"+"},
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   833
\texttt{"*"} and so on, need to be translated into the appropriate
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   834
Scala operation of adding, multiplying and so on.  Being inspired by
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   835
the parser for well-nested parentheses above and ignoring the fact
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   836
that we want $*$ to take precedence over $+$ and $-$, we might want to
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   837
write something like
592
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   838
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   839
\begin{center}
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   840
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   841
lazy val E: Parser[String, Int] = 
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   842
  ((E ~ p"+" ~ E).map{ case ((x, y), z) => x + z} ||
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   843
   (E ~ p"-" ~ E).map{ case ((x, y), z) => x - z} ||
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   844
   (E ~ p"*" ~ E).map{ case ((x, y), z) => x * z} ||
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   845
   (p"(" ~ E ~ p")").map{ case ((x, y), z) => y} ||
592
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   846
   NumParserInt)
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   847
\end{lstlisting}
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   848
\end{center}
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   849
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   850
\noindent
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   851
Consider again carefully how the semantic actions pick out the correct
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   852
arguments for the calculation. In case of plus, we need \texttt{x} and
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   853
\texttt{z}, because they correspond to the results of the component
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   854
parser \texttt{E}. We can just add \texttt{x + z} in order to obtain
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   855
an \texttt{Int} because the output type of \texttt{E} is
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   856
\texttt{Int}.  Similarly with subtraction and multiplication. In
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   857
contrast in the fourth clause we need to return \texttt{y}, because it
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   858
is the result enclosed inside the parentheses. The information about
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   859
parentheses, roughly speaking, we just throw away.
592
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   860
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   861
So far so good. The problem arises when we try to call \pcode{parse_all} with the
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   862
expression \texttt{"1+2+3"}. Lets try it
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   863
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   864
\begin{center}
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   865
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   866
E.parse_all("1+2+3")
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   867
\end{lstlisting}
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   868
\end{center}
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   869
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   870
\noindent
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   871
\ldots and we wait and wait and \ldots still wait. What is the
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   872
problem? Actually, the parser just fell into an infinite loop! The
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   873
reason is that the above grammar is left-recursive and recall that our
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   874
parser combinators cannot deal with such left-recursive
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   875
grammars. Fortunately, every left-recursive context-free grammar can be
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   876
transformed into a non-left-recursive grammars that still recognises
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   877
the same strings. This allows us to design the following grammar
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   878
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   879
\begin{plstx}[margin=3cm]
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   880
  : \meta{E} ::=  \meta{T} \cdot + \cdot \meta{E} |  \meta{T} \cdot - \cdot \meta{E} | \meta{T}\\
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   881
: \meta{T} ::=  \meta{F} \cdot * \cdot \meta{T} | \meta{F}\\
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   882
: \meta{F} ::= ( \cdot \meta{E} \cdot ) | Number\\
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   883
\end{plstx}
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   884
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   885
\noindent
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   886
Recall what left-recursive means from Handout 5 and make sure you see
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   887
why this grammar is \emph{non} left-recursive. This version of the grammar
937
dc5ab66b11cc updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 799
diff changeset
   888
also deals with the fact that $*$ should have a higher precedence than $+$ and $-$. This does not
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   889
affect which strings this grammar can recognise, but in which order we are going
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   890
to evaluate any arithmetic expression. We can translate this grammar into
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   891
parsing combinators as follows:
592
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   892
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   893
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   894
\begin{center}
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   895
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   896
lazy val E: Parser[String, Int] = 
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   897
  (T ~ p"+" ~ E).map{ case ((x, y), z) => x + z } ||
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   898
  (T ~ p"-" ~ E).map{ case ((x, y), z) => x - z } || T 
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   899
lazy val T: Parser[String, Int] = 
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   900
  (F ~ p"*" ~ T).map{ case ((x, y), z) => x * z } || F
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   901
lazy val F: Parser[String, Int] = 
799
85267be9a5ed updated
Christian Urban <christian.urban@kcl.ac.uk>
parents: 595
diff changeset
   902
  (p"(" ~ E ~ p")").map{ case ((x, y), z) => y } || NumParserInt
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   903
\end{lstlisting}
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   904
\end{center}
592
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   905
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   906
\noindent
594
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
   907
Let us try out some examples:
592
0d309fafa9f0 updated
Christian Urban <urbanc@in.tum.de>
parents: 591
diff changeset
   908
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   909
\begin{center}
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   910
\begin{tabular}{rcl}
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   911
  input strings & & output of \pcode{parse_all}\medskip\\
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   912
  \texttt{\Grid{1+2+3}} & $\rightarrow$ & \texttt{Set(6)}\\
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   913
  \texttt{\Grid{4*2+3}} & $\rightarrow$ & \texttt{Set(11)}\\
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   914
  \texttt{\Grid{4*(2+3)}} & $\rightarrow$ & \texttt{Set(20)}\\
594
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
   915
  \texttt{\Grid{(4)*((2+3))}} & $\rightarrow$ & \texttt{Set(20)}\\
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   916
  \texttt{\Grid{4/2+3}} & $\rightarrow$ & \texttt{Set()}\\
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   917
  \texttt{\Grid{1\VS +\VS 2\VS +\VS 3}} & $\rightarrow$ & \texttt{Set()}\\                      
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   918
\end{tabular}
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   919
\end{center}
183
b17eff695c7f added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents: 177
diff changeset
   920
593
bb24d4e207b6 updated
Christian Urban <urbanc@in.tum.de>
parents: 592
diff changeset
   921
\noindent
594
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
   922
Note that we call \pcode{parse_all}, not \pcode{parse}.  The examples
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
   923
should be quite self-explanatory. The last two example do not produce
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
   924
any integer result because our parser does not define what to do in
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
   925
case of division (could be easily added), but also has no idea what to
595
4bf0096bc06b updated
Christian Urban <urbanc@in.tum.de>
parents: 594
diff changeset
   926
do with whitespaces. To deal with them is the task of the lexer! Yes,
594
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
   927
we can deal with them inside the grammar, but that would render many
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
   928
grammars becoming unintelligible, including this one.\footnote{If you
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
   929
  think an easy solution is to extend the notion of what a number
d40d7d7b85bc updated
Christian Urban <urbanc@in.tum.de>
parents: 593
diff changeset
   930
  should be, then think again---you still would have to deal with
595
4bf0096bc06b updated
Christian Urban <urbanc@in.tum.de>
parents: 594
diff changeset
   931
  cases like \texttt{\Grid{(\VS (\VS 2+3)\VS )}}. Just think of the mess 
4bf0096bc06b updated
Christian Urban <urbanc@in.tum.de>
parents: 594
diff changeset
   932
  you would have in a grammar for a full-blown language where there are 
4bf0096bc06b updated
Christian Urban <urbanc@in.tum.de>
parents: 594
diff changeset
   933
  numerous such cases.}
173
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   934
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   935
\end{document}
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   936
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   937
%%% Local Variables: 
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   938
%%% mode: latex  
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   939
%%% TeX-master: t
7cfb7a6f7c99 added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff changeset
   940
%%% End: