handouts/ho06.tex
changeset 386 31295bb945c6
parent 385 7f8516ff408d
child 392 2d0a59127694
equal deleted inserted replaced
385:7f8516ff408d 386:31295bb945c6
     5 
     5 
     6 \begin{document}
     6 \begin{document}
     7 
     7 
     8 \section*{Handout 6 (Parser Combinators)}
     8 \section*{Handout 6 (Parser Combinators)}
     9 
     9 
    10 In what follows we explain \emph{parser combinators}, because
    10 In what follows we explain \emph{parser combinators}. Their
    11 they are very easy to implement. However, they only work when
    11 distinguishing feature is that they are very easy to
    12 the grammar to be parsed is \emph{not} left-recursive and they
    12 implement. However, they only work when the grammar to be
    13 are efficient only when the grammar is unambiguous. It is the
    13 parsed is \emph{not} left-recursive and they are efficient
    14 responsibility of the grammar designer to ensure these two
    14 only when the grammar is unambiguous. It is the responsibility
    15 properties.
    15 of the grammar designer to ensure these two properties.
    16 
    16 
    17 Parser combinators can deal with any kind of input as long as
    17 Parser combinators can deal with any kind of input as long as
    18 this input is a kind of sequence, for example a string or a
    18 this input is a kind of sequence, for example a string or a
    19 list of tokens. If the input are lists of tokens, then parser
    19 list of tokens. The general idea behind parser combinators is
    20 combinators transform them into sets of pairs, like so
    20 to transform the input into sets of pairs, like so
    21 
    21 
    22 \begin{center}
    22 \begin{center}
    23 $\underbrace{\text{list of tokens}}_{\text{input}}$ 
    23 $\underbrace{\text{list of tokens}}_{\text{input}}$ 
    24 $\Rightarrow$
    24 $\Rightarrow$
    25 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
    25 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
    26 \end{center} 
    26 \end{center} 
    27 
    27 
    28 \noindent In Scala parser combinators will therefore be
    28 \noindent In Scala parser combinators are functions of type
    29 functions of type
       
    30 
    29 
    31 \begin{center}
    30 \begin{center}
    32 \texttt{I $\Rightarrow$ Set[(T, I)]}
    31 \texttt{I $\Rightarrow$ Set[(T, I)]}
    33 \end{center}
    32 \end{center}
    34 
    33 
    35 \noindent that is they take as input something of type
    34 \noindent that is they take as input something of type
    36 \texttt{I} and return a set of pairs. The first component of
    35 \texttt{I} and return a set of pairs. The first component of
    37 these pairs corresponds to what the parser combinator was able
    36 these pairs corresponds to what the parser combinator was able
    38 to process from the input and the second is the unprocessed
    37 to process from the input and the second is the unprocessed
    39 part of the input. As we shall see shortly, a parser
    38 part of the input. As we shall see shortly, a parser
    40 combinator might return more than one such pair, with the idea
    39 combinator might return more than one such pair, the idea
    41 that there are potentially several ways how to interpret the
    40 being that there are potentially several ways how to interpret
    42 input. To simplify matters we will first look at the case
    41 the input. To simplify matters we will first look at the case
    43 where the input to the parser combinators is just strings. As
    42 where the input to the parser combinators is just strings. As
    44 a concrete example, consider the case where the input is the
    43 a concrete example, consider the string
    45 string
       
    46 
    44 
    47 \begin{center}
    45 \begin{center}
    48 \tt\Grid{iffoo\VS testbar}
    46 \tt\Grid{iffoo\VS testbar}
    49 \end{center}
    47 \end{center}
    50 
    48 
    51 \noindent We might have a parser combinator which tries to
    49 \noindent We might have a parser combinator which tries to
    52 interpret this string as a keyword (\texttt{if}) or an
    50 interpret this string as a keyword (\texttt{if}) or as an
    53 identifier (\texttt{iffoo}). Then the output will be the set
    51 identifier (\texttt{iffoo}). Then the output will be the set
    54 
    52 
    55 \begin{center}
    53 \begin{center}
    56 $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), 
    54 $\left\{ \left(\texttt{\Grid{if}}\;,\; \texttt{\Grid{foo\VS testbar}}\right), 
    57            \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
    55            \left(\texttt{\Grid{iffoo}}\;,\; \texttt{\Grid{\VS testbar}}\right) \right\}$
    58 \end{center}
    56 \end{center}
    59 
    57 
    60 \noindent where the first pair means the parser could
    58 \noindent where the first pair means the parser could
    61 recognise \texttt{if} from the input and leaves the rest as
    59 recognise \texttt{if} from the input and leaves the rest as
    62 `unprocessed' as the second component of the pair; in the
    60 `unprocessed' as the second component of the pair; in the
    63 other case it could recognise \texttt{iffoo} and leaves
    61 other case it could recognise \texttt{iffoo} and leaves
    64 \texttt{\VS testbar} as unprocessed. If the parser cannot
    62 \texttt{\VS testbar} as unprocessed. If the parser cannot
    65 recognise anything from the input then parser combinators just
    63 recognise anything from the input, then parser combinators just
    66 return the empty set $\{\}$. This will indicate something
    64 return the empty set $\{\}$. This will indicate something
    67 ``went wrong''\ldots or more precisely, nothing could be
    65 ``went wrong''\ldots or more precisely, nothing could be
    68 parsed.
    66 parsed.
    69 
    67 
    70 The main attraction of parser combinators is that we can
    68 The main attraction of parser combinators is that we can
    91 
    89 
    92 \noindent From the function \texttt{parse} we can then
    90 \noindent From the function \texttt{parse} we can then
    93 ``centrally'' derive the function \texttt{parse\_all}, which
    91 ``centrally'' derive the function \texttt{parse\_all}, which
    94 just filters out all pairs whose second component is not empty
    92 just filters out all pairs whose second component is not empty
    95 (that is has still some unprocessed part). The reason is that
    93 (that is has still some unprocessed part). The reason is that
    96 at the end of parsing we are only interested in the results
    94 at the end of the parsing we are only interested in the results
    97 where all the input has been consumed and no unprocessed part
    95 where all the input has been consumed and no unprocessed part
    98 is left.
    96 is left over.
    99 
    97 
   100 One of the simplest parser combinators recognises just a
    98 One of the simplest parser combinators recognises just a
   101 character, say $c$, from the beginning of strings. Its
    99 character, say $c$, from the beginning of strings. Its
   102 behaviour can be described as follows:
   100 behaviour can be described as follows:
   103 
   101 
   104 \begin{itemize}
   102 \begin{itemize}
   105 \item if the head of the input string starts with a $c$, it returns 
   103 \item if the head of the input string starts with a $c$, then returns 
   106 	the set $\{(c, \textit{tail of}\; s)\}$, where \textit{tail of} 
   104 	the set $\{(c, \textit{tail of}\; s)\}$, where \textit{tail of} 
   107 	$s$ is the unprocessed part of the input string
   105 	$s$ is the unprocessed part of the input string
   108 \item otherwise it returns the empty set $\{\}$	
   106 \item otherwise return the empty set $\{\}$	
   109 \end{itemize}
   107 \end{itemize}
   110 
   108 
   111 \noindent
   109 \noindent
   112 The input type of this simple parser combinator for characters is
   110 The input type of this simple parser combinator for characters is
   113 \texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. 
   111 \texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. 
   133 
   131 
   134 More interesting are the parser combinators that build larger
   132 More interesting are the parser combinators that build larger
   135 parsers out of smaller component parsers. For example the
   133 parsers out of smaller component parsers. For example the
   136 alternative parser combinator is as follows: given two
   134 alternative parser combinator is as follows: given two
   137 parsers, say, $p$ and $q$, then we apply both parsers to the
   135 parsers, say, $p$ and $q$, then we apply both parsers to the
   138 input (remember parsers are functions) and combine the input
   136 input (remember parsers are functions) and combine the output
       
   137 (remember they are sets)
   139  
   138  
   140 \begin{center}
   139 \begin{center}
   141 $p(\text{input}) \cup q(\text{input})$
   140 $p(\text{input}) \cup q(\text{input})$
   142 \end{center}
   141 \end{center}
   143 
   142 
   144 \noindent
   143 \noindent In Scala we would implement alternative parser
   145 In Scala we would implement alternative parsers as follows
   144 combinator as follows
   146 
   145 
   147 \begin{center}
   146 \begin{center}
   148 \begin{lstlisting}[language=Scala, numbers=none]
   147 \begin{lstlisting}[language=Scala, numbers=none]
   149 class AltParser[I, T]
   148 class AltParser[I, T]
   150        (p: => Parser[I, T], 
   149        (p: => Parser[I, T], 
   155 \end{center}
   154 \end{center}
   156 
   155 
   157 \noindent The types of this parser combinator are polymorphic
   156 \noindent The types of this parser combinator are polymorphic
   158 (we just have \texttt{I} for the input type, and \texttt{T}
   157 (we just have \texttt{I} for the input type, and \texttt{T}
   159 for the output type). The alternative parser builds a new
   158 for the output type). The alternative parser builds a new
   160 parser out of two existing parser combinator \texttt{p} and
   159 parser out of two existing parsers \texttt{p} and \texttt{q}.
   161 \texttt{q}. Both need to be able to process input of type
   160 Both need to be able to process input of type \texttt{I} and
   162 \texttt{I} and return the same output type \texttt{Set[(T,
   161 return the same output type \texttt{Set[(T,
   163 I)]}. (There is an interesting detail of Scala, namely the
   162 I)]}.\footnote{There is an interesting detail of Scala, namely
   164 \texttt{=>} in front of the types of \texttt{p} and
   163 the \texttt{=>} in front of the types of \texttt{p} and
   165 \texttt{q}. They will prevent the evaluation of the arguments
   164 \texttt{q}. They will prevent the evaluation of the arguments
   166 before they are used. This is often called \emph{lazy
   165 before they are used. This is often called \emph{lazy
   167 evaluation} of the arguments.) The alternative parser should
   166 evaluation} of the arguments. We will explain this later.} The alternative parser should
   168 run the input with the first parser \texttt{p} (producing a
   167 run the input with the first parser \texttt{p} (producing a
   169 set of outputs) and then run the same input with \texttt{q}.
   168 set of outputs) and then run the same input with \texttt{q}.
   170 The result should be then just the union of both sets, which
   169 The result should be then just the union of both sets, which
   171 is the operation \texttt{++} in Scala.
   170 is the operation \texttt{++} in Scala.
   172 
   171 
   173 This parser combinator already allows us to construct a parser
   172 The alternative parser combinator already allows us to
   174 that either a character \texttt{a} or \texttt{b}, as
   173 construct a parser that parses either a character \texttt{a}
       
   174 or \texttt{b}, as
   175 
   175 
   176 \begin{center}
   176 \begin{center}
   177 \begin{lstlisting}[language=Scala, numbers=none]
   177 \begin{lstlisting}[language=Scala, numbers=none]
   178 new AltParser(CharParser('a'), CharParser('b'))
   178 new AltParser(CharParser('a'), CharParser('b'))
   179 \end{lstlisting}
   179 \end{lstlisting}
   192 \end{tabular}
   192 \end{tabular}
   193 \end{center}
   193 \end{center}
   194 
   194 
   195 \noindent We receive in the first two cases a successful
   195 \noindent We receive in the first two cases a successful
   196 output (that is a non-empty set). In each case, either
   196 output (that is a non-empty set). In each case, either
   197 \pcode{a} or \pcode{b} are the processed parts, and \pcode{c}
   197 \pcode{a} or \pcode{b} are in the processed parts, and
   198 is the unprocessed part. Clearly this parser cannot parse
   198 \pcode{c} in the unprocessed part. Clearly this parser cannot
   199 anything with the string \pcode{cc}.
   199 parse anything in the string \pcode{cc}, therefore the empty
       
   200 set.
   200 
   201 
   201 A bit more interesting is the \emph{sequence parser
   202 A bit more interesting is the \emph{sequence parser
   202 combinator}. Given two parsers, say, $p$ and $q$,
   203 combinator}. Given two parsers, say, $p$ and $q$, apply first
   203 apply first the input to $p$ producing a set of pairs;
   204 the input to $p$ producing a set of pairs; then apply $q$ to
   204 then apply $q$ to the unparsed parts; then combine the 
   205 all the unparsed parts; then combine the results like
   205 results like
       
   206 
   206 
   207 \begin{center}
   207 \begin{center}
   208 \begin{tabular}{lcl}
   208 \begin{tabular}{lcl}
   209 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & 
   209 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & 
   210 $(\textit{output}_1, u_1) \in p(\text{input}) 
   210 $(\textit{output}_1, u_1) \in p(\text{input}) 
   211 \;\wedge\;$\\
   211 \;\wedge\;$\\
   212 && $(\textit{output}_2, u_2) \in q(u_1)\}$
   212 && $\;(\textit{output}_2, u_2) \in q(u_1)\}$
   213 \end{tabular}
   213 \end{tabular}
   214 \end{center}
   214 \end{center}
   215 
   215 
   216 \noindent
   216 \noindent
   217 This can be implemented in Scala as follows:
   217 This can be implemented in Scala as follows:
   220 \begin{lstlisting}[language=Scala,numbers=none]
   220 \begin{lstlisting}[language=Scala,numbers=none]
   221 class SeqParser[I, T, S]
   221 class SeqParser[I, T, S]
   222        (p: => Parser[I, T], 
   222        (p: => Parser[I, T], 
   223         q: => Parser[I, S]) extends Parser[I, (T, S)] {
   223         q: => Parser[I, S]) extends Parser[I, (T, S)] {
   224   def parse(sb: I) = 
   224   def parse(sb: I) = 
   225     for ((head1, tail1) <- p.parse(sb); 
   225     for ((output1, u1) <- p.parse(sb); 
   226          (head2, tail2) <- q.parse(tail1)) 
   226          (output2, u2) <- q.parse(u1)) 
   227             yield ((head1, head2), tail2)
   227             yield ((output1, output2), u2)
   228 }
   228 }
   229 \end{lstlisting}
   229 \end{lstlisting}
   230 \end{center}
   230 \end{center}
   231 
   231 
   232 \noindent This parser takes as input two parsers, \texttt{p}
   232 \noindent This parser takes as input two parsers, \texttt{p}
   233 and \texttt{q}. It implements \texttt{parse} as follows: let
   233 and \texttt{q}. It implements \texttt{parse} as follows: let
   234 first run the parser \texttt{p} on the input producing a set
   234 first run the parser \texttt{p} on the input producing a set
   235 of pairs (\texttt{head1}, \texttt{tail1}). The \texttt{tail1}
   235 of pairs (\texttt{output1}, \texttt{u1}). The \texttt{u1}
   236 stands for the unprocessed parts left over by \texttt{p}. Let
   236 stands for the unprocessed parts left over by \texttt{p}. Let
   237 \texttt{q} run on these unprocessed parts producing again a
   237 \texttt{q} run on these unprocessed parts producing again a
   238 set of pairs. The output of the sequence parser combinator is
   238 set of pairs. The output of the sequence parser combinator is
   239 then a set containing pairs where the first components are
   239 then a set containing pairs where the first components are
   240 again pairs, namely what the first parser could parse together
   240 again pairs, namely what the first parser could parse together
   247 \begin{center}
   247 \begin{center}
   248 \texttt{Set[((T, S), I)]}
   248 \texttt{Set[((T, S), I)]}
   249 \end{center}
   249 \end{center}
   250 
   250 
   251 Scala allows us to provide some shorthand notation for the
   251 Scala allows us to provide some shorthand notation for the
   252 sequence parser combinator. So we can write for example
   252 sequence parser combinator. We can write for example
   253 \texttt{'a' $\sim$ 'b'}, which is the parser combinator that
   253 \pcode{'a' ~ 'b'}, which is the parser combinator that
   254 first consumes the character \texttt{a} from a string and then
   254 first consumes the character \texttt{a} from a string and then
   255 \texttt{b}. Three examples of this parser combinator are as
   255 \texttt{b}. Three examples of this parser combinator are as
   256 follows:
   256 follows:
   257 
   257 
   258 \begin{center}
   258 \begin{center}
   262 \texttt{\Grid{bac}} & $\rightarrow$ & $\{\}$\\
   262 \texttt{\Grid{bac}} & $\rightarrow$ & $\{\}$\\
   263 \texttt{\Grid{ccc}} & $\rightarrow$ & $\{\}$
   263 \texttt{\Grid{ccc}} & $\rightarrow$ & $\{\}$
   264 \end{tabular}
   264 \end{tabular}
   265 \end{center}
   265 \end{center}
   266 
   266 
   267 \noindent A slightly more complicated parser is \texttt{('a'
   267 \noindent A slightly more complicated parser is \pcode{('a'
   268 || 'b') $\sim$ 'b'} which parses as first character either an
   268 || 'b') ~ 'b'} which parses as first character either an
   269 \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser
   269 \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser
   270 produces the following results.
   270 produces the following outputs.
   271 
   271 
   272 \begin{center}
   272 \begin{center}
   273 \begin{tabular}{rcl}
   273 \begin{tabular}{rcl}
   274 input strings & & output\medskip\\
   274 input strings & & output\medskip\\
   275 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
   275 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
   287 \texttt{\Grid{aaaa}} & $\rightarrow$ & 
   287 \texttt{\Grid{aaaa}} & $\rightarrow$ & 
   288 $\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
   288 $\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
   289 \end{tabular}
   289 \end{tabular}
   290 \end{center}
   290 \end{center}
   291 
   291 
   292 \noindent Notice how the results nest deeper as pairs (the
   292 \noindent Notice how the results nest deeper and deeper as
   293 last \pcode{a} is in the unprocessed part). To consume
   293 pairs (the last \pcode{a} is in the unprocessed part). To
   294 everything we can use the parser \pcode{(('a' ~'a') ~ 'a') ~
   294 consume everything of this string we can use the parser
   295 'a'}. Then the output is as follows:
   295 \pcode{(('a' ~'a') ~ 'a') ~ 'a'}. Then the output is as
       
   296 follows:
   296 
   297 
   297 \begin{center}
   298 \begin{center}
   298 \begin{tabular}{rcl}
   299 \begin{tabular}{rcl}
   299 input string & & output\medskip\\
   300 input string & & output\medskip\\
   300 \texttt{\Grid{aaaa}} & $\rightarrow$ & 
   301 \texttt{\Grid{aaaa}} & $\rightarrow$ & 
   305 \noindent This is an instance where the parser consumed
   306 \noindent This is an instance where the parser consumed
   306 completely the input, meaning the unprocessed part is just the
   307 completely the input, meaning the unprocessed part is just the
   307 empty string.
   308 empty string.
   308 
   309 
   309 
   310 
   310 Note carefully that constructing a parser such \texttt{'a' ||
   311 Note carefully that constructing a parser such \pcode{'a' ||
   311 ('a' $\sim$ 'b')} will result in a tying error. The first
   312 ('a' ~ 'b')} will result in a typing error. The first
   312 parser has as output type a single character (recall the type
   313 parser has as output type a single character (recall the type
   313 of \texttt{CharParser}), but the second parser produces a pair
   314 of \texttt{CharParser}), but the second parser produces a pair
   314 of characters as output. The alternative parser is however
   315 of characters as output. The alternative parser is however
   315 required to have both component parsers to have the same type.
   316 required to have both component parsers to have the same type.
   316 We will see later how we can build this parser without the
   317 We will see later how we can build this parser without the
   317 typing error.
   318 typing error.
   318 
   319 
   319 The next parser combinator does not actually combine smaller
   320 The next parser combinator does not actually combine smaller
   320 parsers, but applies a function to the result of the parser.
   321 parsers, but applies a function to the result of a parser.
   321 It is implemented in Scala as follows
   322 It is implemented in Scala as follows
   322 
   323 
   323 \begin{center}
   324 \begin{center}
   324 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   325 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   325 class FunParser[I, T, S]
   326 class FunParser[I, T, S]
   330 }
   331 }
   331 \end{lstlisting}
   332 \end{lstlisting}
   332 \end{center}
   333 \end{center}
   333 
   334 
   334 
   335 
   335 \noindent
   336 \noindent This parser combinator takes a parser \texttt{p}
   336 This parser combinator takes a parser \texttt{p} with output type \texttt{T} as
   337 with output type \texttt{T} as one argument as well as a
   337 input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p}
   338 function \texttt{f} with type \texttt{T => S}. The parser
   338 produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then
   339 \texttt{p} produces sets of type \texttt{(T, I)}. The
   339 applies the function \texttt{f} to all the parer outputs. Since this function
   340 \texttt{FunParser} combinator then applies the function
   340 is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}.
   341 \texttt{f} to all the parser outputs. Since this function is of
   341 Again Scala lets us introduce some shorthand notation for this parser combinator. 
   342 type \texttt{T => S}, we obtain a parser with output type
   342 Therefore we will write \texttt{p ==> f} for it.
   343 \texttt{S}. Again Scala lets us introduce some shorthand
       
   344 notation for this parser combinator. Therefore we will write
       
   345 \texttt{p ==> f} for it.
       
   346 
       
   347 \subsubsection*{How to build parsers using parser combinators?}
       
   348 
       
   349 \subsubsection*{Implementing an Interpreter}
   343 
   350 
   344 %\bigskip
   351 %\bigskip
   345 %takes advantage of the full generality---have a look
   352 %takes advantage of the full generality---have a look
   346 %what it produces if we call it with the string \texttt{abc}
   353 %what it produces if we call it with the string \texttt{abc}
   347 %
   354 %