handouts/ho06.tex
changeset 385 7f8516ff408d
parent 297 5c51839c88fd
child 386 31295bb945c6
equal deleted inserted replaced
384:4629448c1bd9 385:7f8516ff408d
     5 
     5 
     6 \begin{document}
     6 \begin{document}
     7 
     7 
     8 \section*{Handout 6 (Parser Combinators)}
     8 \section*{Handout 6 (Parser Combinators)}
     9 
     9 
    10 While regular expressions are very useful for lexing and for recognising
    10 In what follows we explain \emph{parser combinators}, because
    11 many patterns in strings (like email addresses), they have their limitations. For
    11 they are very easy to implement. However, they only work when
    12 example there is no regular expression that can recognise the language 
    12 the grammar to be parsed is \emph{not} left-recursive and they
    13 $a^nb^n$. Another example for which there exists no regular expression is the language of well-parenthesised 
    13 are efficient only when the grammar is unambiguous. It is the
    14 expressions.  In languages like Lisp, which use parentheses rather
    14 responsibility of the grammar designer to ensure these two
    15 extensively, it might be of interest whether the following two expressions
    15 properties.
    16 are well-parenthesised (the left one is, the right one is not):
    16 
    17 
    17 Parser combinators can deal with any kind of input as long as
    18 \begin{center}
    18 this input is a kind of sequence, for example a string or a
    19 $(((()()))())$  \hspace{10mm} $(((()()))()))$
    19 list of tokens. If the input are lists of tokens, then parser
    20 \end{center}
    20 combinators transform them into sets of pairs, like so
    21 
    21 
    22 \noindent
    22 \begin{center}
    23 Not being able to solve such recognition problems is a serious limitation.
    23 $\underbrace{\text{list of tokens}}_{\text{input}}$ 
    24 In order to solve such recognition problems, we need more powerful 
    24 $\Rightarrow$
    25 techniques than regular expressions. We will in particular look at \emph{context-free
    25 $\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
    26 languages}. They include the regular languages as the picture below shows:
    26 \end{center} 
    27 
    27 
    28 
    28 \noindent In Scala parser combinators will therefore be
    29 \begin{center}
    29 functions of type
    30 \begin{tikzpicture}
       
    31 [rect/.style={draw=black!50, 
       
    32  top color=white,bottom color=black!20, 
       
    33  rectangle, very thick, rounded corners}]
       
    34 
       
    35 \draw (0,0) node [rect, text depth=30mm, text width=46mm] {\small all languages};
       
    36 \draw (0,-0.4) node [rect, text depth=20mm, text width=44mm] {\small decidable languages};
       
    37 \draw (0,-0.65) node [rect, text depth=13mm] {\small context sensitive languages};
       
    38 \draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages};
       
    39 \draw (0,-1.05) node [rect] {\small regular languages};
       
    40 \end{tikzpicture}
       
    41 \end{center}
       
    42 
       
    43 \noindent
       
    44 Context-free languages play an important role in `day-to-day' text processing and in
       
    45 programming languages. Context-free languages are usually specified by grammars.
       
    46 For example a grammar for well-parenthesised  expressions is
       
    47 
       
    48 \begin{center}
       
    49 $P \;\;\rightarrow\;\; ( \cdot  P \cdot ) \cdot P \;|\; \epsilon$
       
    50 \end{center}
       
    51  
       
    52 \noindent
       
    53 In general grammars consist of finitely many rules built up from \emph{terminal symbols} 
       
    54 (usually lower-case letters) and \emph{non-terminal symbols} (upper-case letters).  Rules 
       
    55 have the shape
       
    56 
       
    57 \begin{center}
       
    58 $NT \;\;\rightarrow\;\; \textit{rhs}$
       
    59 \end{center}
       
    60  
       
    61 \noindent
       
    62 where on the left-hand side is a single non-terminal and on the right a string consisting
       
    63 of both terminals and non-terminals including the $\epsilon$-symbol for indicating the
       
    64 empty string. We use the convention  to separate components on
       
    65 the right hand-side by using the $\cdot$ symbol, as in the grammar for well-parenthesised  expressions.
       
    66 We also use the convention to use $|$ as a shorthand notation for several rules. For example
       
    67 
       
    68 \begin{center}
       
    69 $NT \;\;\rightarrow\;\; \textit{rhs}_1 \;|\; \textit{rhs}_2$
       
    70 \end{center}
       
    71 
       
    72 \noindent
       
    73 means that the non-terminal $NT$ can be replaced by either $\textit{rhs}_1$ or $\textit{rhs}_2$.
       
    74 If there are more than one non-terminal on the left-hand side of the rules, then we need to indicate
       
    75 what is the \emph{starting} symbol of the grammar. For example the grammar for arithmetic expressions
       
    76 can be given as follows
       
    77 
       
    78 \begin{center}
       
    79 \begin{tabular}{lcl}
       
    80 $E$ & $\rightarrow$ &  $N$ \\
       
    81 $E$ & $\rightarrow$ &  $E \cdot + \cdot E$ \\
       
    82 $E$ & $\rightarrow$ &  $E \cdot - \cdot E$ \\
       
    83 $E$ & $\rightarrow$ &  $E \cdot * \cdot E$ \\
       
    84 $E$ & $\rightarrow$ &  $( \cdot E \cdot )$\\
       
    85 $N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$ 
       
    86 \end{tabular}
       
    87 \end{center}
       
    88 
       
    89 \noindent
       
    90 where $E$ is the starting symbol. A \emph{derivation} for a grammar
       
    91 starts with the staring symbol of the grammar and in each step replaces one
       
    92 non-terminal by a right-hand side of a rule. A derivation ends with a string
       
    93 in which only terminal symbols are left. For example a derivation for the
       
    94 string $(1 + 2) + 3$ is as follows:
       
    95 
       
    96 \begin{center}
       
    97 \begin{tabular}{lll}
       
    98 $E$ & $\rightarrow$ & $E+E$\\
       
    99        & $\rightarrow$ & $(E)+E$\\
       
   100        & $\rightarrow$ & $(E+E)+E$\\
       
   101        & $\rightarrow$ & $(E+E)+N$\\
       
   102        & $\rightarrow$ & $(E+E)+3$\\
       
   103        & $\rightarrow$ & $(N+E)+3$\\	
       
   104        & $\rightarrow^+$ & $(1+2)+3$\\
       
   105 \end{tabular} 
       
   106 \end{center}
       
   107 
       
   108 \noindent
       
   109 The \emph{language} of a context-free grammar $G$ with start symbol $S$ 
       
   110 is defined as the set of strings derivable by a derivation, that is
       
   111 
       
   112 \begin{center}
       
   113 $\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
       
   114 \end{center}
       
   115 
       
   116 \noindent
       
   117 A \emph{parse-tree} encodes how a string is derived with the starting symbol on 
       
   118 top and each non-terminal containing a subtree for how it is replaced in a derivation.
       
   119 The parse tree for the string $(1 + 23)+4$ is as follows:
       
   120 
       
   121 \begin{center}
       
   122 \begin{tikzpicture}[level distance=8mm, black]
       
   123   \node {$E$}
       
   124     child {node {$E$} 
       
   125        child {node {$($}}
       
   126        child {node {$E$}       
       
   127          child {node {$E$} child {node {$N$} child {node {$1$}}}}
       
   128          child {node {$+$}}
       
   129          child {node {$E$} 
       
   130             child {node {$N$} child {node {$2$}}}
       
   131             child {node {$N$} child {node {$3$}}}
       
   132             } 
       
   133         }
       
   134        child {node {$)$}}
       
   135      }
       
   136      child {node {$+$}}
       
   137      child {node {$E$}
       
   138         child {node {$N$} child {node {$4$}}}
       
   139      };
       
   140 \end{tikzpicture}
       
   141 \end{center}
       
   142 
       
   143 \noindent
       
   144 We are often interested in these parse-trees since they encode the structure of
       
   145 how a string is derived by a grammar. Before we come to the problem of constructing
       
   146 such parse-trees, we need to consider the following two properties of grammars.
       
   147 A grammar is \emph{left-recursive} if there is a derivation starting from a non-terminal, say
       
   148 $NT$ which leads to a string which again starts with $NT$. This means a derivation of the
       
   149 form.
       
   150 
       
   151 \begin{center}
       
   152 $NT \rightarrow \ldots \rightarrow NT \cdot \ldots$
       
   153 \end{center}
       
   154 
       
   155 \noindent
       
   156 It can be easily seen that the grammar above for arithmetic expressions is left-recursive:
       
   157 for example the rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow N\cdot N$ 
       
   158 show that this grammar is left-recursive. Some algorithms cannot cope with left-recursive 
       
   159 grammars. Fortunately every left-recursive grammar can be transformed into one that is
       
   160 not left-recursive, although this transformation might make the grammar less human-readable.
       
   161 For example if we want to give a non-left-recursive grammar for numbers we might
       
   162 specify
       
   163 
       
   164 \begin{center}
       
   165 $N \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot N\;|\;2\cdot N\;|\;\ldots\;|\;9\cdot N$
       
   166 \end{center}
       
   167 
       
   168 \noindent
       
   169 Using this grammar we can still derive every number string, but we will never be able 
       
   170 to derive a string of the form $\ldots \rightarrow N \cdot \ldots$.
       
   171 
       
   172 The other property we have to watch out for is when a grammar is
       
   173 \emph{ambiguous}. A grammar is said to be ambiguous if there are two parse-trees
       
   174 for one string. Again the grammar for arithmetic expressions shown above is ambiguous.
       
   175 While the shown parse tree for the string $(1 + 23) + 4$ is unique, this is not the case in
       
   176 general. For example there are two parse
       
   177 trees for the string $1 + 2 + 3$, namely
       
   178 
       
   179 \begin{center}
       
   180 \begin{tabular}{c@{\hspace{10mm}}c}
       
   181 \begin{tikzpicture}[level distance=8mm, black]
       
   182   \node {$E$}
       
   183     child {node {$E$} child {node {$N$} child {node {$1$}}}}
       
   184     child {node {$+$}}
       
   185     child {node {$E$}
       
   186        child {node {$E$} child {node {$N$} child {node {$2$}}}}
       
   187        child {node {$+$}}
       
   188        child {node {$E$} child {node {$N$} child {node {$3$}}}}
       
   189     }
       
   190     ;
       
   191 \end{tikzpicture} 
       
   192 &
       
   193 \begin{tikzpicture}[level distance=8mm, black]
       
   194   \node {$E$}
       
   195     child {node {$E$}
       
   196        child {node {$E$} child {node {$N$} child {node {$1$}}}}
       
   197        child {node {$+$}}
       
   198        child {node {$E$} child {node {$N$} child {node {$2$}}}} 
       
   199     }
       
   200     child {node {$+$}}
       
   201     child {node {$E$} child {node {$N$} child {node {$3$}}}}
       
   202     ;
       
   203 \end{tikzpicture}
       
   204 \end{tabular} 
       
   205 \end{center}
       
   206 
       
   207 \noindent
       
   208 In particular in programming languages we will try to avoid ambiguous
       
   209 grammars because two different parse-trees for a string mean a program can
       
   210 be interpreted in two different ways. In such cases we have to somehow make sure
       
   211 the two different ways do not matter, or disambiguate the grammar in
       
   212 some other way (for example making the $+$ left-associative). Unfortunately already 
       
   213 the problem of deciding whether a grammar
       
   214 is ambiguous or not is in general undecidable. 
       
   215 
       
   216 Let us now turn to the problem of generating a parse-tree for a grammar and string.
       
   217 In what follows we explain \emph{parser combinators}, because they are easy
       
   218 to implement and closely resemble grammar rules. Imagine that a grammar
       
   219 describes the strings of natural numbers, such as the grammar $N$ shown above.
       
   220 For all such strings we want to generate the parse-trees or later on we actually 
       
   221 want to extract the meaning of these strings, that is the concrete integers ``behind'' 
       
   222 these strings. In Scala the parser combinators will be functions of type
       
   223 
    30 
   224 \begin{center}
    31 \begin{center}
   225 \texttt{I $\Rightarrow$ Set[(T, I)]}
    32 \texttt{I $\Rightarrow$ Set[(T, I)]}
   226 \end{center}
    33 \end{center}
   227 
    34 
   228 \noindent 
    35 \noindent that is they take as input something of type
   229 that is they take as input something of type \texttt{I}, typically a list of tokens or a string,
    36 \texttt{I} and return a set of pairs. The first component of
   230 and return a set of pairs. The first component of these pairs corresponds to what the
    37 these pairs corresponds to what the parser combinator was able
   231 parser combinator was able to process from the input and the second is the unprocessed 
    38 to process from the input and the second is the unprocessed
   232 part of the input. As we shall see shortly, a parser combinator might return more than one such pair,
    39 part of the input. As we shall see shortly, a parser
   233 with the idea that there are potentially several ways how to interpret the input. As a concrete
    40 combinator might return more than one such pair, with the idea
   234 example, consider the case where the input is of type string, say the string
    41 that there are potentially several ways how to interpret the
       
    42 input. To simplify matters we will first look at the case
       
    43 where the input to the parser combinators is just strings. As
       
    44 a concrete example, consider the case where the input is the
       
    45 string
   235 
    46 
   236 \begin{center}
    47 \begin{center}
   237 \tt\Grid{iffoo\VS testbar}
    48 \tt\Grid{iffoo\VS testbar}
   238 \end{center}
    49 \end{center}
   239 
    50 
   240 \noindent
    51 \noindent We might have a parser combinator which tries to
   241 We might have a parser combinator which tries to interpret this string as a keyword (\texttt{if}) or
    52 interpret this string as a keyword (\texttt{if}) or an
   242 an identifier (\texttt{iffoo}). Then the output will be the set
    53 identifier (\texttt{iffoo}). Then the output will be the set
   243 
    54 
   244 \begin{center}
    55 \begin{center}
   245 $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), 
    56 $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), 
   246            \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
    57            \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
   247 \end{center}
    58 \end{center}
   248 
    59 
   249 \noindent
    60 \noindent where the first pair means the parser could
   250 where the first pair means the parser could recognise \texttt{if} from the input and leaves 
    61 recognise \texttt{if} from the input and leaves the rest as
   251 the rest as `unprocessed' as the second component of the pair; in the other case
    62 `unprocessed' as the second component of the pair; in the
   252 it could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parser
    63 other case it could recognise \texttt{iffoo} and leaves
   253 cannot recognise anything from the input then parser combinators just return the empty 
    64 \texttt{\VS testbar} as unprocessed. If the parser cannot
   254 set $\varnothing$. This will indicate something ``went wrong''.
    65 recognise anything from the input then parser combinators just
   255 
    66 return the empty set $\{\}$. This will indicate something
   256 The main attraction is that we can easily build parser combinators out of smaller components
    67 ``went wrong''\ldots or more precisely, nothing could be
   257 following very closely the structure of a grammar. In order to implement this in an object
    68 parsed.
   258 oriented programming language, like Scala, we need to specify an abstract class for parser 
    69 
   259 combinators. This abstract class requires the implementation of the function
    70 The main attraction of parser combinators is that we can
   260 \texttt{parse} taking an argument of type \texttt{I} and returns a set of type  
    71 easily build parser combinators out of smaller components
   261 \mbox{\texttt{Set[(T, I)]}}.
    72 following very closely the structure of a grammar. In order to
   262 
    73 implement this in an object-oriented programming language,
   263 \begin{center}
    74 like Scala, we need to specify an abstract class for parser
   264 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
    75 combinators. This abstract class requires the implementation
       
    76 of the function \texttt{parse} taking an argument of type
       
    77 \texttt{I} and returns a set of type \mbox{\texttt{Set[(T,
       
    78 I)]}}.
       
    79 
       
    80 \begin{center}
       
    81 \begin{lstlisting}[language=Scala]
   265 abstract class Parser[I, T] {
    82 abstract class Parser[I, T] {
   266   def parse(ts: I): Set[(T, I)]
    83   def parse(ts: I): Set[(T, I)]
   267 
    84 
   268   def parse_all(ts: I): Set[T] =
    85   def parse_all(ts: I): Set[T] =
   269     for ((head, tail) <- parse(ts); if (tail.isEmpty)) 
    86     for ((head, tail) <- parse(ts); if (tail.isEmpty)) 
   270       yield head
    87       yield head
   271 }
    88 }
   272 \end{lstlisting}
    89 \end{lstlisting}
   273 \end{center}
    90 \end{center}
   274 
    91 
   275 \noindent
    92 \noindent From the function \texttt{parse} we can then
   276 From the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all},
    93 ``centrally'' derive the function \texttt{parse\_all}, which
   277 which just filters out all pairs whose second component is not empty (that is has still some
    94 just filters out all pairs whose second component is not empty
   278 unprocessed part). The reason is that at the end of parsing we are only interested in the
    95 (that is has still some unprocessed part). The reason is that
   279 results where all the input has been consumed and no unprocessed part is left.
    96 at the end of parsing we are only interested in the results
   280 
    97 where all the input has been consumed and no unprocessed part
   281 One of the simplest parser combinators recognises just a character, say $c$, 
    98 is left.
   282 from the beginning of strings. Its behaviour is as follows:
    99 
       
   100 One of the simplest parser combinators recognises just a
       
   101 character, say $c$, from the beginning of strings. Its
       
   102 behaviour can be described as follows:
   283 
   103 
   284 \begin{itemize}
   104 \begin{itemize}
   285 \item if the head of the input string starts with a $c$, it returns 
   105 \item if the head of the input string starts with a $c$, it returns 
   286 	the set $\{(c, \textit{tail of}\; s)\}$
   106 	the set $\{(c, \textit{tail of}\; s)\}$, where \textit{tail of} 
   287 \item otherwise it returns the empty set $\varnothing$	
   107 	$s$ is the unprocessed part of the input string
       
   108 \item otherwise it returns the empty set $\{\}$	
   288 \end{itemize}
   109 \end{itemize}
   289 
   110 
   290 \noindent
   111 \noindent
   291 The input type of this simple parser combinator for characters is
   112 The input type of this simple parser combinator for characters is
   292 \texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. 
   113 \texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. 
   299     if (sb.head == c) Set((c, sb.tail)) else Set()
   120     if (sb.head == c) Set((c, sb.tail)) else Set()
   300 }
   121 }
   301 \end{lstlisting}
   122 \end{lstlisting}
   302 \end{center}
   123 \end{center}
   303 
   124 
       
   125 \noindent The \texttt{parse} function tests whether the first
       
   126 character of the input string \texttt{sb} is equal to
       
   127 \texttt{c}. If yes, then it splits the string into the
       
   128 recognised part \texttt{c} and the unprocessed part
       
   129 \texttt{sb.tail}. In case \texttt{sb} does not start with
       
   130 \texttt{c} then the parser returns the empty set (in Scala
       
   131 \texttt{Set()}).
       
   132 
       
   133 
       
   134 More interesting are the parser combinators that build larger
       
   135 parsers out of smaller component parsers. For example the
       
   136 alternative parser combinator is as follows: given two
       
   137 parsers, say, $p$ and $q$, then we apply both parsers to the
       
   138 input (remember parsers are functions) and combine the input
       
   139  
       
   140 \begin{center}
       
   141 $p(\text{input}) \cup q(\text{input})$
       
   142 \end{center}
       
   143 
   304 \noindent
   144 \noindent
   305 The \texttt{parse} function tests whether the first character of the 
   145 In Scala we would implement alternative parsers as follows
   306 input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits the
   146 
   307 string into the recognised part \texttt{c} and the unprocessed part
   147 \begin{center}
   308 \texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} then
   148 \begin{lstlisting}[language=Scala, numbers=none]
   309 the parser returns the empty set (in Scala \texttt{Set()}).
       
   310 
       
   311 More interesting are the parser combinators that build larger parsers
       
   312 out of smaller component parsers. For example the alternative 
       
   313 parser combinator is as follows.
       
   314 
       
   315 \begin{center}
       
   316 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   317 class AltParser[I, T]
   149 class AltParser[I, T]
   318        (p: => Parser[I, T], 
   150        (p: => Parser[I, T], 
   319         q: => Parser[I, T]) extends Parser[I, T] {
   151         q: => Parser[I, T]) extends Parser[I, T] {
   320   def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
   152   def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
   321 }
   153 }
   322 \end{lstlisting}
   154 \end{lstlisting}
   323 \end{center}
   155 \end{center}
   324 
   156 
   325 \noindent
   157 \noindent The types of this parser combinator are polymorphic
   326 The types of this parser combinator are polymorphic (we just have \texttt{I}
   158 (we just have \texttt{I} for the input type, and \texttt{T}
   327 for the input type, and \texttt{T} for the output type). The alternative parser
   159 for the output type). The alternative parser builds a new
   328 builds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}.
   160 parser out of two existing parser combinator \texttt{p} and
   329 Both need to be able to process input of type \texttt{I} and return the same
   161 \texttt{q}. Both need to be able to process input of type
   330 output type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the 
   162 \texttt{I} and return the same output type \texttt{Set[(T,
   331 \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the
   163 I)]}. (There is an interesting detail of Scala, namely the
   332 evaluation of the arguments before they are used. This is often called 
   164 \texttt{=>} in front of the types of \texttt{p} and
   333 \emph{lazy evaluation} of the arguments.) The alternative parser should run
   165 \texttt{q}. They will prevent the evaluation of the arguments
   334 the input with the first parser \texttt{p} (producing a set of outputs) and then
   166 before they are used. This is often called \emph{lazy
   335 run the same input with \texttt{q}. The result should be then just the union
   167 evaluation} of the arguments.) The alternative parser should
   336 of both sets, which is the operation \texttt{++} in Scala.
   168 run the input with the first parser \texttt{p} (producing a
   337 
   169 set of outputs) and then run the same input with \texttt{q}.
   338 This parser combinator already allows us to construct a parser that either 
   170 The result should be then just the union of both sets, which
   339 a character \texttt{a} or \texttt{b}, as
   171 is the operation \texttt{++} in Scala.
   340 
   172 
   341 \begin{center}
   173 This parser combinator already allows us to construct a parser
   342 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   174 that either a character \texttt{a} or \texttt{b}, as
       
   175 
       
   176 \begin{center}
       
   177 \begin{lstlisting}[language=Scala, numbers=none]
   343 new AltParser(CharParser('a'), CharParser('b'))
   178 new AltParser(CharParser('a'), CharParser('b'))
   344 \end{lstlisting}
   179 \end{lstlisting}
   345 \end{center}
   180 \end{center}
   346 
   181 
   347 \noindent
   182 \noindent Scala allows us to introduce some more readable
   348 Scala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}. 
   183 shorthand notation for this, like \texttt{'a' || 'b'}. We can
   349 We can call this parser combinator with the strings
   184 call this parser combinator with the strings
   350 
   185 
   351 \begin{center}
   186 \begin{center}
   352 \begin{tabular}{rcl}
   187 \begin{tabular}{rcl}
   353 input string & & output\medskip\\
   188 input strings & & output\medskip\\
   354 \texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\
   189 \texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\
   355 \texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\
   190 \texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\
   356 \texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$
   191 \texttt{\Grid{cc}} & $\rightarrow$ & $\{\}$
       
   192 \end{tabular}
       
   193 \end{center}
       
   194 
       
   195 \noindent We receive in the first two cases a successful
       
   196 output (that is a non-empty set). In each case, either
       
   197 \pcode{a} or \pcode{b} are the processed parts, and \pcode{c}
       
   198 is the unprocessed part. Clearly this parser cannot parse
       
   199 anything with the string \pcode{cc}.
       
   200 
       
   201 A bit more interesting is the \emph{sequence parser
       
   202 combinator}. Given two parsers, say, $p$ and $q$,
       
   203 apply first the input to $p$ producing a set of pairs;
       
   204 then apply $q$ to the unparsed parts; then combine the 
       
   205 results like
       
   206 
       
   207 \begin{center}
       
   208 \begin{tabular}{lcl}
       
   209 $\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & 
       
   210 $(\textit{output}_1, u_1) \in p(\text{input}) 
       
   211 \;\wedge\;$\\
       
   212 && $(\textit{output}_2, u_2) \in q(u_1)\}$
   357 \end{tabular}
   213 \end{tabular}
   358 \end{center}
   214 \end{center}
   359 
   215 
   360 \noindent
   216 \noindent
   361 We receive in the first two cases a successful output (that is a non-empty set).
   217 This can be implemented in Scala as follows:
   362 
   218 
   363 A bit more interesting is the \emph{sequence parser combinator} implemented in
   219 \begin{center}
   364 Scala as follows:
   220 \begin{lstlisting}[language=Scala,numbers=none]
   365 
       
   366 \begin{center}
       
   367 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   368 class SeqParser[I, T, S]
   221 class SeqParser[I, T, S]
   369        (p: => Parser[I, T], 
   222        (p: => Parser[I, T], 
   370         q: => Parser[I, S]) extends Parser[I, (T, S)] {
   223         q: => Parser[I, S]) extends Parser[I, (T, S)] {
   371   def parse(sb: I) = 
   224   def parse(sb: I) = 
   372     for ((head1, tail1) <- p.parse(sb); 
   225     for ((head1, tail1) <- p.parse(sb); 
   374             yield ((head1, head2), tail2)
   227             yield ((head1, head2), tail2)
   375 }
   228 }
   376 \end{lstlisting}
   229 \end{lstlisting}
   377 \end{center}
   230 \end{center}
   378 
   231 
   379 \noindent
   232 \noindent This parser takes as input two parsers, \texttt{p}
   380 This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} 
   233 and \texttt{q}. It implements \texttt{parse} as follows: let
   381 as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}).
   234 first run the parser \texttt{p} on the input producing a set
   382 The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}. 
   235 of pairs (\texttt{head1}, \texttt{tail1}). The \texttt{tail1}
   383 Let \texttt{q} run on these unprocessed parts
   236 stands for the unprocessed parts left over by \texttt{p}. Let
   384 producing again a set of pairs. The output of the sequence parser combinator is then a set
   237 \texttt{q} run on these unprocessed parts producing again a
   385 containing pairs where the first components are again pairs, namely what the first parser could parse
   238 set of pairs. The output of the sequence parser combinator is
   386 together with what the second parser could parse; the second component is the unprocessed
   239 then a set containing pairs where the first components are
   387 part left over after running the second parser \texttt{q}. Therefore the input type of
   240 again pairs, namely what the first parser could parse together
   388 the sequence parser combinator is as usual \texttt{I}, but the output type is
   241 with what the second parser could parse; the second component
       
   242 is the unprocessed part left over after running the second
       
   243 parser \texttt{q}. Therefore the input type of the sequence
       
   244 parser combinator is as usual \texttt{I}, but the output type
       
   245 is
   389 
   246 
   390 \begin{center}
   247 \begin{center}
   391 \texttt{Set[((T, S), I)]}
   248 \texttt{Set[((T, S), I)]}
   392 \end{center}
   249 \end{center}
   393 
   250 
   394 Scala allows us to provide some
   251 Scala allows us to provide some shorthand notation for the
   395 shorthand notation for the sequence parser combinator. So we can write for 
   252 sequence parser combinator. So we can write for example
   396 example \texttt{'a'  $\sim$ 'b'}, which is the
   253 \texttt{'a' $\sim$ 'b'}, which is the parser combinator that
   397 parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}.
   254 first consumes the character \texttt{a} from a string and then
   398 Calling this parser combinator with the strings
   255 \texttt{b}. Three examples of this parser combinator are as
   399 
   256 follows:
   400 \begin{center}
   257 
   401 \begin{tabular}{rcl}
   258 \begin{center}
   402 input string & & output\medskip\\
   259 \begin{tabular}{rcl}
       
   260 input strings & & output\medskip\\
   403 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
   261 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
   404 \texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\
   262 \texttt{\Grid{bac}} & $\rightarrow$ & $\{\}$\\
   405 \texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$
   263 \texttt{\Grid{ccc}} & $\rightarrow$ & $\{\}$
   406 \end{tabular}
   264 \end{tabular}
   407 \end{center}
   265 \end{center}
   408 
   266 
   409 \noindent
   267 \noindent A slightly more complicated parser is \texttt{('a'
   410 A slightly more complicated parser is \texttt{('a'  || 'b') $\sim$ 'b'} which parses as first character either
   268 || 'b') $\sim$ 'b'} which parses as first character either an
   411 an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results.
   269 \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser
   412 
   270 produces the following results.
   413 \begin{center}
   271 
   414 \begin{tabular}{rcl}
   272 \begin{center}
   415 input string & & output\medskip\\
   273 \begin{tabular}{rcl}
       
   274 input strings & & output\medskip\\
   416 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
   275 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
   417 \texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
   276 \texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
   418 \texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
   277 \texttt{\Grid{aac}} & $\rightarrow$ & $\{\}$
   419 \end{tabular}
   278 \end{tabular}
   420 \end{center}
   279 \end{center}
   421 
   280 
   422 Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error.
   281 \noindent Two more examples: first consider the parser \pcode{('a' ~
   423 The first parser has as output type a single character (recall the type of \texttt{CharParser}),
   282 'a') ~ 'a'} and the input \pcode{aaaa}:
   424 but the second parser produces a pair of characters as output. The alternative parser is however
   283 
   425 required to have both component parsers to have the same type. We will see later how we can 
   284 \begin{center}
   426 build this parser without the typing error.
   285 \begin{tabular}{rcl}
   427 
   286 input string & & output\medskip\\
   428 The next parser combinator does not actually combine smaller parsers, but applies
   287 \texttt{\Grid{aaaa}} & $\rightarrow$ & 
   429 a function to the result of the parser. It is implemented in Scala as follows
   288 $\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
       
   289 \end{tabular}
       
   290 \end{center}
       
   291 
       
   292 \noindent Notice how the results nest deeper as pairs (the
       
   293 last \pcode{a} is in the unprocessed part). To consume
       
   294 everything we can use the parser \pcode{(('a' ~'a') ~ 'a') ~
       
   295 'a'}. Then the output is as follows:
       
   296 
       
   297 \begin{center}
       
   298 \begin{tabular}{rcl}
       
   299 input string & & output\medskip\\
       
   300 \texttt{\Grid{aaaa}} & $\rightarrow$ & 
       
   301 $\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
       
   302 \end{tabular}
       
   303 \end{center}
       
   304 
       
   305 \noindent This is an instance where the parser consumed
       
   306 completely the input, meaning the unprocessed part is just the
       
   307 empty string.
       
   308 
       
   309 
       
   310 Note carefully that constructing a parser such \texttt{'a' ||
       
   311 ('a' $\sim$ 'b')} will result in a tying error. The first
       
   312 parser has as output type a single character (recall the type
       
   313 of \texttt{CharParser}), but the second parser produces a pair
       
   314 of characters as output. The alternative parser is however
       
   315 required to have both component parsers to have the same type.
       
   316 We will see later how we can build this parser without the
       
   317 typing error.
       
   318 
       
   319 The next parser combinator does not actually combine smaller
       
   320 parsers, but applies a function to the result of the parser.
       
   321 It is implemented in Scala as follows
   430 
   322 
   431 \begin{center}
   323 \begin{center}
   432 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   324 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   433 class FunParser[I, T, S]
   325 class FunParser[I, T, S]
   434          (p: => Parser[I, T], 
   326          (p: => Parser[I, T],