afl-material: comparison handouts/ho05.tex

equal deleted inserted replaced

-:603e171a7b48
+:b3129cff41e9
 \documentclass{article}
 \usepackage{../style}
 \usepackage{../langs}
-\usepackage{../graphics}
 \begin{document}
-\section*{Handout 5 (Lexing)}
+\section*{Handout 6 (Parser Combinators)}
-Whenever you want to design a new programming language or
+While regular expressions are very useful for lexing and for recognising
-implement a compiler for an existing language, the first task
+many patterns in strings (like email addresses), they have their limitations. For
-is to fix the basic ``words'' of the language. For example
+example there is no regular expression that can recognise the language
-what are the keywords, or reserved words, of the language,
+$a^nb^n$. Another example for which there exists no regular expression is the language of well-parenthesised
-what are permitted identifiers, numbers, expressions and so
+expressions.  In languages like Lisp, which use parentheses rather
-on. One convenient way to do this is, of course, by using
+extensively, it might be of interest whether the following two expressions
-regular expressions.
+are well-parenthesised (the left one is, the right one is not):
-In this course we want to take a closer look at the WHILE
+\begin{center}
-programming language. This is a simple imperative programming
+$(((()()))())$  \hspace{10mm} $(((()()))()))$
-language consisting of arithmetic expressions, assignments,
+\end{center}
-if-statements and loops. For example the Fibonacci program can
-be written in this language as follows:
+\noindent
+Not being able to solve such recognition problems is a serious limitation.
-\begin{center}
+In order to solve such recognition problems, we need more powerful
-\mbox{\lstinputlisting[language=while]{../progs/fib.while}}
+techniques than regular expressions. We will in particular look at \emph{context-free
-\end{center}
+languages}. They include the regular languages as the picture below shows:
-\noindent
-The keywords in this language will be
+\begin{center}
+\begin{tikzpicture}
-\begin{center}
+[rect/.style={draw=black!50,
-\texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}
+top color=white,bottom color=black!20,
-\end{center}
+rectangle, very thick, rounded corners}]
-\noindent
+\draw (0,0) node [rect, text depth=30mm, text width=46mm] {\small all languages};
-In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers
+\draw (0,-0.4) node [rect, text depth=20mm, text width=44mm] {\small decidable languages};
-and strings (which we however ignore for the moment). We also need to specify what the ``whitespace''
+\draw (0,-0.65) node [rect, text depth=13mm] {\small context sensitive languages};
-is in our programming language and what comments should look like.
+\draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages};
-As a first try, we might specify the regular expressions for our language roughly as follows
+\draw (0,-1.05) node [rect] {\small regular languages};
+\end{tikzpicture}
+\end{center}
+\noindent
+Context-free languages play an important role in `day-to-day' text processing and in
+programming languages. Context-free languages are usually specified by grammars.
+For example a grammar for well-parenthesised  expressions is
+\begin{center}
+$P \;\;\rightarrow\;\; ( \cdot  P \cdot ) \cdot P \;|\; \epsilon$
+\end{center}
+\noindent
+In general grammars consist of finitely many rules built up from \emph{terminal symbols}
+(usually lower-case letters) and \emph{non-terminal symbols} (upper-case letters).  Rules
+have the shape
+\begin{center}
+$NT \;\;\rightarrow\;\; \textit{rhs}$
+\end{center}
+\noindent
+where on the left-hand side is a single non-terminal and on the right a string consisting
+of both terminals and non-terminals including the $\epsilon$-symbol for indicating the
+empty string. We use the convention  to separate components on
+the right hand-side by using the $\cdot$ symbol, as in the grammar for well-parenthesised  expressions.
+We also use the convention to use $|$ as a shorthand notation for several rules. For example
+\begin{center}
+$NT \;\;\rightarrow\;\; \textit{rhs}_1 \;|\; \textit{rhs}_2$
+\end{center}
+\noindent
+means that the non-terminal $NT$ can be replaced by either $\textit{rhs}_1$ or $\textit{rhs}_2$.
+If there are more than one non-terminal on the left-hand side of the rules, then we need to indicate
+what is the \emph{starting} symbol of the grammar. For example the grammar for arithmetic expressions
+can be given as follows
 \begin{center}
 \begin{tabular}{lcl}
-$\textit{LETTER}$ & $:=$ & $\texttt{a} + \texttt{A} + \texttt{b} + \texttt{B} + \ldots$\\
+$E$ & $\rightarrow$ &  $N$ \\
-$\textit{DIGIT}$ & $:=$ & $\texttt{0} + \texttt{1} + \texttt{2} + \ldots$\\
+$E$ & $\rightarrow$ &  $E \cdot + \cdot E$ \\
-$\textit{KEYWORD}$ & $:=$ & $\texttt{while}  + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\
+$E$ & $\rightarrow$ &  $E \cdot - \cdot E$ \\
-$\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\
+$E$ & $\rightarrow$ &  $E \cdot * \cdot E$ \\
-$\textit{OP}$      &  $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\
+$E$ & $\rightarrow$ &  $( \cdot E \cdot )$\\
-$\textit{NUM}$   &  $:=$ & $\textit{DIGIT}^+$\\
+$N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$
-$\textit{WHITESPACE}$ & $:=$ & $("\hspace{2mm}" + \backslash\texttt{n})^+$
 \end{tabular}
 \end{center}
-Having the regular expressions in place, the problem we have to solve is:
+\noindent
-given a string of our programming language, which regular expression
+where $E$ is the starting symbol. A \emph{derivation} for a grammar
-matches which part of the string. By solving this problem, we can split up a string
+starts with the staring symbol of the grammar and in each step replaces one
-of our language into components.
+non-terminal by a right-hand side of a rule. A derivation ends with a string
-For example given the input string
+in which only terminal symbols are left. For example a derivation for the
+string $(1 + 2) + 3$ is as follows:
-\begin{center}
-\texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
+\begin{center}
-\end{center}
+\begin{tabular}{lll}
+$E$ & $\rightarrow$ & $E+E$\\
-\noindent
+& $\rightarrow$ & $(E)+E$\\
-we expect it is split up as follows
+& $\rightarrow$ & $(E+E)+E$\\
+& $\rightarrow$ & $(E+E)+N$\\
-\begin{center}
+& $\rightarrow$ & $(E+E)+3$\\
-\tt
+& $\rightarrow$ & $(N+E)+3$\\
-\Grid{if}\,
+& $\rightarrow^+$ & $(1+2)+3$\\
-\Grid{\VS}\,
+\end{tabular}
-\Grid{true}\,
+\end{center}
-\Grid{\VS}\,
-\Grid{then}\,
+\noindent
-\Grid{\VS}\,
+The \emph{language} of a context-free grammar $G$ with start symbol $S$
-\Grid{x}\,
+is defined as the set of strings derivable by a derivation, that is
-\Grid{+}\,
-\Grid{2}\,
+\begin{center}
-\Grid{\VS}\,
+$\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
-\Grid{else}\,
+\end{center}
-\Grid{\VS}\,
-\Grid{x}\,
+\noindent
-\Grid{+}\,
+A \emph{parse-tree} encodes how a string is derived with the starting symbol on
-\Grid{3}
+top and each non-terminal containing a subtree for how it is replaced in a derivation.
-\end{center}
+The parse tree for the string $(1 + 23)+4$ is as follows:
-\noindent
+\begin{center}
-This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}.
+\begin{tikzpicture}[level distance=8mm, black]
-It is usually the first phase of a compiler. Note that the separation into words cannot, in general,
+\node {$E$}
-be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace
+child {node {$E$}
-in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are
+child {node {$($}}
-not separated by any whitespace. Another reason for recognising whitespaces explicitly is
+child {node {$E$}
-that in some languages, for example Python, whitespaces matters, that is carry meaning. However in
+child {node {$E$} child {node {$N$} child {node {$1$}}}}
-our small language we will eventually just filter out all whitespaces and also all comments.
+child {node {$+$}}
+child {node {$E$}
-Lexing not just separates a string into its components, but also classifies the components, meaning it explicitly records that \texttt{if} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
+child {node {$N$} child {node {$2$}}}
-For the moment, though, we will only focus on the simpler problem of just splitting up
+child {node {$N$} child {node {$3$}}}
-a string into components.
+}
+}
-There are a few subtleties  we need to consider first. For example, say the string is
+child {node {$)$}}
+}
-\begin{center}
+child {node {$+$}}
-\texttt{\Grid{iffoo\VS}}\;\ldots
+child {node {$E$}
-\end{center}
+child {node {$N$} child {node {$4$}}}
+};
-\noindent
+\end{tikzpicture}
-then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed
+\end{center}
-by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a
-single identifier. The choice that is often made in lexers is to look for the longest possible match.
+\noindent
-This leaves  \texttt{iffoo} as the only match in this case (since it is longer than \texttt{if}).
+We are often interested in these parse-trees since they encode the structure of
+how a string is derived by a grammar. Before we come to the problem of constructing
-Unfortunately, the convention about the longest match does not yet make the process
+such parse-trees, we need to consider the following two properties of grammars.
-of lexing completely deterministic. Consider the string
+A grammar is \emph{left-recursive} if there is a derivation starting from a non-terminal, say
+$NT$ which leads to a string which again starts with $NT$. This means a derivation of the
-\begin{center}
+form.
-\texttt{\Grid{then\VS}}\;\dots
-\end{center}
+\begin{center}
+$NT \rightarrow \ldots \rightarrow NT \cdot \ldots$
-\noindent
+\end{center}
-Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our
-regular expressions. In our running example we just use the ranking
+\noindent
+It can be easily seen that the grammar above for arithmetic expressions is left-recursive:
-\[
+for example the rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow N\cdot N$
-\textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots
+show that this grammar is left-recursive. Some algorithms cannot cope with left-recursive
-\]
+grammars. Fortunately every left-recursive grammar can be transformed into one that is
+not left-recursive, although this transformation might make the grammar less human-readable.
-\noindent
+For example if we want to give a non-left-recursive grammar for numbers we might
-So even if both regular expressions match in the example above,
+specify
-we give preference to the regular expression for keywords.
+\begin{center}
-Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$
+$N \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot N\;|\;2\cdot N\;|\;\ldots\;|\;9\cdot N$
-and $der$, it will  use the function \emph{zeroable} defined as follows:
+\end{center}
-\begin{center}
+\noindent
-\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
+Using this grammar we can still derive every number string, but we will never be able
-$zeroable(\varnothing)$      & $\dn$ & $true$\\
+to derive a string of the form $\ldots \rightarrow N \cdot \ldots$.
-$zeroable(\epsilon)$           & $\dn$ &  $f\!alse$\\
-$zeroable (c)$                    & $\dn$ &  $f\!alse$\\
+The other property we have to watch out for is when a grammar is
-$zeroable (r_1 + r_2)$        & $\dn$ &  $zeroable(r_1) \wedge zeroable(r_2)$ \\
+\emph{ambiguous}. A grammar is said to be ambiguous if there are two parse-trees
-$zeroable (r_1 \cdot r_2)$  & $\dn$ &  $zeroable(r_1) \vee zeroable(r_2)$ \\
+for one string. Again the grammar for arithmetic expressions shown above is ambiguous.
-$zeroable (r^*)$                  & $\dn$ & $f\!alse$\\
+While the shown parse tree for the string $(1 + 23) + 4$ is unique, this is not the case in
+general. For example there are two parse
+trees for the string $1 + 2 + 3$, namely
+\begin{center}
+\begin{tabular}{c@{\hspace{10mm}}c}
+\begin{tikzpicture}[level distance=8mm, black]
+\node {$E$}
+child {node {$E$} child {node {$N$} child {node {$1$}}}}
+child {node {$+$}}
+child {node {$E$}
+child {node {$E$} child {node {$N$} child {node {$2$}}}}
+child {node {$+$}}
+child {node {$E$} child {node {$N$} child {node {$3$}}}}
+}
+;
+\end{tikzpicture}
+&
+\begin{tikzpicture}[level distance=8mm, black]
+\node {$E$}
+child {node {$E$}
+child {node {$E$} child {node {$N$} child {node {$1$}}}}
+child {node {$+$}}
+child {node {$E$} child {node {$N$} child {node {$2$}}}}
+}
+child {node {$+$}}
+child {node {$E$} child {node {$N$} child {node {$3$}}}}
+;
+\end{tikzpicture}
+\end{tabular}
+\end{center}
+\noindent
+In particular in programming languages we will try to avoid ambiguous
+grammars because two different parse-trees for a string mean a program can
+be interpreted in two different ways. In such cases we have to somehow make sure
+the two different ways do not matter, or disambiguate the grammar in
+some other way (for example making the $+$ left-associative). Unfortunately already
+the problem of deciding whether a grammar
+is ambiguous or not is in general undecidable.
+Let us now turn to the problem of generating a parse-tree for a grammar and string.
+In what follows we explain \emph{parser combinators}, because they are easy
+to implement and closely resemble grammar rules. Imagine that a grammar
+describes the strings of natural numbers, such as the grammar $N$ shown above.
+For all such strings we want to generate the parse-trees or later on we actually
+want to extract the meaning of these strings, that is the concrete integers ``behind''
+these strings. In Scala the parser combinators will be functions of type
+\begin{center}
+\texttt{I $\Rightarrow$ Set[(T, I)]}
+\end{center}
+\noindent
+that is they take as input something of type \texttt{I}, typically a list of tokens or a string,
+and return a set of pairs. The first component of these pairs corresponds to what the
+parser combinator was able to process from the input and the second is the unprocessed
+part of the input. As we shall see shortly, a parser combinator might return more than one such pair,
+with the idea that there are potentially several ways how to interpret the input. As a concrete
+example, consider the case where the input is of type string, say the string
+\begin{center}
+\tt\Grid{iffoo\VS testbar}
+\end{center}
+\noindent
+We might have a parser combinator which tries to interpret this string as a keyword (\texttt{if}) or
+an identifier (\texttt{iffoo}). Then the output will be the set
+\begin{center}
+$\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right),
+\left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
+\end{center}
+\noindent
+where the first pair means the parser could recognise \texttt{if} from the input and leaves
+the rest as `unprocessed' as the second component of the pair; in the other case
+it could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parser
+cannot recognise anything from the input then parser combinators just return the empty
+set $\varnothing$. This will indicate something ``went wrong''.
+The main attraction is that we can easily build parser combinators out of smaller components
+following very closely the structure of a grammar. In order to implement this in an object
+oriented programming language, like Scala, we need to specify an abstract class for parser
+combinators. This abstract class requires the implementation of the function
+\texttt{parse} taking an argument of type \texttt{I} and returns a set of type
+\mbox{\texttt{Set[(T, I)]}}.
+\begin{center}
+\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
+abstract class Parser[I, T] {
+def parse(ts: I): Set[(T, I)]
+def parse_all(ts: I): Set[T] =
+for ((head, tail) <- parse(ts); if (tail.isEmpty))
+yield head
+}
+\end{lstlisting}
+\end{center}
+\noindent
+From the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all},
+which just filters out all pairs whose second component is not empty (that is has still some
+unprocessed part). The reason is that at the end of parsing we are only interested in the
+results where all the input has been consumed and no unprocessed part is left.
+One of the simplest parser combinators recognises just a character, say $c$,
+from the beginning of strings. Its behaviour is as follows:
+\begin{itemize}
+\item if the head of the input string starts with a $c$, it returns
+	the set $\{(c, \textit{tail of}\; s)\}$
+\item otherwise it returns the empty set $\varnothing$
+\end{itemize}
+\noindent
+The input type of this simple parser combinator for characters is
+\texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}.
+The code in Scala is as follows:
+\begin{center}
+\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
+case class CharParser(c: Char) extends Parser[String, Char] {
+def parse(sb: String) =
+if (sb.head == c) Set((c, sb.tail)) else Set()
+}
+\end{lstlisting}
+\end{center}
+\noindent
+The \texttt{parse} function tests whether the first character of the
+input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits the
+string into the recognised part \texttt{c} and the unprocessed part
+\texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} then
+the parser returns the empty set (in Scala \texttt{Set()}).
+More interesting are the parser combinators that build larger parsers
+out of smaller component parsers. For example the alternative
+parser combinator is as follows.
+\begin{center}
+\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
+class AltParser[I, T]
+(p: => Parser[I, T],
+q: => Parser[I, T]) extends Parser[I, T] {
+def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
+}
+\end{lstlisting}
+\end{center}
+\noindent
+The types of this parser combinator are polymorphic (we just have \texttt{I}
+for the input type, and \texttt{T} for the output type). The alternative parser
+builds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}.
+Both need to be able to process input of type \texttt{I} and return the same
+output type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the
+\texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the
+evaluation of the arguments before they are used. This is often called
+\emph{lazy evaluation} of the arguments.) The alternative parser should run
+the input with the first parser \texttt{p} (producing a set of outputs) and then
+run the same input with \texttt{q}. The result should be then just the union
+of both sets, which is the operation \texttt{++} in Scala.
+This parser combinator already allows us to construct a parser that either
+a character \texttt{a} or \texttt{b}, as
+\begin{center}
+\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
+new AltParser(CharParser('a'), CharParser('b'))
+\end{lstlisting}
+\end{center}
+\noindent
+Scala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}.
+We can call this parser combinator with the strings
+\begin{center}
+\begin{tabular}{rcl}
+input string & & output\medskip\\
+\texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\
+\texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\
+\texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$
 \end{tabular}
 \end{center}
 \noindent
-Recall that the function $nullable(r)$ tests whether a regular expression
+We receive in the first two cases a successful output (that is a non-empty set).
-can match the empty string. The function $zeroable$, on the other hand, tests whether a regular
-expression cannot match anything at all. The mathematical way of stating this
+A bit more interesting is the \emph{sequence parser combinator} implemented in
-property is
+Scala as follows:
 \begin{center}
-$zeroable(r)$ if and only if $L(r) = \varnothing$
+\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
-\end{center}
+class SeqParser[I, T, S]
+(p: => Parser[I, T],
-\noindent
+q: => Parser[I, S]) extends Parser[I, (T, S)] {
-For what follows let us fix a set of regular expressions $rs$ as being
+def parse(sb: I) =
+for ((head1, tail1) <- p.parse(sb);
-\begin{center}
+(head2, tail2) <- q.parse(tail1))
-\textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE}
+yield ((head1, head2), tail2)
-\end{center}
+}
+\end{lstlisting}
-\noindent
+\end{center}
-specifying the ``words''
-of our programming language. The algorithm takes as input the $rs$ and a string, say
+\noindent
+This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse}
-\begin{center}
+as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}).
-\texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots
+The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}.
-\end{center}
+Let \texttt{q} run on these unprocessed parts
+producing again a set of pairs. The output of the sequence parser combinator is then a set
-\noindent
+containing pairs where the first components are again pairs, namely what the first parser could parse
-and tries to chop off one word from the beginning of the string. If none of the
+together with what the second parser could parse; the second component is the unprocessed
-regular expression in $rs$ matches, we will just return
+part left over after running the second parser \texttt{q}. Therefore the input type of
-the empty string.
+the sequence parser combinator is as usual \texttt{I}, but the output type is
-The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect
+\begin{center}
-to the first character $c_1$. Then we take the results and continue with
+\texttt{Set[((T, S), I)]}
-building the derivatives with respect to $c_2$ until we have either exhausted our
+\end{center}
-input string or all of the regular expressions are ``zeroable''.  Suppose the input string is
+Scala allows us to provide some
-\begin{center}
+shorthand notation for the sequence parser combinator. So we can write for
-\texttt{\Grid{if2\VS}}\;\dots
+example \texttt{'a'  $\sim$ 'b'}, which is the
-\end{center}
+parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}.
+Calling this parser combinator with the strings
-\noindent
-then building the derivatives with respect to \texttt{i} gives
+\begin{center}
+\begin{tabular}{rcl}
-\begin{center}
+input string & & output\medskip\\
-\begin{tabular}{l|c}
+\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
-& $zeroable$\\\hline
+\texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\
-$der\;\texttt{i}\;(\textit{KEYWORD})$      & no\\
+\texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$
-$der\;\texttt{i}\;(\textit{IDENT})$              & no\\
-$der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\
 \end{tabular}
 \end{center}
 \noindent
-We can eliminate \textit{WHITESPACE} as a potential candidate, because
+A slightly more complicated parser is \texttt{('a'  || 'b') $\sim$ 'b'} which parses as first character either
-no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other
+an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results.
-two regular expressions as potential candidate and we have to consider the
-next character, \texttt{f}, from the input string
+\begin{center}
+\begin{tabular}{rcl}
-\begin{center}
+input string & & output\medskip\\
-\begin{tabular}{l|c}
+\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
-& $zeroable$\\\hline
+\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
-$der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$      & no\\
+\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
-$der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$              & no\\
 \end{tabular}
 \end{center}
-\noindent
+Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error.
-Since both are `no', we have to continue with \texttt{2} from the input string
+The first parser has as output type a single character (recall the type of \texttt{CharParser}),
+but the second parser produces a pair of characters as output. The alternative parser is however
-\begin{center}
+required to have both component parsers to have the same type. We will see later how we can
-\begin{tabular}{l|c}
+build this parser without the typing error.
-& $zeroable$\\\hline
-$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$      & yes\\
+The next parser combinator does not actually combine smaller parsers, but applies
-$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$              & no\\
+a function to the result of the parser. It is implemented in Scala as follows
-\end{tabular}
-\end{center}
+\begin{center}
+\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
-\noindent
+class FunParser[I, T, S]
-Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet
+(p: => Parser[I, T],
-know how much of the input string should be considered as an \textit{IDENT}. So we
+f: T => S) extends Parser[I, S] {
-still have to continue and consider the next derivative.
+def parse(sb: I) =
+for ((head, tail) <- p.parse(sb)) yield (f(head), tail)
-\begin{center}
+}
-\begin{tabular}{l|c}
+\end{lstlisting}
-& $zeroable$\\\hline
+\end{center}
-$der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$              & yes\\
-\end{tabular}
-\end{center}
+\noindent
+This parser combinator takes a parser \texttt{p} with output type \texttt{T} as
-\noindent
+input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p}
-Since the answer is now `yes' also in this case, we can stop: once all derivatives are
+produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then
-zeroable, we know the regular expressions cannot match any more letters from
+applies the function \texttt{f} to all the parer outputs. Since this function
-the input. In this case we only have to go back to the derivative that is nullable.  In this case it
+is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}.
-is
+Again Scala lets us introduce some shorthand notation for this parser combinator.
+Therefore we will write \texttt{p ==> f} for it.
-\begin{center}
-$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$
+%\bigskip
-\end{center}
+%takes advantage of the full generality---have a look
+%what it produces if we call it with the string \texttt{abc}
-\noindent
+%
-which means we recognised an identifier. In case where there is a choice of more than one
+%\begin{center}
-regular expressions that are nullable, then we choose the one with the highest precedence.
+%\begin{tabular}{rcl}
-You can try out such a case with the input string
+%input string & & output\medskip\\
+%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
-\begin{center}
+%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
-\texttt{\Grid{then\VS}}\;\dots
+%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
-\end{center}
+%\end{tabular}
+%\end{center}
-\noindent
-which can both be recognised as a keyword, but also an identifier.
-While in the example above the last nullable derivative is the one directly before
-the derivative turns zeroable, this is not always the case. Imagine, identifiers can be
-letters, as permuted by the regular expression \textit{IDENT}, but must end with an
-undercore.
-\begin{center}
-\begin{tabular}{lcl}
-$\textit{NEWIDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^* \cdot \_$\\
-\end{tabular}
-\end{center}
-\noindent
-If we use \textit{NEWIDENT} with the input string
-\begin{center}
-\texttt{\Grid{iffoo\VS}}\;\ldots
-\end{center}
-\noindent
-then it will only become $zeroable$ after the $\VS$ has been analysed. In this case we have to go back to the
-first \texttt{f} because only
-\begin{center}
-$der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$
-\end{center}
-\noindent
-is nullable. As a result we recognise successfully the keyword \texttt{if} and the remaining
-string needs to be consumed by other regular expressions or lead to a lexing error.
 \end{document}
 %%% Local Variables:
 %%% mode: latex
 %%% TeX-master: t
 %%% End:

changeset 358	b3129cff41e9
parent 297	5c51839c88fd
child 360	c6c574d2ca0c