--- a/handouts/ho06.tex Mon Nov 23 16:45:07 2015 +0000
+++ b/handouts/ho06.tex Wed Nov 25 15:59:32 2015 +0000
@@ -7,261 +7,78 @@
\section*{Handout 6 (Parser Combinators)}
-While regular expressions are very useful for lexing and for recognising
-many patterns in strings (like email addresses), they have their limitations. For
-example there is no regular expression that can recognise the language
-$a^nb^n$. Another example for which there exists no regular expression is the language of well-parenthesised
-expressions. In languages like Lisp, which use parentheses rather
-extensively, it might be of interest whether the following two expressions
-are well-parenthesised (the left one is, the right one is not):
-
-\begin{center}
-$(((()()))())$ \hspace{10mm} $(((()()))()))$
-\end{center}
-
-\noindent
-Not being able to solve such recognition problems is a serious limitation.
-In order to solve such recognition problems, we need more powerful
-techniques than regular expressions. We will in particular look at \emph{context-free
-languages}. They include the regular languages as the picture below shows:
-
-
-\begin{center}
-\begin{tikzpicture}
-[rect/.style={draw=black!50,
- top color=white,bottom color=black!20,
- rectangle, very thick, rounded corners}]
-
-\draw (0,0) node [rect, text depth=30mm, text width=46mm] {\small all languages};
-\draw (0,-0.4) node [rect, text depth=20mm, text width=44mm] {\small decidable languages};
-\draw (0,-0.65) node [rect, text depth=13mm] {\small context sensitive languages};
-\draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages};
-\draw (0,-1.05) node [rect] {\small regular languages};
-\end{tikzpicture}
-\end{center}
-
-\noindent
-Context-free languages play an important role in `day-to-day' text processing and in
-programming languages. Context-free languages are usually specified by grammars.
-For example a grammar for well-parenthesised expressions is
-
-\begin{center}
-$P \;\;\rightarrow\;\; ( \cdot P \cdot ) \cdot P \;|\; \epsilon$
-\end{center}
-
-\noindent
-In general grammars consist of finitely many rules built up from \emph{terminal symbols}
-(usually lower-case letters) and \emph{non-terminal symbols} (upper-case letters). Rules
-have the shape
+In what follows we explain \emph{parser combinators}, because
+they are very easy to implement. However, they only work when
+the grammar to be parsed is \emph{not} left-recursive and they
+are efficient only when the grammar is unambiguous. It is the
+responsibility of the grammar designer to ensure these two
+properties.
-\begin{center}
-$NT \;\;\rightarrow\;\; \textit{rhs}$
-\end{center}
-
-\noindent
-where on the left-hand side is a single non-terminal and on the right a string consisting
-of both terminals and non-terminals including the $\epsilon$-symbol for indicating the
-empty string. We use the convention to separate components on
-the right hand-side by using the $\cdot$ symbol, as in the grammar for well-parenthesised expressions.
-We also use the convention to use $|$ as a shorthand notation for several rules. For example
-
-\begin{center}
-$NT \;\;\rightarrow\;\; \textit{rhs}_1 \;|\; \textit{rhs}_2$
-\end{center}
-
-\noindent
-means that the non-terminal $NT$ can be replaced by either $\textit{rhs}_1$ or $\textit{rhs}_2$.
-If there are more than one non-terminal on the left-hand side of the rules, then we need to indicate
-what is the \emph{starting} symbol of the grammar. For example the grammar for arithmetic expressions
-can be given as follows
-
-\begin{center}
-\begin{tabular}{lcl}
-$E$ & $\rightarrow$ & $N$ \\
-$E$ & $\rightarrow$ & $E \cdot + \cdot E$ \\
-$E$ & $\rightarrow$ & $E \cdot - \cdot E$ \\
-$E$ & $\rightarrow$ & $E \cdot * \cdot E$ \\
-$E$ & $\rightarrow$ & $( \cdot E \cdot )$\\
-$N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$
-\end{tabular}
-\end{center}
-
-\noindent
-where $E$ is the starting symbol. A \emph{derivation} for a grammar
-starts with the staring symbol of the grammar and in each step replaces one
-non-terminal by a right-hand side of a rule. A derivation ends with a string
-in which only terminal symbols are left. For example a derivation for the
-string $(1 + 2) + 3$ is as follows:
-
-\begin{center}
-\begin{tabular}{lll}
-$E$ & $\rightarrow$ & $E+E$\\
- & $\rightarrow$ & $(E)+E$\\
- & $\rightarrow$ & $(E+E)+E$\\
- & $\rightarrow$ & $(E+E)+N$\\
- & $\rightarrow$ & $(E+E)+3$\\
- & $\rightarrow$ & $(N+E)+3$\\
- & $\rightarrow^+$ & $(1+2)+3$\\
-\end{tabular}
-\end{center}
-
-\noindent
-The \emph{language} of a context-free grammar $G$ with start symbol $S$
-is defined as the set of strings derivable by a derivation, that is
+Parser combinators can deal with any kind of input as long as
+this input is a kind of sequence, for example a string or a
+list of tokens. If the input are lists of tokens, then parser
+combinators transform them into sets of pairs, like so
\begin{center}
-$\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
-\end{center}
-
-\noindent
-A \emph{parse-tree} encodes how a string is derived with the starting symbol on
-top and each non-terminal containing a subtree for how it is replaced in a derivation.
-The parse tree for the string $(1 + 23)+4$ is as follows:
-
-\begin{center}
-\begin{tikzpicture}[level distance=8mm, black]
- \node {$E$}
- child {node {$E$}
- child {node {$($}}
- child {node {$E$}
- child {node {$E$} child {node {$N$} child {node {$1$}}}}
- child {node {$+$}}
- child {node {$E$}
- child {node {$N$} child {node {$2$}}}
- child {node {$N$} child {node {$3$}}}
- }
- }
- child {node {$)$}}
- }
- child {node {$+$}}
- child {node {$E$}
- child {node {$N$} child {node {$4$}}}
- };
-\end{tikzpicture}
-\end{center}
-
-\noindent
-We are often interested in these parse-trees since they encode the structure of
-how a string is derived by a grammar. Before we come to the problem of constructing
-such parse-trees, we need to consider the following two properties of grammars.
-A grammar is \emph{left-recursive} if there is a derivation starting from a non-terminal, say
-$NT$ which leads to a string which again starts with $NT$. This means a derivation of the
-form.
-
-\begin{center}
-$NT \rightarrow \ldots \rightarrow NT \cdot \ldots$
-\end{center}
-
-\noindent
-It can be easily seen that the grammar above for arithmetic expressions is left-recursive:
-for example the rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow N\cdot N$
-show that this grammar is left-recursive. Some algorithms cannot cope with left-recursive
-grammars. Fortunately every left-recursive grammar can be transformed into one that is
-not left-recursive, although this transformation might make the grammar less human-readable.
-For example if we want to give a non-left-recursive grammar for numbers we might
-specify
-
-\begin{center}
-$N \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot N\;|\;2\cdot N\;|\;\ldots\;|\;9\cdot N$
-\end{center}
+$\underbrace{\text{list of tokens}}_{\text{input}}$
+$\Rightarrow$
+$\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
+\end{center}
-\noindent
-Using this grammar we can still derive every number string, but we will never be able
-to derive a string of the form $\ldots \rightarrow N \cdot \ldots$.
-
-The other property we have to watch out for is when a grammar is
-\emph{ambiguous}. A grammar is said to be ambiguous if there are two parse-trees
-for one string. Again the grammar for arithmetic expressions shown above is ambiguous.
-While the shown parse tree for the string $(1 + 23) + 4$ is unique, this is not the case in
-general. For example there are two parse
-trees for the string $1 + 2 + 3$, namely
-
-\begin{center}
-\begin{tabular}{c@{\hspace{10mm}}c}
-\begin{tikzpicture}[level distance=8mm, black]
- \node {$E$}
- child {node {$E$} child {node {$N$} child {node {$1$}}}}
- child {node {$+$}}
- child {node {$E$}
- child {node {$E$} child {node {$N$} child {node {$2$}}}}
- child {node {$+$}}
- child {node {$E$} child {node {$N$} child {node {$3$}}}}
- }
- ;
-\end{tikzpicture}
-&
-\begin{tikzpicture}[level distance=8mm, black]
- \node {$E$}
- child {node {$E$}
- child {node {$E$} child {node {$N$} child {node {$1$}}}}
- child {node {$+$}}
- child {node {$E$} child {node {$N$} child {node {$2$}}}}
- }
- child {node {$+$}}
- child {node {$E$} child {node {$N$} child {node {$3$}}}}
- ;
-\end{tikzpicture}
-\end{tabular}
-\end{center}
-
-\noindent
-In particular in programming languages we will try to avoid ambiguous
-grammars because two different parse-trees for a string mean a program can
-be interpreted in two different ways. In such cases we have to somehow make sure
-the two different ways do not matter, or disambiguate the grammar in
-some other way (for example making the $+$ left-associative). Unfortunately already
-the problem of deciding whether a grammar
-is ambiguous or not is in general undecidable.
-
-Let us now turn to the problem of generating a parse-tree for a grammar and string.
-In what follows we explain \emph{parser combinators}, because they are easy
-to implement and closely resemble grammar rules. Imagine that a grammar
-describes the strings of natural numbers, such as the grammar $N$ shown above.
-For all such strings we want to generate the parse-trees or later on we actually
-want to extract the meaning of these strings, that is the concrete integers ``behind''
-these strings. In Scala the parser combinators will be functions of type
+\noindent In Scala parser combinators will therefore be
+functions of type
\begin{center}
\texttt{I $\Rightarrow$ Set[(T, I)]}
\end{center}
-\noindent
-that is they take as input something of type \texttt{I}, typically a list of tokens or a string,
-and return a set of pairs. The first component of these pairs corresponds to what the
-parser combinator was able to process from the input and the second is the unprocessed
-part of the input. As we shall see shortly, a parser combinator might return more than one such pair,
-with the idea that there are potentially several ways how to interpret the input. As a concrete
-example, consider the case where the input is of type string, say the string
+\noindent that is they take as input something of type
+\texttt{I} and return a set of pairs. The first component of
+these pairs corresponds to what the parser combinator was able
+to process from the input and the second is the unprocessed
+part of the input. As we shall see shortly, a parser
+combinator might return more than one such pair, with the idea
+that there are potentially several ways how to interpret the
+input. To simplify matters we will first look at the case
+where the input to the parser combinators is just strings. As
+a concrete example, consider the case where the input is the
+string
\begin{center}
\tt\Grid{iffoo\VS testbar}
\end{center}
-\noindent
-We might have a parser combinator which tries to interpret this string as a keyword (\texttt{if}) or
-an identifier (\texttt{iffoo}). Then the output will be the set
+\noindent We might have a parser combinator which tries to
+interpret this string as a keyword (\texttt{if}) or an
+identifier (\texttt{iffoo}). Then the output will be the set
\begin{center}
$\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right),
\left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
\end{center}
-\noindent
-where the first pair means the parser could recognise \texttt{if} from the input and leaves
-the rest as `unprocessed' as the second component of the pair; in the other case
-it could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parser
-cannot recognise anything from the input then parser combinators just return the empty
-set $\varnothing$. This will indicate something ``went wrong''.
+\noindent where the first pair means the parser could
+recognise \texttt{if} from the input and leaves the rest as
+`unprocessed' as the second component of the pair; in the
+other case it could recognise \texttt{iffoo} and leaves
+\texttt{\VS testbar} as unprocessed. If the parser cannot
+recognise anything from the input then parser combinators just
+return the empty set $\{\}$. This will indicate something
+``went wrong''\ldots or more precisely, nothing could be
+parsed.
-The main attraction is that we can easily build parser combinators out of smaller components
-following very closely the structure of a grammar. In order to implement this in an object
-oriented programming language, like Scala, we need to specify an abstract class for parser
-combinators. This abstract class requires the implementation of the function
-\texttt{parse} taking an argument of type \texttt{I} and returns a set of type
-\mbox{\texttt{Set[(T, I)]}}.
+The main attraction of parser combinators is that we can
+easily build parser combinators out of smaller components
+following very closely the structure of a grammar. In order to
+implement this in an object-oriented programming language,
+like Scala, we need to specify an abstract class for parser
+combinators. This abstract class requires the implementation
+of the function \texttt{parse} taking an argument of type
+\texttt{I} and returns a set of type \mbox{\texttt{Set[(T,
+I)]}}.
\begin{center}
-\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
+\begin{lstlisting}[language=Scala]
abstract class Parser[I, T] {
def parse(ts: I): Set[(T, I)]
@@ -272,19 +89,23 @@
\end{lstlisting}
\end{center}
-\noindent
-From the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all},
-which just filters out all pairs whose second component is not empty (that is has still some
-unprocessed part). The reason is that at the end of parsing we are only interested in the
-results where all the input has been consumed and no unprocessed part is left.
+\noindent From the function \texttt{parse} we can then
+``centrally'' derive the function \texttt{parse\_all}, which
+just filters out all pairs whose second component is not empty
+(that is has still some unprocessed part). The reason is that
+at the end of parsing we are only interested in the results
+where all the input has been consumed and no unprocessed part
+is left.
-One of the simplest parser combinators recognises just a character, say $c$,
-from the beginning of strings. Its behaviour is as follows:
+One of the simplest parser combinators recognises just a
+character, say $c$, from the beginning of strings. Its
+behaviour can be described as follows:
\begin{itemize}
\item if the head of the input string starts with a $c$, it returns
- the set $\{(c, \textit{tail of}\; s)\}$
-\item otherwise it returns the empty set $\varnothing$
+ the set $\{(c, \textit{tail of}\; s)\}$, where \textit{tail of}
+ $s$ is the unprocessed part of the input string
+\item otherwise it returns the empty set $\{\}$
\end{itemize}
\noindent
@@ -301,19 +122,30 @@
\end{lstlisting}
\end{center}
+\noindent The \texttt{parse} function tests whether the first
+character of the input string \texttt{sb} is equal to
+\texttt{c}. If yes, then it splits the string into the
+recognised part \texttt{c} and the unprocessed part
+\texttt{sb.tail}. In case \texttt{sb} does not start with
+\texttt{c} then the parser returns the empty set (in Scala
+\texttt{Set()}).
+
+
+More interesting are the parser combinators that build larger
+parsers out of smaller component parsers. For example the
+alternative parser combinator is as follows: given two
+parsers, say, $p$ and $q$, then we apply both parsers to the
+input (remember parsers are functions) and combine the input
+
+\begin{center}
+$p(\text{input}) \cup q(\text{input})$
+\end{center}
+
\noindent
-The \texttt{parse} function tests whether the first character of the
-input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits the
-string into the recognised part \texttt{c} and the unprocessed part
-\texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} then
-the parser returns the empty set (in Scala \texttt{Set()}).
-
-More interesting are the parser combinators that build larger parsers
-out of smaller component parsers. For example the alternative
-parser combinator is as follows.
+In Scala we would implement alternative parsers as follows
\begin{center}
-\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
+\begin{lstlisting}[language=Scala, numbers=none]
class AltParser[I, T]
(p: => Parser[I, T],
q: => Parser[I, T]) extends Parser[I, T] {
@@ -322,49 +154,70 @@
\end{lstlisting}
\end{center}
-\noindent
-The types of this parser combinator are polymorphic (we just have \texttt{I}
-for the input type, and \texttt{T} for the output type). The alternative parser
-builds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}.
-Both need to be able to process input of type \texttt{I} and return the same
-output type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the
-\texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the
-evaluation of the arguments before they are used. This is often called
-\emph{lazy evaluation} of the arguments.) The alternative parser should run
-the input with the first parser \texttt{p} (producing a set of outputs) and then
-run the same input with \texttt{q}. The result should be then just the union
-of both sets, which is the operation \texttt{++} in Scala.
+\noindent The types of this parser combinator are polymorphic
+(we just have \texttt{I} for the input type, and \texttt{T}
+for the output type). The alternative parser builds a new
+parser out of two existing parser combinator \texttt{p} and
+\texttt{q}. Both need to be able to process input of type
+\texttt{I} and return the same output type \texttt{Set[(T,
+I)]}. (There is an interesting detail of Scala, namely the
+\texttt{=>} in front of the types of \texttt{p} and
+\texttt{q}. They will prevent the evaluation of the arguments
+before they are used. This is often called \emph{lazy
+evaluation} of the arguments.) The alternative parser should
+run the input with the first parser \texttt{p} (producing a
+set of outputs) and then run the same input with \texttt{q}.
+The result should be then just the union of both sets, which
+is the operation \texttt{++} in Scala.
-This parser combinator already allows us to construct a parser that either
-a character \texttt{a} or \texttt{b}, as
+This parser combinator already allows us to construct a parser
+that either a character \texttt{a} or \texttt{b}, as
\begin{center}
-\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
+\begin{lstlisting}[language=Scala, numbers=none]
new AltParser(CharParser('a'), CharParser('b'))
\end{lstlisting}
\end{center}
-\noindent
-Scala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}.
-We can call this parser combinator with the strings
+\noindent Scala allows us to introduce some more readable
+shorthand notation for this, like \texttt{'a' || 'b'}. We can
+call this parser combinator with the strings
\begin{center}
\begin{tabular}{rcl}
-input string & & output\medskip\\
+input strings & & output\medskip\\
\texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\
-\texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$
+\texttt{\Grid{cc}} & $\rightarrow$ & $\{\}$
+\end{tabular}
+\end{center}
+
+\noindent We receive in the first two cases a successful
+output (that is a non-empty set). In each case, either
+\pcode{a} or \pcode{b} are the processed parts, and \pcode{c}
+is the unprocessed part. Clearly this parser cannot parse
+anything with the string \pcode{cc}.
+
+A bit more interesting is the \emph{sequence parser
+combinator}. Given two parsers, say, $p$ and $q$,
+apply first the input to $p$ producing a set of pairs;
+then apply $q$ to the unparsed parts; then combine the
+results like
+
+\begin{center}
+\begin{tabular}{lcl}
+$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ &
+$(\textit{output}_1, u_1) \in p(\text{input})
+\;\wedge\;$\\
+&& $(\textit{output}_2, u_2) \in q(u_1)\}$
\end{tabular}
\end{center}
\noindent
-We receive in the first two cases a successful output (that is a non-empty set).
-
-A bit more interesting is the \emph{sequence parser combinator} implemented in
-Scala as follows:
+This can be implemented in Scala as follows:
\begin{center}
-\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
+\begin{lstlisting}[language=Scala,numbers=none]
class SeqParser[I, T, S]
(p: => Parser[I, T],
q: => Parser[I, S]) extends Parser[I, (T, S)] {
@@ -376,57 +229,96 @@
\end{lstlisting}
\end{center}
-\noindent
-This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse}
-as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}).
-The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}.
-Let \texttt{q} run on these unprocessed parts
-producing again a set of pairs. The output of the sequence parser combinator is then a set
-containing pairs where the first components are again pairs, namely what the first parser could parse
-together with what the second parser could parse; the second component is the unprocessed
-part left over after running the second parser \texttt{q}. Therefore the input type of
-the sequence parser combinator is as usual \texttt{I}, but the output type is
+\noindent This parser takes as input two parsers, \texttt{p}
+and \texttt{q}. It implements \texttt{parse} as follows: let
+first run the parser \texttt{p} on the input producing a set
+of pairs (\texttt{head1}, \texttt{tail1}). The \texttt{tail1}
+stands for the unprocessed parts left over by \texttt{p}. Let
+\texttt{q} run on these unprocessed parts producing again a
+set of pairs. The output of the sequence parser combinator is
+then a set containing pairs where the first components are
+again pairs, namely what the first parser could parse together
+with what the second parser could parse; the second component
+is the unprocessed part left over after running the second
+parser \texttt{q}. Therefore the input type of the sequence
+parser combinator is as usual \texttt{I}, but the output type
+is
\begin{center}
\texttt{Set[((T, S), I)]}
\end{center}
-Scala allows us to provide some
-shorthand notation for the sequence parser combinator. So we can write for
-example \texttt{'a' $\sim$ 'b'}, which is the
-parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}.
-Calling this parser combinator with the strings
+Scala allows us to provide some shorthand notation for the
+sequence parser combinator. So we can write for example
+\texttt{'a' $\sim$ 'b'}, which is the parser combinator that
+first consumes the character \texttt{a} from a string and then
+\texttt{b}. Three examples of this parser combinator are as
+follows:
+
+\begin{center}
+\begin{tabular}{rcl}
+input strings & & output\medskip\\
+\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
+\texttt{\Grid{bac}} & $\rightarrow$ & $\{\}$\\
+\texttt{\Grid{ccc}} & $\rightarrow$ & $\{\}$
+\end{tabular}
+\end{center}
+
+\noindent A slightly more complicated parser is \texttt{('a'
+|| 'b') $\sim$ 'b'} which parses as first character either an
+\texttt{a} or \texttt{b} followed by a \texttt{b}. This parser
+produces the following results.
+
+\begin{center}
+\begin{tabular}{rcl}
+input strings & & output\medskip\\
+\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
+\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
+\texttt{\Grid{aac}} & $\rightarrow$ & $\{\}$
+\end{tabular}
+\end{center}
+
+\noindent Two more examples: first consider the parser \pcode{('a' ~
+'a') ~ 'a'} and the input \pcode{aaaa}:
\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
-\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
-\texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\
-\texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$
+\texttt{\Grid{aaaa}} & $\rightarrow$ &
+$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\
\end{tabular}
\end{center}
-\noindent
-A slightly more complicated parser is \texttt{('a' || 'b') $\sim$ 'b'} which parses as first character either
-an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results.
+\noindent Notice how the results nest deeper as pairs (the
+last \pcode{a} is in the unprocessed part). To consume
+everything we can use the parser \pcode{(('a' ~'a') ~ 'a') ~
+'a'}. Then the output is as follows:
\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
-\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
-\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
-\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
+\texttt{\Grid{aaaa}} & $\rightarrow$ &
+$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\
\end{tabular}
\end{center}
-Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error.
-The first parser has as output type a single character (recall the type of \texttt{CharParser}),
-but the second parser produces a pair of characters as output. The alternative parser is however
-required to have both component parsers to have the same type. We will see later how we can
-build this parser without the typing error.
+\noindent This is an instance where the parser consumed
+completely the input, meaning the unprocessed part is just the
+empty string.
+
-The next parser combinator does not actually combine smaller parsers, but applies
-a function to the result of the parser. It is implemented in Scala as follows
+Note carefully that constructing a parser such \texttt{'a' ||
+('a' $\sim$ 'b')} will result in a tying error. The first
+parser has as output type a single character (recall the type
+of \texttt{CharParser}), but the second parser produces a pair
+of characters as output. The alternative parser is however
+required to have both component parsers to have the same type.
+We will see later how we can build this parser without the
+typing error.
+
+The next parser combinator does not actually combine smaller
+parsers, but applies a function to the result of the parser.
+It is implemented in Scala as follows
\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
--- a/progs/token2.scala Mon Nov 23 16:45:07 2015 +0000
+++ b/progs/token2.scala Wed Nov 25 15:59:32 2015 +0000
@@ -13,6 +13,7 @@
case class CRANGE(cs: String) extends Rexp
case class PLUS(r: Rexp) extends Rexp
case class OPT(r: Rexp) extends Rexp
+case class NTIMES(r: Rexp, n: Int) extends Rexp
abstract class Val
case object Empty extends Val
@@ -59,6 +60,7 @@
case CRANGE(_) => false
case PLUS(r) => nullable(r)
case OPT(_) => true
+ case NTIMES(r, n) => if (n == 0) true else nullable(r)
}
// derivative of a regular expression w.r.t. a character
@@ -75,6 +77,7 @@
case CRANGE(cs) => if (cs.contains(c)) EMPTY else NULL
case PLUS(r) => SEQ(der(c, r), STAR(r))
case OPT(r) => ALT(der(c, r), NULL)
+ case NTIMES(r, n) => if (n == 0) NULL else der(c, SEQ(r, NTIMES(r, n - 1)))
}
// derivative w.r.t. a string (iterates der)
@@ -115,6 +118,9 @@
case RECD(x, r) => Rec(x, mkeps(r))
case PLUS(r) => Stars(List(mkeps(r)))
case OPT(_) => Right(Empty)
+ case NTIMES(r, n) => if (n == 0) Stars(Nil) else Stars(Nil.padTo(n, mkeps(r)))
+ case _ => { println ("r : " + r.toString);
+ throw new Exception("mkeps error")}
}
@@ -130,6 +136,11 @@
case (RECD(x, r1), _) => Rec(x, inj(r1, c, v))
case (PLUS(r), Seq(v1, Stars(vs))) => Stars(inj(r, c, v1)::vs)
case (OPT(r), Left(v1)) => Left(inj(r, c, v1))
+ case (NTIMES(r, n), Seq(v1, Stars(vs))) => Stars(inj(r, c, v1)::vs)
+ case (NTIMES(r, n), Left(Seq(v1, Stars(vs)))) => Stars(inj(r, c, v1)::vs)
+ case (NTIMES(r, n), Right(Stars(v::vs))) => Stars(mkeps(r)::inj(r, c, v)::vs)
+ case _ => { println ("r : " + r.toString + " v: " + v.toString);
+ throw new Exception("inj error")}
}
// main lexing function (produces a value)
@@ -240,7 +251,6 @@
def tokenise(r: Rexp, s: String) =
env(lexing_simp(r, s)).filterNot { (s) => s._1 == "w"}.mkString("\n")
-
// Testing
//============
@@ -252,6 +262,11 @@
result
}
+println(lexing_simp(OPT("1"), "1"))
+println(lexing_simp(OPT("1"), ""))
+println(ders("111".toList, NTIMES("1",3)))
+println(lexing_simp(NTIMES("1",3), "111"))
+
val r1 = ("a" | "ab") ~ ("bcd" | "c")
println(lexing(r1, "abcd"))