# HG changeset patch # User Christian Urban # Date 1448467172 0 # Node ID 7f8516ff408d95e4259f18fa5c078c8ab890d39d # Parent 4629448c1bd92e6181572cdf61c5d271b0e752fe updated diff -r 4629448c1bd9 -r 7f8516ff408d handouts/ho05.pdf Binary file handouts/ho05.pdf has changed diff -r 4629448c1bd9 -r 7f8516ff408d handouts/ho05.tex --- a/handouts/ho05.tex Mon Nov 23 16:45:07 2015 +0000 +++ b/handouts/ho05.tex Wed Nov 25 15:59:32 2015 +0000 @@ -5,7 +5,7 @@ \begin{document} -\section*{Handout 6 (Grammars \& Parser)} +\section*{Handout 5 (Grammars \& Parser)} While regular expressions are very useful for lexing and for recognising many patterns in strings (like email addresses), diff -r 4629448c1bd9 -r 7f8516ff408d handouts/ho06.pdf Binary file handouts/ho06.pdf has changed diff -r 4629448c1bd9 -r 7f8516ff408d handouts/ho06.tex --- a/handouts/ho06.tex Mon Nov 23 16:45:07 2015 +0000 +++ b/handouts/ho06.tex Wed Nov 25 15:59:32 2015 +0000 @@ -7,261 +7,78 @@ \section*{Handout 6 (Parser Combinators)} -While regular expressions are very useful for lexing and for recognising -many patterns in strings (like email addresses), they have their limitations. For -example there is no regular expression that can recognise the language -$a^nb^n$. Another example for which there exists no regular expression is the language of well-parenthesised -expressions. In languages like Lisp, which use parentheses rather -extensively, it might be of interest whether the following two expressions -are well-parenthesised (the left one is, the right one is not): - -\begin{center} -$(((()()))())$ \hspace{10mm} $(((()()))()))$ -\end{center} - -\noindent -Not being able to solve such recognition problems is a serious limitation. -In order to solve such recognition problems, we need more powerful -techniques than regular expressions. We will in particular look at \emph{context-free -languages}. They include the regular languages as the picture below shows: - - -\begin{center} -\begin{tikzpicture} -[rect/.style={draw=black!50, - top color=white,bottom color=black!20, - rectangle, very thick, rounded corners}] - -\draw (0,0) node [rect, text depth=30mm, text width=46mm] {\small all languages}; -\draw (0,-0.4) node [rect, text depth=20mm, text width=44mm] {\small decidable languages}; -\draw (0,-0.65) node [rect, text depth=13mm] {\small context sensitive languages}; -\draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages}; -\draw (0,-1.05) node [rect] {\small regular languages}; -\end{tikzpicture} -\end{center} - -\noindent -Context-free languages play an important role in `day-to-day' text processing and in -programming languages. Context-free languages are usually specified by grammars. -For example a grammar for well-parenthesised expressions is - -\begin{center} -$P \;\;\rightarrow\;\; ( \cdot P \cdot ) \cdot P \;|\; \epsilon$ -\end{center} - -\noindent -In general grammars consist of finitely many rules built up from \emph{terminal symbols} -(usually lower-case letters) and \emph{non-terminal symbols} (upper-case letters). Rules -have the shape +In what follows we explain \emph{parser combinators}, because +they are very easy to implement. However, they only work when +the grammar to be parsed is \emph{not} left-recursive and they +are efficient only when the grammar is unambiguous. It is the +responsibility of the grammar designer to ensure these two +properties. -\begin{center} -$NT \;\;\rightarrow\;\; \textit{rhs}$ -\end{center} - -\noindent -where on the left-hand side is a single non-terminal and on the right a string consisting -of both terminals and non-terminals including the $\epsilon$-symbol for indicating the -empty string. We use the convention to separate components on -the right hand-side by using the $\cdot$ symbol, as in the grammar for well-parenthesised expressions. -We also use the convention to use $|$ as a shorthand notation for several rules. For example - -\begin{center} -$NT \;\;\rightarrow\;\; \textit{rhs}_1 \;|\; \textit{rhs}_2$ -\end{center} - -\noindent -means that the non-terminal $NT$ can be replaced by either $\textit{rhs}_1$ or $\textit{rhs}_2$. -If there are more than one non-terminal on the left-hand side of the rules, then we need to indicate -what is the \emph{starting} symbol of the grammar. For example the grammar for arithmetic expressions -can be given as follows - -\begin{center} -\begin{tabular}{lcl} -$E$ & $\rightarrow$ & $N$ \\ -$E$ & $\rightarrow$ & $E \cdot + \cdot E$ \\ -$E$ & $\rightarrow$ & $E \cdot - \cdot E$ \\ -$E$ & $\rightarrow$ & $E \cdot * \cdot E$ \\ -$E$ & $\rightarrow$ & $( \cdot E \cdot )$\\ -$N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$ -\end{tabular} -\end{center} - -\noindent -where $E$ is the starting symbol. A \emph{derivation} for a grammar -starts with the staring symbol of the grammar and in each step replaces one -non-terminal by a right-hand side of a rule. A derivation ends with a string -in which only terminal symbols are left. For example a derivation for the -string $(1 + 2) + 3$ is as follows: - -\begin{center} -\begin{tabular}{lll} -$E$ & $\rightarrow$ & $E+E$\\ - & $\rightarrow$ & $(E)+E$\\ - & $\rightarrow$ & $(E+E)+E$\\ - & $\rightarrow$ & $(E+E)+N$\\ - & $\rightarrow$ & $(E+E)+3$\\ - & $\rightarrow$ & $(N+E)+3$\\ - & $\rightarrow^+$ & $(1+2)+3$\\ -\end{tabular} -\end{center} - -\noindent -The \emph{language} of a context-free grammar $G$ with start symbol $S$ -is defined as the set of strings derivable by a derivation, that is +Parser combinators can deal with any kind of input as long as +this input is a kind of sequence, for example a string or a +list of tokens. If the input are lists of tokens, then parser +combinators transform them into sets of pairs, like so \begin{center} -$\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$ -\end{center} - -\noindent -A \emph{parse-tree} encodes how a string is derived with the starting symbol on -top and each non-terminal containing a subtree for how it is replaced in a derivation. -The parse tree for the string $(1 + 23)+4$ is as follows: - -\begin{center} -\begin{tikzpicture}[level distance=8mm, black] - \node {$E$} - child {node {$E$} - child {node {$($}} - child {node {$E$} - child {node {$E$} child {node {$N$} child {node {$1$}}}} - child {node {$+$}} - child {node {$E$} - child {node {$N$} child {node {$2$}}} - child {node {$N$} child {node {$3$}}} - } - } - child {node {$)$}} - } - child {node {$+$}} - child {node {$E$} - child {node {$N$} child {node {$4$}}} - }; -\end{tikzpicture} -\end{center} - -\noindent -We are often interested in these parse-trees since they encode the structure of -how a string is derived by a grammar. Before we come to the problem of constructing -such parse-trees, we need to consider the following two properties of grammars. -A grammar is \emph{left-recursive} if there is a derivation starting from a non-terminal, say -$NT$ which leads to a string which again starts with $NT$. This means a derivation of the -form. - -\begin{center} -$NT \rightarrow \ldots \rightarrow NT \cdot \ldots$ -\end{center} - -\noindent -It can be easily seen that the grammar above for arithmetic expressions is left-recursive: -for example the rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow N\cdot N$ -show that this grammar is left-recursive. Some algorithms cannot cope with left-recursive -grammars. Fortunately every left-recursive grammar can be transformed into one that is -not left-recursive, although this transformation might make the grammar less human-readable. -For example if we want to give a non-left-recursive grammar for numbers we might -specify - -\begin{center} -$N \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot N\;|\;2\cdot N\;|\;\ldots\;|\;9\cdot N$ -\end{center} +$\underbrace{\text{list of tokens}}_{\text{input}}$ +$\Rightarrow$ +$\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$ +\end{center} -\noindent -Using this grammar we can still derive every number string, but we will never be able -to derive a string of the form $\ldots \rightarrow N \cdot \ldots$. - -The other property we have to watch out for is when a grammar is -\emph{ambiguous}. A grammar is said to be ambiguous if there are two parse-trees -for one string. Again the grammar for arithmetic expressions shown above is ambiguous. -While the shown parse tree for the string $(1 + 23) + 4$ is unique, this is not the case in -general. For example there are two parse -trees for the string $1 + 2 + 3$, namely - -\begin{center} -\begin{tabular}{c@{\hspace{10mm}}c} -\begin{tikzpicture}[level distance=8mm, black] - \node {$E$} - child {node {$E$} child {node {$N$} child {node {$1$}}}} - child {node {$+$}} - child {node {$E$} - child {node {$E$} child {node {$N$} child {node {$2$}}}} - child {node {$+$}} - child {node {$E$} child {node {$N$} child {node {$3$}}}} - } - ; -\end{tikzpicture} -& -\begin{tikzpicture}[level distance=8mm, black] - \node {$E$} - child {node {$E$} - child {node {$E$} child {node {$N$} child {node {$1$}}}} - child {node {$+$}} - child {node {$E$} child {node {$N$} child {node {$2$}}}} - } - child {node {$+$}} - child {node {$E$} child {node {$N$} child {node {$3$}}}} - ; -\end{tikzpicture} -\end{tabular} -\end{center} - -\noindent -In particular in programming languages we will try to avoid ambiguous -grammars because two different parse-trees for a string mean a program can -be interpreted in two different ways. In such cases we have to somehow make sure -the two different ways do not matter, or disambiguate the grammar in -some other way (for example making the $+$ left-associative). Unfortunately already -the problem of deciding whether a grammar -is ambiguous or not is in general undecidable. - -Let us now turn to the problem of generating a parse-tree for a grammar and string. -In what follows we explain \emph{parser combinators}, because they are easy -to implement and closely resemble grammar rules. Imagine that a grammar -describes the strings of natural numbers, such as the grammar $N$ shown above. -For all such strings we want to generate the parse-trees or later on we actually -want to extract the meaning of these strings, that is the concrete integers ``behind'' -these strings. In Scala the parser combinators will be functions of type +\noindent In Scala parser combinators will therefore be +functions of type \begin{center} \texttt{I $\Rightarrow$ Set[(T, I)]} \end{center} -\noindent -that is they take as input something of type \texttt{I}, typically a list of tokens or a string, -and return a set of pairs. The first component of these pairs corresponds to what the -parser combinator was able to process from the input and the second is the unprocessed -part of the input. As we shall see shortly, a parser combinator might return more than one such pair, -with the idea that there are potentially several ways how to interpret the input. As a concrete -example, consider the case where the input is of type string, say the string +\noindent that is they take as input something of type +\texttt{I} and return a set of pairs. The first component of +these pairs corresponds to what the parser combinator was able +to process from the input and the second is the unprocessed +part of the input. As we shall see shortly, a parser +combinator might return more than one such pair, with the idea +that there are potentially several ways how to interpret the +input. To simplify matters we will first look at the case +where the input to the parser combinators is just strings. As +a concrete example, consider the case where the input is the +string \begin{center} \tt\Grid{iffoo\VS testbar} \end{center} -\noindent -We might have a parser combinator which tries to interpret this string as a keyword (\texttt{if}) or -an identifier (\texttt{iffoo}). Then the output will be the set +\noindent We might have a parser combinator which tries to +interpret this string as a keyword (\texttt{if}) or an +identifier (\texttt{iffoo}). Then the output will be the set \begin{center} $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$ \end{center} -\noindent -where the first pair means the parser could recognise \texttt{if} from the input and leaves -the rest as `unprocessed' as the second component of the pair; in the other case -it could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parser -cannot recognise anything from the input then parser combinators just return the empty -set $\varnothing$. This will indicate something ``went wrong''. +\noindent where the first pair means the parser could +recognise \texttt{if} from the input and leaves the rest as +`unprocessed' as the second component of the pair; in the +other case it could recognise \texttt{iffoo} and leaves +\texttt{\VS testbar} as unprocessed. If the parser cannot +recognise anything from the input then parser combinators just +return the empty set $\{\}$. This will indicate something +``went wrong''\ldots or more precisely, nothing could be +parsed. -The main attraction is that we can easily build parser combinators out of smaller components -following very closely the structure of a grammar. In order to implement this in an object -oriented programming language, like Scala, we need to specify an abstract class for parser -combinators. This abstract class requires the implementation of the function -\texttt{parse} taking an argument of type \texttt{I} and returns a set of type -\mbox{\texttt{Set[(T, I)]}}. +The main attraction of parser combinators is that we can +easily build parser combinators out of smaller components +following very closely the structure of a grammar. In order to +implement this in an object-oriented programming language, +like Scala, we need to specify an abstract class for parser +combinators. This abstract class requires the implementation +of the function \texttt{parse} taking an argument of type +\texttt{I} and returns a set of type \mbox{\texttt{Set[(T, +I)]}}. \begin{center} -\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] +\begin{lstlisting}[language=Scala] abstract class Parser[I, T] { def parse(ts: I): Set[(T, I)] @@ -272,19 +89,23 @@ \end{lstlisting} \end{center} -\noindent -From the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all}, -which just filters out all pairs whose second component is not empty (that is has still some -unprocessed part). The reason is that at the end of parsing we are only interested in the -results where all the input has been consumed and no unprocessed part is left. +\noindent From the function \texttt{parse} we can then +``centrally'' derive the function \texttt{parse\_all}, which +just filters out all pairs whose second component is not empty +(that is has still some unprocessed part). The reason is that +at the end of parsing we are only interested in the results +where all the input has been consumed and no unprocessed part +is left. -One of the simplest parser combinators recognises just a character, say $c$, -from the beginning of strings. Its behaviour is as follows: +One of the simplest parser combinators recognises just a +character, say $c$, from the beginning of strings. Its +behaviour can be described as follows: \begin{itemize} \item if the head of the input string starts with a $c$, it returns - the set $\{(c, \textit{tail of}\; s)\}$ -\item otherwise it returns the empty set $\varnothing$ + the set $\{(c, \textit{tail of}\; s)\}$, where \textit{tail of} + $s$ is the unprocessed part of the input string +\item otherwise it returns the empty set $\{\}$ \end{itemize} \noindent @@ -301,19 +122,30 @@ \end{lstlisting} \end{center} +\noindent The \texttt{parse} function tests whether the first +character of the input string \texttt{sb} is equal to +\texttt{c}. If yes, then it splits the string into the +recognised part \texttt{c} and the unprocessed part +\texttt{sb.tail}. In case \texttt{sb} does not start with +\texttt{c} then the parser returns the empty set (in Scala +\texttt{Set()}). + + +More interesting are the parser combinators that build larger +parsers out of smaller component parsers. For example the +alternative parser combinator is as follows: given two +parsers, say, $p$ and $q$, then we apply both parsers to the +input (remember parsers are functions) and combine the input + +\begin{center} +$p(\text{input}) \cup q(\text{input})$ +\end{center} + \noindent -The \texttt{parse} function tests whether the first character of the -input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits the -string into the recognised part \texttt{c} and the unprocessed part -\texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} then -the parser returns the empty set (in Scala \texttt{Set()}). - -More interesting are the parser combinators that build larger parsers -out of smaller component parsers. For example the alternative -parser combinator is as follows. +In Scala we would implement alternative parsers as follows \begin{center} -\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] +\begin{lstlisting}[language=Scala, numbers=none] class AltParser[I, T] (p: => Parser[I, T], q: => Parser[I, T]) extends Parser[I, T] { @@ -322,49 +154,70 @@ \end{lstlisting} \end{center} -\noindent -The types of this parser combinator are polymorphic (we just have \texttt{I} -for the input type, and \texttt{T} for the output type). The alternative parser -builds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}. -Both need to be able to process input of type \texttt{I} and return the same -output type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the -\texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the -evaluation of the arguments before they are used. This is often called -\emph{lazy evaluation} of the arguments.) The alternative parser should run -the input with the first parser \texttt{p} (producing a set of outputs) and then -run the same input with \texttt{q}. The result should be then just the union -of both sets, which is the operation \texttt{++} in Scala. +\noindent The types of this parser combinator are polymorphic +(we just have \texttt{I} for the input type, and \texttt{T} +for the output type). The alternative parser builds a new +parser out of two existing parser combinator \texttt{p} and +\texttt{q}. Both need to be able to process input of type +\texttt{I} and return the same output type \texttt{Set[(T, +I)]}. (There is an interesting detail of Scala, namely the +\texttt{=>} in front of the types of \texttt{p} and +\texttt{q}. They will prevent the evaluation of the arguments +before they are used. This is often called \emph{lazy +evaluation} of the arguments.) The alternative parser should +run the input with the first parser \texttt{p} (producing a +set of outputs) and then run the same input with \texttt{q}. +The result should be then just the union of both sets, which +is the operation \texttt{++} in Scala. -This parser combinator already allows us to construct a parser that either -a character \texttt{a} or \texttt{b}, as +This parser combinator already allows us to construct a parser +that either a character \texttt{a} or \texttt{b}, as \begin{center} -\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] +\begin{lstlisting}[language=Scala, numbers=none] new AltParser(CharParser('a'), CharParser('b')) \end{lstlisting} \end{center} -\noindent -Scala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}. -We can call this parser combinator with the strings +\noindent Scala allows us to introduce some more readable +shorthand notation for this, like \texttt{'a' || 'b'}. We can +call this parser combinator with the strings \begin{center} \begin{tabular}{rcl} -input string & & output\medskip\\ +input strings & & output\medskip\\ \texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\ \texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\ -\texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$ +\texttt{\Grid{cc}} & $\rightarrow$ & $\{\}$ +\end{tabular} +\end{center} + +\noindent We receive in the first two cases a successful +output (that is a non-empty set). In each case, either +\pcode{a} or \pcode{b} are the processed parts, and \pcode{c} +is the unprocessed part. Clearly this parser cannot parse +anything with the string \pcode{cc}. + +A bit more interesting is the \emph{sequence parser +combinator}. Given two parsers, say, $p$ and $q$, +apply first the input to $p$ producing a set of pairs; +then apply $q$ to the unparsed parts; then combine the +results like + +\begin{center} +\begin{tabular}{lcl} +$\{((\textit{output}_1, \textit{output}_2), u_2)$ & $\;|\;$ & +$(\textit{output}_1, u_1) \in p(\text{input}) +\;\wedge\;$\\ +&& $(\textit{output}_2, u_2) \in q(u_1)\}$ \end{tabular} \end{center} \noindent -We receive in the first two cases a successful output (that is a non-empty set). - -A bit more interesting is the \emph{sequence parser combinator} implemented in -Scala as follows: +This can be implemented in Scala as follows: \begin{center} -\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] +\begin{lstlisting}[language=Scala,numbers=none] class SeqParser[I, T, S] (p: => Parser[I, T], q: => Parser[I, S]) extends Parser[I, (T, S)] { @@ -376,57 +229,96 @@ \end{lstlisting} \end{center} -\noindent -This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} -as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}). -The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}. -Let \texttt{q} run on these unprocessed parts -producing again a set of pairs. The output of the sequence parser combinator is then a set -containing pairs where the first components are again pairs, namely what the first parser could parse -together with what the second parser could parse; the second component is the unprocessed -part left over after running the second parser \texttt{q}. Therefore the input type of -the sequence parser combinator is as usual \texttt{I}, but the output type is +\noindent This parser takes as input two parsers, \texttt{p} +and \texttt{q}. It implements \texttt{parse} as follows: let +first run the parser \texttt{p} on the input producing a set +of pairs (\texttt{head1}, \texttt{tail1}). The \texttt{tail1} +stands for the unprocessed parts left over by \texttt{p}. Let +\texttt{q} run on these unprocessed parts producing again a +set of pairs. The output of the sequence parser combinator is +then a set containing pairs where the first components are +again pairs, namely what the first parser could parse together +with what the second parser could parse; the second component +is the unprocessed part left over after running the second +parser \texttt{q}. Therefore the input type of the sequence +parser combinator is as usual \texttt{I}, but the output type +is \begin{center} \texttt{Set[((T, S), I)]} \end{center} -Scala allows us to provide some -shorthand notation for the sequence parser combinator. So we can write for -example \texttt{'a' $\sim$ 'b'}, which is the -parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}. -Calling this parser combinator with the strings +Scala allows us to provide some shorthand notation for the +sequence parser combinator. So we can write for example +\texttt{'a' $\sim$ 'b'}, which is the parser combinator that +first consumes the character \texttt{a} from a string and then +\texttt{b}. Three examples of this parser combinator are as +follows: + +\begin{center} +\begin{tabular}{rcl} +input strings & & output\medskip\\ +\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ +\texttt{\Grid{bac}} & $\rightarrow$ & $\{\}$\\ +\texttt{\Grid{ccc}} & $\rightarrow$ & $\{\}$ +\end{tabular} +\end{center} + +\noindent A slightly more complicated parser is \texttt{('a' +|| 'b') $\sim$ 'b'} which parses as first character either an +\texttt{a} or \texttt{b} followed by a \texttt{b}. This parser +produces the following results. + +\begin{center} +\begin{tabular}{rcl} +input strings & & output\medskip\\ +\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ +\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ +\texttt{\Grid{aac}} & $\rightarrow$ & $\{\}$ +\end{tabular} +\end{center} + +\noindent Two more examples: first consider the parser \pcode{('a' ~ +'a') ~ 'a'} and the input \pcode{aaaa}: \begin{center} \begin{tabular}{rcl} input string & & output\medskip\\ -\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ -\texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\ -\texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$ +\texttt{\Grid{aaaa}} & $\rightarrow$ & +$\left\{(((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}})\right\}$\\ \end{tabular} \end{center} -\noindent -A slightly more complicated parser is \texttt{('a' || 'b') $\sim$ 'b'} which parses as first character either -an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results. +\noindent Notice how the results nest deeper as pairs (the +last \pcode{a} is in the unprocessed part). To consume +everything we can use the parser \pcode{(('a' ~'a') ~ 'a') ~ +'a'}. Then the output is as follows: \begin{center} \begin{tabular}{rcl} input string & & output\medskip\\ -\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ -\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ -\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$ +\texttt{\Grid{aaaa}} & $\rightarrow$ & +$\left\{((((\texttt{\Grid{a}}, \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{\Grid{a}}), \texttt{""})\right\}$\\ \end{tabular} \end{center} -Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error. -The first parser has as output type a single character (recall the type of \texttt{CharParser}), -but the second parser produces a pair of characters as output. The alternative parser is however -required to have both component parsers to have the same type. We will see later how we can -build this parser without the typing error. +\noindent This is an instance where the parser consumed +completely the input, meaning the unprocessed part is just the +empty string. + -The next parser combinator does not actually combine smaller parsers, but applies -a function to the result of the parser. It is implemented in Scala as follows +Note carefully that constructing a parser such \texttt{'a' || +('a' $\sim$ 'b')} will result in a tying error. The first +parser has as output type a single character (recall the type +of \texttt{CharParser}), but the second parser produces a pair +of characters as output. The alternative parser is however +required to have both component parsers to have the same type. +We will see later how we can build this parser without the +typing error. + +The next parser combinator does not actually combine smaller +parsers, but applies a function to the result of the parser. +It is implemented in Scala as follows \begin{center} \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] diff -r 4629448c1bd9 -r 7f8516ff408d progs/Matcher2.thy --- a/progs/Matcher2.thy Mon Nov 23 16:45:07 2015 +0000 +++ b/progs/Matcher2.thy Wed Nov 25 15:59:32 2015 +0000 @@ -1,4 +1,4 @@ -theory Matcher2 +ytheory Matcher2 imports "Main" begin diff -r 4629448c1bd9 -r 7f8516ff408d progs/token2.scala --- a/progs/token2.scala Mon Nov 23 16:45:07 2015 +0000 +++ b/progs/token2.scala Wed Nov 25 15:59:32 2015 +0000 @@ -13,6 +13,7 @@ case class CRANGE(cs: String) extends Rexp case class PLUS(r: Rexp) extends Rexp case class OPT(r: Rexp) extends Rexp +case class NTIMES(r: Rexp, n: Int) extends Rexp abstract class Val case object Empty extends Val @@ -59,6 +60,7 @@ case CRANGE(_) => false case PLUS(r) => nullable(r) case OPT(_) => true + case NTIMES(r, n) => if (n == 0) true else nullable(r) } // derivative of a regular expression w.r.t. a character @@ -75,6 +77,7 @@ case CRANGE(cs) => if (cs.contains(c)) EMPTY else NULL case PLUS(r) => SEQ(der(c, r), STAR(r)) case OPT(r) => ALT(der(c, r), NULL) + case NTIMES(r, n) => if (n == 0) NULL else der(c, SEQ(r, NTIMES(r, n - 1))) } // derivative w.r.t. a string (iterates der) @@ -115,6 +118,9 @@ case RECD(x, r) => Rec(x, mkeps(r)) case PLUS(r) => Stars(List(mkeps(r))) case OPT(_) => Right(Empty) + case NTIMES(r, n) => if (n == 0) Stars(Nil) else Stars(Nil.padTo(n, mkeps(r))) + case _ => { println ("r : " + r.toString); + throw new Exception("mkeps error")} } @@ -130,6 +136,11 @@ case (RECD(x, r1), _) => Rec(x, inj(r1, c, v)) case (PLUS(r), Seq(v1, Stars(vs))) => Stars(inj(r, c, v1)::vs) case (OPT(r), Left(v1)) => Left(inj(r, c, v1)) + case (NTIMES(r, n), Seq(v1, Stars(vs))) => Stars(inj(r, c, v1)::vs) + case (NTIMES(r, n), Left(Seq(v1, Stars(vs)))) => Stars(inj(r, c, v1)::vs) + case (NTIMES(r, n), Right(Stars(v::vs))) => Stars(mkeps(r)::inj(r, c, v)::vs) + case _ => { println ("r : " + r.toString + " v: " + v.toString); + throw new Exception("inj error")} } // main lexing function (produces a value) @@ -240,7 +251,6 @@ def tokenise(r: Rexp, s: String) = env(lexing_simp(r, s)).filterNot { (s) => s._1 == "w"}.mkString("\n") - // Testing //============ @@ -252,6 +262,11 @@ result } +println(lexing_simp(OPT("1"), "1")) +println(lexing_simp(OPT("1"), "")) +println(ders("111".toList, NTIMES("1",3))) +println(lexing_simp(NTIMES("1",3), "111")) + val r1 = ("a" | "ab") ~ ("bcd" | "c") println(lexing(r1, "abcd"))