\documentclass{article}\usepackage{charter}\usepackage{hyperref}\usepackage{amssymb}\usepackage{amsmath}\usepackage[T1]{fontenc}\usepackage{listings}\usepackage{xcolor}\usepackage{tikz}\usetikzlibrary{arrows}\usetikzlibrary{automata}\usetikzlibrary{shapes}\usetikzlibrary{shadows}\usetikzlibrary{positioning}\usetikzlibrary{calc}\usetikzlibrary{fit}\usetikzlibrary{backgrounds}\usepackage{fontspec}\setmonofont{Consolas}\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%\definecolor{javared}{rgb}{0.6,0,0} % for strings\definecolor{javagreen}{rgb}{0.25,0.5,0.35} % comments\definecolor{javapurple}{rgb}{0.5,0,0.35} % keywords\definecolor{javadocblue}{rgb}{0.25,0.35,0.75} % javadoc\lstdefinelanguage{scala}{ morekeywords={abstract,case,catch,class,def,% do,else,extends,false,final,finally,% for,if,implicit,import,match,mixin,% new,null,object,override,package,% private,protected,requires,return,sealed,% super,this,throw,trait,true,try,% type,val,var,while,with,yield}, otherkeywords={=>,<-,<\%,<:,>:,\#,@}, sensitive=true, morecomment=[l]{//}, morecomment=[n]{/*}{*/}, morestring=[b]", morestring=[b]', morestring=[b]"""}\lstdefinelanguage{while}{ morekeywords={while, if, then. else, read, write}, otherkeywords={=>,<-,<\%,<:,>:,\#,@}, sensitive=true, morecomment=[l]{//}, morecomment=[n]{/*}{*/}, morestring=[b]", morestring=[b]', morestring=[b]"""}\lstset{language=Scala, basicstyle=\ttfamily, keywordstyle=\color{javapurple}\bfseries, stringstyle=\color{javagreen}, commentstyle=\color{javagreen}, morecomment=[s][\color{javadocblue}]{/**}{*/}, numbers=left, numberstyle=\tiny\color{black}, stepnumber=1, numbersep=10pt, tabsize=2, showspaces=false, showstringspaces=false}\newcommand\grid[1]{%\begin{tikzpicture}[baseline=(char.base)] \path[use as bounding box] (0,0) rectangle (1em,1em); \draw[red!50, fill=red!20] (0,0) rectangle (1em,1em); \node[inner sep=1pt,anchor=base west] (char) at (0em,\gridraiseamount) {#1};\end{tikzpicture}}\newcommand\gridraiseamount{0.12em}\makeatletter\newcommand\Grid[1]{% \@tfor\z:=#1\do{\grid{\z}}}\makeatother \newcommand\Vspace[1][.3em]{% \mbox{\kern.06em\vrule height.3ex}% \vbox{\hrule width#1}% \hbox{\vrule height.3ex}}\def\VS{\Vspace[0.6em]}\begin{document}\section*{Handout 6}While regular expressions are very useful for lexing and for recognisingmany patterns in strings (like email addresses), they have their limitations. Forexample there is no regular expression that can recognise the language $a^nb^n$. Another example for which there exists no regular expression is the language of well-parenthesised expressions. In languages like Lisp, which use parentheses ratherextensively, it might be of interest whether the following two expressionsare well-parenthesised (the left one is, the right one is not):\begin{center}$(((()()))())$ \hspace{10mm} $(((()()))()))$\end{center}\noindentNot being able to solve such recognition problems is a serious limitation.In order to solve such recognition problems, we need more powerful techniques than regular expressions. We will in particular look at \emph{context-freelanguages}. They include the regular languages as the picture below shows:\begin{center}\begin{tikzpicture}[rect/.style={draw=black!50, top color=white,bottom color=black!20, rectangle, very thick, rounded corners}]\draw (0,0) node [rect, text depth=30mm, text width=46mm] {all languages};\draw (0,-0.4) node [rect, text depth=20mm, text width=44mm] {decidable languages};\draw (0,-0.65) node [rect, text depth=13mm] {context sensitive languages};\draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {context-free languages};\draw (0,-1.05) node [rect] {regular languages};\end{tikzpicture}\end{center}\noindentContext-free languages play an important role in `day-to-day' text processing and inprogramming languages. Context-free languages are usually specified by grammars.For example a grammar for well-parenthesised expressions is\begin{center}$P \;\;\rightarrow\;\; ( \cdot P \cdot ) \cdot P \;|\; \epsilon$\end{center}\noindentIn general grammars consist of finitely many rules built up from \emph{terminal symbols} (usually lower-case letters) and \emph{non-terminal symbols} (upper-case letters). Rules have the shape\begin{center}$NT \;\;\rightarrow\;\; \textit{rhs}$\end{center}\noindentwhere on the left-hand side is a single non-terminal and on the right a string consistingof both terminals and non-terminals including the $\epsilon$-symbol for indicating theempty string. We use the convention to separate components onthe right hand-side by using the $\cdot$ symbol, as in the grammar for well-parenthesised expressions.We also use the convention to use $|$ as a shorthand notation for several rules. For example\begin{center}$NT \;\;\rightarrow\;\; \textit{rhs}_1 \;|\; \textit{rhs}_2$\end{center}\noindentmeans that the non-terminal $NT$ can be replaced by either $\textit{rhs}_1$ or $\textit{rhs}_2$.If there are more than one non-terminal on the left-hand side of the rules, then we need to indicatewhat is the \emph{starting} symbol of the grammar. For example the grammar for arithmetic expressionscan be given as follows\begin{center}\begin{tabular}{lcl}$E$ & $\rightarrow$ & $N$ \\$E$ & $\rightarrow$ & $E \cdot + \cdot E$ \\$E$ & $\rightarrow$ & $E \cdot - \cdot E$ \\$E$ & $\rightarrow$ & $E \cdot * \cdot E$ \\$E$ & $\rightarrow$ & $( \cdot E \cdot )$\\$N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$ \end{tabular}\end{center}\noindentwhere $E$ is the starting symbol. A \emph{derivation} for a grammarstarts with the staring symbol of the grammar and in each step replaces onenon-terminal by a right-hand side of a rule. A derivation ends with a stringin which only terminal symbols are left. For example a derivation for thestring $(1 + 2) + 3$ is as follows:\begin{center}\begin{tabular}{lll}$E$ & $\rightarrow$ & $E+E$\\ & $\rightarrow$ & $(E)+E$\\ & $\rightarrow$ & $(E+E)+E$\\ & $\rightarrow$ & $(E+E)+N$\\ & $\rightarrow$ & $(E+E)+3$\\ & $\rightarrow$ & $(N+E)+3$\\ & $\rightarrow^+$ & $(1+2)+3$\\\end{tabular} \end{center}\noindentThe \emph{language} of a context-free grammar $G$ with start symbol $S$ is defined as the set of strings derivable by a derivation, that is\begin{center}$\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$\end{center}\noindentA \emph{parse-tree} encodes how a string is derived with the starting symbol on top and each non-terminal containing a subtree for how it is replaced in a derivation.The parse tree for the string $(1 + 23)+4$ is as follows:\begin{center}\begin{tikzpicture}[level distance=8mm, black] \node {$E$} child {node {$E$} child {node {$($}} child {node {$E$} child {node {$E$} child {node {$N$} child {node {$1$}}}} child {node {$+$}} child {node {$E$} child {node {$N$} child {node {$2$}}} child {node {$N$} child {node {$3$}}} } } child {node {$)$}} } child {node {$+$}} child {node {$E$} child {node {$N$} child {node {$4$}}} };\end{tikzpicture}\end{center}\noindentWe are often interested in these parse-trees since they encode the structure ofhow a string is derived by a grammar. Before we come to the problem of constructingsuch parse-trees, we need to consider the following two properties of grammars.A grammar is \emph{left-recursive} if there is a derivation starting from a non-terminal, say$NT$ which leads to a string which again starts with $NT$. This means a derivation of theform.\begin{center}$NT \rightarrow \ldots \rightarrow NT \cdot \ldots$\end{center}\noindentIt can be easily seen that the grammar above for arithmetic expressions is left-recursive:for example the rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow N\cdot N$ show that this grammar is left-recursive. Some algorithms cannot cope with left-recursive grammars. Fortunately every left-recursive grammar can be transformed into one that isnot left-recursive, although this transformation might make the grammar less human-readable.For example if we want to give a non-left-recursive grammar for numbers we mightspecify\begin{center}$N \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot N\;|\;2\cdot N\;|\;\ldots\;|\;9\cdot N$\end{center}\noindentUsing this grammar we can still derive every number string, but we will never be able to derive a string of the form $\ldots \rightarrow N \cdot \ldots$.The other property we have to watch out for is when a grammar is\emph{ambiguous}. A grammar is said to be ambiguous if there are two parse-treesfor one string. Again the grammar for arithmetic expressions shown above is ambiguous.While the shown parse tree for the string $(1 + 23) + 4$ is unique, this is not the case ingeneral. For example there are two parsetrees for the string $1 + 2 + 3$, namely\begin{center}\begin{tabular}{c@{\hspace{10mm}}c}\begin{tikzpicture}[level distance=8mm, black] \node {$E$} child {node {$E$} child {node {$N$} child {node {$1$}}}} child {node {$+$}} child {node {$E$} child {node {$E$} child {node {$N$} child {node {$2$}}}} child {node {$+$}} child {node {$E$} child {node {$N$} child {node {$3$}}}} } ;\end{tikzpicture} &\begin{tikzpicture}[level distance=8mm, black] \node {$E$} child {node {$E$} child {node {$E$} child {node {$N$} child {node {$1$}}}} child {node {$+$}} child {node {$E$} child {node {$N$} child {node {$2$}}}} } child {node {$+$}} child {node {$E$} child {node {$N$} child {node {$3$}}}} ;\end{tikzpicture}\end{tabular} \end{center}\noindentIn particular in programming languages we will try to avoid ambiguousgrammars because two different parse-trees for a string mean a program canbe interpreted in two different ways. In such cases we have to somehow make surethe two different ways do not matter, or disambiguate the grammar insome other way (for example making the $+$ left-associative). Unfortunately already the problem of deciding whether a grammaris ambiguous or not is in general undecidable. Let us now turn to the problem of generating a parse-tree for a grammar and string.In what follows we explain \emph{parser combinators}, because they are easyto implement and closely resemble grammar rules. Imagine that a grammardescribes the strings of natural numbers, such as the grammar $N$ shown above.For all such strings we want to generate the parse-trees or later on we actually want to extract the meaning of these strings, that is the concrete integers ``behind'' these strings. In Scala the parser combinators will be functions of type\begin{center}\texttt{I $\Rightarrow$ Set[(T, I)]}\end{center}\noindent that is they take as input something of type \texttt{I}, typically a list of tokens or a string,and return a set of pairs. The first component of these pairs corresponds to what theparser combinator was able to process from the input and the second is the unprocessed part of the input. As we shall see shortly, a parser combinator might return more than one such pair,with the idea that there are potentially several ways how to interpret the input. As a concreteexample, consider the case where the input is of type string, say the string\begin{center}\tt\Grid{iffoo\VS testbar}\end{center}\noindentWe might have a parser combinator which tries to interpret this string as a keyword (\texttt{if}) oran identifier (\texttt{iffoo}). Then the output will be the set\begin{center}$\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$\end{center}\noindentwhere the first pair means the parser could recognise \texttt{if} from the input and leaves the rest as `unprocessed' as the second component of the pair; in the other caseit could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parsercannot recognise anything from the input then parser combinators just return the empty set $\varnothing$. This will indicate something ``went wrong''.The main attraction is that we can easily build parser combinators out of smaller componentsfollowing very closely the structure of a grammar. In order to implement this in an objectoriented programming language, like Scala, we need to specify an abstract class for parser combinators. This abstract class requires the implementation of the function\texttt{parse} taking an argument of type \texttt{I} and returns a set of type \mbox{\texttt{Set[(T, I)]}}.\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]abstract class Parser[I, T] { def parse(ts: I): Set[(T, I)] def parse_all(ts: I): Set[T] = for ((head, tail) <- parse(ts); if (tail.isEmpty)) yield head}\end{lstlisting}\end{center}\noindentFrom the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all},which just filters out all pairs whose second component is not empty (that is has still someunprocessed part). The reason is that at the end of parsing we are only interested in theresults where all the input has been consumed and no unprocessed part is left.One of the simplest parser combinators recognises just a character, say $c$, from the beginning of strings. Its behaviour is as follows:\begin{itemize}\item if the head of the input string starts with a $c$, it returns the set $\{(c, \textit{tail of}\; s)\}$\item otherwise it returns the empty set $\varnothing$ \end{itemize}\noindentThe input type of this simple parser combinator for characters is\texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. The code in Scala is as follows:\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]case class CharParser(c: Char) extends Parser[String, Char] { def parse(sb: String) = if (sb.head == c) Set((c, sb.tail)) else Set()}\end{lstlisting}\end{center}\noindentThe \texttt{parse} function tests whether the first character of the input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits thestring into the recognised part \texttt{c} and the unprocessed part\texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} thenthe parser returns the empty set (in Scala \texttt{Set()}).More interesting are the parser combinators that build larger parsersout of smaller component parsers. For example the alternative parser combinator is as follows.\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]class AltParser[I, T] (p: => Parser[I, T], q: => Parser[I, T]) extends Parser[I, T] { def parse(sb: I) = p.parse(sb) ++ q.parse(sb)}\end{lstlisting}\end{center}\noindentThe types of this parser combinator are polymorphic (we just have \texttt{I}for the input type, and \texttt{T} for the output type). The alternative parserbuilds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}.Both need to be able to process input of type \texttt{I} and return the sameoutput type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent theevaluation of the arguments before they are used. This is often called \emph{lazy evaluation} of the arguments.) The alternative parser should runthe input with the first parser \texttt{p} (producing a set of outputs) and thenrun the same input with \texttt{q}. The result should be then just the unionof both sets, which is the operation \texttt{++} in Scala.This parser combinator already allows us to construct a parser that either a character \texttt{a} or \texttt{b}, as\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]new AltParser(CharParser('a'), CharParser('b'))\end{lstlisting}\end{center}\noindentScala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}. We can call this parser combinator with the strings\begin{center}\begin{tabular}{rcl}input string & & output\medskip\\\texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\\texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\\texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$\end{tabular}\end{center}\noindentWe receive in the first two cases a successful output (that is a non-empty set).A bit more interesting is the \emph{sequence parser combinator} implemented inScala as follows:\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]class SeqParser[I, T, S] (p: => Parser[I, T], q: => Parser[I, S]) extends Parser[I, (T, S)] { def parse(sb: I) = for ((head1, tail1) <- p.parse(sb); (head2, tail2) <- q.parse(tail1)) yield ((head1, head2), tail2)}\end{lstlisting}\end{center}\noindentThis parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}).The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}. Let \texttt{q} run on these unprocessed partsproducing again a set of pairs. The output of the sequence parser combinator is then a setcontaining pairs where the first components are again pairs, namely what the first parser could parsetogether with what the second parser could parse; the second component is the unprocessedpart left over after running the second parser \texttt{q}. Therefore the input type ofthe sequence parser combinator is as usual \texttt{I}, but the output type is\begin{center}\texttt{Set[((T, S), I)]}\end{center}Scala allows us to provide someshorthand notation for the sequence parser combinator. So we can write for example \texttt{'a' $\sim$ 'b'}, which is theparser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}.Calling this parser combinator with the strings\begin{center}\begin{tabular}{rcl}input string & & output\medskip\\\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\\texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\\texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$\end{tabular}\end{center}\noindentA slightly more complicated parser is \texttt{('a' || 'b') $\sim$ 'b'} which parses as first character eitheran \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results.\begin{center}\begin{tabular}{rcl}input string & & output\medskip\\\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$\end{tabular}\end{center}Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error.The first parser has as output type a single character (recall the type of \texttt{CharParser}),but the second parser produces a pair of characters as output. The alternative parser is howeverrequired to have both component parsers to have the same type. We will see later how we can build this parser without the typing error.The next parser combinator does not actually combine smaller parsers, but appliesa function to the result of the parser. It is implemented in Scala as follows\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]class FunParser[I, T, S] (p: => Parser[I, T], f: T => S) extends Parser[I, S] { def parse(sb: I) = for ((head, tail) <- p.parse(sb)) yield (f(head), tail)}\end{lstlisting}\end{center}\noindentThis parser combinator takes a parser \texttt{p} with output type \texttt{T} asinput as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p}produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator thenapplies the function \texttt{f} to all the parer outputs. Since this functionis of type \texttt{T => S}, we obtain a parser with output type \texttt{S}.Again Scala lets us introduce some shorthand notation for this parser combinator. Therefore we will write \texttt{p ==> f} for it.%\bigskip%takes advantage of the full generality---have a look%what it produces if we call it with the string \texttt{abc}%%\begin{center}%\begin{tabular}{rcl}%input string & & output\medskip\\%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$%\end{tabular}%\end{center}\end{document}%%% Local Variables: %%% mode: latex %%% TeX-master: t%%% End: