% !TEX program = xelatex\documentclass{article}\usepackage{../style}\usepackage{../langs}\usepackage{../grammar}\usepackage{../graphics}% epsilon and left-recursion elimination% http://www.mollypages.org/page/grammar/index.mp%% parsing scala files%% https://scalameta.org/% chomsky normal form transformation% https://youtu.be/FNPSlnj3Vt0% Language hierachy is about complexity% How hard is it to recognise an element in a language% Pratt parsing% https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html% https://www.oilshell.org/blog/2017/03/31.html\begin{document}\section*{Handout 5 (Grammars \& Parser)}So far we have focused on regular expressions as well as matching andlexing algorithms. While regular expressions are very useful for lexingand for recognising many patterns in strings (like email addresses),they have their limitations. For example there is no regular expressionthat can recognise the language $a^nb^n$ (where you have stringsstarting with $n$ $a$'s followed by the same amount of $b$'s). Anotherexample for which there exists no regular expression is the language ofwell-parenthesised expressions. In languages like Lisp, which useparentheses rather extensively, it might be of interest to know whetherthe following two expressions are well-parenthesised or not (the leftone is, the right one is not):\begin{center}$(((()()))())$ \hspace{10mm} $(((()()))()))$\end{center}\noindent Not being able to solve such recognition problems isa serious limitation. In order to solve such recognitionproblems, we need more powerful techniques than regularexpressions. We will in particular look at \emph{context-freelanguages}. They include the regular languages as the picturebelow about language classes shows:\begin{center}\begin{tikzpicture}[rect/.style={draw=black!50, top color=white,bottom color=black!20, rectangle, very thick, rounded corners}]\draw (0,0) node [rect, text depth=30mm, text width=46mm] {\small all languages};\draw (0,-0.4) node [rect, text depth=20mm, text width=44mm] {\small decidable languages};\draw (0,-0.65) node [rect, text depth=13mm] {\small context sensitive languages};\draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages};\draw (0,-1.05) node [rect] {\small regular languages};\end{tikzpicture}\end{center}\noindent Each ``bubble'' stands for sets of languages (rememberlanguages are sets of strings). As indicated the set of regularlanguages is fully included inside the context-free languages,meaning every regular language is also context-free, but not viceversa. Below I will let you think, for example, what the context-freegrammar is for the language corresponding to the regular expression$(aaa)^*a$.Because of their convenience, context-free languages play an importantrole in `day-to-day' text processing and in programminglanguages. Context-free in this setting means that ``words'' have onemeaning only and this meaning is independent from the contextthe ``words'' appear in. For example ambiguity issues like\begin{center}\tt Time flies like an arrow. Fruit flies like bananas.\end{center} \noindentfrom natural languages where the meaning of \emph{flies} depends on thesurrounding \emph{context} are avoided as much as possible. Here isan interesting video about C++ not being a context-free language\begin{center}\url{https://www.youtube.com/watch?v=OzK8pUu4UfM}\end{center} Context-free languages are usually specified by grammars. For examplea grammar for well-parenthesised expressions can be given as follows:\begin{plstx}[margin=3cm]: \meta{P} ::= ( \cdot \meta{P} \cdot ) \cdot \meta{P} | \epsilon\\ \end{plstx}\noindent or a grammar for recognising strings consisting of ones (at least one) is\begin{plstx}[margin=3cm]: \meta{O} ::= 1 \cdot \meta{O} | 1\\\end{plstx}In general grammars consist of finitely many rules built upfrom \emph{terminal symbols} (usually lower-case letters) and\emph{non-terminal symbols} (upper-case letters written inbold like \meta{A}, \meta{N} and so on). Rules havethe shape\begin{plstx}[margin=3cm]: \meta{NT} ::= rhs\\\end{plstx}\noindent where on the left-hand side is a single non-terminaland on the right a string consisting of both terminals andnon-terminals including the $\epsilon$-symbol for indicatingthe empty string. We use the convention to separate componentson the right hand-side by using the $\cdot$ symbol, as in thegrammar for well-parenthesised expressions. We also use theconvention to use $|$ as a shorthand notation for severalrules. For example\begin{plstx}[margin=3cm]: \meta{NT} ::= rhs_1 | rhs_2\\\end{plstx}\noindent means that the non-terminal \meta{NT} can be replaced byeither $\textit{rhs}_1$ or $\textit{rhs}_2$. If there are morethan one non-terminal on the left-hand side of the rules, thenwe need to indicate what is the \emph{starting} symbol of thegrammar. For example the grammar for arithmetic expressionscan be given as follows\begin{plstx}[margin=3cm,one per line]\mbox{\rm (1)}: \meta{E} ::= \meta{N}\\\mbox{\rm (2)}: \meta{E} ::= \meta{E} \cdot + \cdot \meta{E}\\\mbox{\rm (3)}: \meta{E} ::= \meta{E} \cdot - \cdot \meta{E}\\\mbox{\rm (4)}: \meta{E} ::= \meta{E} \cdot * \cdot \meta{E}\\\mbox{\rm (5)}: \meta{E} ::= ( \cdot \meta{E} \cdot )\\\mbox{\rm (6\ldots)}: \meta{N} ::= \meta{N} \cdot \meta{N} \mid 0 \mid 1 \mid \ldots \mid 9\\\end{plstx}\noindent where \meta{E} is the starting symbol. A\emph{derivation} for a grammar starts with the startingsymbol of the grammar and in each step replaces onenon-terminal by a right-hand side of a rule. A derivation endswith a string in which only terminal symbols are left. Forexample a derivation for the string $(1 + 2) + 3$ is asfollows:\begin{center}\begin{tabular}{lll@{\hspace{2cm}}l}\meta{E} & $\rightarrow$ & $\meta{E}+\meta{E}$ & by (2)\\ & $\rightarrow$ & $(\meta{E})+\meta{E}$ & by (5)\\ & $\rightarrow$ & $(\meta{E}+\meta{E})+\meta{E}$ & by (2)\\ & $\rightarrow$ & $(\meta{E}+\meta{E})+\meta{N}$ & by (1)\\ & $\rightarrow$ & $(\meta{E}+\meta{E})+3$ & by (6\dots)\\ & $\rightarrow$ & $(\meta{N}+\meta{E})+3$ & by (1)\\ & $\rightarrow^+$ & $(1+2)+3$ & by (1, 6\ldots)\\\end{tabular} \end{center}\noindent where on the right it is indicated which grammar rule has been applied. In the last step wemerged several steps into one.The \emph{language} of a context-free grammar $G$with start symbol $S$ is defined as the set of stringsderivable by a derivation, that is\begin{center}$\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$\end{center}\noindentA \emph{parse-tree} encodes how a string is derived with the startingsymbol on top and each non-terminal containing a subtree for how it isreplaced in a derivation. The parse tree for the string $(1 + 23)+4$ isas follows:\begin{center}\begin{tikzpicture}[level distance=8mm, black] \node {\meta{E}} child {node {\meta{E} } child {node {$($}} child {node {\meta{E} } child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}} child {node {$+$}} child {node {\meta{E} } child {node {\meta{N} } child {node {$2$}}} child {node {\meta{N} } child {node {$3$}}} } } child {node {$)$}} } child {node {$+$}} child {node {\meta{E} } child {node {\meta{N} } child {node {$4$}}} };\end{tikzpicture}\end{center}\noindent We are often interested in these parse-trees sincethey encode the structure of how a string is derived by agrammar. Before we come to the problem of constructing such parse-trees, we needto consider the following two properties of grammars. A grammar is\emph{left-recursive} if there is a derivation starting from anon-terminal, say \meta{NT} which leads to a string which again startswith \meta{NT}. This means a derivation of the form.\begin{center}$\meta{NT} \rightarrow \ldots \rightarrow \meta{NT} \cdot \ldots$\end{center}\noindent It can be easily seen that the grammar above for arithmeticexpressions is left-recursive: for example the rules $\meta{E}\rightarrow \meta{E}\cdot + \cdot \meta{E}$ and $\meta{N} \rightarrow\meta{N}\cdot \meta{N}$ show that this grammar is left-recursive. Butnote that left-recursiveness can involve more than one step in thederivation. The problem with left-recursive grammars is that somealgorithms cannot cope with them: with left-recursive grammars they willfall into a loop. Fortunately every left-recursive grammar can betransformed into one that is not left-recursive, although thistransformation might make the grammar less ``human-readable''. Forexample if we want to give a non-left-recursive grammar for numbers wemight specify\begin{center}$\meta{N} \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot \meta{N}\;|\;2\cdot \meta{N}\;|\;\ldots\;|\;9\cdot \meta{N}$\end{center}\noindent Using this grammar we can still derive every numberstring, but we will never be able to derive a string of theform $\meta{N} \to \ldots \to \meta{N} \cdot \ldots$.The other property we have to watch out for is when a grammaris \emph{ambiguous}. A grammar is said to be ambiguous ifthere are two parse-trees for one string. Again the grammarfor arithmetic expressions shown above is ambiguous. While theshown parse tree for the string $(1 + 23) + 4$ is unique, thisis not the case in general. For example there are two parsetrees for the string $1 + 2 + 3$, namely\begin{center}\begin{tabular}{c@{\hspace{10mm}}c}\begin{tikzpicture}[level distance=8mm, black] \node {\meta{E} } child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}} child {node {$+$}} child {node {\meta{E} } child {node {\meta{E} } child {node {\meta{N} } child {node {$2$}}}} child {node {$+$}} child {node {\meta{E} } child {node {\meta{N} } child {node {$3$}}}} } ;\end{tikzpicture} &\begin{tikzpicture}[level distance=8mm, black] \node {\meta{E} } child {node {\meta{E} } child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}} child {node {$+$}} child {node {\meta{E} } child {node {\meta{N} } child {node {$2$}}}} } child {node {$+$}} child {node {\meta{E} } child {node {\meta{N} } child {node {$3$}}}} ;\end{tikzpicture}\end{tabular} \end{center}\noindent In particular in programming languages we will try to avoidambiguous grammars because two different parse-trees for a string mean aprogram can be interpreted in two different ways. In such cases we haveto somehow make sure the two different ways do not matter, ordisambiguate the grammar in some other way (for example making the $+$left-associative). Unfortunately already the problem of deciding whethera grammar is ambiguous or not is in general undecidable. But in simpleinstance (the ones we deal with in this module) one can usually see whena grammar is ambiguous.\subsection*{Removing Left-Recursion}Let us come back to the problem of left-recursion and consider the following grammar for binary numbers:\begin{plstx}[margin=1cm]: \meta{B} ::= \meta{B} \cdot \meta{B} | 0 | 1\\\end{plstx}\noindentIt is clear that this grammar can create all binary numbers, butit is also clear that this grammar is left-recursive. Giving thisgrammar as is to parser combinators will result in an infinite loop. Fortunately, every left-recursive grammar can be translatedinto one that is not left-recursive with the help of sometransformation rules. Suppose we identified the ``offensive'' rule, then we can separate the grammar into this offensive ruleand the ``rest'':\begin{plstx}[margin=1cm] : \meta{B} ::= \underbrace{\meta{B} \cdot \meta{B}}_{\textit{lft-rec}} | \underbrace{0 \;\;|\;\; 1}_{\textit{rest}}\\\end{plstx}\noindentTo make the idea of the transformation clearer, suppose the left-recursiverule is of the form $\meta{B}\alpha$ (the left-recursive non-terminal followed by something called $\alpha$) and the ``rest'' is called $\beta$.That means our grammar looks schematically as follows\begin{plstx}[margin=1cm] : \meta{B} ::= \meta{B} \cdot \alpha | \beta\\\end{plstx}\noindentTo get rid of the left-recursion, we are required to introduce a new non-terminal, say $\meta{B'}$ and transform the ruleas follows:\begin{plstx}[margin=1cm] : \meta{B} ::= \beta \cdot \meta{B'}\\ : \meta{B'} ::= \alpha \cdot \meta{B'} | \epsilon\\\end{plstx}\noindentIn our example of binary numbers we would after the transformation end up with the rules\begin{plstx}[margin=1cm] : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\ : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\\end{plstx}\noindentA little thought should convince you that this grammar still derivesall the binary numbers (for example 0 and 1 are derivable because $\meta{B'}$can be $\epsilon$). Less clear might be why this grammar is non-left recursive.For $\meta{B'}$ it is relatively clear because we will never be able to derive things like\begin{center}$\meta{B'} \rightarrow\ldots\rightarrow \meta{B'}\cdot\ldots$\end{center} \noindentbecause there will always be a $\meta{B}$ in front of a $\meta{B'}$, and$\meta{B}$ now has always a $0$ or $1$ in front, so a $\meta{B'}$ cannever be in the first place. The reasoning is similar for $\meta{B}$:the $0$ and $1$ in the rule for $\meta{B}$ ``protect'' it from becomingleft-recursive. This transformation does not mean the grammar is thesimplest left-recursive grammar for binary numbers. For example thefollowing grammar would do as well\begin{plstx}[margin=1cm] : \meta{B} ::= 0 \cdot \meta{B} | 1 \cdot \meta{B} | 0 | 1\\\end{plstx}\noindentThe point is that we can in principle transform every left-recursivegrammar into one that is non-left-recursive. This explains why oftenthe following grammar is used for arithmetic expressions:\begin{plstx}[margin=1cm] : \meta{E} ::= \meta{T} | \meta{T} \cdot + \cdot \meta{E} | \meta{T} \cdot - \cdot \meta{E}\\ : \meta{T} ::= \meta{F} | \meta{F} \cdot * \cdot \meta{T}\\ : \meta{F} ::= num\_token | ( \cdot \meta{E} \cdot )\\\end{plstx}\noindentIn this grammar all $\meta{E}$xpressions, $\meta{T}$erms and$\meta{F}$actors are in some way protected from beingleft-recursive. For example if you start $\meta{E}$ you can deriveanother one by going through $\meta{T}$, then $\meta{F}$, but then$\meta{E}$ is protected by the open-parenthesis in the last rule.\subsection*{Removing $\epsilon$-Rules and CYK-Algorithm}I showed above that the non-left-recursive grammar for binary numbers is\begin{plstx}[margin=1cm] : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\ : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\\end{plstx}\noindentThe transformation made the original grammar non-left-recursive, but atthe expense of introducing an $\epsilon$ in the second rule. Having anexplicit $\epsilon$-rule is annoying, not in terms of looping, but interms of efficiency. The reason is that the $\epsilon$-rule alwaysapplies but since it recognises the empty string, it does not make anyprogress with recognising a string. Better are rules like $( \cdot\meta{E} \cdot )$ where something of the input is consumed. Gettingrid of $\epsilon$-rules is also important for the CYK parsing algorithm,which can give us an insight into the complexity class of parsing.It turns out we can also by some generic transformations eliminate$\epsilon$-rules from grammars. Consider again the grammar above forbinary numbers where have a rule $\meta{B'} ::= \epsilon$. In this casewe look for rules of the (generic) form \mbox{$\meta{A} :=\alpha\cdot\meta{B'}\cdot\beta$}. That is, there are rules that use$\meta{B'}$ and something ($\alpha$) is in front of $\meta{B'}$ andsomething follows ($\beta$). Such rules need to be replaced byadditional rules of the form \mbox{$\meta{A} := \alpha\cdot\beta$}.In our running example there are the two rules for $\meta{B}$ whichfall into this category\begin{plstx}[margin=1cm] : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\\end{plstx} \noindent To follow the general scheme of the transformation,the $\alpha$ is either is either $0$ or $1$, and the $\beta$ happensto be empty. So we need to generate new rules for the form \mbox{$\meta{A} := \alpha\cdot\beta$}, which in our particular example means we obtain\begin{plstx}[margin=1cm] : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\\end{plstx} \noindentUnfortunately $\meta{B'}$ is also used in the rule \begin{plstx}[margin=1cm] : \meta{B'} ::= \meta{B} \cdot \meta{B'}\\\end{plstx}\noindentFor this we repeat the transformation, giving \begin{plstx}[margin=1cm] : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\\end{plstx}\noindentIn this case $\alpha$ was substituted with $\meta{B}$ and $\beta$was again empty. Once no rule is left over, we can simply throwaway the $\epsilon$ rule. This gives the grammar\begin{plstx}[margin=1cm] : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\ : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\\end{plstx}\noindentI let you think about whether this grammar can still recognise allbinary numbers and whether this grammar is non-left-recursive. Theprecise statement for the transformation of removing $\epsilon$-rulesis that if the original grammar was able to recognise only non-emptystrings, then the transformed grammar will be equivalent (matching thesame set of strings); if the original grammar was able to match theempty string, then the transformed grammar will be able to match thesame strings, \emph{except} the empty string. So the$\epsilon$-removal does not preserve equivalence of grammars ingeneral, but the small defect with the empty string is not importantfor practical purposes.So why are these transformations all useful? Well apart from making the parser combinators work (remember they cannot deal with left-recursion andare inefficient with $\epsilon$-rules), a second reason is that they helpwith getting any insight into the complexity of the parsing problem. The parser combinators are very easy to implement, but are far from the most efficient way of processing input (they can blow up exponentiallywith ambiguous grammars). The question remains what is the best possiblecomplexity for parsing? It turns out that this is $O(n^3)$ for context-freelanguages. To answer the question about complexity, let me describe next the CYKalgorithm (named after the authors Cocke–Younger–Kasami). This algorithmworks with grammars that are in \emph{Chomsky normalform}. In Chomskynormalform all rules must be of the form $\meta{A} ::= a$, where $a$ is a terminal, or $\meta{A} ::= \meta{B}\cdot \meta{C}$, where $\meta{B}$ and$\meta{B}$ need to be non-terminals. And no rule can contain $\epsilon$.The following grammar is in Chomsky normalform:\begin{plstx}[margin=1cm] : \meta{S} ::= \meta{N}\cdot \meta{P}\\ : \meta{P} ::= \meta{V}\cdot \meta{N}\\ : \meta{N} ::= \meta{N}\cdot \meta{N}\\ : \meta{N} ::= \meta{A}\cdot \meta{N}\\ : \meta{N} ::= \texttt{student} | \texttt{trainer} | \texttt{team} | \texttt{trains}\\ : \meta{V} ::= \texttt{trains} | \texttt{team}\\ : \meta{A} ::= \texttt{The} | \texttt{the}\\\end{plstx}\noindentwhere $\meta{S}$ is the start symbol and $\meta{S}$, $\meta{P}$,$\meta{N}$, $\meta{V}$ and $\meta{A}$ are non-terminals. The ``words''are terminals. The rough idea behind this grammar is that $\meta{S}$stands for a sentence, $\meta{P}$ is a predicate, $\meta{N}$ is a nounand so on. For example the rule \mbox{$\meta{P} ::= \meta{V}\cdot\meta{N}$} states that a predicate can be a verb followed by a noun.Now the question is whether the string \begin{center} \texttt{The trainer trains the student team}\end{center}\noindentis recognised by the grammar. The CYK algorithm starts with thefollowing triangular data structure.%%\begin{figure}[t]\begin{center} \begin{tikzpicture}[scale=0.7,line width=0.8mm] \draw (-2,0) -- (4,0); \draw (-2,1) -- (4,1); \draw (-2,2) -- (3,2); \draw (-2,3) -- (2,3); \draw (-2,4) -- (1,4); \draw (-2,5) -- (0,5); \draw (-2,6) -- (-1,6); \draw (0,0) -- (0, 5); \draw (1,0) -- (1, 4); \draw (2,0) -- (2, 3); \draw (3,0) -- (3, 2); \draw (4,0) -- (4, 1); \draw (-1,0) -- (-1, 6); \draw (-2,0) -- (-2, 6); \draw (-1.5,-0.5) node {\footnotesize{}\texttt{The}}; \draw (-0.5,-1.0) node {\footnotesize{}\texttt{trainer}}; \draw ( 0.5,-0.5) node {\footnotesize{}\texttt{trains}}; \draw ( 1.5,-1.0) node {\footnotesize{}\texttt{the}}; \draw ( 2.5,-0.5) node {\footnotesize{}\texttt{student}}; \draw ( 3.5,-1.0) node {\footnotesize{}\texttt{team}}; \draw (-1.5,0.5) node {$A$}; \draw (-0.5,0.5) node {$N$}; \draw ( 0.5,0.5) node {$N,V$}; \draw ( 1.5,0.5) node {$A$}; \draw ( 2.5,0.5) node {$N$}; \draw ( 3.5,0.5) node {$N,V$};% \draw (-1.5,1.5) node {\small{}a}; % \draw (-0.5,1.5) node {\small{}b}; % \draw ( 0.5,1.5) node {\small{}c}; % \draw ( 1.5,1.5) node {\small{}d}; % \draw ( 2.5,1.5) node {\small{}e}; \draw (-2.4, 5.5) node {$1$}; \draw (-2.4, 4.5) node {$2$}; \draw (-2.4, 3.5) node {$3$}; \draw (-2.4, 2.5) node {$4$}; \draw (-2.4, 1.5) node {$5$}; \draw (-2.4, 0.5) node {$6$}; \end{tikzpicture} \end{center}%%\end{figure}\noindentThe last row contains the information about all words and theircorresponding non-terminals. For example the field for \texttt{trains}contains the information $\meta{N}$ and $\meta{V}$ because \texttt{trains} can be a``verb'' and a ``noun'' according to the grammar. The row above,let's call the corresponding fields 5a to 5e, contains informationabout \underline{2-word} parts of the sentence, namely\begin{center}\begin{tabular}{llll} 5a) & $\underbrace{\texttt{The}}_{A}$ $\mid$ $\underbrace{\texttt{trainer}}_{N}$ \\ 5b) & $\underbrace{\texttt{trainer}}_{N}$ $\mid$ $\underbrace{\texttt{trains}}_{N,V}$\\ 5c) & \texttt{trains} $\mid$ \texttt{the} \\ 5d) & \texttt{the} $\mid$ \texttt{student} \\ 5e) & \texttt{student} $\mid$ \texttt{team} \\\end{tabular} \end{center} \noindentFor each of them, we look up in row 6 which non-terminals it belongs to(indicated above for 5a and 5b). For 5a, with the non-terminals\meta{A} and \meta{N}, we find the grammar rule\begin{plstx}[margin=1cm] : \meta{N} ::= \meta{A}\cdot \meta{N}\\\end{plstx}\noindentwhich means into field 5a we put the left-hand side of this rule,which in this case is the non-terminal \meta{N}. For 5b we have to checkfor both $\meta{N}\cdot\meta{N}$ and $\meta{N}\cdot\meta{V}$ whether thereis a right-hand side of this form in the grammar. But only thegrammar rule \begin{plstx}[margin=1cm] : \meta{N} ::= \meta{N}\cdot \meta{N}\\\end{plstx}\noindentmatches, which means 5b gets also an \meta{N}. Continuing for allfields in row 5 gives:\begin{center} \begin{tikzpicture}[scale=0.7,line width=0.8mm] \draw (-2,0) -- (4,0); \draw (-2,1) -- (4,1); \draw (-2,2) -- (3,2); \draw (-2,3) -- (2,3); \draw (-2,4) -- (1,4); \draw (-2,5) -- (0,5); \draw (-2,6) -- (-1,6); \draw (0,0) -- (0, 5); \draw (1,0) -- (1, 4); \draw (2,0) -- (2, 3); \draw (3,0) -- (3, 2); \draw (4,0) -- (4, 1); \draw (-1,0) -- (-1, 6); \draw (-2,0) -- (-2, 6); \draw (-1.5,-0.5) node {\footnotesize{}\texttt{The}}; \draw (-0.5,-1.0) node {\footnotesize{}\texttt{trainer}}; \draw ( 0.5,-0.5) node {\footnotesize{}\texttt{trains}}; \draw ( 1.5,-1.0) node {\footnotesize{}\texttt{the}}; \draw ( 2.5,-0.5) node {\footnotesize{}\texttt{student}}; \draw ( 3.5,-1.0) node {\footnotesize{}\texttt{team}}; \draw (-1.5,0.5) node {$A$}; \draw (-0.5,0.5) node {$N$}; \draw ( 0.5,0.5) node {$N,V$}; \draw ( 1.5,0.5) node {$A$}; \draw ( 2.5,0.5) node {$N$}; \draw ( 3.5,0.5) node {$N,V$}; \draw (-1.5,1.5) node {$N$}; \draw (-0.5,1.5) node {$N$}; \draw ( 0.5,1.5) node {$$}; \draw ( 1.5,1.5) node {$N$}; \draw ( 2.5,1.5) node {$N$}; % \draw (-1.5,1.5) node {\small{}a}; % \draw (-0.5,1.5) node {\small{}b}; % \draw ( 0.5,1.5) node {\small{}c}; % \draw ( 1.5,1.5) node {\small{}d}; % \draw ( 2.5,1.5) node {\small{}e}; \draw (-2.4, 5.5) node {$1$}; \draw (-2.4, 4.5) node {$2$}; \draw (-2.4, 3.5) node {$3$}; \draw (-2.4, 2.5) node {$4$}; \draw (-2.4, 1.5) node {$5$}; \draw (-2.4, 0.5) node {$6$}; \end{tikzpicture} \end{center}\noindentNow row 4 is in charge of all 3-word parts of the sentence, namely\begin{center}\begin{tabular}{llll} 4a) & The $\mid$ trainer trains\\ & The trainer $\mid$ trains\\ 4b) & trainer $\mid$ trains the\\ & trainer trains $\mid$ the\\ 4c) & trains $\mid$ the student\\ & trains the $\mid$ student\\ 4d) & the $\mid$ student team\\ & the student $\mid$ team\\\end{tabular} \end{center}\noindentNote that in case of 3-word parts we have two splits. For example for4a: $\underbrace{\texttt{The}}_{A}$ and$\underbrace{\texttt{trainer trains}}_{N}$; and also$\underbrace{\texttt{The trainer}}_{N}$ and$\underbrace{\texttt{trains}}_{N,V}$. For each of these splits we haveto look up in the rows below, which non-terminals we alreadycomputed. This allows us to look for right-hand sides in our grammar:$\meta{A}\cdot\meta{N}$, $\meta{N}\cdot\meta{N}$ and$\meta{N}\cdot\meta{V}$, which yield the only left-hand side\meta{N}. This is what we fill in for 4a. And so on for row 4.Row 3 is about all 4-word parts in the sentence, namely\begin{center}\begin{tabular}{llll} 3a) & The trainer trains the\\ 3b) & trainer trains the student\\ 3c) & trains the student team\\\end{tabular} \end{center}\noindentEach of them can be split up in 3 ways, for example\begin{center}\begin{tabular}{llll} 3a) & The $\mid$ trainer trains the\\ & The trainer $\mid$ trains the\\ & The trainer trains $\mid$ the\\\end{tabular} \end{center}\noindentand we have to consider them all in turn to fill in the non-terminals for3a. You can guess how it continues: row 2 is for all 5-word parts, and finallythe field on the top is for the whole sentence.The idea of the CYK algorithm is that if in the top-field the startingsymbol \meta{S} appears (possibly together with other non-terminals),then the sentence is accepted by the grammar. If it does not, then thesentence is not accepted.Let us very quickly calculate the complexity of the CYKalgorithm. Lookup operations inside the triangle and in the grammarare assumed to be of constant time, $O(1)$, meaning they do notmatter. How many fields are in the triangle\ldots$\frac{n}{2} * (n+1)$, where $n$ is the size of the input. That meansroughly $O(n^2)$ fields. How much work do we have to do for eachfield? Well, for the top-most we have to consider $n-1$ splits, whichmeans roughly $O(n)$ for each field. The overall result is a $O(n^3)$time-complexity for CYK. It turns out that this is the best worst-timecomplexity for parsing with context-free grammars.\end{document}%%% Parser combinators are now part of handout 6\subsection*{Parser Combinators}Let us now turn to the problem of generating a parse-tree fora grammar and string. In what follows we explain \emph{parsercombinators}, because they are easy to implement and closelyresemble grammar rules. Imagine that a grammar describes thestrings of natural numbers, such as the grammar \meta{N} shownabove. For all such strings we want to generate theparse-trees or later on we actually want to extract themeaning of these strings, that is the concrete integers``behind'' these strings. In Scala the parser combinators willbe functions of type\begin{center}\texttt{I $\Rightarrow$ Set[(T, I)]}\end{center}\noindent that is they take as input something of type\texttt{I}, typically a list of tokens or a string, and returna set of pairs. The first component of these pairs correspondsto what the parser combinator was able to process from theinput and the second is the unprocessed part of the input. Aswe shall see shortly, a parser combinator might return morethan one such pair, with the idea that there are potentiallyseveral ways how to interpret the input. As a concreteexample, consider the case where the input is of type string,say the string\begin{center}\tt\Grid{iffoo\VS testbar}\end{center}\noindent We might have a parser combinator which tries tointerpret this string as a keyword (\texttt{if}) or anidentifier (\texttt{iffoo}). Then the output will be the set\begin{center}$\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$\end{center}\noindent where the first pair means the parser couldrecognise \texttt{if} from the input and leaves the rest as`unprocessed' as the second component of the pair; in theother case it could recognise \texttt{iffoo} and leaves\texttt{\VS testbar} as unprocessed. If the parser cannotrecognise anything from the input then parser combinators justreturn the empty set $\{\}$. This will indicatesomething ``went wrong''.The main attraction is that we can easily build parser combinators out of smaller componentsfollowing very closely the structure of a grammar. In order to implement this in an objectoriented programming language, like Scala, we need to specify an abstract class for parser combinators. This abstract class requires the implementation of the function\texttt{parse} taking an argument of type \texttt{I} and returns a set of type \mbox{\texttt{Set[(T, I)]}}.\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]abstract class Parser[I, T] { def parse(ts: I): Set[(T, I)] def parse_all(ts: I): Set[T] = for ((head, tail) <- parse(ts); if (tail.isEmpty)) yield head}\end{lstlisting}\end{center}\noindentFrom the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all},which just filters out all pairs whose second component is not empty (that is has still someunprocessed part). The reason is that at the end of parsing we are only interested in theresults where all the input has been consumed and no unprocessed part is left.One of the simplest parser combinators recognises just a character, say $c$, from the beginning of strings. Its behaviour is as follows:\begin{itemize}\item if the head of the input string starts with a $c$, it returns the set $\{(c, \textit{tail of}\; s)\}$\item otherwise it returns the empty set $\varnothing$ \end{itemize}\noindentThe input type of this simple parser combinator for characters is\texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. The code in Scala is as follows:\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]case class CharParser(c: Char) extends Parser[String, Char] { def parse(sb: String) = if (sb.head == c) Set((c, sb.tail)) else Set()}\end{lstlisting}\end{center}\noindentThe \texttt{parse} function tests whether the first character of the input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits thestring into the recognised part \texttt{c} and the unprocessed part\texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} thenthe parser returns the empty set (in Scala \texttt{Set()}).More interesting are the parser combinators that build larger parsersout of smaller component parsers. For example the alternative parser combinator is as follows.\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]class AltParser[I, T] (p: => Parser[I, T], q: => Parser[I, T]) extends Parser[I, T] { def parse(sb: I) = p.parse(sb) ++ q.parse(sb)}\end{lstlisting}\end{center}\noindentThe types of this parser combinator are polymorphic (we just have \texttt{I}for the input type, and \texttt{T} for the output type). The alternative parserbuilds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}.Both need to be able to process input of type \texttt{I} and return the sameoutput type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent theevaluation of the arguments before they are used. This is often called \emph{lazy evaluation} of the arguments.) The alternative parser should runthe input with the first parser \texttt{p} (producing a set of outputs) and thenrun the same input with \texttt{q}. The result should be then just the unionof both sets, which is the operation \texttt{++} in Scala.This parser combinator already allows us to construct a parser that either a character \texttt{a} or \texttt{b}, as\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]new AltParser(CharParser('a'), CharParser('b'))\end{lstlisting}\end{center}\noindentScala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}. We can call this parser combinator with the strings\begin{center}\begin{tabular}{rcl}input string & & output\medskip\\\texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\\texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\\texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$\end{tabular}\end{center}\noindentWe receive in the first two cases a successful output (that is a non-empty set).A bit more interesting is the \emph{sequence parser combinator} implemented inScala as follows:\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]class SeqParser[I, T, S] (p: => Parser[I, T], q: => Parser[I, S]) extends Parser[I, (T, S)] { def parse(sb: I) = for ((head1, tail1) <- p.parse(sb); (head2, tail2) <- q.parse(tail1)) yield ((head1, head2), tail2)}\end{lstlisting}\end{center}\noindentThis parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}).The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}. Let \texttt{q} run on these unprocessed partsproducing again a set of pairs. The output of the sequence parser combinator is then a setcontaining pairs where the first components are again pairs, namely what the first parser could parsetogether with what the second parser could parse; the second component is the unprocessedpart left over after running the second parser \texttt{q}. Therefore the input type ofthe sequence parser combinator is as usual \texttt{I}, but the output type is\begin{center}\texttt{Set[((T, S), I)]}\end{center}Scala allows us to provide someshorthand notation for the sequence parser combinator. So we can write for example \texttt{'a' $\sim$ 'b'}, which is theparser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}.Calling this parser combinator with the strings\begin{center}\begin{tabular}{rcl}input string & & output\medskip\\\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\\texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\\texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$\end{tabular}\end{center}\noindentA slightly more complicated parser is \texttt{('a' || 'b') $\sim$ 'b'} which parses as first character eitheran \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results.\begin{center}\begin{tabular}{rcl}input string & & output\medskip\\\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$\end{tabular}\end{center}Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error.The first parser has as output type a single character (recall the type of \texttt{CharParser}),but the second parser produces a pair of characters as output. The alternative parser is howeverrequired to have both component parsers to have the same type. We will see later how we can build this parser without the typing error.The next parser combinator does not actually combine smaller parsers, but appliesa function to the result of the parser. It is implemented in Scala as follows\begin{center}\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]class FunParser[I, T, S] (p: => Parser[I, T], f: T => S) extends Parser[I, S] { def parse(sb: I) = for ((head, tail) <- p.parse(sb)) yield (f(head), tail)}\end{lstlisting}\end{center}\noindentThis parser combinator takes a parser \texttt{p} with output type \texttt{T} asinput as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p}produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator thenapplies the function \texttt{f} to all the parer outputs. Since this functionis of type \texttt{T => S}, we obtain a parser with output type \texttt{S}.Again Scala lets us introduce some shorthand notation for this parser combinator. Therefore we will write \texttt{p ==> f} for it.%\bigskip%takes advantage of the full generality---have a look%what it produces if we call it with the string \texttt{abc}%%\begin{center}%\begin{tabular}{rcl}%input string & & output\medskip\\%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$%\end{tabular}%\end{center}%%% Local Variables: %%% mode: latex %%% TeX-master: t%%% End: