afl-material: handouts/ho05.tex@66adcae6c762

% !TEX program = xelatex
\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../grammar}
\usepackage{../graphics}

% epsilon and left-recursion elimination
% http://www.mollypages.org/page/grammar/index.mp

%% parsing scala files
%% https://scalameta.org/

% chomsky normal form transformation
% https://youtu.be/FNPSlnj3Vt0

% Language hierachy is about complexity
%    How hard is it to recognise an element in a language

% Pratt parsing
% https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html
% https://www.oilshell.org/blog/2017/03/31.html

\begin{document}

\section*{Handout 5 (Grammars \& Parser)}

So far we have focused on regular expressions as well as matching and
lexing algorithms. While regular expressions are very useful for lexing
and for recognising many patterns in strings (like email addresses),
they have their limitations. For example there is no regular expression
that can recognise the language $a^nb^n$ (where you have strings
starting with $n$ $a$'s followed by the same amount of $b$'s). Another
example for which there exists no regular expression is the language of
well-parenthesised expressions. In languages like Lisp, which use
parentheses rather extensively, it might be of interest to know whether
the following two expressions are well-parenthesised or not (the left
one is, the right one is not):

\begin{center}
$(((()()))())$  \hspace{10mm} $(((()()))()))$
\end{center}

\noindent Not being able to solve such recognition problems is
a serious limitation. In order to solve such recognition
problems, we need more powerful techniques than regular
expressions. We will in particular look at \emph{context-free
languages}. They include the regular languages as the picture
below about language classes shows:


\begin{center}
\begin{tikzpicture}
[rect/.style={draw=black!50, 
 top color=white,bottom color=black!20, 
 rectangle, very thick, rounded corners}]

\draw (0,0) node [rect, text depth=30mm, text width=46mm] {\small all languages};
\draw (0,-0.4) node [rect, text depth=20mm, text width=44mm] {\small decidable languages};
\draw (0,-0.65) node [rect, text depth=13mm] {\small context sensitive languages};
\draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages};
\draw (0,-1.05) node [rect] {\small regular languages};
\end{tikzpicture}
\end{center}

\noindent Each ``bubble'' stands for sets of languages (remember
languages are sets of strings). As indicated the set of regular
languages is fully included inside the context-free languages,
meaning every regular language is also context-free, but not vice
versa. Below I will let you think, for example, what the context-free
grammar is for the language corresponding to the regular expression
$(aaa)^*a$.

Because of their convenience, context-free languages play an important
role in `day-to-day' text processing and in programming
languages. Context-free in this setting means that ``words'' have one
meaning only and this meaning is independent from the context
the ``words'' appear in. For example ambiguity issues like

\begin{center}
\tt Time flies like an arrow. Fruit flies like bananas.
\end{center}  

\noindent
from natural languages where the meaning of \emph{flies} depends on the
surrounding \emph{context} are avoided as much as possible. Here is
an interesting video about C++ not being a context-free language

\begin{center}
\url{https://www.youtube.com/watch?v=OzK8pUu4UfM}
\end{center}  

Context-free languages are usually specified by grammars. For example
a grammar for well-parenthesised expressions can be given as follows:

\begin{plstx}[margin=3cm]
: \meta{P} ::=  ( \cdot  \meta{P} \cdot ) \cdot \meta{P}
             | \epsilon\\ 
\end{plstx}

\noindent 
or a grammar for recognising strings consisting of ones (at least one) is

\begin{plstx}[margin=3cm]
: \meta{O} ::= 1 \cdot  \meta{O} 
             | 1\\
\end{plstx}

In general grammars consist of finitely many rules built up
from \emph{terminal symbols} (usually lower-case letters) and
\emph{non-terminal symbols} (upper-case letters written in
bold like \meta{A}, \meta{N} and so on). Rules have
the shape

\begin{plstx}[margin=3cm]
: \meta{NT} ::= rhs\\
\end{plstx}
 
\noindent where on the left-hand side is a single non-terminal
and on the right a string consisting of both terminals and
non-terminals including the $\epsilon$-symbol for indicating
the empty string. We use the convention to separate components
on the right hand-side by using the $\cdot$ symbol, as in the
grammar for well-parenthesised expressions. We also use the
convention to use $|$ as a shorthand notation for several
rules. For example

\begin{plstx}[margin=3cm]
: \meta{NT} ::= rhs_1
   | rhs_2\\
\end{plstx}

\noindent means that the non-terminal \meta{NT} can be replaced by
either $\textit{rhs}_1$ or $\textit{rhs}_2$. If there are more
than one non-terminal on the left-hand side of the rules, then
we need to indicate what is the \emph{starting} symbol of the
grammar. For example the grammar for arithmetic expressions
can be given as follows

\begin{plstx}[margin=3cm,one per line]
\mbox{\rm (1)}: \meta{E} ::= \meta{N}\\
\mbox{\rm (2)}: \meta{E} ::= \meta{E} \cdot + \cdot \meta{E}\\
\mbox{\rm (3)}: \meta{E} ::= \meta{E} \cdot - \cdot \meta{E}\\
\mbox{\rm (4)}: \meta{E} ::= \meta{E} \cdot * \cdot \meta{E}\\
\mbox{\rm (5)}: \meta{E} ::= ( \cdot \meta{E} \cdot )\\
\mbox{\rm (6\ldots)}: \meta{N} ::= \meta{N} \cdot \meta{N} 
  \mid 0 \mid 1 \mid \ldots \mid 9\\
\end{plstx}

\noindent where \meta{E} is the starting symbol. A
\emph{derivation} for a grammar starts with the starting
symbol of the grammar and in each step replaces one
non-terminal by a right-hand side of a rule. A derivation ends
with a string in which only terminal symbols are left. For
example a derivation for the string $(1 + 2) + 3$ is as
follows:

\begin{center}
\begin{tabular}{lll@{\hspace{2cm}}l}
\meta{E} & $\rightarrow$ & $\meta{E}+\meta{E}$          & by (2)\\
       & $\rightarrow$ & $(\meta{E})+\meta{E}$     & by (5)\\
       & $\rightarrow$ & $(\meta{E}+\meta{E})+\meta{E}$   & by (2)\\
       & $\rightarrow$ & $(\meta{E}+\meta{E})+\meta{N}$   & by (1)\\
       & $\rightarrow$ & $(\meta{E}+\meta{E})+3$   & by (6\dots)\\
       & $\rightarrow$ & $(\meta{N}+\meta{E})+3$   & by (1)\\
       & $\rightarrow^+$ & $(1+2)+3$ & by (1, 6\ldots)\\
\end{tabular} 
\end{center}

\noindent where on the right it is indicated which 
grammar rule has been applied. In the last step we
merged several steps into one.

The \emph{language} of a context-free grammar $G$
with start symbol $S$ is defined as the set of strings
derivable by a derivation, that is

\begin{center}
$\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
\end{center}

\noindent
A \emph{parse-tree} encodes how a string is derived with the starting
symbol on top and each non-terminal containing a subtree for how it is
replaced in a derivation. The parse tree for the string $(1 + 23)+4$ is
as follows:

\begin{center}
\begin{tikzpicture}[level distance=8mm, black]
  \node {\meta{E}}
    child {node {\meta{E} } 
       child {node {$($}}
       child {node {\meta{E} }       
         child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}}
         child {node {$+$}}
         child {node {\meta{E} } 
            child {node {\meta{N} } child {node {$2$}}}
            child {node {\meta{N} } child {node {$3$}}}
            } 
        }
       child {node {$)$}}
     }
     child {node {$+$}}
     child {node {\meta{E} }
        child {node {\meta{N} } child {node {$4$}}}
     };
\end{tikzpicture}
\end{center}

\noindent We are often interested in these parse-trees since
they encode the structure of how a string is derived by a
grammar. 

Before we come to the problem of constructing such parse-trees, we need
to consider the following two properties of grammars. A grammar is
\emph{left-recursive} if there is a derivation starting from a
non-terminal, say \meta{NT} which leads to a string which again starts
with \meta{NT}. This means a derivation of the form.

\begin{center}
$\meta{NT} \rightarrow \ldots \rightarrow \meta{NT} \cdot \ldots$
\end{center}

\noindent It can be easily seen that the grammar above for arithmetic
expressions is left-recursive: for example the rules $\meta{E}
\rightarrow \meta{E}\cdot + \cdot \meta{E}$ and $\meta{N} \rightarrow
\meta{N}\cdot \meta{N}$ show that this grammar is left-recursive. But
note that left-recursiveness can involve more than one step in the
derivation. The problem with left-recursive grammars is that some
algorithms cannot cope with them: with left-recursive grammars they will
fall into a loop. Fortunately every left-recursive grammar can be
transformed into one that is not left-recursive, although this
transformation might make the grammar less ``human-readable''. For
example if we want to give a non-left-recursive grammar for numbers we
might specify

\begin{center}
$\meta{N} \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;
1\cdot \meta{N}\;|\;2\cdot \meta{N}\;|\;\ldots\;|\;9\cdot \meta{N}$
\end{center}

\noindent Using this grammar we can still derive every number
string, but we will never be able to derive a string of the
form $\meta{N} \to \ldots \to \meta{N} \cdot \ldots$.

The other property we have to watch out for is when a grammar
is \emph{ambiguous}. A grammar is said to be ambiguous if
there are two parse-trees for one string. Again the grammar
for arithmetic expressions shown above is ambiguous. While the
shown parse tree for the string $(1 + 23) + 4$ is unique, this
is not the case in general. For example there are two parse
trees for the string $1 + 2 + 3$, namely

\begin{center}
\begin{tabular}{c@{\hspace{10mm}}c}
\begin{tikzpicture}[level distance=8mm, black]
  \node {\meta{E} }
    child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}}
    child {node {$+$}}
    child {node {\meta{E} }
       child {node {\meta{E} } child {node {\meta{N} } child {node {$2$}}}}
       child {node {$+$}}
       child {node {\meta{E} } child {node {\meta{N} } child {node {$3$}}}}
    }
    ;
\end{tikzpicture} 
&
\begin{tikzpicture}[level distance=8mm, black]
  \node {\meta{E} }
    child {node {\meta{E} }
       child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}}
       child {node {$+$}}
       child {node {\meta{E} } child {node {\meta{N} } child {node {$2$}}}} 
    }
    child {node {$+$}}
    child {node {\meta{E} } child {node {\meta{N} } child {node {$3$}}}}
    ;
\end{tikzpicture}
\end{tabular} 
\end{center}

\noindent In particular in programming languages we will try to avoid
ambiguous grammars because two different parse-trees for a string mean a
program can be interpreted in two different ways. In such cases we have
to somehow make sure the two different ways do not matter, or
disambiguate the grammar in some other way (for example making the $+$
left-associative). Unfortunately already the problem of deciding whether
a grammar is ambiguous or not is in general undecidable. But in simple
instance (the ones we deal with in this module) one can usually see when
a grammar is ambiguous.

\subsection*{Removing Left-Recursion}

Let us come back to the problem of left-recursion and consider the 
following grammar for binary numbers:

\begin{plstx}[margin=1cm]
: \meta{B} ::= \meta{B} \cdot \meta{B} | 0 | 1\\
\end{plstx}

\noindent
It is clear that this grammar can create all binary numbers, but
it is also clear that this grammar is left-recursive. Giving this
grammar as is to parser combinators will result in an infinite 
loop. Fortunately, every left-recursive grammar can be translated
into one that is not left-recursive with the help of some
transformation rules. Suppose we identified the ``offensive'' 
rule, then we can separate the grammar into this offensive rule
and the ``rest'':

\begin{plstx}[margin=1cm]
  : \meta{B} ::= \underbrace{\meta{B} \cdot \meta{B}}_{\textit{lft-rec}} 
  | \underbrace{0 \;\;|\;\; 1}_{\textit{rest}}\\
\end{plstx}

\noindent
To make the idea of the transformation clearer, suppose the left-recursive
rule is of the form $\meta{B}\alpha$ (the left-recursive non-terminal 
followed by something called $\alpha$) and the ``rest'' is called $\beta$.
That means our grammar looks schematically as follows

\begin{plstx}[margin=1cm]
  : \meta{B} ::= \meta{B} \cdot \alpha | \beta\\
\end{plstx}

\noindent
To get rid of the left-recursion, we are required to introduce 
a new non-terminal, say $\meta{B'}$ and transform the rule
as follows:

\begin{plstx}[margin=1cm]
  : \meta{B} ::= \beta \cdot \meta{B'}\\
  : \meta{B'} ::= \alpha \cdot \meta{B'} | \epsilon\\
\end{plstx}

\noindent
In our example of binary numbers we would after the transformation 
end up with the rules

\begin{plstx}[margin=1cm]
  : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
  : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\
\end{plstx}

\noindent
A little thought should convince you that this grammar still derives
all the binary numbers (for example 0 and 1 are derivable because $\meta{B'}$
can be $\epsilon$). Less clear might be why this grammar is non-left recursive.
For $\meta{B'}$ it is relatively clear because we will never be 
able to derive things like

\begin{center}
$\meta{B'} \rightarrow\ldots\rightarrow \meta{B'}\cdot\ldots$
\end{center}  

\noindent
because there will always be a $\meta{B}$ in front of a $\meta{B'}$, and
$\meta{B}$ now has always a $0$ or $1$ in front, so a $\meta{B'}$ can
never be in the first place. The reasoning is similar for $\meta{B}$:
the $0$ and $1$ in the rule for $\meta{B}$ ``protect'' it from becoming
left-recursive. This transformation does not mean the grammar is the
simplest left-recursive grammar for binary numbers. For example the
following grammar would do as well

\begin{plstx}[margin=1cm]
  : \meta{B} ::= 0 \cdot \meta{B} | 1 \cdot \meta{B} | 0 | 1\\
\end{plstx}

\noindent
The point is that we can in principle transform every left-recursive
grammar into one that is non-left-recursive. This explains why often
the following grammar is used for arithmetic expressions:

\begin{plstx}[margin=1cm]
  : \meta{E} ::= \meta{T} | \meta{T} \cdot + \cdot \meta{E} |  \meta{T} \cdot - \cdot \meta{E}\\
  : \meta{T} ::= \meta{F} | \meta{F} \cdot * \cdot \meta{T}\\
  : \meta{F} ::= num\_token | ( \cdot \meta{E} \cdot )\\
\end{plstx}

\noindent
In this grammar all $\meta{E}$xpressions, $\meta{T}$erms and
$\meta{F}$actors are in some way protected from being
left-recursive. For example if you start $\meta{E}$ you can derive
another one by going through $\meta{T}$, then $\meta{F}$, but then
$\meta{E}$ is protected by the open-parenthesis in the last rule.

\subsection*{Removing $\epsilon$-Rules and CYK-Algorithm}

I showed above that the non-left-recursive grammar for binary numbers is

\begin{plstx}[margin=1cm]
  : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
  : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\
\end{plstx}

\noindent
The transformation made the original grammar non-left-recursive, but at
the expense of introducing an $\epsilon$ in the second rule. Having an
explicit $\epsilon$-rule is annoying, not in terms of looping, but in
terms of efficiency. The reason is that the $\epsilon$-rule always
applies but since it recognises the empty string, it does not make any
progress with recognising a string. Better are rules like $( \cdot
\meta{E} \cdot )$ where something of the input is consumed. Getting
rid of $\epsilon$-rules is also important for the CYK parsing algorithm,
which can give us an insight into the complexity class of parsing.

It turns out we can also by some generic transformations eliminate
$\epsilon$-rules from grammars. Consider again the grammar above for
binary numbers where have a rule $\meta{B'} ::= \epsilon$. In this case
we look for rules of the (generic) form \mbox{$\meta{A} :=
\alpha\cdot\meta{B'}\cdot\beta$}. That is, there are rules that use
$\meta{B'}$ and something ($\alpha$) is in front of $\meta{B'}$ and
something follows ($\beta$). Such rules need to be replaced by
additional rules of the form \mbox{$\meta{A} := \alpha\cdot\beta$}.
In our running example there are the two rules for $\meta{B}$ which
fall into this category

\begin{plstx}[margin=1cm]
  : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
\end{plstx} 

\noindent To follow the general scheme of the transformation,
the $\alpha$ is either is either $0$ or $1$, and the $\beta$ happens
to be empty. So we need to generate new rules for the form 
\mbox{$\meta{A} := \alpha\cdot\beta$}, which in our particular 
example means we obtain

\begin{plstx}[margin=1cm]
  : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\
\end{plstx} 

\noindent
Unfortunately $\meta{B'}$ is also used in the rule 

\begin{plstx}[margin=1cm]
  : \meta{B'} ::= \meta{B} \cdot \meta{B'}\\
\end{plstx}

\noindent
For this we repeat the transformation, giving 

\begin{plstx}[margin=1cm]
  : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\
\end{plstx}

\noindent
In this case $\alpha$ was substituted with $\meta{B}$ and $\beta$
was again empty. Once no rule is left over, we can simply throw
away the $\epsilon$ rule.  This gives the grammar

\begin{plstx}[margin=1cm]
  : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\
  : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\
\end{plstx}

\noindent
I let you think about whether this grammar can still recognise all
binary numbers and whether this grammar is non-left-recursive. The
precise statement for the transformation of removing $\epsilon$-rules
is that if the original grammar was able to recognise only non-empty
strings, then the transformed grammar will be equivalent (matching the
same set of strings); if the original grammar was able to match the
empty string, then the transformed grammar will be able to match the
same strings, \emph{except} the empty string. So the
$\epsilon$-removal does not preserve equivalence of grammars in
general, but the small defect with the empty string is not important
for practical purposes.

So why are these transformations all useful? Well apart from making the 
parser combinators work (remember they cannot deal with left-recursion and
are inefficient with $\epsilon$-rules), a second reason is that they help
with getting any insight into the complexity of the parsing problem. 
The parser combinators are very easy to implement, but are far from the 
most efficient way of processing input (they can blow up exponentially
with ambiguous grammars). The question remains what is the best possible
complexity for parsing? It turns out that this is $O(n^3)$ for context-free
languages. 

To answer the question about complexity, let me describe next the CYK
algorithm (named after the authors Cockeâ€“Youngerâ€“Kasami). This algorithm
works with grammars that are in \emph{Chomsky normalform}. In Chomsky
normalform all rules must be of the form $\meta{A} ::= a$, where $a$ is 
a terminal, or $\meta{A} ::= \meta{B}\cdot \meta{C}$, where $\meta{B}$ and
$\meta{B}$ need to be non-terminals. And no rule can contain $\epsilon$.
The following grammar is in Chomsky normalform:

\begin{plstx}[margin=1cm]
  : \meta{S} ::= \meta{N}\cdot \meta{P}\\
  : \meta{P} ::= \meta{V}\cdot \meta{N}\\
  : \meta{N} ::= \meta{N}\cdot \meta{N}\\
  : \meta{N} ::= \meta{A}\cdot \meta{N}\\
  : \meta{N} ::= \texttt{student} | \texttt{trainer} | \texttt{team} 
                | \texttt{trains}\\
  : \meta{V} ::= \texttt{trains} | \texttt{team}\\
  : \meta{A} ::= \texttt{The} | \texttt{the}\\
\end{plstx}

\noindent
where $\meta{S}$ is the start symbol and $\meta{S}$, $\meta{P}$,
$\meta{N}$, $\meta{V}$ and $\meta{A}$ are non-terminals. The ``words''
are terminals. The rough idea behind this grammar is that $\meta{S}$
stands for a sentence, $\meta{P}$ is a predicate, $\meta{N}$ is a noun
and so on. For example the rule \mbox{$\meta{P} ::= \meta{V}\cdot
\meta{N}$} states that a predicate can be a verb followed by a noun.
Now the question is whether the string 

\begin{center}
  \texttt{The trainer trains the student team}
\end{center}

\noindent
is recognised by the grammar. The CYK algorithm starts with the
following triangular data structure.

%%\begin{figure}[t]
\begin{center}
  \begin{tikzpicture}[scale=0.7,line width=0.8mm]
  \draw (-2,0) -- (4,0);
  \draw (-2,1) -- (4,1);
  \draw (-2,2) -- (3,2);
  \draw (-2,3) -- (2,3);
  \draw (-2,4) -- (1,4);
  \draw (-2,5) -- (0,5);
  \draw (-2,6) -- (-1,6);
  
  \draw (0,0) -- (0, 5);
  \draw (1,0) -- (1, 4);
  \draw (2,0) -- (2, 3);
  \draw (3,0) -- (3, 2);
  \draw (4,0) -- (4, 1);
  \draw (-1,0) -- (-1, 6);
  \draw (-2,0) -- (-2, 6);
  
  \draw (-1.5,-0.5) node {\footnotesize{}\texttt{The}}; 
  \draw (-0.5,-1.0) node {\footnotesize{}\texttt{trainer}}; 
  \draw ( 0.5,-0.5) node {\footnotesize{}\texttt{trains}}; 
  \draw ( 1.5,-1.0) node {\footnotesize{}\texttt{the}}; 
  \draw ( 2.5,-0.5) node {\footnotesize{}\texttt{student}}; 
  \draw ( 3.5,-1.0) node {\footnotesize{}\texttt{team}};
  
  \draw (-1.5,0.5) node {$A$}; 
  \draw (-0.5,0.5) node {$N$}; 
  \draw ( 0.5,0.5) node {$N,V$}; 
  \draw ( 1.5,0.5) node {$A$}; 
  \draw ( 2.5,0.5) node {$N$}; 
  \draw ( 3.5,0.5) node {$N,V$};

%  \draw (-1.5,1.5) node {\small{}a}; 
%  \draw (-0.5,1.5) node {\small{}b}; 
%  \draw ( 0.5,1.5) node {\small{}c}; 
%  \draw ( 1.5,1.5) node {\small{}d}; 
%  \draw ( 2.5,1.5) node {\small{}e}; 
  
  \draw (-2.4, 5.5) node {$1$}; 
  \draw (-2.4, 4.5) node {$2$}; 
  \draw (-2.4, 3.5) node {$3$}; 
  \draw (-2.4, 2.5) node {$4$}; 
  \draw (-2.4, 1.5) node {$5$}; 
  \draw (-2.4, 0.5) node {$6$}; 
  \end{tikzpicture}
  \end{center}
%%\end{figure}


\noindent
The last row contains the information about all words and their
corresponding non-terminals. For example the field for \texttt{trains}
contains the information $\meta{N}$ and $\meta{V}$ because \texttt{trains} can be a
``verb'' and a ``noun'' according to the grammar.  The row above,
let's call the corresponding fields 5a to 5e, contains information
about \underline{2-word} parts of the sentence, namely

\begin{center}
\begin{tabular}{llll}
  5a) & $\underbrace{\texttt{The}}_{A}$ $\mid$ $\underbrace{\texttt{trainer}}_{N}$   \\
  5b) & $\underbrace{\texttt{trainer}}_{N}$ $\mid$ $\underbrace{\texttt{trains}}_{N,V}$\\
  5c) & \texttt{trains} $\mid$ \texttt{the}    \\
  5d) & \texttt{the} $\mid$ \texttt{student}   \\
  5e) & \texttt{student} $\mid$ \texttt{team}  \\
\end{tabular}  
\end{center}  

\noindent
For each of them, we look up in row 6 which non-terminals it belongs to
(indicated above for 5a and 5b). For 5a, with the non-terminals
\meta{A} and \meta{N}, we find the grammar rule

\begin{plstx}[margin=1cm]
  : \meta{N} ::= \meta{A}\cdot \meta{N}\\
\end{plstx}

\noindent
which means into field 5a we put the left-hand side of this rule,
which in this case is the non-terminal \meta{N}. For 5b we have to check
for both $\meta{N}\cdot\meta{N}$ and $\meta{N}\cdot\meta{V}$ whether there
is a right-hand side of this form in the grammar. But only the
grammar rule 

\begin{plstx}[margin=1cm]
  : \meta{N} ::= \meta{N}\cdot \meta{N}\\
\end{plstx}

\noindent
matches, which means 5b gets also an \meta{N}. Continuing for all
fields in row 5 gives:

\begin{center}
  \begin{tikzpicture}[scale=0.7,line width=0.8mm]
  \draw (-2,0) -- (4,0);
  \draw (-2,1) -- (4,1);
  \draw (-2,2) -- (3,2);
  \draw (-2,3) -- (2,3);
  \draw (-2,4) -- (1,4);
  \draw (-2,5) -- (0,5);
  \draw (-2,6) -- (-1,6);
  
  \draw (0,0) -- (0, 5);
  \draw (1,0) -- (1, 4);
  \draw (2,0) -- (2, 3);
  \draw (3,0) -- (3, 2);
  \draw (4,0) -- (4, 1);
  \draw (-1,0) -- (-1, 6);
  \draw (-2,0) -- (-2, 6);
  
  \draw (-1.5,-0.5) node {\footnotesize{}\texttt{The}}; 
  \draw (-0.5,-1.0) node {\footnotesize{}\texttt{trainer}}; 
  \draw ( 0.5,-0.5) node {\footnotesize{}\texttt{trains}}; 
  \draw ( 1.5,-1.0) node {\footnotesize{}\texttt{the}}; 
  \draw ( 2.5,-0.5) node {\footnotesize{}\texttt{student}}; 
  \draw ( 3.5,-1.0) node {\footnotesize{}\texttt{team}};
  
  \draw (-1.5,0.5) node {$A$}; 
  \draw (-0.5,0.5) node {$N$}; 
  \draw ( 0.5,0.5) node {$N,V$}; 
  \draw ( 1.5,0.5) node {$A$}; 
  \draw ( 2.5,0.5) node {$N$}; 
  \draw ( 3.5,0.5) node {$N,V$};

  \draw (-1.5,1.5) node {$N$}; 
  \draw (-0.5,1.5) node {$N$}; 
  \draw ( 0.5,1.5) node {$$}; 
  \draw ( 1.5,1.5) node {$N$}; 
  \draw ( 2.5,1.5) node {$N$}; 


%  \draw (-1.5,1.5) node {\small{}a}; 
%  \draw (-0.5,1.5) node {\small{}b}; 
%  \draw ( 0.5,1.5) node {\small{}c}; 
%  \draw ( 1.5,1.5) node {\small{}d}; 
%  \draw ( 2.5,1.5) node {\small{}e}; 
  
  \draw (-2.4, 5.5) node {$1$}; 
  \draw (-2.4, 4.5) node {$2$}; 
  \draw (-2.4, 3.5) node {$3$}; 
  \draw (-2.4, 2.5) node {$4$}; 
  \draw (-2.4, 1.5) node {$5$}; 
  \draw (-2.4, 0.5) node {$6$}; 
  \end{tikzpicture}
  \end{center}

\noindent
Now row 4 is in charge of all 3-word parts of the sentence, namely

\begin{center}
\begin{tabular}{llll}
  4a) & The $\mid$ trainer trains\\
      & The trainer $\mid$ trains\\
  4b) & trainer $\mid$ trains the\\
      & trainer trains $\mid$ the\\
  4c) & trains $\mid$ the student\\
      & trains the $\mid$ student\\
  4d) & the $\mid$ student team\\
      & the student $\mid$ team\\
\end{tabular}  
\end{center}

\noindent
Note that in case of 3-word parts we have two splits. For example for
4a: $\underbrace{\texttt{The}}_{A}$ and
$\underbrace{\texttt{trainer trains}}_{N}$; and also
$\underbrace{\texttt{The trainer}}_{N}$ and
$\underbrace{\texttt{trains}}_{N,V}$. For each of these splits we have
to look up in the rows below, which non-terminals we already
computed. This allows us to look for right-hand sides in our grammar:
$\meta{A}\cdot\meta{N}$, $\meta{N}\cdot\meta{N}$ and
$\meta{N}\cdot\meta{V}$, which yield the only left-hand side
\meta{N}. This is what we fill in for 4a. And so on for row 4.

Row 3 is about all 4-word parts in the sentence, namely

\begin{center}
\begin{tabular}{llll}
  3a) & The trainer trains the\\
  3b) & trainer trains the student\\
  3c) & trains the student team\\
\end{tabular}  
\end{center}

\noindent
Each of them can be split up in 3 ways, for example

\begin{center}
\begin{tabular}{llll}
  3a) & The $\mid$ trainer trains the\\
      & The trainer $\mid$ trains the\\
      & The trainer trains $\mid$ the\\
\end{tabular}  
\end{center}

\noindent
and we have to consider them all in turn to fill in the non-terminals for
3a. You can guess how it continues: row 2 is for all 5-word parts, and finally
the field on the top is for the whole sentence.
The idea of the CYK algorithm is that if in the top-field the starting
symbol \meta{S} appears (possibly together with other non-terminals),
then the sentence is accepted by the grammar. If it does not, then the
sentence is not accepted.

Let us very quickly calculate the complexity of the CYK
algorithm. Lookup operations inside the triangle and in the grammar
are assumed to be of constant time, $O(1)$, meaning they do not
matter. How many fields are in the triangle\ldots
$\frac{n}{2} * (n+1)$, where $n$ is the size of the input. That means
roughly $O(n^2)$ fields. How much work do we have to do for each
field? Well, for the top-most we have to consider $n-1$ splits, which
means roughly $O(n)$ for each field. The overall result is a $O(n^3)$
time-complexity for CYK. It turns out that this is the best worst-time
complexity for parsing with context-free grammars.

\end{document}


%%% Parser combinators are now part of handout 6

\subsection*{Parser Combinators}

Let us now turn to the problem of generating a parse-tree for
a grammar and string. In what follows we explain \emph{parser
combinators}, because they are easy to implement and closely
resemble grammar rules. Imagine that a grammar describes the
strings of natural numbers, such as the grammar \meta{N}  shown
above. For all such strings we want to generate the
parse-trees or later on we actually want to extract the
meaning of these strings, that is the concrete integers
``behind'' these strings. In Scala the parser combinators will
be functions of type

\begin{center}
\texttt{I $\Rightarrow$ Set[(T, I)]}
\end{center}

\noindent that is they take as input something of type
\texttt{I}, typically a list of tokens or a string, and return
a set of pairs. The first component of these pairs corresponds
to what the parser combinator was able to process from the
input and the second is the unprocessed part of the input. As
we shall see shortly, a parser combinator might return more
than one such pair, with the idea that there are potentially
several ways how to interpret the input. As a concrete
example, consider the case where the input is of type string,
say the string

\begin{center}
\tt\Grid{iffoo\VS testbar}
\end{center}

\noindent We might have a parser combinator which tries to
interpret this string as a keyword (\texttt{if}) or an
identifier (\texttt{iffoo}). Then the output will be the set

\begin{center}
$\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), 
           \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
\end{center}

\noindent where the first pair means the parser could
recognise \texttt{if} from the input and leaves the rest as
`unprocessed' as the second component of the pair; in the
other case it could recognise \texttt{iffoo} and leaves
\texttt{\VS testbar} as unprocessed. If the parser cannot
recognise anything from the input then parser combinators just
return the empty set $\{\}$. This will indicate
something ``went wrong''.

The main attraction is that we can easily build parser combinators out of smaller components
following very closely the structure of a grammar. In order to implement this in an object
oriented programming language, like Scala, we need to specify an abstract class for parser 
combinators. This abstract class requires the implementation of the function
\texttt{parse} taking an argument of type \texttt{I} and returns a set of type  
\mbox{\texttt{Set[(T, I)]}}.

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
abstract class Parser[I, T] {
  def parse(ts: I): Set[(T, I)]

  def parse_all(ts: I): Set[T] =
    for ((head, tail) <- parse(ts); if (tail.isEmpty)) 
      yield head
}
\end{lstlisting}
\end{center}

\noindent
From the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all},
which just filters out all pairs whose second component is not empty (that is has still some
unprocessed part). The reason is that at the end of parsing we are only interested in the
results where all the input has been consumed and no unprocessed part is left.

One of the simplest parser combinators recognises just a character, say $c$, 
from the beginning of strings. Its behaviour is as follows:

\begin{itemize}
\item if the head of the input string starts with a $c$, it returns 
	the set $\{(c, \textit{tail of}\; s)\}$
\item otherwise it returns the empty set $\varnothing$	
\end{itemize}

\noindent
The input type of this simple parser combinator for characters is
\texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. 
The code in Scala is as follows:

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
case class CharParser(c: Char) extends Parser[String, Char] {
  def parse(sb: String) = 
    if (sb.head == c) Set((c, sb.tail)) else Set()
}
\end{lstlisting}
\end{center}

\noindent
The \texttt{parse} function tests whether the first character of the 
input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits the
string into the recognised part \texttt{c} and the unprocessed part
\texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} then
the parser returns the empty set (in Scala \texttt{Set()}).

More interesting are the parser combinators that build larger parsers
out of smaller component parsers. For example the alternative 
parser combinator is as follows.

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
class AltParser[I, T]
       (p: => Parser[I, T], 
        q: => Parser[I, T]) extends Parser[I, T] {
  def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
}
\end{lstlisting}
\end{center}

\noindent
The types of this parser combinator are polymorphic (we just have \texttt{I}
for the input type, and \texttt{T} for the output type). The alternative parser
builds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}.
Both need to be able to process input of type \texttt{I} and return the same
output type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the 
\texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the
evaluation of the arguments before they are used. This is often called 
\emph{lazy evaluation} of the arguments.) The alternative parser should run
the input with the first parser \texttt{p} (producing a set of outputs) and then
run the same input with \texttt{q}. The result should be then just the union
of both sets, which is the operation \texttt{++} in Scala.

This parser combinator already allows us to construct a parser that either 
a character \texttt{a} or \texttt{b}, as

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
new AltParser(CharParser('a'), CharParser('b'))
\end{lstlisting}
\end{center}

\noindent
Scala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}. 
We can call this parser combinator with the strings

\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
\texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$
\end{tabular}
\end{center}

\noindent
We receive in the first two cases a successful output (that is a non-empty set).

A bit more interesting is the \emph{sequence parser combinator} implemented in
Scala as follows:

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
class SeqParser[I, T, S]
       (p: => Parser[I, T], 
        q: => Parser[I, S]) extends Parser[I, (T, S)] {
  def parse(sb: I) = 
    for ((head1, tail1) <- p.parse(sb); 
         (head2, tail2) <- q.parse(tail1)) 
            yield ((head1, head2), tail2)
}
\end{lstlisting}
\end{center}

\noindent
This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} 
as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}).
The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}. 
Let \texttt{q} run on these unprocessed parts
producing again a set of pairs. The output of the sequence parser combinator is then a set
containing pairs where the first components are again pairs, namely what the first parser could parse
together with what the second parser could parse; the second component is the unprocessed
part left over after running the second parser \texttt{q}. Therefore the input type of
the sequence parser combinator is as usual \texttt{I}, but the output type is

\begin{center}
\texttt{Set[((T, S), I)]}
\end{center}

Scala allows us to provide some
shorthand notation for the sequence parser combinator. So we can write for 
example \texttt{'a'  $\sim$ 'b'}, which is the
parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}.
Calling this parser combinator with the strings

\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\
\texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$
\end{tabular}
\end{center}

\noindent
A slightly more complicated parser is \texttt{('a'  || 'b') $\sim$ 'b'} which parses as first character either
an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results.

\begin{center}
\begin{tabular}{rcl}
input string & & output\medskip\\
\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
\end{tabular}
\end{center}

Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error.
The first parser has as output type a single character (recall the type of \texttt{CharParser}),
but the second parser produces a pair of characters as output. The alternative parser is however
required to have both component parsers to have the same type. We will see later how we can 
build this parser without the typing error.

The next parser combinator does not actually combine smaller parsers, but applies
a function to the result of the parser. It is implemented in Scala as follows

\begin{center}
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
class FunParser[I, T, S]
         (p: => Parser[I, T], 
          f: T => S) extends Parser[I, S] {
  def parse(sb: I) = 
    for ((head, tail) <- p.parse(sb)) yield (f(head), tail)
}
\end{lstlisting}
\end{center}


\noindent
This parser combinator takes a parser \texttt{p} with output type \texttt{T} as
input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p}
produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then
applies the function \texttt{f} to all the parer outputs. Since this function
is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}.
Again Scala lets us introduce some shorthand notation for this parser combinator. 
Therefore we will write \texttt{p ==> f} for it.

%\bigskip
%takes advantage of the full generality---have a look
%what it produces if we call it with the string \texttt{abc}
%
%\begin{center}
%\begin{tabular}{rcl}
%input string & & output\medskip\\
%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
%\end{tabular}
%\end{center}






%%% Local Variables: 
%%% mode: latex  
%%% TeX-master: t
%%% End:
author	Christian Urban <christian.urban@kcl.ac.uk>
	Fri, 13 Oct 2023 15:07:37 +0100 (16 months ago)
changeset 941	66adcae6c762
parent 937	dc5ab66b11cc
permissions	-rw-r--r--