diff -r 8fc109f36b78 -r eecc4d5a2172 handouts/ho05.tex
--- a/handouts/ho05.tex	Fri Nov 01 13:21:51 2019 +0000
+++ b/handouts/ho05.tex	Wed Nov 06 15:04:40 2019 +0000
@@ -14,15 +14,16 @@
 
 \section*{Handout 5 (Grammars \& Parser)}
 
-While regular expressions are very useful for lexing and for
-recognising many patterns in strings (like email addresses),
-they have their limitations. For example there is no regular
-expression that can recognise the language $a^nb^n$. Another
-example for which there exists no regular expression is the
-language of well-parenthesised expressions. In languages like
-Lisp, which use parentheses rather extensively, it might be of
-interest to know whether the following two expressions are
-well-parenthesised or not (the left one is, the right one is not):
+While regular expressions are very useful for lexing and for recognising
+many patterns in strings (like email addresses), they have their
+limitations. For example there is no regular expression that can
+recognise the language $a^nb^n$ (where you have strings with $n$ $a$'s
+followed by the same amount of $b$'s). Another example for which there
+exists no regular expression is the language of well-parenthesised
+expressions. In languages like Lisp, which use parentheses rather
+extensively, it might be of interest to know whether the following two
+expressions are well-parenthesised or not (the left one is, the right
+one is not):
 
 \begin{center}
 $(((()()))())$  \hspace{10mm} $(((()()))()))$
@@ -52,7 +53,7 @@
 
 \noindent Each ``bubble'' stands for sets of languages (remember
 languages are sets of strings). As indicated the set of regular
-languages are fully included inside the context-free languages,
+languages is fully included inside the context-free languages,
 meaning every regular language is also context-free, but not vice
 versa. Below I will let you think, for example, what the context-free
 grammar is for the language corresponding to the regular expression
@@ -61,15 +62,15 @@
 Because of their convenience, context-free languages play an important
 role in `day-to-day' text processing and in programming
 languages. Context-free in this setting means that ``words'' have one
-meaning only and this meaning is independent from in which context
-the ``words'' appear. For example ambiguity issues like
+meaning only and this meaning is independent from the context
+the ``words'' appear in. For example ambiguity issues like
 
 \begin{center}
 \tt Time flies like an arrow; fruit flies like bananas.
 \end{center}  
 
 \noindent
-from natural languages were the meaning of \emph{flies} depend on the
+from natural languages were the meaning of \emph{flies} depends on the
 surrounding \emph{context} are avoided as much as possible.
 
 Context-free languages are usually specified by grammars. For example
@@ -162,9 +163,10 @@
 \end{center}
 
 \noindent
-A \emph{parse-tree} encodes how a string is derived with the starting symbol on 
-top and each non-terminal containing a subtree for how it is replaced in a derivation.
-The parse tree for the string $(1 + 23)+4$ is as follows:
+A \emph{parse-tree} encodes how a string is derived with the starting
+symbol on top and each non-terminal containing a subtree for how it is
+replaced in a derivation. The parse tree for the string $(1 + 23)+4$ is
+as follows:
 
 \begin{center}
 \begin{tikzpicture}[level distance=8mm, black]
@@ -190,30 +192,30 @@
 
 \noindent We are often interested in these parse-trees since
 they encode the structure of how a string is derived by a
-grammar. Before we come to the problem of constructing such
-parse-trees, we need to consider the following two properties
-of grammars. A grammar is \emph{left-recursive} if there is a
-derivation starting from a non-terminal, say \meta{NT} which leads
-to a string which again starts with \meta{NT}. This means a
-derivation of the form.
+grammar. 
+
+Before we come to the problem of constructing such parse-trees, we need
+to consider the following two properties of grammars. A grammar is
+\emph{left-recursive} if there is a derivation starting from a
+non-terminal, say \meta{NT} which leads to a string which again starts
+with \meta{NT}. This means a derivation of the form.
 
 \begin{center}
 $\meta{NT} \rightarrow \ldots \rightarrow \meta{NT} \cdot \ldots$
 \end{center}
 
-\noindent It can be easily seen that the grammar above for
-arithmetic expressions is left-recursive: for example the
-rules $\meta{E} \rightarrow \meta{E}\cdot + \cdot \meta{E}$ and 
-$\meta{N} \rightarrow \meta{N}\cdot \meta{N}$ show that this 
-grammar is left-recursive. But note
-that left-recursiveness can involve more than one step in the
-derivation. The problem with left-recursive grammars is that
-some algorithms cannot cope with them: they fall into a loop.
-Fortunately every left-recursive grammar can be transformed
-into one that is not left-recursive, although this
-transformation might make the grammar less ``human-readable''.
-For example if we want to give a non-left-recursive grammar
-for numbers we might specify
+\noindent It can be easily seen that the grammar above for arithmetic
+expressions is left-recursive: for example the rules $\meta{E}
+\rightarrow \meta{E}\cdot + \cdot \meta{E}$ and $\meta{N} \rightarrow
+\meta{N}\cdot \meta{N}$ show that this grammar is left-recursive. But
+note that left-recursiveness can involve more than one step in the
+derivation. The problem with left-recursive grammars is that some
+algorithms cannot cope with them: with left-recursive grammars they will
+fall into a loop. Fortunately every left-recursive grammar can be
+transformed into one that is not left-recursive, although this
+transformation might make the grammar less ``human-readable''. For
+example if we want to give a non-left-recursive grammar for numbers we
+might specify
 
 \begin{center}
 $\meta{N} \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;
@@ -260,17 +262,211 @@
 \end{tabular} 
 \end{center}
 
-\noindent In particular in programming languages we will try
-to avoid ambiguous grammars because two different parse-trees
-for a string mean a program can be interpreted in two
-different ways. In such cases we have to somehow make sure the
-two different ways do not matter, or disambiguate the grammar
-in some other way (for example making the $+$
-left-associative). Unfortunately already the problem of
-deciding whether a grammar is ambiguous or not is in general
-undecidable. But in simple instance (the ones we deal in this
-module) one can usually see when a grammar is ambiguous.
+\noindent In particular in programming languages we will try to avoid
+ambiguous grammars because two different parse-trees for a string mean a
+program can be interpreted in two different ways. In such cases we have
+to somehow make sure the two different ways do not matter, or
+disambiguate the grammar in some other way (for example making the $+$
+left-associative). Unfortunately already the problem of deciding whether
+a grammar is ambiguous or not is in general undecidable. But in simple
+instance (the ones we deal with in this module) one can usually see when
+a grammar is ambiguous.
+
+\subsection*{Removing Left-Recursion}
+
+Let us come back to the problem of left-recursion and consider the 
+following grammar for binary numbers:
+
+\begin{plstx}[margin=1cm]
+: \meta{B} ::= \meta{B} \cdot \meta{B} | 0 | 1\\
+\end{plstx}
+
+\noindent
+It is clear that this grammar can create all binary numbers, but
+it is also clear that this grammar is left-recursive. Giving this
+grammar as is to parser combinators will result in an infinite 
+loop. Fortunately, every left-recursive grammar can be translated
+into one that is not left-recursive with the help of some
+transformation rules. Suppose we identified the ``offensive'' 
+rule, then we can separate the grammar into this offensive rule
+and the ``rest'':
+
+\begin{plstx}[margin=1cm]
+  : \meta{B} ::= \underbrace{\meta{B} \cdot \meta{B}}_{\textit{lft-rec}} 
+  | \underbrace{0 \;\;|\;\; 1}_{\textit{rest}}\\
+\end{plstx}
+
+\noindent
+To make the idea of the transformation clearer, suppose the left-recursive
+rule is of the form $\meta{B}\alpha$ (the left-recursive non-terminal 
+followed by something called $\alpha$) and the ``rest'' is called $\beta$.
+That means our grammar looks schematically as follows
+
+\begin{plstx}[margin=1cm]
+  : \meta{B} ::= \meta{B} \cdot \alpha | \beta\\
+\end{plstx}
+
+\noindent
+To get rid of the left-recursion, we are required to introduce 
+a new non-terminal, say $\meta{B'}$ and transform the rule
+as follows:
+
+\begin{plstx}[margin=1cm]
+  : \meta{B} ::= \beta \cdot \meta{B'}\\
+  : \meta{B'} ::= \alpha \cdot \meta{B'} | \epsilon\\
+\end{plstx}
+
+\noindent
+In our example of binary numbers we would after the transformation 
+end up with the rules
+
+\begin{plstx}[margin=1cm]
+  : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
+  : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\
+\end{plstx}
+
+\noindent
+A little thought should convince you that this grammar still derives
+all the binary numbers (for example 0 and 1 are derivable because $\meta{B'}$
+can be $\epsilon$). Less clear might be why this grammar is non-left recursive.
+For $\meta{B'}$ it is relatively clear because we will never be 
+able to derive things like
+
+\begin{center}
+$\meta{B'} \rightarrow\ldots\rightarrow \meta{B'}\cdot\ldots$
+\end{center}  
+
+\noindent
+because there will always be a $\meta{B}$ in front of a $\meta{B'}$, and
+$\meta{B}$ now has always a $0$ or $1$ in front, so a $\meta{B'}$ can
+never be in the first place. The reasoning is similar for $\meta{B}$:
+the $0$ and $1$ in the rule for $\meta{B}$ ``protect'' it from becoming
+left-recursive. This transformation does not mean the grammar is the
+simplest left-recursive grammar for binary numbers. For example the
+following grammar would do as well
+
+\begin{plstx}[margin=1cm]
+  : \meta{B} ::= 0 \cdot \meta{B} | 1 \cdot \meta{B} | 0 | 1\\
+\end{plstx}
+
+\noindent
+The point is that we can in principle transform every left-recursive
+grammar into one that is non-left-recursive one. This explains why often
+the following grammar is used for arithmetic expressions:
+
+\begin{plstx}[margin=1cm]
+  : \meta{E} ::= \meta{T} | \meta{T} \cdot + \cdot \meta{E} |  \meta{T} \cdot - \cdot \meta{E}\\
+  : \meta{T} ::= \meta{F} | \meta{F} \cdot * \cdot \meta{T}\\
+  : \meta{F} ::= num\_token | ( \cdot \meta{E} \cdot )\\
+\end{plstx}
 
+\noindent
+In this grammar all $\meta{E}$xpressions, $\meta{T}$erms and $\meta{F}$actors
+are in some way protected from being left-recusive. For example if you
+start $\meta{E}$ you can derive another one by going through $\meta{T}$, then 
+$\meta{F}$, but then $\meta{E}$ is protected by the open-parenthesis.
+
+\subsection*{Removing $\epsilon$-Rules and CYK-Algorithm}
+
+I showed above that the non-left-recursive grammar for binary numbers is
+
+\begin{plstx}[margin=1cm]
+  : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
+  : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\
+\end{plstx}
+
+\noindent
+The transformation made the original grammar non-left-recursive, but at
+the expense of introducing an $\epsilon$ in the second rule. Having an
+explicit $\epsilon$-rule is annoying to, not in terms of looping, but in
+terms of efficiency. The reason is that the $\epsilon$-rule always
+applies but since it recognises the empty string, it does not make any
+progress with recognising a string. Better are rules like $( \cdot
+\meta{E} \cdot )$ where something of the input is consumed. Getting
+rid of $\epsilon$-rules is also important for the CYK parsing algorithm,
+which can give us an insight into the complexity class of parsing.
+
+It turns out we can also by some generic transformations eliminate
+$\epsilon$-rules from grammars. Consider again the grammar above for
+binary numbers where have a rule $\meta{B'} ::= \epsilon$. In this case
+we look for rules of the (generic) form \mbox{$\meta{A} :=
+\alpha\cdot\meta{B'}\cdot\beta$}. That is there are rules that use
+$\meta{B'}$ and something ($\alpha$) is in front of $\meta{B'}$ and
+something follows ($\beta$). Such rules need to be replaced by
+additional rules of the form \mbox{$\meta{A} := \alpha\cdot\beta$}.
+In our running example there are the two rules for $\meta{B}$ which
+fall into this category
+
+\begin{plstx}[margin=1cm]
+  : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
+\end{plstx} 
+
+\noindent To follow the general scheme of the transfromation,
+the $\alpha$ is either is either $0$ or $1$, and the $\beta$ happens
+to be empty. SO we need to generate new rules for the form 
+\mbox{$\meta{A} := \alpha\cdot\beta$}, which in our particular 
+example means we obtain
+
+\begin{plstx}[margin=1cm]
+  : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\
+\end{plstx} 
+
+\noindent
+Unfortunately $\meta{B'}$ is also used in the rule 
+
+\begin{plstx}[margin=1cm]
+  : \meta{B'} ::= \meta{B} \cdot \meta{B'}\\
+\end{plstx}
+
+\noindent
+For this we repeat the transformation, giving 
+
+\begin{plstx}[margin=1cm]
+  : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\
+\end{plstx}
+
+\noindent
+In this case $\alpha$ was substituted with $\meta{B}$ and $\beta$
+was again empty. Once no rule is left over, we can simply throw
+away the $\epsilon$ rule.  This gives the grammar
+
+\begin{plstx}[margin=1cm]
+  : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\
+  : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\
+\end{plstx}
+
+\noindent
+I let you think about whether this grammar can still recognise all
+binary numbers and whether this grammar is non-left-recursive. The
+precise statement for the transformation of removing $\epsilon$-rules is
+that if the original grammar was able to recognise only non-empty
+strings, then the transformed grammar will be equivalent (matching the
+same set of strings); if the original grammar was able to match the
+empty string, then the transformed grammar will be able to match the
+same strings, \emph{except} the empty string. So the  $\epsilon$-removal
+does not preserve equivalence of grammars, but the small defect with the
+empty string is not important for practical purposes.
+
+So why are these transformations all useful? Well apart from making the 
+parser combinators work (remember they cannot deal with left-recursion and
+are inefficient with $\epsilon$-rules), a second reason is that they help
+with getting any insight into the complexity of the parsing problem. 
+The parser combinators are very easy to implement, but are far from the 
+most efficient way of processing input (they can blow up exponentially
+with ambiguous grammars). The question remains what is the best possible
+complexity for parsing? It turns out that this is $O(n^3)$ for context-free
+languages. 
+
+To answer the question about complexity, let me describe next the CYK
+algorithm (named after the authors Cocke–Younger–Kasami). This algorithm
+works with grammars that are in Chomsky normalform. 
+
+TBD
+
+\end{document}
+
+
+%%% Parser combinators are now part of handout 6
 
 \subsection*{Parser Combinators}
 
@@ -533,9 +729,11 @@
 
 
 
-\end{document}
+
+
 
 %%% Local Variables: 
 %%% mode: latex  
 %%% TeX-master: t
 %%% End: 
+