afl-material: comparison handouts/ho05.tex

equal deleted inserted replaced

-:8fc109f36b78
+:eecc4d5a2172
 \begin{document}
 \section*{Handout 5 (Grammars \& Parser)}
-While regular expressions are very useful for lexing and for
+While regular expressions are very useful for lexing and for recognising
-recognising many patterns in strings (like email addresses),
+many patterns in strings (like email addresses), they have their
-they have their limitations. For example there is no regular
+limitations. For example there is no regular expression that can
-expression that can recognise the language $a^nb^n$. Another
+recognise the language $a^nb^n$ (where you have strings with $n$ $a$'s
-example for which there exists no regular expression is the
+followed by the same amount of $b$'s). Another example for which there
-language of well-parenthesised expressions. In languages like
+exists no regular expression is the language of well-parenthesised
-Lisp, which use parentheses rather extensively, it might be of
+expressions. In languages like Lisp, which use parentheses rather
-interest to know whether the following two expressions are
+extensively, it might be of interest to know whether the following two
-well-parenthesised or not (the left one is, the right one is not):
+expressions are well-parenthesised or not (the left one is, the right
+one is not):
 \begin{center}
 $(((()()))())$  \hspace{10mm} $(((()()))()))$
 \end{center}
 \end{tikzpicture}
 \end{center}
 \noindent Each ``bubble'' stands for sets of languages (remember
 languages are sets of strings). As indicated the set of regular
-languages are fully included inside the context-free languages,
+languages is fully included inside the context-free languages,
 meaning every regular language is also context-free, but not vice
 versa. Below I will let you think, for example, what the context-free
 grammar is for the language corresponding to the regular expression
 $(aaa)^*a$.
 Because of their convenience, context-free languages play an important
 role in `day-to-day' text processing and in programming
 languages. Context-free in this setting means that ``words'' have one
-meaning only and this meaning is independent from in which context
+meaning only and this meaning is independent from the context
-the ``words'' appear. For example ambiguity issues like
+the ``words'' appear in. For example ambiguity issues like
 \begin{center}
 \tt Time flies like an arrow; fruit flies like bananas.
 \end{center}
 \noindent
-from natural languages were the meaning of \emph{flies} depend on the
+from natural languages were the meaning of \emph{flies} depends on the
 surrounding \emph{context} are avoided as much as possible.
 Context-free languages are usually specified by grammars. For example
 a grammar for well-parenthesised expressions can be given as follows:
 \begin{center}
 $\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
 \end{center}
 \noindent
-A \emph{parse-tree} encodes how a string is derived with the starting symbol on
+A \emph{parse-tree} encodes how a string is derived with the starting
-top and each non-terminal containing a subtree for how it is replaced in a derivation.
+symbol on top and each non-terminal containing a subtree for how it is
-The parse tree for the string $(1 + 23)+4$ is as follows:
+replaced in a derivation. The parse tree for the string $(1 + 23)+4$ is
+as follows:
 \begin{center}
 \begin{tikzpicture}[level distance=8mm, black]
 \node {\meta{E}}
 child {node {\meta{E} }
 \end{tikzpicture}
 \end{center}
 \noindent We are often interested in these parse-trees since
 they encode the structure of how a string is derived by a
-grammar. Before we come to the problem of constructing such
+grammar.
-parse-trees, we need to consider the following two properties
-of grammars. A grammar is \emph{left-recursive} if there is a
+Before we come to the problem of constructing such parse-trees, we need
-derivation starting from a non-terminal, say \meta{NT} which leads
+to consider the following two properties of grammars. A grammar is
-to a string which again starts with \meta{NT}. This means a
+\emph{left-recursive} if there is a derivation starting from a
-derivation of the form.
+non-terminal, say \meta{NT} which leads to a string which again starts
+with \meta{NT}. This means a derivation of the form.
 \begin{center}
 $\meta{NT} \rightarrow \ldots \rightarrow \meta{NT} \cdot \ldots$
 \end{center}
-\noindent It can be easily seen that the grammar above for
+\noindent It can be easily seen that the grammar above for arithmetic
-arithmetic expressions is left-recursive: for example the
+expressions is left-recursive: for example the rules $\meta{E}
-rules $\meta{E} \rightarrow \meta{E}\cdot + \cdot \meta{E}$ and
+\rightarrow \meta{E}\cdot + \cdot \meta{E}$ and $\meta{N} \rightarrow
-$\meta{N} \rightarrow \meta{N}\cdot \meta{N}$ show that this
+\meta{N}\cdot \meta{N}$ show that this grammar is left-recursive. But
-grammar is left-recursive. But note
+note that left-recursiveness can involve more than one step in the
-that left-recursiveness can involve more than one step in the
+derivation. The problem with left-recursive grammars is that some
-derivation. The problem with left-recursive grammars is that
+algorithms cannot cope with them: with left-recursive grammars they will
-some algorithms cannot cope with them: they fall into a loop.
+fall into a loop. Fortunately every left-recursive grammar can be
-Fortunately every left-recursive grammar can be transformed
+transformed into one that is not left-recursive, although this
-into one that is not left-recursive, although this
+transformation might make the grammar less ``human-readable''. For
-transformation might make the grammar less ``human-readable''.
+example if we want to give a non-left-recursive grammar for numbers we
-For example if we want to give a non-left-recursive grammar
+might specify
-for numbers we might specify
 \begin{center}
 $\meta{N} \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;
 1\cdot \meta{N}\;|\;2\cdot \meta{N}\;|\;\ldots\;|\;9\cdot \meta{N}$
 \end{center}
 ;
 \end{tikzpicture}
 \end{tabular}
 \end{center}
-\noindent In particular in programming languages we will try
+\noindent In particular in programming languages we will try to avoid
-to avoid ambiguous grammars because two different parse-trees
+ambiguous grammars because two different parse-trees for a string mean a
-for a string mean a program can be interpreted in two
+program can be interpreted in two different ways. In such cases we have
-different ways. In such cases we have to somehow make sure the
+to somehow make sure the two different ways do not matter, or
-two different ways do not matter, or disambiguate the grammar
+disambiguate the grammar in some other way (for example making the $+$
-in some other way (for example making the $+$
+left-associative). Unfortunately already the problem of deciding whether
-left-associative). Unfortunately already the problem of
+a grammar is ambiguous or not is in general undecidable. But in simple
-deciding whether a grammar is ambiguous or not is in general
+instance (the ones we deal with in this module) one can usually see when
-undecidable. But in simple instance (the ones we deal in this
+a grammar is ambiguous.
-module) one can usually see when a grammar is ambiguous.
+\subsection*{Removing Left-Recursion}
+Let us come back to the problem of left-recursion and consider the
+following grammar for binary numbers:
+\begin{plstx}[margin=1cm]
+: \meta{B} ::= \meta{B} \cdot \meta{B} | 0 | 1\\
+\end{plstx}
+\noindent
+It is clear that this grammar can create all binary numbers, but
+it is also clear that this grammar is left-recursive. Giving this
+grammar as is to parser combinators will result in an infinite
+loop. Fortunately, every left-recursive grammar can be translated
+into one that is not left-recursive with the help of some
+transformation rules. Suppose we identified the ``offensive''
+rule, then we can separate the grammar into this offensive rule
+and the ``rest'':
+\begin{plstx}[margin=1cm]
+: \meta{B} ::= \underbrace{\meta{B} \cdot \meta{B}}_{\textit{lft-rec}}
+| \underbrace{0 \;\;|\;\; 1}_{\textit{rest}}\\
+\end{plstx}
+\noindent
+To make the idea of the transformation clearer, suppose the left-recursive
+rule is of the form $\meta{B}\alpha$ (the left-recursive non-terminal
+followed by something called $\alpha$) and the ``rest'' is called $\beta$.
+That means our grammar looks schematically as follows
+\begin{plstx}[margin=1cm]
+: \meta{B} ::= \meta{B} \cdot \alpha | \beta\\
+\end{plstx}
+\noindent
+To get rid of the left-recursion, we are required to introduce
+a new non-terminal, say $\meta{B'}$ and transform the rule
+as follows:
+\begin{plstx}[margin=1cm]
+: \meta{B} ::= \beta \cdot \meta{B'}\\
+: \meta{B'} ::= \alpha \cdot \meta{B'} | \epsilon\\
+\end{plstx}
+\noindent
+In our example of binary numbers we would after the transformation
+end up with the rules
+\begin{plstx}[margin=1cm]
+: \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
+: \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\
+\end{plstx}
+\noindent
+A little thought should convince you that this grammar still derives
+all the binary numbers (for example 0 and 1 are derivable because $\meta{B'}$
+can be $\epsilon$). Less clear might be why this grammar is non-left recursive.
+For $\meta{B'}$ it is relatively clear because we will never be
+able to derive things like
+\begin{center}
+$\meta{B'} \rightarrow\ldots\rightarrow \meta{B'}\cdot\ldots$
+\end{center}
+\noindent
+because there will always be a $\meta{B}$ in front of a $\meta{B'}$, and
+$\meta{B}$ now has always a $0$ or $1$ in front, so a $\meta{B'}$ can
+never be in the first place. The reasoning is similar for $\meta{B}$:
+the $0$ and $1$ in the rule for $\meta{B}$ ``protect'' it from becoming
+left-recursive. This transformation does not mean the grammar is the
+simplest left-recursive grammar for binary numbers. For example the
+following grammar would do as well
+\begin{plstx}[margin=1cm]
+: \meta{B} ::= 0 \cdot \meta{B} | 1 \cdot \meta{B} | 0 | 1\\
+\end{plstx}
+\noindent
+The point is that we can in principle transform every left-recursive
+grammar into one that is non-left-recursive one. This explains why often
+the following grammar is used for arithmetic expressions:
+\begin{plstx}[margin=1cm]
+: \meta{E} ::= \meta{T} | \meta{T} \cdot + \cdot \meta{E} |  \meta{T} \cdot - \cdot \meta{E}\\
+: \meta{T} ::= \meta{F} | \meta{F} \cdot * \cdot \meta{T}\\
+: \meta{F} ::= num\_token | ( \cdot \meta{E} \cdot )\\
+\end{plstx}
+\noindent
+In this grammar all $\meta{E}$xpressions, $\meta{T}$erms and $\meta{F}$actors
+are in some way protected from being left-recusive. For example if you
+start $\meta{E}$ you can derive another one by going through $\meta{T}$, then
+$\meta{F}$, but then $\meta{E}$ is protected by the open-parenthesis.
+\subsection*{Removing $\epsilon$-Rules and CYK-Algorithm}
+I showed above that the non-left-recursive grammar for binary numbers is
+\begin{plstx}[margin=1cm]
+: \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
+: \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\
+\end{plstx}
+\noindent
+The transformation made the original grammar non-left-recursive, but at
+the expense of introducing an $\epsilon$ in the second rule. Having an
+explicit $\epsilon$-rule is annoying to, not in terms of looping, but in
+terms of efficiency. The reason is that the $\epsilon$-rule always
+applies but since it recognises the empty string, it does not make any
+progress with recognising a string. Better are rules like $( \cdot
+\meta{E} \cdot )$ where something of the input is consumed. Getting
+rid of $\epsilon$-rules is also important for the CYK parsing algorithm,
+which can give us an insight into the complexity class of parsing.
+It turns out we can also by some generic transformations eliminate
+$\epsilon$-rules from grammars. Consider again the grammar above for
+binary numbers where have a rule $\meta{B'} ::= \epsilon$. In this case
+we look for rules of the (generic) form \mbox{$\meta{A} :=
+\alpha\cdot\meta{B'}\cdot\beta$}. That is there are rules that use
+$\meta{B'}$ and something ($\alpha$) is in front of $\meta{B'}$ and
+something follows ($\beta$). Such rules need to be replaced by
+additional rules of the form \mbox{$\meta{A} := \alpha\cdot\beta$}.
+In our running example there are the two rules for $\meta{B}$ which
+fall into this category
+\begin{plstx}[margin=1cm]
+: \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
+\end{plstx}
+\noindent To follow the general scheme of the transfromation,
+the $\alpha$ is either is either $0$ or $1$, and the $\beta$ happens
+to be empty. SO we need to generate new rules for the form
+\mbox{$\meta{A} := \alpha\cdot\beta$}, which in our particular
+example means we obtain
+\begin{plstx}[margin=1cm]
+: \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\
+\end{plstx}
+\noindent
+Unfortunately $\meta{B'}$ is also used in the rule
+\begin{plstx}[margin=1cm]
+: \meta{B'} ::= \meta{B} \cdot \meta{B'}\\
+\end{plstx}
+\noindent
+For this we repeat the transformation, giving
+\begin{plstx}[margin=1cm]
+: \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\
+\end{plstx}
+\noindent
+In this case $\alpha$ was substituted with $\meta{B}$ and $\beta$
+was again empty. Once no rule is left over, we can simply throw
+away the $\epsilon$ rule.  This gives the grammar
+\begin{plstx}[margin=1cm]
+: \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\
+: \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\
+\end{plstx}
+\noindent
+I let you think about whether this grammar can still recognise all
+binary numbers and whether this grammar is non-left-recursive. The
+precise statement for the transformation of removing $\epsilon$-rules is
+that if the original grammar was able to recognise only non-empty
+strings, then the transformed grammar will be equivalent (matching the
+same set of strings); if the original grammar was able to match the
+empty string, then the transformed grammar will be able to match the
+same strings, \emph{except} the empty string. So the  $\epsilon$-removal
+does not preserve equivalence of grammars, but the small defect with the
+empty string is not important for practical purposes.
+So why are these transformations all useful? Well apart from making the
+parser combinators work (remember they cannot deal with left-recursion and
+are inefficient with $\epsilon$-rules), a second reason is that they help
+with getting any insight into the complexity of the parsing problem.
+The parser combinators are very easy to implement, but are far from the
+most efficient way of processing input (they can blow up exponentially
+with ambiguous grammars). The question remains what is the best possible
+complexity for parsing? It turns out that this is $O(n^3)$ for context-free
+languages.
+To answer the question about complexity, let me describe next the CYK
+algorithm (named after the authors Cocke–Younger–Kasami). This algorithm
+works with grammars that are in Chomsky normalform.
+TBD
+\end{document}
+%%% Parser combinators are now part of handout 6
 \subsection*{Parser Combinators}
 Let us now turn to the problem of generating a parse-tree for
 a grammar and string. In what follows we explain \emph{parser
 %\end{tabular}
 %\end{center}
-\end{document}
 %%% Local Variables:
 %%% mode: latex
 %%% TeX-master: t
 %%% End:

changeset 680	eecc4d5a2172
parent 665	6d74d2a0a4b0
child 681	7b7736bea3ca