diff -r 8fc109f36b78 -r eecc4d5a2172 handouts/ho05.tex --- a/handouts/ho05.tex Fri Nov 01 13:21:51 2019 +0000 +++ b/handouts/ho05.tex Wed Nov 06 15:04:40 2019 +0000 @@ -14,15 +14,16 @@ \section*{Handout 5 (Grammars \& Parser)} -While regular expressions are very useful for lexing and for -recognising many patterns in strings (like email addresses), -they have their limitations. For example there is no regular -expression that can recognise the language $a^nb^n$. Another -example for which there exists no regular expression is the -language of well-parenthesised expressions. In languages like -Lisp, which use parentheses rather extensively, it might be of -interest to know whether the following two expressions are -well-parenthesised or not (the left one is, the right one is not): +While regular expressions are very useful for lexing and for recognising +many patterns in strings (like email addresses), they have their +limitations. For example there is no regular expression that can +recognise the language $a^nb^n$ (where you have strings with $n$ $a$'s +followed by the same amount of $b$'s). Another example for which there +exists no regular expression is the language of well-parenthesised +expressions. In languages like Lisp, which use parentheses rather +extensively, it might be of interest to know whether the following two +expressions are well-parenthesised or not (the left one is, the right +one is not): \begin{center} $(((()()))())$ \hspace{10mm} $(((()()))()))$ @@ -52,7 +53,7 @@ \noindent Each ``bubble'' stands for sets of languages (remember languages are sets of strings). As indicated the set of regular -languages are fully included inside the context-free languages, +languages is fully included inside the context-free languages, meaning every regular language is also context-free, but not vice versa. Below I will let you think, for example, what the context-free grammar is for the language corresponding to the regular expression @@ -61,15 +62,15 @@ Because of their convenience, context-free languages play an important role in `day-to-day' text processing and in programming languages. Context-free in this setting means that ``words'' have one -meaning only and this meaning is independent from in which context -the ``words'' appear. For example ambiguity issues like +meaning only and this meaning is independent from the context +the ``words'' appear in. For example ambiguity issues like \begin{center} \tt Time flies like an arrow; fruit flies like bananas. \end{center} \noindent -from natural languages were the meaning of \emph{flies} depend on the +from natural languages were the meaning of \emph{flies} depends on the surrounding \emph{context} are avoided as much as possible. Context-free languages are usually specified by grammars. For example @@ -162,9 +163,10 @@ \end{center} \noindent -A \emph{parse-tree} encodes how a string is derived with the starting symbol on -top and each non-terminal containing a subtree for how it is replaced in a derivation. -The parse tree for the string $(1 + 23)+4$ is as follows: +A \emph{parse-tree} encodes how a string is derived with the starting +symbol on top and each non-terminal containing a subtree for how it is +replaced in a derivation. The parse tree for the string $(1 + 23)+4$ is +as follows: \begin{center} \begin{tikzpicture}[level distance=8mm, black] @@ -190,30 +192,30 @@ \noindent We are often interested in these parse-trees since they encode the structure of how a string is derived by a -grammar. Before we come to the problem of constructing such -parse-trees, we need to consider the following two properties -of grammars. A grammar is \emph{left-recursive} if there is a -derivation starting from a non-terminal, say \meta{NT} which leads -to a string which again starts with \meta{NT}. This means a -derivation of the form. +grammar. + +Before we come to the problem of constructing such parse-trees, we need +to consider the following two properties of grammars. A grammar is +\emph{left-recursive} if there is a derivation starting from a +non-terminal, say \meta{NT} which leads to a string which again starts +with \meta{NT}. This means a derivation of the form. \begin{center} $\meta{NT} \rightarrow \ldots \rightarrow \meta{NT} \cdot \ldots$ \end{center} -\noindent It can be easily seen that the grammar above for -arithmetic expressions is left-recursive: for example the -rules $\meta{E} \rightarrow \meta{E}\cdot + \cdot \meta{E}$ and -$\meta{N} \rightarrow \meta{N}\cdot \meta{N}$ show that this -grammar is left-recursive. But note -that left-recursiveness can involve more than one step in the -derivation. The problem with left-recursive grammars is that -some algorithms cannot cope with them: they fall into a loop. -Fortunately every left-recursive grammar can be transformed -into one that is not left-recursive, although this -transformation might make the grammar less ``human-readable''. -For example if we want to give a non-left-recursive grammar -for numbers we might specify +\noindent It can be easily seen that the grammar above for arithmetic +expressions is left-recursive: for example the rules $\meta{E} +\rightarrow \meta{E}\cdot + \cdot \meta{E}$ and $\meta{N} \rightarrow +\meta{N}\cdot \meta{N}$ show that this grammar is left-recursive. But +note that left-recursiveness can involve more than one step in the +derivation. The problem with left-recursive grammars is that some +algorithms cannot cope with them: with left-recursive grammars they will +fall into a loop. Fortunately every left-recursive grammar can be +transformed into one that is not left-recursive, although this +transformation might make the grammar less ``human-readable''. For +example if we want to give a non-left-recursive grammar for numbers we +might specify \begin{center} $\meta{N} \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\; @@ -260,17 +262,211 @@ \end{tabular} \end{center} -\noindent In particular in programming languages we will try -to avoid ambiguous grammars because two different parse-trees -for a string mean a program can be interpreted in two -different ways. In such cases we have to somehow make sure the -two different ways do not matter, or disambiguate the grammar -in some other way (for example making the $+$ -left-associative). Unfortunately already the problem of -deciding whether a grammar is ambiguous or not is in general -undecidable. But in simple instance (the ones we deal in this -module) one can usually see when a grammar is ambiguous. +\noindent In particular in programming languages we will try to avoid +ambiguous grammars because two different parse-trees for a string mean a +program can be interpreted in two different ways. In such cases we have +to somehow make sure the two different ways do not matter, or +disambiguate the grammar in some other way (for example making the $+$ +left-associative). Unfortunately already the problem of deciding whether +a grammar is ambiguous or not is in general undecidable. But in simple +instance (the ones we deal with in this module) one can usually see when +a grammar is ambiguous. + +\subsection*{Removing Left-Recursion} + +Let us come back to the problem of left-recursion and consider the +following grammar for binary numbers: + +\begin{plstx}[margin=1cm] +: \meta{B} ::= \meta{B} \cdot \meta{B} | 0 | 1\\ +\end{plstx} + +\noindent +It is clear that this grammar can create all binary numbers, but +it is also clear that this grammar is left-recursive. Giving this +grammar as is to parser combinators will result in an infinite +loop. Fortunately, every left-recursive grammar can be translated +into one that is not left-recursive with the help of some +transformation rules. Suppose we identified the ``offensive'' +rule, then we can separate the grammar into this offensive rule +and the ``rest'': + +\begin{plstx}[margin=1cm] + : \meta{B} ::= \underbrace{\meta{B} \cdot \meta{B}}_{\textit{lft-rec}} + | \underbrace{0 \;\;|\;\; 1}_{\textit{rest}}\\ +\end{plstx} + +\noindent +To make the idea of the transformation clearer, suppose the left-recursive +rule is of the form $\meta{B}\alpha$ (the left-recursive non-terminal +followed by something called $\alpha$) and the ``rest'' is called $\beta$. +That means our grammar looks schematically as follows + +\begin{plstx}[margin=1cm] + : \meta{B} ::= \meta{B} \cdot \alpha | \beta\\ +\end{plstx} + +\noindent +To get rid of the left-recursion, we are required to introduce +a new non-terminal, say $\meta{B'}$ and transform the rule +as follows: + +\begin{plstx}[margin=1cm] + : \meta{B} ::= \beta \cdot \meta{B'}\\ + : \meta{B'} ::= \alpha \cdot \meta{B'} | \epsilon\\ +\end{plstx} + +\noindent +In our example of binary numbers we would after the transformation +end up with the rules + +\begin{plstx}[margin=1cm] + : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\ + : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\ +\end{plstx} + +\noindent +A little thought should convince you that this grammar still derives +all the binary numbers (for example 0 and 1 are derivable because $\meta{B'}$ +can be $\epsilon$). Less clear might be why this grammar is non-left recursive. +For $\meta{B'}$ it is relatively clear because we will never be +able to derive things like + +\begin{center} +$\meta{B'} \rightarrow\ldots\rightarrow \meta{B'}\cdot\ldots$ +\end{center} + +\noindent +because there will always be a $\meta{B}$ in front of a $\meta{B'}$, and +$\meta{B}$ now has always a $0$ or $1$ in front, so a $\meta{B'}$ can +never be in the first place. The reasoning is similar for $\meta{B}$: +the $0$ and $1$ in the rule for $\meta{B}$ ``protect'' it from becoming +left-recursive. This transformation does not mean the grammar is the +simplest left-recursive grammar for binary numbers. For example the +following grammar would do as well + +\begin{plstx}[margin=1cm] + : \meta{B} ::= 0 \cdot \meta{B} | 1 \cdot \meta{B} | 0 | 1\\ +\end{plstx} + +\noindent +The point is that we can in principle transform every left-recursive +grammar into one that is non-left-recursive one. This explains why often +the following grammar is used for arithmetic expressions: + +\begin{plstx}[margin=1cm] + : \meta{E} ::= \meta{T} | \meta{T} \cdot + \cdot \meta{E} | \meta{T} \cdot - \cdot \meta{E}\\ + : \meta{T} ::= \meta{F} | \meta{F} \cdot * \cdot \meta{T}\\ + : \meta{F} ::= num\_token | ( \cdot \meta{E} \cdot )\\ +\end{plstx} +\noindent +In this grammar all $\meta{E}$xpressions, $\meta{T}$erms and $\meta{F}$actors +are in some way protected from being left-recusive. For example if you +start $\meta{E}$ you can derive another one by going through $\meta{T}$, then +$\meta{F}$, but then $\meta{E}$ is protected by the open-parenthesis. + +\subsection*{Removing $\epsilon$-Rules and CYK-Algorithm} + +I showed above that the non-left-recursive grammar for binary numbers is + +\begin{plstx}[margin=1cm] + : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\ + : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\ +\end{plstx} + +\noindent +The transformation made the original grammar non-left-recursive, but at +the expense of introducing an $\epsilon$ in the second rule. Having an +explicit $\epsilon$-rule is annoying to, not in terms of looping, but in +terms of efficiency. The reason is that the $\epsilon$-rule always +applies but since it recognises the empty string, it does not make any +progress with recognising a string. Better are rules like $( \cdot +\meta{E} \cdot )$ where something of the input is consumed. Getting +rid of $\epsilon$-rules is also important for the CYK parsing algorithm, +which can give us an insight into the complexity class of parsing. + +It turns out we can also by some generic transformations eliminate +$\epsilon$-rules from grammars. Consider again the grammar above for +binary numbers where have a rule $\meta{B'} ::= \epsilon$. In this case +we look for rules of the (generic) form \mbox{$\meta{A} := +\alpha\cdot\meta{B'}\cdot\beta$}. That is there are rules that use +$\meta{B'}$ and something ($\alpha$) is in front of $\meta{B'}$ and +something follows ($\beta$). Such rules need to be replaced by +additional rules of the form \mbox{$\meta{A} := \alpha\cdot\beta$}. +In our running example there are the two rules for $\meta{B}$ which +fall into this category + +\begin{plstx}[margin=1cm] + : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\ +\end{plstx} + +\noindent To follow the general scheme of the transfromation, +the $\alpha$ is either is either $0$ or $1$, and the $\beta$ happens +to be empty. SO we need to generate new rules for the form +\mbox{$\meta{A} := \alpha\cdot\beta$}, which in our particular +example means we obtain + +\begin{plstx}[margin=1cm] + : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\ +\end{plstx} + +\noindent +Unfortunately $\meta{B'}$ is also used in the rule + +\begin{plstx}[margin=1cm] + : \meta{B'} ::= \meta{B} \cdot \meta{B'}\\ +\end{plstx} + +\noindent +For this we repeat the transformation, giving + +\begin{plstx}[margin=1cm] + : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\ +\end{plstx} + +\noindent +In this case $\alpha$ was substituted with $\meta{B}$ and $\beta$ +was again empty. Once no rule is left over, we can simply throw +away the $\epsilon$ rule. This gives the grammar + +\begin{plstx}[margin=1cm] + : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\ + : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\ +\end{plstx} + +\noindent +I let you think about whether this grammar can still recognise all +binary numbers and whether this grammar is non-left-recursive. The +precise statement for the transformation of removing $\epsilon$-rules is +that if the original grammar was able to recognise only non-empty +strings, then the transformed grammar will be equivalent (matching the +same set of strings); if the original grammar was able to match the +empty string, then the transformed grammar will be able to match the +same strings, \emph{except} the empty string. So the $\epsilon$-removal +does not preserve equivalence of grammars, but the small defect with the +empty string is not important for practical purposes. + +So why are these transformations all useful? Well apart from making the +parser combinators work (remember they cannot deal with left-recursion and +are inefficient with $\epsilon$-rules), a second reason is that they help +with getting any insight into the complexity of the parsing problem. +The parser combinators are very easy to implement, but are far from the +most efficient way of processing input (they can blow up exponentially +with ambiguous grammars). The question remains what is the best possible +complexity for parsing? It turns out that this is $O(n^3)$ for context-free +languages. + +To answer the question about complexity, let me describe next the CYK +algorithm (named after the authors Cocke–Younger–Kasami). This algorithm +works with grammars that are in Chomsky normalform. + +TBD + +\end{document} + + +%%% Parser combinators are now part of handout 6 \subsection*{Parser Combinators} @@ -533,9 +729,11 @@ -\end{document} + + %%% Local Variables: %%% mode: latex %%% TeX-master: t %%% End: +