handouts/ho05.tex
changeset 680 eecc4d5a2172
parent 665 6d74d2a0a4b0
child 681 7b7736bea3ca
equal deleted inserted replaced
679:8fc109f36b78 680:eecc4d5a2172
    12 
    12 
    13 \begin{document}
    13 \begin{document}
    14 
    14 
    15 \section*{Handout 5 (Grammars \& Parser)}
    15 \section*{Handout 5 (Grammars \& Parser)}
    16 
    16 
    17 While regular expressions are very useful for lexing and for
    17 While regular expressions are very useful for lexing and for recognising
    18 recognising many patterns in strings (like email addresses),
    18 many patterns in strings (like email addresses), they have their
    19 they have their limitations. For example there is no regular
    19 limitations. For example there is no regular expression that can
    20 expression that can recognise the language $a^nb^n$. Another
    20 recognise the language $a^nb^n$ (where you have strings with $n$ $a$'s
    21 example for which there exists no regular expression is the
    21 followed by the same amount of $b$'s). Another example for which there
    22 language of well-parenthesised expressions. In languages like
    22 exists no regular expression is the language of well-parenthesised
    23 Lisp, which use parentheses rather extensively, it might be of
    23 expressions. In languages like Lisp, which use parentheses rather
    24 interest to know whether the following two expressions are
    24 extensively, it might be of interest to know whether the following two
    25 well-parenthesised or not (the left one is, the right one is not):
    25 expressions are well-parenthesised or not (the left one is, the right
       
    26 one is not):
    26 
    27 
    27 \begin{center}
    28 \begin{center}
    28 $(((()()))())$  \hspace{10mm} $(((()()))()))$
    29 $(((()()))())$  \hspace{10mm} $(((()()))()))$
    29 \end{center}
    30 \end{center}
    30 
    31 
    50 \end{tikzpicture}
    51 \end{tikzpicture}
    51 \end{center}
    52 \end{center}
    52 
    53 
    53 \noindent Each ``bubble'' stands for sets of languages (remember
    54 \noindent Each ``bubble'' stands for sets of languages (remember
    54 languages are sets of strings). As indicated the set of regular
    55 languages are sets of strings). As indicated the set of regular
    55 languages are fully included inside the context-free languages,
    56 languages is fully included inside the context-free languages,
    56 meaning every regular language is also context-free, but not vice
    57 meaning every regular language is also context-free, but not vice
    57 versa. Below I will let you think, for example, what the context-free
    58 versa. Below I will let you think, for example, what the context-free
    58 grammar is for the language corresponding to the regular expression
    59 grammar is for the language corresponding to the regular expression
    59 $(aaa)^*a$.
    60 $(aaa)^*a$.
    60 
    61 
    61 Because of their convenience, context-free languages play an important
    62 Because of their convenience, context-free languages play an important
    62 role in `day-to-day' text processing and in programming
    63 role in `day-to-day' text processing and in programming
    63 languages. Context-free in this setting means that ``words'' have one
    64 languages. Context-free in this setting means that ``words'' have one
    64 meaning only and this meaning is independent from in which context
    65 meaning only and this meaning is independent from the context
    65 the ``words'' appear. For example ambiguity issues like
    66 the ``words'' appear in. For example ambiguity issues like
    66 
    67 
    67 \begin{center}
    68 \begin{center}
    68 \tt Time flies like an arrow; fruit flies like bananas.
    69 \tt Time flies like an arrow; fruit flies like bananas.
    69 \end{center}  
    70 \end{center}  
    70 
    71 
    71 \noindent
    72 \noindent
    72 from natural languages were the meaning of \emph{flies} depend on the
    73 from natural languages were the meaning of \emph{flies} depends on the
    73 surrounding \emph{context} are avoided as much as possible.
    74 surrounding \emph{context} are avoided as much as possible.
    74 
    75 
    75 Context-free languages are usually specified by grammars. For example
    76 Context-free languages are usually specified by grammars. For example
    76 a grammar for well-parenthesised expressions can be given as follows:
    77 a grammar for well-parenthesised expressions can be given as follows:
    77 
    78 
   160 \begin{center}
   161 \begin{center}
   161 $\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
   162 $\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
   162 \end{center}
   163 \end{center}
   163 
   164 
   164 \noindent
   165 \noindent
   165 A \emph{parse-tree} encodes how a string is derived with the starting symbol on 
   166 A \emph{parse-tree} encodes how a string is derived with the starting
   166 top and each non-terminal containing a subtree for how it is replaced in a derivation.
   167 symbol on top and each non-terminal containing a subtree for how it is
   167 The parse tree for the string $(1 + 23)+4$ is as follows:
   168 replaced in a derivation. The parse tree for the string $(1 + 23)+4$ is
       
   169 as follows:
   168 
   170 
   169 \begin{center}
   171 \begin{center}
   170 \begin{tikzpicture}[level distance=8mm, black]
   172 \begin{tikzpicture}[level distance=8mm, black]
   171   \node {\meta{E}}
   173   \node {\meta{E}}
   172     child {node {\meta{E} } 
   174     child {node {\meta{E} } 
   188 \end{tikzpicture}
   190 \end{tikzpicture}
   189 \end{center}
   191 \end{center}
   190 
   192 
   191 \noindent We are often interested in these parse-trees since
   193 \noindent We are often interested in these parse-trees since
   192 they encode the structure of how a string is derived by a
   194 they encode the structure of how a string is derived by a
   193 grammar. Before we come to the problem of constructing such
   195 grammar. 
   194 parse-trees, we need to consider the following two properties
   196 
   195 of grammars. A grammar is \emph{left-recursive} if there is a
   197 Before we come to the problem of constructing such parse-trees, we need
   196 derivation starting from a non-terminal, say \meta{NT} which leads
   198 to consider the following two properties of grammars. A grammar is
   197 to a string which again starts with \meta{NT}. This means a
   199 \emph{left-recursive} if there is a derivation starting from a
   198 derivation of the form.
   200 non-terminal, say \meta{NT} which leads to a string which again starts
       
   201 with \meta{NT}. This means a derivation of the form.
   199 
   202 
   200 \begin{center}
   203 \begin{center}
   201 $\meta{NT} \rightarrow \ldots \rightarrow \meta{NT} \cdot \ldots$
   204 $\meta{NT} \rightarrow \ldots \rightarrow \meta{NT} \cdot \ldots$
   202 \end{center}
   205 \end{center}
   203 
   206 
   204 \noindent It can be easily seen that the grammar above for
   207 \noindent It can be easily seen that the grammar above for arithmetic
   205 arithmetic expressions is left-recursive: for example the
   208 expressions is left-recursive: for example the rules $\meta{E}
   206 rules $\meta{E} \rightarrow \meta{E}\cdot + \cdot \meta{E}$ and 
   209 \rightarrow \meta{E}\cdot + \cdot \meta{E}$ and $\meta{N} \rightarrow
   207 $\meta{N} \rightarrow \meta{N}\cdot \meta{N}$ show that this 
   210 \meta{N}\cdot \meta{N}$ show that this grammar is left-recursive. But
   208 grammar is left-recursive. But note
   211 note that left-recursiveness can involve more than one step in the
   209 that left-recursiveness can involve more than one step in the
   212 derivation. The problem with left-recursive grammars is that some
   210 derivation. The problem with left-recursive grammars is that
   213 algorithms cannot cope with them: with left-recursive grammars they will
   211 some algorithms cannot cope with them: they fall into a loop.
   214 fall into a loop. Fortunately every left-recursive grammar can be
   212 Fortunately every left-recursive grammar can be transformed
   215 transformed into one that is not left-recursive, although this
   213 into one that is not left-recursive, although this
   216 transformation might make the grammar less ``human-readable''. For
   214 transformation might make the grammar less ``human-readable''.
   217 example if we want to give a non-left-recursive grammar for numbers we
   215 For example if we want to give a non-left-recursive grammar
   218 might specify
   216 for numbers we might specify
       
   217 
   219 
   218 \begin{center}
   220 \begin{center}
   219 $\meta{N} \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;
   221 $\meta{N} \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;
   220 1\cdot \meta{N}\;|\;2\cdot \meta{N}\;|\;\ldots\;|\;9\cdot \meta{N}$
   222 1\cdot \meta{N}\;|\;2\cdot \meta{N}\;|\;\ldots\;|\;9\cdot \meta{N}$
   221 \end{center}
   223 \end{center}
   258     ;
   260     ;
   259 \end{tikzpicture}
   261 \end{tikzpicture}
   260 \end{tabular} 
   262 \end{tabular} 
   261 \end{center}
   263 \end{center}
   262 
   264 
   263 \noindent In particular in programming languages we will try
   265 \noindent In particular in programming languages we will try to avoid
   264 to avoid ambiguous grammars because two different parse-trees
   266 ambiguous grammars because two different parse-trees for a string mean a
   265 for a string mean a program can be interpreted in two
   267 program can be interpreted in two different ways. In such cases we have
   266 different ways. In such cases we have to somehow make sure the
   268 to somehow make sure the two different ways do not matter, or
   267 two different ways do not matter, or disambiguate the grammar
   269 disambiguate the grammar in some other way (for example making the $+$
   268 in some other way (for example making the $+$
   270 left-associative). Unfortunately already the problem of deciding whether
   269 left-associative). Unfortunately already the problem of
   271 a grammar is ambiguous or not is in general undecidable. But in simple
   270 deciding whether a grammar is ambiguous or not is in general
   272 instance (the ones we deal with in this module) one can usually see when
   271 undecidable. But in simple instance (the ones we deal in this
   273 a grammar is ambiguous.
   272 module) one can usually see when a grammar is ambiguous.
   274 
   273 
   275 \subsection*{Removing Left-Recursion}
       
   276 
       
   277 Let us come back to the problem of left-recursion and consider the 
       
   278 following grammar for binary numbers:
       
   279 
       
   280 \begin{plstx}[margin=1cm]
       
   281 : \meta{B} ::= \meta{B} \cdot \meta{B} | 0 | 1\\
       
   282 \end{plstx}
       
   283 
       
   284 \noindent
       
   285 It is clear that this grammar can create all binary numbers, but
       
   286 it is also clear that this grammar is left-recursive. Giving this
       
   287 grammar as is to parser combinators will result in an infinite 
       
   288 loop. Fortunately, every left-recursive grammar can be translated
       
   289 into one that is not left-recursive with the help of some
       
   290 transformation rules. Suppose we identified the ``offensive'' 
       
   291 rule, then we can separate the grammar into this offensive rule
       
   292 and the ``rest'':
       
   293 
       
   294 \begin{plstx}[margin=1cm]
       
   295   : \meta{B} ::= \underbrace{\meta{B} \cdot \meta{B}}_{\textit{lft-rec}} 
       
   296   | \underbrace{0 \;\;|\;\; 1}_{\textit{rest}}\\
       
   297 \end{plstx}
       
   298 
       
   299 \noindent
       
   300 To make the idea of the transformation clearer, suppose the left-recursive
       
   301 rule is of the form $\meta{B}\alpha$ (the left-recursive non-terminal 
       
   302 followed by something called $\alpha$) and the ``rest'' is called $\beta$.
       
   303 That means our grammar looks schematically as follows
       
   304 
       
   305 \begin{plstx}[margin=1cm]
       
   306   : \meta{B} ::= \meta{B} \cdot \alpha | \beta\\
       
   307 \end{plstx}
       
   308 
       
   309 \noindent
       
   310 To get rid of the left-recursion, we are required to introduce 
       
   311 a new non-terminal, say $\meta{B'}$ and transform the rule
       
   312 as follows:
       
   313 
       
   314 \begin{plstx}[margin=1cm]
       
   315   : \meta{B} ::= \beta \cdot \meta{B'}\\
       
   316   : \meta{B'} ::= \alpha \cdot \meta{B'} | \epsilon\\
       
   317 \end{plstx}
       
   318 
       
   319 \noindent
       
   320 In our example of binary numbers we would after the transformation 
       
   321 end up with the rules
       
   322 
       
   323 \begin{plstx}[margin=1cm]
       
   324   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
       
   325   : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\
       
   326 \end{plstx}
       
   327 
       
   328 \noindent
       
   329 A little thought should convince you that this grammar still derives
       
   330 all the binary numbers (for example 0 and 1 are derivable because $\meta{B'}$
       
   331 can be $\epsilon$). Less clear might be why this grammar is non-left recursive.
       
   332 For $\meta{B'}$ it is relatively clear because we will never be 
       
   333 able to derive things like
       
   334 
       
   335 \begin{center}
       
   336 $\meta{B'} \rightarrow\ldots\rightarrow \meta{B'}\cdot\ldots$
       
   337 \end{center}  
       
   338 
       
   339 \noindent
       
   340 because there will always be a $\meta{B}$ in front of a $\meta{B'}$, and
       
   341 $\meta{B}$ now has always a $0$ or $1$ in front, so a $\meta{B'}$ can
       
   342 never be in the first place. The reasoning is similar for $\meta{B}$:
       
   343 the $0$ and $1$ in the rule for $\meta{B}$ ``protect'' it from becoming
       
   344 left-recursive. This transformation does not mean the grammar is the
       
   345 simplest left-recursive grammar for binary numbers. For example the
       
   346 following grammar would do as well
       
   347 
       
   348 \begin{plstx}[margin=1cm]
       
   349   : \meta{B} ::= 0 \cdot \meta{B} | 1 \cdot \meta{B} | 0 | 1\\
       
   350 \end{plstx}
       
   351 
       
   352 \noindent
       
   353 The point is that we can in principle transform every left-recursive
       
   354 grammar into one that is non-left-recursive one. This explains why often
       
   355 the following grammar is used for arithmetic expressions:
       
   356 
       
   357 \begin{plstx}[margin=1cm]
       
   358   : \meta{E} ::= \meta{T} | \meta{T} \cdot + \cdot \meta{E} |  \meta{T} \cdot - \cdot \meta{E}\\
       
   359   : \meta{T} ::= \meta{F} | \meta{F} \cdot * \cdot \meta{T}\\
       
   360   : \meta{F} ::= num\_token | ( \cdot \meta{E} \cdot )\\
       
   361 \end{plstx}
       
   362 
       
   363 \noindent
       
   364 In this grammar all $\meta{E}$xpressions, $\meta{T}$erms and $\meta{F}$actors
       
   365 are in some way protected from being left-recusive. For example if you
       
   366 start $\meta{E}$ you can derive another one by going through $\meta{T}$, then 
       
   367 $\meta{F}$, but then $\meta{E}$ is protected by the open-parenthesis.
       
   368 
       
   369 \subsection*{Removing $\epsilon$-Rules and CYK-Algorithm}
       
   370 
       
   371 I showed above that the non-left-recursive grammar for binary numbers is
       
   372 
       
   373 \begin{plstx}[margin=1cm]
       
   374   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
       
   375   : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\
       
   376 \end{plstx}
       
   377 
       
   378 \noindent
       
   379 The transformation made the original grammar non-left-recursive, but at
       
   380 the expense of introducing an $\epsilon$ in the second rule. Having an
       
   381 explicit $\epsilon$-rule is annoying to, not in terms of looping, but in
       
   382 terms of efficiency. The reason is that the $\epsilon$-rule always
       
   383 applies but since it recognises the empty string, it does not make any
       
   384 progress with recognising a string. Better are rules like $( \cdot
       
   385 \meta{E} \cdot )$ where something of the input is consumed. Getting
       
   386 rid of $\epsilon$-rules is also important for the CYK parsing algorithm,
       
   387 which can give us an insight into the complexity class of parsing.
       
   388 
       
   389 It turns out we can also by some generic transformations eliminate
       
   390 $\epsilon$-rules from grammars. Consider again the grammar above for
       
   391 binary numbers where have a rule $\meta{B'} ::= \epsilon$. In this case
       
   392 we look for rules of the (generic) form \mbox{$\meta{A} :=
       
   393 \alpha\cdot\meta{B'}\cdot\beta$}. That is there are rules that use
       
   394 $\meta{B'}$ and something ($\alpha$) is in front of $\meta{B'}$ and
       
   395 something follows ($\beta$). Such rules need to be replaced by
       
   396 additional rules of the form \mbox{$\meta{A} := \alpha\cdot\beta$}.
       
   397 In our running example there are the two rules for $\meta{B}$ which
       
   398 fall into this category
       
   399 
       
   400 \begin{plstx}[margin=1cm]
       
   401   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
       
   402 \end{plstx} 
       
   403 
       
   404 \noindent To follow the general scheme of the transfromation,
       
   405 the $\alpha$ is either is either $0$ or $1$, and the $\beta$ happens
       
   406 to be empty. SO we need to generate new rules for the form 
       
   407 \mbox{$\meta{A} := \alpha\cdot\beta$}, which in our particular 
       
   408 example means we obtain
       
   409 
       
   410 \begin{plstx}[margin=1cm]
       
   411   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\
       
   412 \end{plstx} 
       
   413 
       
   414 \noindent
       
   415 Unfortunately $\meta{B'}$ is also used in the rule 
       
   416 
       
   417 \begin{plstx}[margin=1cm]
       
   418   : \meta{B'} ::= \meta{B} \cdot \meta{B'}\\
       
   419 \end{plstx}
       
   420 
       
   421 \noindent
       
   422 For this we repeat the transformation, giving 
       
   423 
       
   424 \begin{plstx}[margin=1cm]
       
   425   : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\
       
   426 \end{plstx}
       
   427 
       
   428 \noindent
       
   429 In this case $\alpha$ was substituted with $\meta{B}$ and $\beta$
       
   430 was again empty. Once no rule is left over, we can simply throw
       
   431 away the $\epsilon$ rule.  This gives the grammar
       
   432 
       
   433 \begin{plstx}[margin=1cm]
       
   434   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\
       
   435   : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\
       
   436 \end{plstx}
       
   437 
       
   438 \noindent
       
   439 I let you think about whether this grammar can still recognise all
       
   440 binary numbers and whether this grammar is non-left-recursive. The
       
   441 precise statement for the transformation of removing $\epsilon$-rules is
       
   442 that if the original grammar was able to recognise only non-empty
       
   443 strings, then the transformed grammar will be equivalent (matching the
       
   444 same set of strings); if the original grammar was able to match the
       
   445 empty string, then the transformed grammar will be able to match the
       
   446 same strings, \emph{except} the empty string. So the  $\epsilon$-removal
       
   447 does not preserve equivalence of grammars, but the small defect with the
       
   448 empty string is not important for practical purposes.
       
   449 
       
   450 So why are these transformations all useful? Well apart from making the 
       
   451 parser combinators work (remember they cannot deal with left-recursion and
       
   452 are inefficient with $\epsilon$-rules), a second reason is that they help
       
   453 with getting any insight into the complexity of the parsing problem. 
       
   454 The parser combinators are very easy to implement, but are far from the 
       
   455 most efficient way of processing input (they can blow up exponentially
       
   456 with ambiguous grammars). The question remains what is the best possible
       
   457 complexity for parsing? It turns out that this is $O(n^3)$ for context-free
       
   458 languages. 
       
   459 
       
   460 To answer the question about complexity, let me describe next the CYK
       
   461 algorithm (named after the authors Cocke–Younger–Kasami). This algorithm
       
   462 works with grammars that are in Chomsky normalform. 
       
   463 
       
   464 TBD
       
   465 
       
   466 \end{document}
       
   467 
       
   468 
       
   469 %%% Parser combinators are now part of handout 6
   274 
   470 
   275 \subsection*{Parser Combinators}
   471 \subsection*{Parser Combinators}
   276 
   472 
   277 Let us now turn to the problem of generating a parse-tree for
   473 Let us now turn to the problem of generating a parse-tree for
   278 a grammar and string. In what follows we explain \emph{parser
   474 a grammar and string. In what follows we explain \emph{parser
   531 %\end{tabular}
   727 %\end{tabular}
   532 %\end{center}
   728 %\end{center}
   533 
   729 
   534 
   730 
   535 
   731 
   536 \end{document}
   732 
       
   733 
   537 
   734 
   538 %%% Local Variables: 
   735 %%% Local Variables: 
   539 %%% mode: latex  
   736 %%% mode: latex  
   540 %%% TeX-master: t
   737 %%% TeX-master: t
   541 %%% End: 
   738 %%% End: 
       
   739