50 \end{tikzpicture} |
51 \end{tikzpicture} |
51 \end{center} |
52 \end{center} |
52 |
53 |
53 \noindent Each ``bubble'' stands for sets of languages (remember |
54 \noindent Each ``bubble'' stands for sets of languages (remember |
54 languages are sets of strings). As indicated the set of regular |
55 languages are sets of strings). As indicated the set of regular |
55 languages are fully included inside the context-free languages, |
56 languages is fully included inside the context-free languages, |
56 meaning every regular language is also context-free, but not vice |
57 meaning every regular language is also context-free, but not vice |
57 versa. Below I will let you think, for example, what the context-free |
58 versa. Below I will let you think, for example, what the context-free |
58 grammar is for the language corresponding to the regular expression |
59 grammar is for the language corresponding to the regular expression |
59 $(aaa)^*a$. |
60 $(aaa)^*a$. |
60 |
61 |
61 Because of their convenience, context-free languages play an important |
62 Because of their convenience, context-free languages play an important |
62 role in `day-to-day' text processing and in programming |
63 role in `day-to-day' text processing and in programming |
63 languages. Context-free in this setting means that ``words'' have one |
64 languages. Context-free in this setting means that ``words'' have one |
64 meaning only and this meaning is independent from in which context |
65 meaning only and this meaning is independent from the context |
65 the ``words'' appear. For example ambiguity issues like |
66 the ``words'' appear in. For example ambiguity issues like |
66 |
67 |
67 \begin{center} |
68 \begin{center} |
68 \tt Time flies like an arrow; fruit flies like bananas. |
69 \tt Time flies like an arrow; fruit flies like bananas. |
69 \end{center} |
70 \end{center} |
70 |
71 |
71 \noindent |
72 \noindent |
72 from natural languages were the meaning of \emph{flies} depend on the |
73 from natural languages were the meaning of \emph{flies} depends on the |
73 surrounding \emph{context} are avoided as much as possible. |
74 surrounding \emph{context} are avoided as much as possible. |
74 |
75 |
75 Context-free languages are usually specified by grammars. For example |
76 Context-free languages are usually specified by grammars. For example |
76 a grammar for well-parenthesised expressions can be given as follows: |
77 a grammar for well-parenthesised expressions can be given as follows: |
77 |
78 |
188 \end{tikzpicture} |
190 \end{tikzpicture} |
189 \end{center} |
191 \end{center} |
190 |
192 |
191 \noindent We are often interested in these parse-trees since |
193 \noindent We are often interested in these parse-trees since |
192 they encode the structure of how a string is derived by a |
194 they encode the structure of how a string is derived by a |
193 grammar. Before we come to the problem of constructing such |
195 grammar. |
194 parse-trees, we need to consider the following two properties |
196 |
195 of grammars. A grammar is \emph{left-recursive} if there is a |
197 Before we come to the problem of constructing such parse-trees, we need |
196 derivation starting from a non-terminal, say \meta{NT} which leads |
198 to consider the following two properties of grammars. A grammar is |
197 to a string which again starts with \meta{NT}. This means a |
199 \emph{left-recursive} if there is a derivation starting from a |
198 derivation of the form. |
200 non-terminal, say \meta{NT} which leads to a string which again starts |
|
201 with \meta{NT}. This means a derivation of the form. |
199 |
202 |
200 \begin{center} |
203 \begin{center} |
201 $\meta{NT} \rightarrow \ldots \rightarrow \meta{NT} \cdot \ldots$ |
204 $\meta{NT} \rightarrow \ldots \rightarrow \meta{NT} \cdot \ldots$ |
202 \end{center} |
205 \end{center} |
203 |
206 |
204 \noindent It can be easily seen that the grammar above for |
207 \noindent It can be easily seen that the grammar above for arithmetic |
205 arithmetic expressions is left-recursive: for example the |
208 expressions is left-recursive: for example the rules $\meta{E} |
206 rules $\meta{E} \rightarrow \meta{E}\cdot + \cdot \meta{E}$ and |
209 \rightarrow \meta{E}\cdot + \cdot \meta{E}$ and $\meta{N} \rightarrow |
207 $\meta{N} \rightarrow \meta{N}\cdot \meta{N}$ show that this |
210 \meta{N}\cdot \meta{N}$ show that this grammar is left-recursive. But |
208 grammar is left-recursive. But note |
211 note that left-recursiveness can involve more than one step in the |
209 that left-recursiveness can involve more than one step in the |
212 derivation. The problem with left-recursive grammars is that some |
210 derivation. The problem with left-recursive grammars is that |
213 algorithms cannot cope with them: with left-recursive grammars they will |
211 some algorithms cannot cope with them: they fall into a loop. |
214 fall into a loop. Fortunately every left-recursive grammar can be |
212 Fortunately every left-recursive grammar can be transformed |
215 transformed into one that is not left-recursive, although this |
213 into one that is not left-recursive, although this |
216 transformation might make the grammar less ``human-readable''. For |
214 transformation might make the grammar less ``human-readable''. |
217 example if we want to give a non-left-recursive grammar for numbers we |
215 For example if we want to give a non-left-recursive grammar |
218 might specify |
216 for numbers we might specify |
|
217 |
219 |
218 \begin{center} |
220 \begin{center} |
219 $\meta{N} \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\; |
221 $\meta{N} \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\; |
220 1\cdot \meta{N}\;|\;2\cdot \meta{N}\;|\;\ldots\;|\;9\cdot \meta{N}$ |
222 1\cdot \meta{N}\;|\;2\cdot \meta{N}\;|\;\ldots\;|\;9\cdot \meta{N}$ |
221 \end{center} |
223 \end{center} |
258 ; |
260 ; |
259 \end{tikzpicture} |
261 \end{tikzpicture} |
260 \end{tabular} |
262 \end{tabular} |
261 \end{center} |
263 \end{center} |
262 |
264 |
263 \noindent In particular in programming languages we will try |
265 \noindent In particular in programming languages we will try to avoid |
264 to avoid ambiguous grammars because two different parse-trees |
266 ambiguous grammars because two different parse-trees for a string mean a |
265 for a string mean a program can be interpreted in two |
267 program can be interpreted in two different ways. In such cases we have |
266 different ways. In such cases we have to somehow make sure the |
268 to somehow make sure the two different ways do not matter, or |
267 two different ways do not matter, or disambiguate the grammar |
269 disambiguate the grammar in some other way (for example making the $+$ |
268 in some other way (for example making the $+$ |
270 left-associative). Unfortunately already the problem of deciding whether |
269 left-associative). Unfortunately already the problem of |
271 a grammar is ambiguous or not is in general undecidable. But in simple |
270 deciding whether a grammar is ambiguous or not is in general |
272 instance (the ones we deal with in this module) one can usually see when |
271 undecidable. But in simple instance (the ones we deal in this |
273 a grammar is ambiguous. |
272 module) one can usually see when a grammar is ambiguous. |
274 |
273 |
275 \subsection*{Removing Left-Recursion} |
|
276 |
|
277 Let us come back to the problem of left-recursion and consider the |
|
278 following grammar for binary numbers: |
|
279 |
|
280 \begin{plstx}[margin=1cm] |
|
281 : \meta{B} ::= \meta{B} \cdot \meta{B} | 0 | 1\\ |
|
282 \end{plstx} |
|
283 |
|
284 \noindent |
|
285 It is clear that this grammar can create all binary numbers, but |
|
286 it is also clear that this grammar is left-recursive. Giving this |
|
287 grammar as is to parser combinators will result in an infinite |
|
288 loop. Fortunately, every left-recursive grammar can be translated |
|
289 into one that is not left-recursive with the help of some |
|
290 transformation rules. Suppose we identified the ``offensive'' |
|
291 rule, then we can separate the grammar into this offensive rule |
|
292 and the ``rest'': |
|
293 |
|
294 \begin{plstx}[margin=1cm] |
|
295 : \meta{B} ::= \underbrace{\meta{B} \cdot \meta{B}}_{\textit{lft-rec}} |
|
296 | \underbrace{0 \;\;|\;\; 1}_{\textit{rest}}\\ |
|
297 \end{plstx} |
|
298 |
|
299 \noindent |
|
300 To make the idea of the transformation clearer, suppose the left-recursive |
|
301 rule is of the form $\meta{B}\alpha$ (the left-recursive non-terminal |
|
302 followed by something called $\alpha$) and the ``rest'' is called $\beta$. |
|
303 That means our grammar looks schematically as follows |
|
304 |
|
305 \begin{plstx}[margin=1cm] |
|
306 : \meta{B} ::= \meta{B} \cdot \alpha | \beta\\ |
|
307 \end{plstx} |
|
308 |
|
309 \noindent |
|
310 To get rid of the left-recursion, we are required to introduce |
|
311 a new non-terminal, say $\meta{B'}$ and transform the rule |
|
312 as follows: |
|
313 |
|
314 \begin{plstx}[margin=1cm] |
|
315 : \meta{B} ::= \beta \cdot \meta{B'}\\ |
|
316 : \meta{B'} ::= \alpha \cdot \meta{B'} | \epsilon\\ |
|
317 \end{plstx} |
|
318 |
|
319 \noindent |
|
320 In our example of binary numbers we would after the transformation |
|
321 end up with the rules |
|
322 |
|
323 \begin{plstx}[margin=1cm] |
|
324 : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\ |
|
325 : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\ |
|
326 \end{plstx} |
|
327 |
|
328 \noindent |
|
329 A little thought should convince you that this grammar still derives |
|
330 all the binary numbers (for example 0 and 1 are derivable because $\meta{B'}$ |
|
331 can be $\epsilon$). Less clear might be why this grammar is non-left recursive. |
|
332 For $\meta{B'}$ it is relatively clear because we will never be |
|
333 able to derive things like |
|
334 |
|
335 \begin{center} |
|
336 $\meta{B'} \rightarrow\ldots\rightarrow \meta{B'}\cdot\ldots$ |
|
337 \end{center} |
|
338 |
|
339 \noindent |
|
340 because there will always be a $\meta{B}$ in front of a $\meta{B'}$, and |
|
341 $\meta{B}$ now has always a $0$ or $1$ in front, so a $\meta{B'}$ can |
|
342 never be in the first place. The reasoning is similar for $\meta{B}$: |
|
343 the $0$ and $1$ in the rule for $\meta{B}$ ``protect'' it from becoming |
|
344 left-recursive. This transformation does not mean the grammar is the |
|
345 simplest left-recursive grammar for binary numbers. For example the |
|
346 following grammar would do as well |
|
347 |
|
348 \begin{plstx}[margin=1cm] |
|
349 : \meta{B} ::= 0 \cdot \meta{B} | 1 \cdot \meta{B} | 0 | 1\\ |
|
350 \end{plstx} |
|
351 |
|
352 \noindent |
|
353 The point is that we can in principle transform every left-recursive |
|
354 grammar into one that is non-left-recursive one. This explains why often |
|
355 the following grammar is used for arithmetic expressions: |
|
356 |
|
357 \begin{plstx}[margin=1cm] |
|
358 : \meta{E} ::= \meta{T} | \meta{T} \cdot + \cdot \meta{E} | \meta{T} \cdot - \cdot \meta{E}\\ |
|
359 : \meta{T} ::= \meta{F} | \meta{F} \cdot * \cdot \meta{T}\\ |
|
360 : \meta{F} ::= num\_token | ( \cdot \meta{E} \cdot )\\ |
|
361 \end{plstx} |
|
362 |
|
363 \noindent |
|
364 In this grammar all $\meta{E}$xpressions, $\meta{T}$erms and $\meta{F}$actors |
|
365 are in some way protected from being left-recusive. For example if you |
|
366 start $\meta{E}$ you can derive another one by going through $\meta{T}$, then |
|
367 $\meta{F}$, but then $\meta{E}$ is protected by the open-parenthesis. |
|
368 |
|
369 \subsection*{Removing $\epsilon$-Rules and CYK-Algorithm} |
|
370 |
|
371 I showed above that the non-left-recursive grammar for binary numbers is |
|
372 |
|
373 \begin{plstx}[margin=1cm] |
|
374 : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\ |
|
375 : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\ |
|
376 \end{plstx} |
|
377 |
|
378 \noindent |
|
379 The transformation made the original grammar non-left-recursive, but at |
|
380 the expense of introducing an $\epsilon$ in the second rule. Having an |
|
381 explicit $\epsilon$-rule is annoying to, not in terms of looping, but in |
|
382 terms of efficiency. The reason is that the $\epsilon$-rule always |
|
383 applies but since it recognises the empty string, it does not make any |
|
384 progress with recognising a string. Better are rules like $( \cdot |
|
385 \meta{E} \cdot )$ where something of the input is consumed. Getting |
|
386 rid of $\epsilon$-rules is also important for the CYK parsing algorithm, |
|
387 which can give us an insight into the complexity class of parsing. |
|
388 |
|
389 It turns out we can also by some generic transformations eliminate |
|
390 $\epsilon$-rules from grammars. Consider again the grammar above for |
|
391 binary numbers where have a rule $\meta{B'} ::= \epsilon$. In this case |
|
392 we look for rules of the (generic) form \mbox{$\meta{A} := |
|
393 \alpha\cdot\meta{B'}\cdot\beta$}. That is there are rules that use |
|
394 $\meta{B'}$ and something ($\alpha$) is in front of $\meta{B'}$ and |
|
395 something follows ($\beta$). Such rules need to be replaced by |
|
396 additional rules of the form \mbox{$\meta{A} := \alpha\cdot\beta$}. |
|
397 In our running example there are the two rules for $\meta{B}$ which |
|
398 fall into this category |
|
399 |
|
400 \begin{plstx}[margin=1cm] |
|
401 : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\ |
|
402 \end{plstx} |
|
403 |
|
404 \noindent To follow the general scheme of the transfromation, |
|
405 the $\alpha$ is either is either $0$ or $1$, and the $\beta$ happens |
|
406 to be empty. SO we need to generate new rules for the form |
|
407 \mbox{$\meta{A} := \alpha\cdot\beta$}, which in our particular |
|
408 example means we obtain |
|
409 |
|
410 \begin{plstx}[margin=1cm] |
|
411 : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\ |
|
412 \end{plstx} |
|
413 |
|
414 \noindent |
|
415 Unfortunately $\meta{B'}$ is also used in the rule |
|
416 |
|
417 \begin{plstx}[margin=1cm] |
|
418 : \meta{B'} ::= \meta{B} \cdot \meta{B'}\\ |
|
419 \end{plstx} |
|
420 |
|
421 \noindent |
|
422 For this we repeat the transformation, giving |
|
423 |
|
424 \begin{plstx}[margin=1cm] |
|
425 : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\ |
|
426 \end{plstx} |
|
427 |
|
428 \noindent |
|
429 In this case $\alpha$ was substituted with $\meta{B}$ and $\beta$ |
|
430 was again empty. Once no rule is left over, we can simply throw |
|
431 away the $\epsilon$ rule. This gives the grammar |
|
432 |
|
433 \begin{plstx}[margin=1cm] |
|
434 : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\ |
|
435 : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\ |
|
436 \end{plstx} |
|
437 |
|
438 \noindent |
|
439 I let you think about whether this grammar can still recognise all |
|
440 binary numbers and whether this grammar is non-left-recursive. The |
|
441 precise statement for the transformation of removing $\epsilon$-rules is |
|
442 that if the original grammar was able to recognise only non-empty |
|
443 strings, then the transformed grammar will be equivalent (matching the |
|
444 same set of strings); if the original grammar was able to match the |
|
445 empty string, then the transformed grammar will be able to match the |
|
446 same strings, \emph{except} the empty string. So the $\epsilon$-removal |
|
447 does not preserve equivalence of grammars, but the small defect with the |
|
448 empty string is not important for practical purposes. |
|
449 |
|
450 So why are these transformations all useful? Well apart from making the |
|
451 parser combinators work (remember they cannot deal with left-recursion and |
|
452 are inefficient with $\epsilon$-rules), a second reason is that they help |
|
453 with getting any insight into the complexity of the parsing problem. |
|
454 The parser combinators are very easy to implement, but are far from the |
|
455 most efficient way of processing input (they can blow up exponentially |
|
456 with ambiguous grammars). The question remains what is the best possible |
|
457 complexity for parsing? It turns out that this is $O(n^3)$ for context-free |
|
458 languages. |
|
459 |
|
460 To answer the question about complexity, let me describe next the CYK |
|
461 algorithm (named after the authors Cocke–Younger–Kasami). This algorithm |
|
462 works with grammars that are in Chomsky normalform. |
|
463 |
|
464 TBD |
|
465 |
|
466 \end{document} |
|
467 |
|
468 |
|
469 %%% Parser combinators are now part of handout 6 |
274 |
470 |
275 \subsection*{Parser Combinators} |
471 \subsection*{Parser Combinators} |
276 |
472 |
277 Let us now turn to the problem of generating a parse-tree for |
473 Let us now turn to the problem of generating a parse-tree for |
278 a grammar and string. In what follows we explain \emph{parser |
474 a grammar and string. In what follows we explain \emph{parser |