handouts/ho05.tex
changeset 937 dc5ab66b11cc
parent 798 aaf0bd0a211d
child 941 66adcae6c762
equal deleted inserted replaced
936:0b5f06539a84 937:dc5ab66b11cc
    82 \end{center}  
    82 \end{center}  
    83 
    83 
    84 \noindent
    84 \noindent
    85 from natural languages were the meaning of \emph{flies} depends on the
    85 from natural languages were the meaning of \emph{flies} depends on the
    86 surrounding \emph{context} are avoided as much as possible. Here is
    86 surrounding \emph{context} are avoided as much as possible. Here is
    87 an interesting video about C++ being not a context-free language
    87 an interesting video about C++ not being a context-free language
    88 
    88 
    89 \begin{center}
    89 \begin{center}
    90 \url{https://www.youtube.com/watch?v=OzK8pUu4UfM}
    90 \url{https://www.youtube.com/watch?v=OzK8pUu4UfM}
    91 \end{center}  
    91 \end{center}  
    92 
    92 
    97 : \meta{P} ::=  ( \cdot  \meta{P} \cdot ) \cdot \meta{P}
    97 : \meta{P} ::=  ( \cdot  \meta{P} \cdot ) \cdot \meta{P}
    98              | \epsilon\\ 
    98              | \epsilon\\ 
    99 \end{plstx}
    99 \end{plstx}
   100 
   100 
   101 \noindent 
   101 \noindent 
   102 or a grammar for recognising strings consisting of ones is
   102 or a grammar for recognising strings consisting of ones (at least one) is
   103 
   103 
   104 \begin{plstx}[margin=3cm]
   104 \begin{plstx}[margin=3cm]
   105 : \meta{O} ::= 1 \cdot  \meta{O} 
   105 : \meta{O} ::= 1 \cdot  \meta{O} 
   106              | 1\\
   106              | 1\\
   107 \end{plstx}
   107 \end{plstx}
   376   : \meta{T} ::= \meta{F} | \meta{F} \cdot * \cdot \meta{T}\\
   376   : \meta{T} ::= \meta{F} | \meta{F} \cdot * \cdot \meta{T}\\
   377   : \meta{F} ::= num\_token | ( \cdot \meta{E} \cdot )\\
   377   : \meta{F} ::= num\_token | ( \cdot \meta{E} \cdot )\\
   378 \end{plstx}
   378 \end{plstx}
   379 
   379 
   380 \noindent
   380 \noindent
   381 In this grammar all $\meta{E}$xpressions, $\meta{T}$erms and $\meta{F}$actors
   381 In this grammar all $\meta{E}$xpressions, $\meta{T}$erms and
   382 are in some way protected from being left-recusive. For example if you
   382 $\meta{F}$actors are in some way protected from being
   383 start $\meta{E}$ you can derive another one by going through $\meta{T}$, then 
   383 left-recusive. For example if you start $\meta{E}$ you can derive
   384 $\meta{F}$, but then $\meta{E}$ is protected by the open-parenthesis.
   384 another one by going through $\meta{T}$, then $\meta{F}$, but then
       
   385 $\meta{E}$ is protected by the open-parenthesis in the last rule.
   385 
   386 
   386 \subsection*{Removing $\epsilon$-Rules and CYK-Algorithm}
   387 \subsection*{Removing $\epsilon$-Rules and CYK-Algorithm}
   387 
   388 
   388 I showed above that the non-left-recursive grammar for binary numbers is
   389 I showed above that the non-left-recursive grammar for binary numbers is
   389 
   390 
   393 \end{plstx}
   394 \end{plstx}
   394 
   395 
   395 \noindent
   396 \noindent
   396 The transformation made the original grammar non-left-recursive, but at
   397 The transformation made the original grammar non-left-recursive, but at
   397 the expense of introducing an $\epsilon$ in the second rule. Having an
   398 the expense of introducing an $\epsilon$ in the second rule. Having an
   398 explicit $\epsilon$-rule is annoying to, not in terms of looping, but in
   399 explicit $\epsilon$-rule is annoying, not in terms of looping, but in
   399 terms of efficiency. The reason is that the $\epsilon$-rule always
   400 terms of efficiency. The reason is that the $\epsilon$-rule always
   400 applies but since it recognises the empty string, it does not make any
   401 applies but since it recognises the empty string, it does not make any
   401 progress with recognising a string. Better are rules like $( \cdot
   402 progress with recognising a string. Better are rules like $( \cdot
   402 \meta{E} \cdot )$ where something of the input is consumed. Getting
   403 \meta{E} \cdot )$ where something of the input is consumed. Getting
   403 rid of $\epsilon$-rules is also important for the CYK parsing algorithm,
   404 rid of $\epsilon$-rules is also important for the CYK parsing algorithm,
   453 \end{plstx}
   454 \end{plstx}
   454 
   455 
   455 \noindent
   456 \noindent
   456 I let you think about whether this grammar can still recognise all
   457 I let you think about whether this grammar can still recognise all
   457 binary numbers and whether this grammar is non-left-recursive. The
   458 binary numbers and whether this grammar is non-left-recursive. The
   458 precise statement for the transformation of removing $\epsilon$-rules is
   459 precise statement for the transformation of removing $\epsilon$-rules
   459 that if the original grammar was able to recognise only non-empty
   460 is that if the original grammar was able to recognise only non-empty
   460 strings, then the transformed grammar will be equivalent (matching the
   461 strings, then the transformed grammar will be equivalent (matching the
   461 same set of strings); if the original grammar was able to match the
   462 same set of strings); if the original grammar was able to match the
   462 empty string, then the transformed grammar will be able to match the
   463 empty string, then the transformed grammar will be able to match the
   463 same strings, \emph{except} the empty string. So the  $\epsilon$-removal
   464 same strings, \emph{except} the empty string. So the
   464 does not preserve equivalence of grammars, but the small defect with the
   465 $\epsilon$-removal does not preserve equivalence of grammars in
   465 empty string is not important for practical purposes.
   466 general, but the small defect with the empty string is not important
       
   467 for practical purposes.
   466 
   468 
   467 So why are these transformations all useful? Well apart from making the 
   469 So why are these transformations all useful? Well apart from making the 
   468 parser combinators work (remember they cannot deal with left-recursion and
   470 parser combinators work (remember they cannot deal with left-recursion and
   469 are inefficient with $\epsilon$-rules), a second reason is that they help
   471 are inefficient with $\epsilon$-rules), a second reason is that they help
   470 with getting any insight into the complexity of the parsing problem. 
   472 with getting any insight into the complexity of the parsing problem. 
   561 
   563 
   562 
   564 
   563 \noindent
   565 \noindent
   564 The last row contains the information about all words and their
   566 The last row contains the information about all words and their
   565 corresponding non-terminals. For example the field for \texttt{trains}
   567 corresponding non-terminals. For example the field for \texttt{trains}
   566 contains the information $\meta{N}$ and $\meta{V}$ because it can be a
   568 contains the information $\meta{N}$ and $\meta{V}$ because \texttt{trains} can be a
   567 ``verb'' and a ``noun'' according to the grammar.  The row above,
   569 ``verb'' and a ``noun'' according to the grammar.  The row above,
   568 let's call the corresponding fields 5a to 5e, contains information
   570 let's call the corresponding fields 5a to 5e, contains information
   569 about 2-word parts of the sentence, namely
   571 about \underline{2-word} parts of the sentence, namely
   570 
   572 
   571 \begin{center}
   573 \begin{center}
   572 \begin{tabular}{llll}
   574 \begin{tabular}{llll}
   573   5a) & $\underbrace{\texttt{The}}_{A}$ $\mid$ $\underbrace{\texttt{trainer}}_{N}$   \\
   575   5a) & $\underbrace{\texttt{The}}_{A}$ $\mid$ $\underbrace{\texttt{trainer}}_{N}$   \\
   574   5b) & $\underbrace{\texttt{trainer}}_{N}$ $\mid$ $\underbrace{\texttt{trains}}_{N,V}$\\
   576   5b) & $\underbrace{\texttt{trainer}}_{N}$ $\mid$ $\underbrace{\texttt{trains}}_{N,V}$\\