handouts/ho05.tex
changeset 798 aaf0bd0a211d
parent 727 eb9343126625
child 937 dc5ab66b11cc
equal deleted inserted replaced
797:ddcb616e036a 798:aaf0bd0a211d
     1 % !TEX program = xelatex
     1 % !TEX program = xelatex
     2 \documentclass{article}
     2 \documentclass{article}
     3 \usepackage{../style}
     3 \usepackage{../style}
     4 \usepackage{../langs}
     4 \usepackage{../langs}
     5 \usepackage{../grammar}
     5 \usepackage{../grammar}
       
     6 \usepackage{../graphics}
     6 
     7 
     7 % epsilon and left-recursion elimination
     8 % epsilon and left-recursion elimination
     8 % http://www.mollypages.org/page/grammar/index.mp
     9 % http://www.mollypages.org/page/grammar/index.mp
     9 
    10 
    10 %% parsing scala files
    11 %% parsing scala files
   417   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
   418   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
   418 \end{plstx} 
   419 \end{plstx} 
   419 
   420 
   420 \noindent To follow the general scheme of the transfromation,
   421 \noindent To follow the general scheme of the transfromation,
   421 the $\alpha$ is either is either $0$ or $1$, and the $\beta$ happens
   422 the $\alpha$ is either is either $0$ or $1$, and the $\beta$ happens
   422 to be empty. SO we need to generate new rules for the form 
   423 to be empty. So we need to generate new rules for the form 
   423 \mbox{$\meta{A} := \alpha\cdot\beta$}, which in our particular 
   424 \mbox{$\meta{A} := \alpha\cdot\beta$}, which in our particular 
   424 example means we obtain
   425 example means we obtain
   425 
   426 
   426 \begin{plstx}[margin=1cm]
   427 \begin{plstx}[margin=1cm]
   427   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\
   428   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\
   507 
   508 
   508 \noindent
   509 \noindent
   509 is recognised by the grammar. The CYK algorithm starts with the
   510 is recognised by the grammar. The CYK algorithm starts with the
   510 following triangular data structure.
   511 following triangular data structure.
   511 
   512 
   512 \begin{figure}[t]
   513 %%\begin{figure}[t]
   513 \begin{center}
   514 \begin{center}
   514   \begin{tikzpicture}[scale=0.8,line width=0.8mm]
   515   \begin{tikzpicture}[scale=0.7,line width=0.8mm]
   515   \draw (-2,0) -- (4,0);
   516   \draw (-2,0) -- (4,0);
   516   \draw (-2,1) -- (4,1);
   517   \draw (-2,1) -- (4,1);
   517   \draw (-2,2) -- (3,2);
   518   \draw (-2,2) -- (3,2);
   518   \draw (-2,3) -- (2,3);
   519   \draw (-2,3) -- (2,3);
   519   \draw (-2,4) -- (1,4);
   520   \draw (-2,4) -- (1,4);
   540   \draw ( 0.5,0.5) node {$N,V$}; 
   541   \draw ( 0.5,0.5) node {$N,V$}; 
   541   \draw ( 1.5,0.5) node {$A$}; 
   542   \draw ( 1.5,0.5) node {$A$}; 
   542   \draw ( 2.5,0.5) node {$N$}; 
   543   \draw ( 2.5,0.5) node {$N$}; 
   543   \draw ( 3.5,0.5) node {$N,V$};
   544   \draw ( 3.5,0.5) node {$N,V$};
   544 
   545 
       
   546 %  \draw (-1.5,1.5) node {\small{}a}; 
       
   547 %  \draw (-0.5,1.5) node {\small{}b}; 
       
   548 %  \draw ( 0.5,1.5) node {\small{}c}; 
       
   549 %  \draw ( 1.5,1.5) node {\small{}d}; 
       
   550 %  \draw ( 2.5,1.5) node {\small{}e}; 
       
   551   
   545   \draw (-2.4, 5.5) node {$1$}; 
   552   \draw (-2.4, 5.5) node {$1$}; 
   546   \draw (-2.4, 4.5) node {$2$}; 
   553   \draw (-2.4, 4.5) node {$2$}; 
   547   \draw (-2.4, 3.5) node {$3$}; 
   554   \draw (-2.4, 3.5) node {$3$}; 
   548   \draw (-2.4, 2.5) node {$4$}; 
   555   \draw (-2.4, 2.5) node {$4$}; 
   549   \draw (-2.4, 1.5) node {$5$}; 
   556   \draw (-2.4, 1.5) node {$5$}; 
   550   \draw (-2.4, 0.5) node {$6$}; 
   557   \draw (-2.4, 0.5) node {$6$}; 
   551   \end{tikzpicture}
   558   \end{tikzpicture}
   552   \end{center}
   559   \end{center}
   553 \end{figure}
   560 %%\end{figure}
       
   561 
       
   562 
       
   563 \noindent
       
   564 The last row contains the information about all words and their
       
   565 corresponding non-terminals. For example the field for \texttt{trains}
       
   566 contains the information $\meta{N}$ and $\meta{V}$ because it can be a
       
   567 ``verb'' and a ``noun'' according to the grammar.  The row above,
       
   568 let's call the corresponding fields 5a to 5e, contains information
       
   569 about 2-word parts of the sentence, namely
       
   570 
       
   571 \begin{center}
       
   572 \begin{tabular}{llll}
       
   573   5a) & $\underbrace{\texttt{The}}_{A}$ $\mid$ $\underbrace{\texttt{trainer}}_{N}$   \\
       
   574   5b) & $\underbrace{\texttt{trainer}}_{N}$ $\mid$ $\underbrace{\texttt{trains}}_{N,V}$\\
       
   575   5c) & \texttt{trains} $\mid$ \texttt{the}    \\
       
   576   5d) & \texttt{the} $\mid$ \texttt{student}   \\
       
   577   5e) & \texttt{student} $\mid$ \texttt{team}  \\
       
   578 \end{tabular}  
       
   579 \end{center}  
       
   580 
       
   581 \noindent
       
   582 For each of them, we look up in row 6 which non-terminals it belongs to
       
   583 (indicated above for 5a and 5b). For 5a, with the non-terminals
       
   584 \meta{A} and \meta{N}, we find the grammar rule
       
   585 
       
   586 \begin{plstx}[margin=1cm]
       
   587   : \meta{N} ::= \meta{A}\cdot \meta{N}\\
       
   588 \end{plstx}
       
   589 
       
   590 \noindent
       
   591 which means into field 5a we put the left-hand side of this rule,
       
   592 which in this case is the non-terminal \meta{N}. For 5b we have to check
       
   593 for both $\meta{N}\cdot\meta{N}$ and $\meta{N}\cdot\meta{V}$ whether there
       
   594 is a right-hand side of this form in the grammar. But only the
       
   595 grammar rule 
       
   596 
       
   597 \begin{plstx}[margin=1cm]
       
   598   : \meta{N} ::= \meta{N}\cdot \meta{N}\\
       
   599 \end{plstx}
       
   600 
       
   601 \noindent
       
   602 matches, which means 5b gets also an \meta{N}. Continuing for all
       
   603 fields in row 5 gives:
       
   604 
       
   605 \begin{center}
       
   606   \begin{tikzpicture}[scale=0.7,line width=0.8mm]
       
   607   \draw (-2,0) -- (4,0);
       
   608   \draw (-2,1) -- (4,1);
       
   609   \draw (-2,2) -- (3,2);
       
   610   \draw (-2,3) -- (2,3);
       
   611   \draw (-2,4) -- (1,4);
       
   612   \draw (-2,5) -- (0,5);
       
   613   \draw (-2,6) -- (-1,6);
       
   614   
       
   615   \draw (0,0) -- (0, 5);
       
   616   \draw (1,0) -- (1, 4);
       
   617   \draw (2,0) -- (2, 3);
       
   618   \draw (3,0) -- (3, 2);
       
   619   \draw (4,0) -- (4, 1);
       
   620   \draw (-1,0) -- (-1, 6);
       
   621   \draw (-2,0) -- (-2, 6);
       
   622   
       
   623   \draw (-1.5,-0.5) node {\footnotesize{}\texttt{The}}; 
       
   624   \draw (-0.5,-1.0) node {\footnotesize{}\texttt{trainer}}; 
       
   625   \draw ( 0.5,-0.5) node {\footnotesize{}\texttt{trains}}; 
       
   626   \draw ( 1.5,-1.0) node {\footnotesize{}\texttt{the}}; 
       
   627   \draw ( 2.5,-0.5) node {\footnotesize{}\texttt{student}}; 
       
   628   \draw ( 3.5,-1.0) node {\footnotesize{}\texttt{team}};
       
   629   
       
   630   \draw (-1.5,0.5) node {$A$}; 
       
   631   \draw (-0.5,0.5) node {$N$}; 
       
   632   \draw ( 0.5,0.5) node {$N,V$}; 
       
   633   \draw ( 1.5,0.5) node {$A$}; 
       
   634   \draw ( 2.5,0.5) node {$N$}; 
       
   635   \draw ( 3.5,0.5) node {$N,V$};
       
   636 
       
   637   \draw (-1.5,1.5) node {$N$}; 
       
   638   \draw (-0.5,1.5) node {$N$}; 
       
   639   \draw ( 0.5,1.5) node {$$}; 
       
   640   \draw ( 1.5,1.5) node {$N$}; 
       
   641   \draw ( 2.5,1.5) node {$N$}; 
       
   642 
       
   643 
       
   644 %  \draw (-1.5,1.5) node {\small{}a}; 
       
   645 %  \draw (-0.5,1.5) node {\small{}b}; 
       
   646 %  \draw ( 0.5,1.5) node {\small{}c}; 
       
   647 %  \draw ( 1.5,1.5) node {\small{}d}; 
       
   648 %  \draw ( 2.5,1.5) node {\small{}e}; 
       
   649   
       
   650   \draw (-2.4, 5.5) node {$1$}; 
       
   651   \draw (-2.4, 4.5) node {$2$}; 
       
   652   \draw (-2.4, 3.5) node {$3$}; 
       
   653   \draw (-2.4, 2.5) node {$4$}; 
       
   654   \draw (-2.4, 1.5) node {$5$}; 
       
   655   \draw (-2.4, 0.5) node {$6$}; 
       
   656   \end{tikzpicture}
       
   657   \end{center}
       
   658 
       
   659 \noindent
       
   660 Now row 4 is in charge of all 3-word parts of the sentence, namely
       
   661 
       
   662 \begin{center}
       
   663 \begin{tabular}{llll}
       
   664   4a) & The $\mid$ trainer trains\\
       
   665       & The trainer $\mid$ trains\\
       
   666   4b) & trainer $\mid$ trains the\\
       
   667       & trainer trains $\mid$ the\\
       
   668   4c) & trains $\mid$ the student\\
       
   669       & trains the $\mid$ student\\
       
   670   4d) & the $\mid$ student team\\
       
   671       & the student $\mid$ team\\
       
   672 \end{tabular}  
       
   673 \end{center}
       
   674 
       
   675 \noindent
       
   676 Note that in case of 3-word parts we have two splits. For example for
       
   677 4a: $\underbrace{\texttt{The}}_{A}$ and
       
   678 $\underbrace{\texttt{trainer trains}}_{N}$; and also
       
   679 $\underbrace{\texttt{The trainer}}_{N}$ and
       
   680 $\underbrace{\texttt{trains}}_{N,V}$. For each of these splits we have
       
   681 to look up in the rows below, which non-terminals we already
       
   682 computed. This allows us to look for right-hand sides in our grammar:
       
   683 $\meta{A}\cdot\meta{N}$, $\meta{N}\cdot\meta{N}$ and
       
   684 $\meta{N}\cdot\meta{V}$, which yield the only left-hand side
       
   685 \meta{N}. This is what we fill in for 4a. And so on for row 4.
       
   686 
       
   687 Row 3 is about all 4-word parts in the sentence, namely
       
   688 
       
   689 \begin{center}
       
   690 \begin{tabular}{llll}
       
   691   3a) & The trainer trains the\\
       
   692   3b) & trainer trains the student\\
       
   693   3c) & trains the student team\\
       
   694 \end{tabular}  
       
   695 \end{center}
       
   696 
       
   697 \noindent
       
   698 Each of them can be split up in 3 ways, for example
       
   699 
       
   700 \begin{center}
       
   701 \begin{tabular}{llll}
       
   702   3a) & The $\mid$ trainer trains the\\
       
   703       & The trainer $\mid$ trains the\\
       
   704       & The trainer trains $\mid$ the\\
       
   705 \end{tabular}  
       
   706 \end{center}
       
   707 
       
   708 \noindent
       
   709 and we have to consider them all in turn to fill in the non-terminals for
       
   710 3a. You can guess how it continues: row 2 is for all 5-word parts, and finally
       
   711 the field on the top is for the whole sentence.
       
   712 The idea of the CYK algorithm is that if in the top-field the starting
       
   713 symbol \meta{S} appears (possibly together with other non-terminals),
       
   714 then the sentence is accepted by the grammar. If it does not, then the
       
   715 sentence is not accepted.
       
   716 
       
   717 Let us very quickly calculate the complexity of the CYK
       
   718 algorithm. Lookup operations inside the triangle and in the grammar
       
   719 are assumed to be of constant time, $O(1)$, meaning they do not
       
   720 matter. How many fields are in the triangle\ldots
       
   721 $\frac{n}{2} * (n+1)$, where $n$ is the size of the input. That means
       
   722 roughly $O(n^2)$ fields. How much work do we have to do for each
       
   723 field? Well, for the top-most we have to consider $n-1$ splits, which
       
   724 means roughly $O(n)$ for each field. The overall result is a $O(n^3)$
       
   725 time-complexity for CYK. It turns out that this is the best worst-time
       
   726 complexity for parsing with context-free grammars.
   554 
   727 
   555 \end{document}
   728 \end{document}
   556 
   729 
   557 
   730 
   558 %%% Parser combinators are now part of handout 6
   731 %%% Parser combinators are now part of handout 6