| author | Christian Urban <christian.urban@kcl.ac.uk> | 
| Tue, 01 Sep 2020 16:00:37 +0100 | |
| changeset 752 | 1f1a293549c1 | 
| parent 727 | a526f7f20cad | 
| child 798 | 409e5014edde | 
| permissions | -rw-r--r-- | 
| 665 | 1 | % !TEX program = xelatex | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 2 | \documentclass{article}
 | 
| 297 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 3 | \usepackage{../style}
 | 
| 217 
cd6066f1056a
updated handouts
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
183diff
changeset | 4 | \usepackage{../langs}
 | 
| 459 | 5 | \usepackage{../grammar}
 | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 6 | |
| 545 | 7 | % epsilon and left-recursion elimination | 
| 8 | % http://www.mollypages.org/page/grammar/index.mp | |
| 9 | ||
| 618 | 10 | %% parsing scala files | 
| 691 | 11 | %% https://scalameta.org/ | 
| 12 | ||
| 13 | % chomsky normal form transformation | |
| 14 | % https://youtu.be/FNPSlnj3Vt0 | |
| 618 | 15 | |
| 710 | 16 | % Language hierachy is about complexity | 
| 17 | % How hard is it to recognise an element in a language | |
| 18 | ||
| 722 | 19 | % Pratt parsing | 
| 20 | % https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html | |
| 21 | % https://www.oilshell.org/blog/2017/03/31.html | |
| 22 | ||
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 23 | \begin{document}
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 24 | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
362diff
changeset | 25 | \section*{Handout 5 (Grammars \& Parser)}
 | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 26 | |
| 727 | 27 | So far we have focused on regular expressions as well as matching and | 
| 28 | lexing algorithms. While regular expressions are very useful for lexing | |
| 29 | and for recognising many patterns in strings (like email addresses), | |
| 30 | they have their limitations. For example there is no regular expression | |
| 31 | that can recognise the language $a^nb^n$ (where you have strings | |
| 32 | starting with $n$ $a$'s followed by the same amount of $b$'s). Another | |
| 33 | example for which there exists no regular expression is the language of | |
| 34 | well-parenthesised expressions. In languages like Lisp, which use | |
| 35 | parentheses rather extensively, it might be of interest to know whether | |
| 36 | the following two expressions are well-parenthesised or not (the left | |
| 37 | one is, the right one is not): | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 38 | |
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 39 | \begin{center}
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 40 | $(((()()))())$  \hspace{10mm} $(((()()))()))$
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 41 | \end{center}
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 42 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 43 | \noindent Not being able to solve such recognition problems is | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 44 | a serious limitation. In order to solve such recognition | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 45 | problems, we need more powerful techniques than regular | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 46 | expressions. We will in particular look at \emph{context-free
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 47 | languages}. They include the regular languages as the picture | 
| 582 | 48 | below about language classes shows: | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 49 | |
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 50 | |
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 51 | \begin{center}
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 52 | \begin{tikzpicture}
 | 
| 297 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 53 | [rect/.style={draw=black!50, 
 | 
| 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 54 | top color=white,bottom color=black!20, | 
| 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 55 | rectangle, very thick, rounded corners}] | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 56 | |
| 297 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 57 | \draw (0,0) node [rect, text depth=30mm, text width=46mm] {\small all languages};
 | 
| 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 58 | \draw (0,-0.4) node [rect, text depth=20mm, text width=44mm] {\small decidable languages};
 | 
| 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 59 | \draw (0,-0.65) node [rect, text depth=13mm] {\small context sensitive languages};
 | 
| 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 60 | \draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages};
 | 
| 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 61 | \draw (0,-1.05) node [rect] {\small regular languages};
 | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 62 | \end{tikzpicture}
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 63 | \end{center}
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 64 | |
| 582 | 65 | \noindent Each ``bubble'' stands for sets of languages (remember | 
| 66 | languages are sets of strings). As indicated the set of regular | |
| 680 | 67 | languages is fully included inside the context-free languages, | 
| 582 | 68 | meaning every regular language is also context-free, but not vice | 
| 665 | 69 | versa. Below I will let you think, for example, what the context-free | 
| 582 | 70 | grammar is for the language corresponding to the regular expression | 
| 71 | $(aaa)^*a$. | |
| 72 | ||
| 73 | Because of their convenience, context-free languages play an important | |
| 74 | role in `day-to-day' text processing and in programming | |
| 75 | languages. Context-free in this setting means that ``words'' have one | |
| 680 | 76 | meaning only and this meaning is independent from the context | 
| 77 | the ``words'' appear in. For example ambiguity issues like | |
| 582 | 78 | |
| 79 | \begin{center}
 | |
| 682 | 80 | \tt Time flies like an arrow. Fruit flies like bananas. | 
| 582 | 81 | \end{center}  
 | 
| 82 | ||
| 83 | \noindent | |
| 680 | 84 | from natural languages were the meaning of \emph{flies} depends on the
 | 
| 686 | 85 | surrounding \emph{context} are avoided as much as possible. Here is
 | 
| 86 | an interesting video about C++ being not a context-free language | |
| 87 | ||
| 88 | \begin{center}
 | |
| 89 | \url{https://www.youtube.com/watch?v=OzK8pUu4UfM}
 | |
| 90 | \end{center}  
 | |
| 582 | 91 | |
| 92 | Context-free languages are usually specified by grammars. For example | |
| 93 | a grammar for well-parenthesised expressions can be given as follows: | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 94 | |
| 459 | 95 | \begin{plstx}[margin=3cm]
 | 
| 96 | : \meta{P} ::=  ( \cdot  \meta{P} \cdot ) \cdot \meta{P}
 | |
| 97 | | \epsilon\\ | |
| 98 | \end{plstx}
 | |
| 99 | ||
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 100 | \noindent | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 101 | or a grammar for recognising strings consisting of ones is | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 102 | |
| 459 | 103 | \begin{plstx}[margin=3cm]
 | 
| 104 | : \meta{O} ::= 1 \cdot  \meta{O} 
 | |
| 105 | | 1\\ | |
| 106 | \end{plstx}
 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 107 | |
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 108 | In general grammars consist of finitely many rules built up | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 109 | from \emph{terminal symbols} (usually lower-case letters) and
 | 
| 582 | 110 | \emph{non-terminal symbols} (upper-case letters written in
 | 
| 111 | bold like \meta{A}, \meta{N} and so on). Rules have
 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 112 | the shape | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 113 | |
| 459 | 114 | \begin{plstx}[margin=3cm]
 | 
| 115 | : \meta{NT} ::= rhs\\
 | |
| 116 | \end{plstx}
 | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 117 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 118 | \noindent where on the left-hand side is a single non-terminal | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 119 | and on the right a string consisting of both terminals and | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 120 | non-terminals including the $\epsilon$-symbol for indicating | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 121 | the empty string. We use the convention to separate components | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 122 | on the right hand-side by using the $\cdot$ symbol, as in the | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 123 | grammar for well-parenthesised expressions. We also use the | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 124 | convention to use $|$ as a shorthand notation for several | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 125 | rules. For example | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 126 | |
| 459 | 127 | \begin{plstx}[margin=3cm]
 | 
| 128 | : \meta{NT} ::= rhs_1
 | |
| 129 | | rhs_2\\ | |
| 130 | \end{plstx}
 | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 131 | |
| 459 | 132 | \noindent means that the non-terminal \meta{NT} can be replaced by
 | 
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 133 | either $\textit{rhs}_1$ or $\textit{rhs}_2$. If there are more
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 134 | than one non-terminal on the left-hand side of the rules, then | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 135 | we need to indicate what is the \emph{starting} symbol of the
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 136 | grammar. For example the grammar for arithmetic expressions | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 137 | can be given as follows | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 138 | |
| 459 | 139 | \begin{plstx}[margin=3cm,one per line]
 | 
| 140 | \mbox{\rm (1)}: \meta{E} ::= \meta{N}\\
 | |
| 141 | \mbox{\rm (2)}: \meta{E} ::= \meta{E} \cdot + \cdot \meta{E}\\
 | |
| 142 | \mbox{\rm (3)}: \meta{E} ::= \meta{E} \cdot - \cdot \meta{E}\\
 | |
| 143 | \mbox{\rm (4)}: \meta{E} ::= \meta{E} \cdot * \cdot \meta{E}\\
 | |
| 144 | \mbox{\rm (5)}: \meta{E} ::= ( \cdot \meta{E} \cdot )\\
 | |
| 145 | \mbox{\rm (6\ldots)}: \meta{N} ::= \meta{N} \cdot \meta{N} 
 | |
| 146 | \mid 0 \mid 1 \mid \ldots \mid 9\\ | |
| 147 | \end{plstx}
 | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 148 | |
| 459 | 149 | \noindent where \meta{E} is the starting symbol. A
 | 
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 150 | \emph{derivation} for a grammar starts with the starting
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 151 | symbol of the grammar and in each step replaces one | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 152 | non-terminal by a right-hand side of a rule. A derivation ends | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 153 | with a string in which only terminal symbols are left. For | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 154 | example a derivation for the string $(1 + 2) + 3$ is as | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 155 | follows: | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 156 | |
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 157 | \begin{center}
 | 
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 158 | \begin{tabular}{lll@{\hspace{2cm}}l}
 | 
| 459 | 159 | \meta{E} & $\rightarrow$ & $\meta{E}+\meta{E}$          & by (2)\\
 | 
| 160 |        & $\rightarrow$ & $(\meta{E})+\meta{E}$     & by (5)\\
 | |
| 161 |        & $\rightarrow$ & $(\meta{E}+\meta{E})+\meta{E}$   & by (2)\\
 | |
| 162 |        & $\rightarrow$ & $(\meta{E}+\meta{E})+\meta{N}$   & by (1)\\
 | |
| 163 |        & $\rightarrow$ & $(\meta{E}+\meta{E})+3$   & by (6\dots)\\
 | |
| 164 |        & $\rightarrow$ & $(\meta{N}+\meta{E})+3$   & by (1)\\
 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 165 | & $\rightarrow^+$ & $(1+2)+3$ & by (1, 6\ldots)\\ | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 166 | \end{tabular} 
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 167 | \end{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 168 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 169 | \noindent where on the right it is indicated which | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 170 | grammar rule has been applied. In the last step we | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 171 | merged several steps into one. | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 172 | |
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 173 | The \emph{language} of a context-free grammar $G$
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 174 | with start symbol $S$ is defined as the set of strings | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 175 | derivable by a derivation, that is | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 176 | |
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 177 | \begin{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 178 | $\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 179 | \end{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 180 | |
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 181 | \noindent | 
| 680 | 182 | A \emph{parse-tree} encodes how a string is derived with the starting
 | 
| 183 | symbol on top and each non-terminal containing a subtree for how it is | |
| 184 | replaced in a derivation. The parse tree for the string $(1 + 23)+4$ is | |
| 185 | as follows: | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 186 | |
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 187 | \begin{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 188 | \begin{tikzpicture}[level distance=8mm, black]
 | 
| 665 | 189 |   \node {\meta{E}}
 | 
| 190 |     child {node {\meta{E} } 
 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 191 |        child {node {$($}}
 | 
| 665 | 192 |        child {node {\meta{E} }       
 | 
| 193 |          child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}}
 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 194 |          child {node {$+$}}
 | 
| 665 | 195 |          child {node {\meta{E} } 
 | 
| 196 |             child {node {\meta{N} } child {node {$2$}}}
 | |
| 197 |             child {node {\meta{N} } child {node {$3$}}}
 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 198 | } | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 199 | } | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 200 |        child {node {$)$}}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 201 | } | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 202 |      child {node {$+$}}
 | 
| 665 | 203 |      child {node {\meta{E} }
 | 
| 204 |         child {node {\meta{N} } child {node {$4$}}}
 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 205 | }; | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 206 | \end{tikzpicture}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 207 | \end{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 208 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 209 | \noindent We are often interested in these parse-trees since | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 210 | they encode the structure of how a string is derived by a | 
| 680 | 211 | grammar. | 
| 212 | ||
| 213 | Before we come to the problem of constructing such parse-trees, we need | |
| 214 | to consider the following two properties of grammars. A grammar is | |
| 215 | \emph{left-recursive} if there is a derivation starting from a
 | |
| 216 | non-terminal, say \meta{NT} which leads to a string which again starts
 | |
| 217 | with \meta{NT}. This means a derivation of the form.
 | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 218 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 219 | \begin{center}
 | 
| 665 | 220 | $\meta{NT} \rightarrow \ldots \rightarrow \meta{NT} \cdot \ldots$
 | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 221 | \end{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 222 | |
| 680 | 223 | \noindent It can be easily seen that the grammar above for arithmetic | 
| 224 | expressions is left-recursive: for example the rules $\meta{E}
 | |
| 225 | \rightarrow \meta{E}\cdot + \cdot \meta{E}$ and $\meta{N} \rightarrow
 | |
| 226 | \meta{N}\cdot \meta{N}$ show that this grammar is left-recursive. But
 | |
| 227 | note that left-recursiveness can involve more than one step in the | |
| 228 | derivation. The problem with left-recursive grammars is that some | |
| 229 | algorithms cannot cope with them: with left-recursive grammars they will | |
| 230 | fall into a loop. Fortunately every left-recursive grammar can be | |
| 231 | transformed into one that is not left-recursive, although this | |
| 232 | transformation might make the grammar less ``human-readable''. For | |
| 233 | example if we want to give a non-left-recursive grammar for numbers we | |
| 234 | might specify | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 235 | |
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 236 | \begin{center}
 | 
| 665 | 237 | $\meta{N} \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;
 | 
| 238 | 1\cdot \meta{N}\;|\;2\cdot \meta{N}\;|\;\ldots\;|\;9\cdot \meta{N}$
 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 239 | \end{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 240 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 241 | \noindent Using this grammar we can still derive every number | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 242 | string, but we will never be able to derive a string of the | 
| 665 | 243 | form $\meta{N} \to \ldots \to \meta{N} \cdot \ldots$.
 | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 244 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 245 | The other property we have to watch out for is when a grammar | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 246 | is \emph{ambiguous}. A grammar is said to be ambiguous if
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 247 | there are two parse-trees for one string. Again the grammar | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 248 | for arithmetic expressions shown above is ambiguous. While the | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 249 | shown parse tree for the string $(1 + 23) + 4$ is unique, this | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 250 | is not the case in general. For example there are two parse | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 251 | trees for the string $1 + 2 + 3$, namely | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 252 | |
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 253 | \begin{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 254 | \begin{tabular}{c@{\hspace{10mm}}c}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 255 | \begin{tikzpicture}[level distance=8mm, black]
 | 
| 665 | 256 |   \node {\meta{E} }
 | 
| 257 |     child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}}
 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 258 |     child {node {$+$}}
 | 
| 665 | 259 |     child {node {\meta{E} }
 | 
| 260 |        child {node {\meta{E} } child {node {\meta{N} } child {node {$2$}}}}
 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 261 |        child {node {$+$}}
 | 
| 665 | 262 |        child {node {\meta{E} } child {node {\meta{N} } child {node {$3$}}}}
 | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 263 | } | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 264 | ; | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 265 | \end{tikzpicture} 
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 266 | & | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 267 | \begin{tikzpicture}[level distance=8mm, black]
 | 
| 665 | 268 |   \node {\meta{E} }
 | 
| 269 |     child {node {\meta{E} }
 | |
| 270 |        child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}}
 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 271 |        child {node {$+$}}
 | 
| 665 | 272 |        child {node {\meta{E} } child {node {\meta{N} } child {node {$2$}}}} 
 | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 273 | } | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 274 |     child {node {$+$}}
 | 
| 665 | 275 |     child {node {\meta{E} } child {node {\meta{N} } child {node {$3$}}}}
 | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 276 | ; | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 277 | \end{tikzpicture}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 278 | \end{tabular} 
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 279 | \end{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 280 | |
| 680 | 281 | \noindent In particular in programming languages we will try to avoid | 
| 282 | ambiguous grammars because two different parse-trees for a string mean a | |
| 283 | program can be interpreted in two different ways. In such cases we have | |
| 284 | to somehow make sure the two different ways do not matter, or | |
| 285 | disambiguate the grammar in some other way (for example making the $+$ | |
| 286 | left-associative). Unfortunately already the problem of deciding whether | |
| 287 | a grammar is ambiguous or not is in general undecidable. But in simple | |
| 288 | instance (the ones we deal with in this module) one can usually see when | |
| 289 | a grammar is ambiguous. | |
| 290 | ||
| 291 | \subsection*{Removing Left-Recursion}
 | |
| 292 | ||
| 293 | Let us come back to the problem of left-recursion and consider the | |
| 294 | following grammar for binary numbers: | |
| 295 | ||
| 296 | \begin{plstx}[margin=1cm]
 | |
| 297 | : \meta{B} ::= \meta{B} \cdot \meta{B} | 0 | 1\\
 | |
| 298 | \end{plstx}
 | |
| 299 | ||
| 300 | \noindent | |
| 301 | It is clear that this grammar can create all binary numbers, but | |
| 302 | it is also clear that this grammar is left-recursive. Giving this | |
| 303 | grammar as is to parser combinators will result in an infinite | |
| 304 | loop. Fortunately, every left-recursive grammar can be translated | |
| 305 | into one that is not left-recursive with the help of some | |
| 306 | transformation rules. Suppose we identified the ``offensive'' | |
| 307 | rule, then we can separate the grammar into this offensive rule | |
| 308 | and the ``rest'': | |
| 309 | ||
| 310 | \begin{plstx}[margin=1cm]
 | |
| 311 |   : \meta{B} ::= \underbrace{\meta{B} \cdot \meta{B}}_{\textit{lft-rec}} 
 | |
| 312 |   | \underbrace{0 \;\;|\;\; 1}_{\textit{rest}}\\
 | |
| 313 | \end{plstx}
 | |
| 314 | ||
| 315 | \noindent | |
| 316 | To make the idea of the transformation clearer, suppose the left-recursive | |
| 317 | rule is of the form $\meta{B}\alpha$ (the left-recursive non-terminal 
 | |
| 318 | followed by something called $\alpha$) and the ``rest'' is called $\beta$. | |
| 319 | That means our grammar looks schematically as follows | |
| 320 | ||
| 321 | \begin{plstx}[margin=1cm]
 | |
| 322 |   : \meta{B} ::= \meta{B} \cdot \alpha | \beta\\
 | |
| 323 | \end{plstx}
 | |
| 324 | ||
| 325 | \noindent | |
| 326 | To get rid of the left-recursion, we are required to introduce | |
| 327 | a new non-terminal, say $\meta{B'}$ and transform the rule
 | |
| 328 | as follows: | |
| 329 | ||
| 330 | \begin{plstx}[margin=1cm]
 | |
| 331 |   : \meta{B} ::= \beta \cdot \meta{B'}\\
 | |
| 332 |   : \meta{B'} ::= \alpha \cdot \meta{B'} | \epsilon\\
 | |
| 333 | \end{plstx}
 | |
| 334 | ||
| 335 | \noindent | |
| 336 | In our example of binary numbers we would after the transformation | |
| 337 | end up with the rules | |
| 338 | ||
| 339 | \begin{plstx}[margin=1cm]
 | |
| 340 |   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
 | |
| 341 |   : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\
 | |
| 342 | \end{plstx}
 | |
| 343 | ||
| 344 | \noindent | |
| 345 | A little thought should convince you that this grammar still derives | |
| 346 | all the binary numbers (for example 0 and 1 are derivable because $\meta{B'}$
 | |
| 347 | can be $\epsilon$). Less clear might be why this grammar is non-left recursive. | |
| 348 | For $\meta{B'}$ it is relatively clear because we will never be 
 | |
| 349 | able to derive things like | |
| 350 | ||
| 351 | \begin{center}
 | |
| 352 | $\meta{B'} \rightarrow\ldots\rightarrow \meta{B'}\cdot\ldots$
 | |
| 353 | \end{center}  
 | |
| 354 | ||
| 355 | \noindent | |
| 356 | because there will always be a $\meta{B}$ in front of a $\meta{B'}$, and
 | |
| 357 | $\meta{B}$ now has always a $0$ or $1$ in front, so a $\meta{B'}$ can
 | |
| 358 | never be in the first place. The reasoning is similar for $\meta{B}$:
 | |
| 359 | the $0$ and $1$ in the rule for $\meta{B}$ ``protect'' it from becoming
 | |
| 360 | left-recursive. This transformation does not mean the grammar is the | |
| 361 | simplest left-recursive grammar for binary numbers. For example the | |
| 362 | following grammar would do as well | |
| 363 | ||
| 364 | \begin{plstx}[margin=1cm]
 | |
| 365 |   : \meta{B} ::= 0 \cdot \meta{B} | 1 \cdot \meta{B} | 0 | 1\\
 | |
| 366 | \end{plstx}
 | |
| 367 | ||
| 368 | \noindent | |
| 369 | The point is that we can in principle transform every left-recursive | |
| 370 | grammar into one that is non-left-recursive one. This explains why often | |
| 371 | the following grammar is used for arithmetic expressions: | |
| 372 | ||
| 373 | \begin{plstx}[margin=1cm]
 | |
| 374 |   : \meta{E} ::= \meta{T} | \meta{T} \cdot + \cdot \meta{E} |  \meta{T} \cdot - \cdot \meta{E}\\
 | |
| 375 |   : \meta{T} ::= \meta{F} | \meta{F} \cdot * \cdot \meta{T}\\
 | |
| 376 |   : \meta{F} ::= num\_token | ( \cdot \meta{E} \cdot )\\
 | |
| 377 | \end{plstx}
 | |
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 378 | |
| 680 | 379 | \noindent | 
| 380 | In this grammar all $\meta{E}$xpressions, $\meta{T}$erms and $\meta{F}$actors
 | |
| 381 | are in some way protected from being left-recusive. For example if you | |
| 382 | start $\meta{E}$ you can derive another one by going through $\meta{T}$, then 
 | |
| 383 | $\meta{F}$, but then $\meta{E}$ is protected by the open-parenthesis.
 | |
| 384 | ||
| 385 | \subsection*{Removing $\epsilon$-Rules and CYK-Algorithm}
 | |
| 386 | ||
| 387 | I showed above that the non-left-recursive grammar for binary numbers is | |
| 388 | ||
| 389 | \begin{plstx}[margin=1cm]
 | |
| 390 |   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
 | |
| 391 |   : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\
 | |
| 392 | \end{plstx}
 | |
| 393 | ||
| 394 | \noindent | |
| 395 | The transformation made the original grammar non-left-recursive, but at | |
| 396 | the expense of introducing an $\epsilon$ in the second rule. Having an | |
| 397 | explicit $\epsilon$-rule is annoying to, not in terms of looping, but in | |
| 398 | terms of efficiency. The reason is that the $\epsilon$-rule always | |
| 399 | applies but since it recognises the empty string, it does not make any | |
| 400 | progress with recognising a string. Better are rules like $( \cdot | |
| 401 | \meta{E} \cdot )$ where something of the input is consumed. Getting
 | |
| 402 | rid of $\epsilon$-rules is also important for the CYK parsing algorithm, | |
| 403 | which can give us an insight into the complexity class of parsing. | |
| 404 | ||
| 405 | It turns out we can also by some generic transformations eliminate | |
| 406 | $\epsilon$-rules from grammars. Consider again the grammar above for | |
| 407 | binary numbers where have a rule $\meta{B'} ::= \epsilon$. In this case
 | |
| 408 | we look for rules of the (generic) form \mbox{$\meta{A} :=
 | |
| 409 | \alpha\cdot\meta{B'}\cdot\beta$}. That is there are rules that use
 | |
| 410 | $\meta{B'}$ and something ($\alpha$) is in front of $\meta{B'}$ and
 | |
| 411 | something follows ($\beta$). Such rules need to be replaced by | |
| 412 | additional rules of the form \mbox{$\meta{A} := \alpha\cdot\beta$}.
 | |
| 413 | In our running example there are the two rules for $\meta{B}$ which
 | |
| 414 | fall into this category | |
| 415 | ||
| 416 | \begin{plstx}[margin=1cm]
 | |
| 417 |   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
 | |
| 418 | \end{plstx} 
 | |
| 419 | ||
| 420 | \noindent To follow the general scheme of the transfromation, | |
| 421 | the $\alpha$ is either is either $0$ or $1$, and the $\beta$ happens | |
| 422 | to be empty. SO we need to generate new rules for the form | |
| 423 | \mbox{$\meta{A} := \alpha\cdot\beta$}, which in our particular 
 | |
| 424 | example means we obtain | |
| 425 | ||
| 426 | \begin{plstx}[margin=1cm]
 | |
| 427 |   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\
 | |
| 428 | \end{plstx} 
 | |
| 429 | ||
| 430 | \noindent | |
| 431 | Unfortunately $\meta{B'}$ is also used in the rule 
 | |
| 432 | ||
| 433 | \begin{plstx}[margin=1cm]
 | |
| 434 |   : \meta{B'} ::= \meta{B} \cdot \meta{B'}\\
 | |
| 435 | \end{plstx}
 | |
| 436 | ||
| 437 | \noindent | |
| 438 | For this we repeat the transformation, giving | |
| 439 | ||
| 440 | \begin{plstx}[margin=1cm]
 | |
| 441 |   : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\
 | |
| 442 | \end{plstx}
 | |
| 443 | ||
| 444 | \noindent | |
| 445 | In this case $\alpha$ was substituted with $\meta{B}$ and $\beta$
 | |
| 446 | was again empty. Once no rule is left over, we can simply throw | |
| 447 | away the $\epsilon$ rule. This gives the grammar | |
| 448 | ||
| 449 | \begin{plstx}[margin=1cm]
 | |
| 450 |   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\
 | |
| 451 |   : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\
 | |
| 452 | \end{plstx}
 | |
| 453 | ||
| 454 | \noindent | |
| 455 | I let you think about whether this grammar can still recognise all | |
| 456 | binary numbers and whether this grammar is non-left-recursive. The | |
| 457 | precise statement for the transformation of removing $\epsilon$-rules is | |
| 458 | that if the original grammar was able to recognise only non-empty | |
| 459 | strings, then the transformed grammar will be equivalent (matching the | |
| 460 | same set of strings); if the original grammar was able to match the | |
| 461 | empty string, then the transformed grammar will be able to match the | |
| 462 | same strings, \emph{except} the empty string. So the  $\epsilon$-removal
 | |
| 463 | does not preserve equivalence of grammars, but the small defect with the | |
| 464 | empty string is not important for practical purposes. | |
| 465 | ||
| 466 | So why are these transformations all useful? Well apart from making the | |
| 467 | parser combinators work (remember they cannot deal with left-recursion and | |
| 468 | are inefficient with $\epsilon$-rules), a second reason is that they help | |
| 469 | with getting any insight into the complexity of the parsing problem. | |
| 470 | The parser combinators are very easy to implement, but are far from the | |
| 471 | most efficient way of processing input (they can blow up exponentially | |
| 472 | with ambiguous grammars). The question remains what is the best possible | |
| 473 | complexity for parsing? It turns out that this is $O(n^3)$ for context-free | |
| 474 | languages. | |
| 475 | ||
| 476 | To answer the question about complexity, let me describe next the CYK | |
| 477 | algorithm (named after the authors Cocke–Younger–Kasami). This algorithm | |
| 681 | 478 | works with grammars that are in \emph{Chomsky normalform}. In Chomsky
 | 
| 479 | normalform all rules must be of the form $\meta{A} ::= a$, where $a$ is 
 | |
| 480 | a terminal, or $\meta{A} ::= \meta{B}\cdot \meta{C}$, where $\meta{B}$ and
 | |
| 481 | $\meta{B}$ need to be non-terminals. And no rule can contain $\epsilon$.
 | |
| 482 | The following grammar is in Chomsky normalform: | |
| 483 | ||
| 484 | \begin{plstx}[margin=1cm]
 | |
| 682 | 485 |   : \meta{S} ::= \meta{N}\cdot \meta{P}\\
 | 
| 486 |   : \meta{P} ::= \meta{V}\cdot \meta{N}\\
 | |
| 487 |   : \meta{N} ::= \meta{N}\cdot \meta{N}\\
 | |
| 488 |   : \meta{N} ::= \meta{A}\cdot \meta{N}\\
 | |
| 489 |   : \meta{N} ::= \texttt{student} | \texttt{trainer} | \texttt{team} 
 | |
| 490 |                 | \texttt{trains}\\
 | |
| 491 |   : \meta{V} ::= \texttt{trains} | \texttt{team}\\
 | |
| 492 |   : \meta{A} ::= \texttt{The} | \texttt{the}\\
 | |
| 681 | 493 | \end{plstx}
 | 
| 494 | ||
| 495 | \noindent | |
| 496 | where $\meta{S}$ is the start symbol and $\meta{S}$, $\meta{P}$,
 | |
| 497 | $\meta{N}$, $\meta{V}$ and $\meta{A}$ are non-terminals. The ``words''
 | |
| 498 | are terminals. The rough idea behind this grammar is that $\meta{S}$
 | |
| 499 | stands for a sentence, $\meta{P}$ is a predicate, $\meta{N}$ is a noun
 | |
| 500 | and so on. For example the rule \mbox{$\meta{P} ::= \meta{V}\cdot
 | |
| 501 | \meta{N}$} states that a predicate can be a verb followed by a noun.
 | |
| 502 | Now the question is whether the string | |
| 503 | ||
| 504 | \begin{center}
 | |
| 505 |   \texttt{The trainer trains the student team}
 | |
| 506 | \end{center}
 | |
| 507 | ||
| 508 | \noindent | |
| 509 | is recognised by the grammar. The CYK algorithm starts with the | |
| 510 | following triangular data structure. | |
| 680 | 511 | |
| 682 | 512 | \begin{figure}[t]
 | 
| 513 | \begin{center}
 | |
| 514 |   \begin{tikzpicture}[scale=0.8,line width=0.8mm]
 | |
| 515 | \draw (-2,0) -- (4,0); | |
| 516 | \draw (-2,1) -- (4,1); | |
| 517 | \draw (-2,2) -- (3,2); | |
| 518 | \draw (-2,3) -- (2,3); | |
| 519 | \draw (-2,4) -- (1,4); | |
| 520 | \draw (-2,5) -- (0,5); | |
| 521 | \draw (-2,6) -- (-1,6); | |
| 522 | ||
| 523 | \draw (0,0) -- (0, 5); | |
| 524 | \draw (1,0) -- (1, 4); | |
| 525 | \draw (2,0) -- (2, 3); | |
| 526 | \draw (3,0) -- (3, 2); | |
| 527 | \draw (4,0) -- (4, 1); | |
| 528 | \draw (-1,0) -- (-1, 6); | |
| 529 | \draw (-2,0) -- (-2, 6); | |
| 530 | ||
| 531 |   \draw (-1.5,-0.5) node {\footnotesize{}\texttt{The}}; 
 | |
| 532 |   \draw (-0.5,-1.0) node {\footnotesize{}\texttt{trainer}}; 
 | |
| 533 |   \draw ( 0.5,-0.5) node {\footnotesize{}\texttt{trains}}; 
 | |
| 534 |   \draw ( 1.5,-1.0) node {\footnotesize{}\texttt{the}}; 
 | |
| 535 |   \draw ( 2.5,-0.5) node {\footnotesize{}\texttt{student}}; 
 | |
| 536 |   \draw ( 3.5,-1.0) node {\footnotesize{}\texttt{team}};
 | |
| 537 | ||
| 538 |   \draw (-1.5,0.5) node {$A$}; 
 | |
| 539 |   \draw (-0.5,0.5) node {$N$}; 
 | |
| 540 |   \draw ( 0.5,0.5) node {$N,V$}; 
 | |
| 541 |   \draw ( 1.5,0.5) node {$A$}; 
 | |
| 542 |   \draw ( 2.5,0.5) node {$N$}; 
 | |
| 543 |   \draw ( 3.5,0.5) node {$N,V$};
 | |
| 544 | ||
| 545 |   \draw (-2.4, 5.5) node {$1$}; 
 | |
| 546 |   \draw (-2.4, 4.5) node {$2$}; 
 | |
| 547 |   \draw (-2.4, 3.5) node {$3$}; 
 | |
| 548 |   \draw (-2.4, 2.5) node {$4$}; 
 | |
| 549 |   \draw (-2.4, 1.5) node {$5$}; 
 | |
| 550 |   \draw (-2.4, 0.5) node {$6$}; 
 | |
| 551 |   \end{tikzpicture}
 | |
| 552 |   \end{center}
 | |
| 553 | \end{figure}
 | |
| 680 | 554 | |
| 555 | \end{document}
 | |
| 556 | ||
| 557 | ||
| 558 | %%% Parser combinators are now part of handout 6 | |
| 459 | 559 | |
| 560 | \subsection*{Parser Combinators}
 | |
| 561 | ||
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 562 | Let us now turn to the problem of generating a parse-tree for | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 563 | a grammar and string. In what follows we explain \emph{parser
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 564 | combinators}, because they are easy to implement and closely | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 565 | resemble grammar rules. Imagine that a grammar describes the | 
| 665 | 566 | strings of natural numbers, such as the grammar \meta{N}  shown
 | 
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 567 | above. For all such strings we want to generate the | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 568 | parse-trees or later on we actually want to extract the | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 569 | meaning of these strings, that is the concrete integers | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 570 | ``behind'' these strings. In Scala the parser combinators will | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 571 | be functions of type | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 572 | |
| 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 573 | \begin{center}
 | 
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 574 | \texttt{I $\Rightarrow$ Set[(T, I)]}
 | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 575 | \end{center}
 | 
| 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 576 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 577 | \noindent that is they take as input something of type | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 578 | \texttt{I}, typically a list of tokens or a string, and return
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 579 | a set of pairs. The first component of these pairs corresponds | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 580 | to what the parser combinator was able to process from the | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 581 | input and the second is the unprocessed part of the input. As | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 582 | we shall see shortly, a parser combinator might return more | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 583 | than one such pair, with the idea that there are potentially | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 584 | several ways how to interpret the input. As a concrete | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 585 | example, consider the case where the input is of type string, | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 586 | say the string | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 587 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 588 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 589 | \tt\Grid{iffoo\VS testbar}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 590 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 591 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 592 | \noindent We might have a parser combinator which tries to | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 593 | interpret this string as a keyword (\texttt{if}) or an
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 594 | identifier (\texttt{iffoo}). Then the output will be the set
 | 
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 595 | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 596 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 597 | $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), 
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 598 |            \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 599 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 600 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 601 | \noindent where the first pair means the parser could | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 602 | recognise \texttt{if} from the input and leaves the rest as
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 603 | `unprocessed' as the second component of the pair; in the | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 604 | other case it could recognise \texttt{iffoo} and leaves
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 605 | \texttt{\VS testbar} as unprocessed. If the parser cannot
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 606 | recognise anything from the input then parser combinators just | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 607 | return the empty set $\{\}$. This will indicate
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 608 | something ``went wrong''. | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 609 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 610 | The main attraction is that we can easily build parser combinators out of smaller components | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 611 | following very closely the structure of a grammar. In order to implement this in an object | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 612 | oriented programming language, like Scala, we need to specify an abstract class for parser | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 613 | combinators. This abstract class requires the implementation of the function | 
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 614 | \texttt{parse} taking an argument of type \texttt{I} and returns a set of type  
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 615 | \mbox{\texttt{Set[(T, I)]}}.
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 616 | |
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 617 | \begin{center}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 618 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 619 | abstract class Parser[I, T] {
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 620 | def parse(ts: I): Set[(T, I)] | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 621 | |
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 622 | def parse_all(ts: I): Set[T] = | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 623 | for ((head, tail) <- parse(ts); if (tail.isEmpty)) | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 624 | yield head | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 625 | } | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 626 | \end{lstlisting}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 627 | \end{center}
 | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 628 | |
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 629 | \noindent | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 630 | From the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all},
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 631 | which just filters out all pairs whose second component is not empty (that is has still some | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 632 | unprocessed part). The reason is that at the end of parsing we are only interested in the | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 633 | results where all the input has been consumed and no unprocessed part is left. | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 634 | |
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 635 | One of the simplest parser combinators recognises just a character, say $c$, | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 636 | from the beginning of strings. Its behaviour is as follows: | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 637 | |
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 638 | \begin{itemize}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 639 | \item if the head of the input string starts with a $c$, it returns | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 640 | 	the set $\{(c, \textit{tail of}\; s)\}$
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 641 | \item otherwise it returns the empty set $\varnothing$ | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 642 | \end{itemize}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 643 | |
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 644 | \noindent | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 645 | The input type of this simple parser combinator for characters is | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 646 | \texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. 
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 647 | The code in Scala is as follows: | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 648 | |
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 649 | \begin{center}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 650 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 651 | case class CharParser(c: Char) extends Parser[String, Char] {
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 652 | def parse(sb: String) = | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 653 | if (sb.head == c) Set((c, sb.tail)) else Set() | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 654 | } | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 655 | \end{lstlisting}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 656 | \end{center}
 | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 657 | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 658 | \noindent | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 659 | The \texttt{parse} function tests whether the first character of the 
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 660 | input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits the
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 661 | string into the recognised part \texttt{c} and the unprocessed part
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 662 | \texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} then
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 663 | the parser returns the empty set (in Scala \texttt{Set()}).
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 664 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 665 | More interesting are the parser combinators that build larger parsers | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 666 | out of smaller component parsers. For example the alternative | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 667 | parser combinator is as follows. | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 668 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 669 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 670 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 671 | class AltParser[I, T] | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 672 | (p: => Parser[I, T], | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 673 |         q: => Parser[I, T]) extends Parser[I, T] {
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 674 | def parse(sb: I) = p.parse(sb) ++ q.parse(sb) | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 675 | } | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 676 | \end{lstlisting}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 677 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 678 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 679 | \noindent | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 680 | The types of this parser combinator are polymorphic (we just have \texttt{I}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 681 | for the input type, and \texttt{T} for the output type). The alternative parser
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 682 | builds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}.
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 683 | Both need to be able to process input of type \texttt{I} and return the same
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 684 | output type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the 
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 685 | \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 686 | evaluation of the arguments before they are used. This is often called | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 687 | \emph{lazy evaluation} of the arguments.) The alternative parser should run
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 688 | the input with the first parser \texttt{p} (producing a set of outputs) and then
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 689 | run the same input with \texttt{q}. The result should be then just the union
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 690 | of both sets, which is the operation \texttt{++} in Scala.
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 691 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 692 | This parser combinator already allows us to construct a parser that either | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 693 | a character \texttt{a} or \texttt{b}, as
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 694 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 695 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 696 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 697 | new AltParser(CharParser('a'), CharParser('b'))
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 698 | \end{lstlisting}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 699 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 700 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 701 | \noindent | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 702 | Scala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}. 
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 703 | We can call this parser combinator with the strings | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 704 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 705 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 706 | \begin{tabular}{rcl}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 707 | input string & & output\medskip\\ | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 708 | \texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 709 | \texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 710 | \texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 711 | \end{tabular}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 712 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 713 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 714 | \noindent | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 715 | We receive in the first two cases a successful output (that is a non-empty set). | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 716 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 717 | A bit more interesting is the \emph{sequence parser combinator} implemented in
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 718 | Scala as follows: | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 719 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 720 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 721 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 722 | class SeqParser[I, T, S] | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 723 | (p: => Parser[I, T], | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 724 |         q: => Parser[I, S]) extends Parser[I, (T, S)] {
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 725 | def parse(sb: I) = | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 726 | for ((head1, tail1) <- p.parse(sb); | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 727 | (head2, tail2) <- q.parse(tail1)) | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 728 | yield ((head1, head2), tail2) | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 729 | } | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 730 | \end{lstlisting}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 731 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 732 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 733 | \noindent | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 734 | This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} 
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 735 | as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}).
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 736 | The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}. 
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 737 | Let \texttt{q} run on these unprocessed parts
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 738 | producing again a set of pairs. The output of the sequence parser combinator is then a set | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 739 | containing pairs where the first components are again pairs, namely what the first parser could parse | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 740 | together with what the second parser could parse; the second component is the unprocessed | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 741 | part left over after running the second parser \texttt{q}. Therefore the input type of
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 742 | the sequence parser combinator is as usual \texttt{I}, but the output type is
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 743 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 744 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 745 | \texttt{Set[((T, S), I)]}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 746 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 747 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 748 | Scala allows us to provide some | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 749 | shorthand notation for the sequence parser combinator. So we can write for | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 750 | example \texttt{'a'  $\sim$ 'b'}, which is the
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 751 | parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}.
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 752 | Calling this parser combinator with the strings | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 753 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 754 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 755 | \begin{tabular}{rcl}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 756 | input string & & output\medskip\\ | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 757 | \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 758 | \texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 759 | \texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 760 | \end{tabular}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 761 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 762 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 763 | \noindent | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 764 | A slightly more complicated parser is \texttt{('a'  || 'b') $\sim$ 'b'} which parses as first character either
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 765 | an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results.
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 766 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 767 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 768 | \begin{tabular}{rcl}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 769 | input string & & output\medskip\\ | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 770 | \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 771 | \texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 772 | \texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 773 | \end{tabular}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 774 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 775 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 776 | Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error.
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 777 | The first parser has as output type a single character (recall the type of \texttt{CharParser}),
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 778 | but the second parser produces a pair of characters as output. The alternative parser is however | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 779 | required to have both component parsers to have the same type. We will see later how we can | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 780 | build this parser without the typing error. | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 781 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 782 | The next parser combinator does not actually combine smaller parsers, but applies | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 783 | a function to the result of the parser. It is implemented in Scala as follows | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 784 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 785 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 786 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 787 | class FunParser[I, T, S] | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 788 | (p: => Parser[I, T], | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 789 |           f: T => S) extends Parser[I, S] {
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 790 | def parse(sb: I) = | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 791 | for ((head, tail) <- p.parse(sb)) yield (f(head), tail) | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 792 | } | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 793 | \end{lstlisting}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 794 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 795 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 796 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 797 | \noindent | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 798 | This parser combinator takes a parser \texttt{p} with output type \texttt{T} as
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 799 | input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 800 | produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 801 | applies the function \texttt{f} to all the parer outputs. Since this function
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 802 | is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}.
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 803 | Again Scala lets us introduce some shorthand notation for this parser combinator. | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 804 | Therefore we will write \texttt{p ==> f} for it.
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 805 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 806 | %\bigskip | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 807 | %takes advantage of the full generality---have a look | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 808 | %what it produces if we call it with the string \texttt{abc}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 809 | % | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 810 | %\begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 811 | %\begin{tabular}{rcl}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 812 | %input string & & output\medskip\\ | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 813 | %\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 814 | %\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 815 | %\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 816 | %\end{tabular}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 817 | %\end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 818 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 819 | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 820 | |
| 680 | 821 | |
| 822 | ||
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 823 | |
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 824 | %%% Local Variables: | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 825 | %%% mode: latex | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 826 | %%% TeX-master: t | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 827 | %%% End: | 
| 680 | 828 |