| author | Christian Urban <christian.urban@kcl.ac.uk> | 
| Fri, 17 Oct 2025 11:20:49 +0100 | |
| changeset 1010 | ae9ffbf979ff | 
| parent 941 | 66adcae6c762 | 
| permissions | -rw-r--r-- | 
| 665 | 1 | % !TEX program = xelatex | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 2 | \documentclass{article}
 | 
| 297 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 3 | \usepackage{../style}
 | 
| 217 
cd6066f1056a
updated handouts
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
183diff
changeset | 4 | \usepackage{../langs}
 | 
| 459 | 5 | \usepackage{../grammar}
 | 
| 798 | 6 | \usepackage{../graphics}
 | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 7 | |
| 545 | 8 | % epsilon and left-recursion elimination | 
| 9 | % http://www.mollypages.org/page/grammar/index.mp | |
| 10 | ||
| 618 | 11 | %% parsing scala files | 
| 691 | 12 | %% https://scalameta.org/ | 
| 13 | ||
| 14 | % chomsky normal form transformation | |
| 15 | % https://youtu.be/FNPSlnj3Vt0 | |
| 618 | 16 | |
| 710 | 17 | % Language hierachy is about complexity | 
| 18 | % How hard is it to recognise an element in a language | |
| 19 | ||
| 722 | 20 | % Pratt parsing | 
| 21 | % https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html | |
| 22 | % https://www.oilshell.org/blog/2017/03/31.html | |
| 23 | ||
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 24 | \begin{document}
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 25 | |
| 385 
7f8516ff408d
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
362diff
changeset | 26 | \section*{Handout 5 (Grammars \& Parser)}
 | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 27 | |
| 727 | 28 | So far we have focused on regular expressions as well as matching and | 
| 29 | lexing algorithms. While regular expressions are very useful for lexing | |
| 30 | and for recognising many patterns in strings (like email addresses), | |
| 31 | they have their limitations. For example there is no regular expression | |
| 32 | that can recognise the language $a^nb^n$ (where you have strings | |
| 33 | starting with $n$ $a$'s followed by the same amount of $b$'s). Another | |
| 34 | example for which there exists no regular expression is the language of | |
| 35 | well-parenthesised expressions. In languages like Lisp, which use | |
| 36 | parentheses rather extensively, it might be of interest to know whether | |
| 37 | the following two expressions are well-parenthesised or not (the left | |
| 38 | one is, the right one is not): | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 39 | |
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 40 | \begin{center}
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 41 | $(((()()))())$  \hspace{10mm} $(((()()))()))$
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 42 | \end{center}
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 43 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 44 | \noindent Not being able to solve such recognition problems is | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 45 | a serious limitation. In order to solve such recognition | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 46 | problems, we need more powerful techniques than regular | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 47 | expressions. We will in particular look at \emph{context-free
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 48 | languages}. They include the regular languages as the picture | 
| 582 | 49 | below about language classes shows: | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 50 | |
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 51 | |
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 52 | \begin{center}
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 53 | \begin{tikzpicture}
 | 
| 297 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 54 | [rect/.style={draw=black!50, 
 | 
| 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 55 | top color=white,bottom color=black!20, | 
| 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 56 | rectangle, very thick, rounded corners}] | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 57 | |
| 297 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 58 | \draw (0,0) node [rect, text depth=30mm, text width=46mm] {\small all languages};
 | 
| 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 59 | \draw (0,-0.4) node [rect, text depth=20mm, text width=44mm] {\small decidable languages};
 | 
| 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 60 | \draw (0,-0.65) node [rect, text depth=13mm] {\small context sensitive languages};
 | 
| 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 61 | \draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages};
 | 
| 
5c51839c88fd
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
292diff
changeset | 62 | \draw (0,-1.05) node [rect] {\small regular languages};
 | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 63 | \end{tikzpicture}
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 64 | \end{center}
 | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 65 | |
| 582 | 66 | \noindent Each ``bubble'' stands for sets of languages (remember | 
| 67 | languages are sets of strings). As indicated the set of regular | |
| 680 | 68 | languages is fully included inside the context-free languages, | 
| 582 | 69 | meaning every regular language is also context-free, but not vice | 
| 665 | 70 | versa. Below I will let you think, for example, what the context-free | 
| 582 | 71 | grammar is for the language corresponding to the regular expression | 
| 72 | $(aaa)^*a$. | |
| 73 | ||
| 74 | Because of their convenience, context-free languages play an important | |
| 75 | role in `day-to-day' text processing and in programming | |
| 76 | languages. Context-free in this setting means that ``words'' have one | |
| 680 | 77 | meaning only and this meaning is independent from the context | 
| 78 | the ``words'' appear in. For example ambiguity issues like | |
| 582 | 79 | |
| 80 | \begin{center}
 | |
| 682 | 81 | \tt Time flies like an arrow. Fruit flies like bananas. | 
| 582 | 82 | \end{center}  
 | 
| 83 | ||
| 84 | \noindent | |
| 941 | 85 | from natural languages where the meaning of \emph{flies} depends on the
 | 
| 686 | 86 | surrounding \emph{context} are avoided as much as possible. Here is
 | 
| 937 | 87 | an interesting video about C++ not being a context-free language | 
| 686 | 88 | |
| 89 | \begin{center}
 | |
| 90 | \url{https://www.youtube.com/watch?v=OzK8pUu4UfM}
 | |
| 91 | \end{center}  
 | |
| 582 | 92 | |
| 93 | Context-free languages are usually specified by grammars. For example | |
| 94 | a grammar for well-parenthesised expressions can be given as follows: | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 95 | |
| 459 | 96 | \begin{plstx}[margin=3cm]
 | 
| 97 | : \meta{P} ::=  ( \cdot  \meta{P} \cdot ) \cdot \meta{P}
 | |
| 98 | | \epsilon\\ | |
| 99 | \end{plstx}
 | |
| 100 | ||
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 101 | \noindent | 
| 937 | 102 | or a grammar for recognising strings consisting of ones (at least one) is | 
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 103 | |
| 459 | 104 | \begin{plstx}[margin=3cm]
 | 
| 105 | : \meta{O} ::= 1 \cdot  \meta{O} 
 | |
| 106 | | 1\\ | |
| 107 | \end{plstx}
 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 108 | |
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 109 | In general grammars consist of finitely many rules built up | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 110 | from \emph{terminal symbols} (usually lower-case letters) and
 | 
| 582 | 111 | \emph{non-terminal symbols} (upper-case letters written in
 | 
| 112 | bold like \meta{A}, \meta{N} and so on). Rules have
 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 113 | the shape | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 114 | |
| 459 | 115 | \begin{plstx}[margin=3cm]
 | 
| 116 | : \meta{NT} ::= rhs\\
 | |
| 117 | \end{plstx}
 | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 118 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 119 | \noindent where on the left-hand side is a single non-terminal | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 120 | and on the right a string consisting of both terminals and | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 121 | non-terminals including the $\epsilon$-symbol for indicating | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 122 | the empty string. We use the convention to separate components | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 123 | on the right hand-side by using the $\cdot$ symbol, as in the | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 124 | grammar for well-parenthesised expressions. We also use the | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 125 | convention to use $|$ as a shorthand notation for several | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 126 | rules. For example | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 127 | |
| 459 | 128 | \begin{plstx}[margin=3cm]
 | 
| 129 | : \meta{NT} ::= rhs_1
 | |
| 130 | | rhs_2\\ | |
| 131 | \end{plstx}
 | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 132 | |
| 459 | 133 | \noindent means that the non-terminal \meta{NT} can be replaced by
 | 
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 134 | either $\textit{rhs}_1$ or $\textit{rhs}_2$. If there are more
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 135 | than one non-terminal on the left-hand side of the rules, then | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 136 | we need to indicate what is the \emph{starting} symbol of the
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 137 | grammar. For example the grammar for arithmetic expressions | 
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 138 | can be given as follows | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 139 | |
| 459 | 140 | \begin{plstx}[margin=3cm,one per line]
 | 
| 141 | \mbox{\rm (1)}: \meta{E} ::= \meta{N}\\
 | |
| 142 | \mbox{\rm (2)}: \meta{E} ::= \meta{E} \cdot + \cdot \meta{E}\\
 | |
| 143 | \mbox{\rm (3)}: \meta{E} ::= \meta{E} \cdot - \cdot \meta{E}\\
 | |
| 144 | \mbox{\rm (4)}: \meta{E} ::= \meta{E} \cdot * \cdot \meta{E}\\
 | |
| 145 | \mbox{\rm (5)}: \meta{E} ::= ( \cdot \meta{E} \cdot )\\
 | |
| 146 | \mbox{\rm (6\ldots)}: \meta{N} ::= \meta{N} \cdot \meta{N} 
 | |
| 147 | \mid 0 \mid 1 \mid \ldots \mid 9\\ | |
| 148 | \end{plstx}
 | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 149 | |
| 459 | 150 | \noindent where \meta{E} is the starting symbol. A
 | 
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 151 | \emph{derivation} for a grammar starts with the starting
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 152 | symbol of the grammar and in each step replaces one | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 153 | non-terminal by a right-hand side of a rule. A derivation ends | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 154 | with a string in which only terminal symbols are left. For | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 155 | example a derivation for the string $(1 + 2) + 3$ is as | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 156 | follows: | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 157 | |
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 158 | \begin{center}
 | 
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 159 | \begin{tabular}{lll@{\hspace{2cm}}l}
 | 
| 459 | 160 | \meta{E} & $\rightarrow$ & $\meta{E}+\meta{E}$          & by (2)\\
 | 
| 161 |        & $\rightarrow$ & $(\meta{E})+\meta{E}$     & by (5)\\
 | |
| 162 |        & $\rightarrow$ & $(\meta{E}+\meta{E})+\meta{E}$   & by (2)\\
 | |
| 163 |        & $\rightarrow$ & $(\meta{E}+\meta{E})+\meta{N}$   & by (1)\\
 | |
| 164 |        & $\rightarrow$ & $(\meta{E}+\meta{E})+3$   & by (6\dots)\\
 | |
| 165 |        & $\rightarrow$ & $(\meta{N}+\meta{E})+3$   & by (1)\\
 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 166 | & $\rightarrow^+$ & $(1+2)+3$ & by (1, 6\ldots)\\ | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 167 | \end{tabular} 
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 168 | \end{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 169 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 170 | \noindent where on the right it is indicated which | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 171 | grammar rule has been applied. In the last step we | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 172 | merged several steps into one. | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 173 | |
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 174 | The \emph{language} of a context-free grammar $G$
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 175 | with start symbol $S$ is defined as the set of strings | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 176 | derivable by a derivation, that is | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 177 | |
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 178 | \begin{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 179 | $\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 180 | \end{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 181 | |
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 182 | \noindent | 
| 680 | 183 | A \emph{parse-tree} encodes how a string is derived with the starting
 | 
| 184 | symbol on top and each non-terminal containing a subtree for how it is | |
| 185 | replaced in a derivation. The parse tree for the string $(1 + 23)+4$ is | |
| 186 | as follows: | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 187 | |
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 188 | \begin{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 189 | \begin{tikzpicture}[level distance=8mm, black]
 | 
| 665 | 190 |   \node {\meta{E}}
 | 
| 191 |     child {node {\meta{E} } 
 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 192 |        child {node {$($}}
 | 
| 665 | 193 |        child {node {\meta{E} }       
 | 
| 194 |          child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}}
 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 195 |          child {node {$+$}}
 | 
| 665 | 196 |          child {node {\meta{E} } 
 | 
| 197 |             child {node {\meta{N} } child {node {$2$}}}
 | |
| 198 |             child {node {\meta{N} } child {node {$3$}}}
 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 199 | } | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 200 | } | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 201 |        child {node {$)$}}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 202 | } | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 203 |      child {node {$+$}}
 | 
| 665 | 204 |      child {node {\meta{E} }
 | 
| 205 |         child {node {\meta{N} } child {node {$4$}}}
 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 206 | }; | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 207 | \end{tikzpicture}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 208 | \end{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 209 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 210 | \noindent We are often interested in these parse-trees since | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 211 | they encode the structure of how a string is derived by a | 
| 680 | 212 | grammar. | 
| 213 | ||
| 214 | Before we come to the problem of constructing such parse-trees, we need | |
| 215 | to consider the following two properties of grammars. A grammar is | |
| 216 | \emph{left-recursive} if there is a derivation starting from a
 | |
| 217 | non-terminal, say \meta{NT} which leads to a string which again starts
 | |
| 218 | with \meta{NT}. This means a derivation of the form.
 | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 219 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 220 | \begin{center}
 | 
| 665 | 221 | $\meta{NT} \rightarrow \ldots \rightarrow \meta{NT} \cdot \ldots$
 | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 222 | \end{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 223 | |
| 680 | 224 | \noindent It can be easily seen that the grammar above for arithmetic | 
| 225 | expressions is left-recursive: for example the rules $\meta{E}
 | |
| 226 | \rightarrow \meta{E}\cdot + \cdot \meta{E}$ and $\meta{N} \rightarrow
 | |
| 227 | \meta{N}\cdot \meta{N}$ show that this grammar is left-recursive. But
 | |
| 228 | note that left-recursiveness can involve more than one step in the | |
| 229 | derivation. The problem with left-recursive grammars is that some | |
| 230 | algorithms cannot cope with them: with left-recursive grammars they will | |
| 231 | fall into a loop. Fortunately every left-recursive grammar can be | |
| 232 | transformed into one that is not left-recursive, although this | |
| 233 | transformation might make the grammar less ``human-readable''. For | |
| 234 | example if we want to give a non-left-recursive grammar for numbers we | |
| 235 | might specify | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 236 | |
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 237 | \begin{center}
 | 
| 665 | 238 | $\meta{N} \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;
 | 
| 239 | 1\cdot \meta{N}\;|\;2\cdot \meta{N}\;|\;\ldots\;|\;9\cdot \meta{N}$
 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 240 | \end{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 241 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 242 | \noindent Using this grammar we can still derive every number | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 243 | string, but we will never be able to derive a string of the | 
| 665 | 244 | form $\meta{N} \to \ldots \to \meta{N} \cdot \ldots$.
 | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 245 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 246 | The other property we have to watch out for is when a grammar | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 247 | is \emph{ambiguous}. A grammar is said to be ambiguous if
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 248 | there are two parse-trees for one string. Again the grammar | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 249 | for arithmetic expressions shown above is ambiguous. While the | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 250 | shown parse tree for the string $(1 + 23) + 4$ is unique, this | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 251 | is not the case in general. For example there are two parse | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 252 | trees for the string $1 + 2 + 3$, namely | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 253 | |
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 254 | \begin{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 255 | \begin{tabular}{c@{\hspace{10mm}}c}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 256 | \begin{tikzpicture}[level distance=8mm, black]
 | 
| 665 | 257 |   \node {\meta{E} }
 | 
| 258 |     child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}}
 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 259 |     child {node {$+$}}
 | 
| 665 | 260 |     child {node {\meta{E} }
 | 
| 261 |        child {node {\meta{E} } child {node {\meta{N} } child {node {$2$}}}}
 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 262 |        child {node {$+$}}
 | 
| 665 | 263 |        child {node {\meta{E} } child {node {\meta{N} } child {node {$3$}}}}
 | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 264 | } | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 265 | ; | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 266 | \end{tikzpicture} 
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 267 | & | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 268 | \begin{tikzpicture}[level distance=8mm, black]
 | 
| 665 | 269 |   \node {\meta{E} }
 | 
| 270 |     child {node {\meta{E} }
 | |
| 271 |        child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}}
 | |
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 272 |        child {node {$+$}}
 | 
| 665 | 273 |        child {node {\meta{E} } child {node {\meta{N} } child {node {$2$}}}} 
 | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 274 | } | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 275 |     child {node {$+$}}
 | 
| 665 | 276 |     child {node {\meta{E} } child {node {\meta{N} } child {node {$3$}}}}
 | 
| 175 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 277 | ; | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 278 | \end{tikzpicture}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 279 | \end{tabular} 
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 280 | \end{center}
 | 
| 
5801e8c0e528
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
173diff
changeset | 281 | |
| 680 | 282 | \noindent In particular in programming languages we will try to avoid | 
| 283 | ambiguous grammars because two different parse-trees for a string mean a | |
| 284 | program can be interpreted in two different ways. In such cases we have | |
| 285 | to somehow make sure the two different ways do not matter, or | |
| 286 | disambiguate the grammar in some other way (for example making the $+$ | |
| 287 | left-associative). Unfortunately already the problem of deciding whether | |
| 288 | a grammar is ambiguous or not is in general undecidable. But in simple | |
| 289 | instance (the ones we deal with in this module) one can usually see when | |
| 290 | a grammar is ambiguous. | |
| 291 | ||
| 292 | \subsection*{Removing Left-Recursion}
 | |
| 293 | ||
| 294 | Let us come back to the problem of left-recursion and consider the | |
| 295 | following grammar for binary numbers: | |
| 296 | ||
| 297 | \begin{plstx}[margin=1cm]
 | |
| 298 | : \meta{B} ::= \meta{B} \cdot \meta{B} | 0 | 1\\
 | |
| 299 | \end{plstx}
 | |
| 300 | ||
| 301 | \noindent | |
| 302 | It is clear that this grammar can create all binary numbers, but | |
| 303 | it is also clear that this grammar is left-recursive. Giving this | |
| 304 | grammar as is to parser combinators will result in an infinite | |
| 305 | loop. Fortunately, every left-recursive grammar can be translated | |
| 306 | into one that is not left-recursive with the help of some | |
| 307 | transformation rules. Suppose we identified the ``offensive'' | |
| 308 | rule, then we can separate the grammar into this offensive rule | |
| 309 | and the ``rest'': | |
| 310 | ||
| 311 | \begin{plstx}[margin=1cm]
 | |
| 312 |   : \meta{B} ::= \underbrace{\meta{B} \cdot \meta{B}}_{\textit{lft-rec}} 
 | |
| 313 |   | \underbrace{0 \;\;|\;\; 1}_{\textit{rest}}\\
 | |
| 314 | \end{plstx}
 | |
| 315 | ||
| 316 | \noindent | |
| 317 | To make the idea of the transformation clearer, suppose the left-recursive | |
| 318 | rule is of the form $\meta{B}\alpha$ (the left-recursive non-terminal 
 | |
| 319 | followed by something called $\alpha$) and the ``rest'' is called $\beta$. | |
| 320 | That means our grammar looks schematically as follows | |
| 321 | ||
| 322 | \begin{plstx}[margin=1cm]
 | |
| 323 |   : \meta{B} ::= \meta{B} \cdot \alpha | \beta\\
 | |
| 324 | \end{plstx}
 | |
| 325 | ||
| 326 | \noindent | |
| 327 | To get rid of the left-recursion, we are required to introduce | |
| 328 | a new non-terminal, say $\meta{B'}$ and transform the rule
 | |
| 329 | as follows: | |
| 330 | ||
| 331 | \begin{plstx}[margin=1cm]
 | |
| 332 |   : \meta{B} ::= \beta \cdot \meta{B'}\\
 | |
| 333 |   : \meta{B'} ::= \alpha \cdot \meta{B'} | \epsilon\\
 | |
| 334 | \end{plstx}
 | |
| 335 | ||
| 336 | \noindent | |
| 337 | In our example of binary numbers we would after the transformation | |
| 338 | end up with the rules | |
| 339 | ||
| 340 | \begin{plstx}[margin=1cm]
 | |
| 341 |   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
 | |
| 342 |   : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\
 | |
| 343 | \end{plstx}
 | |
| 344 | ||
| 345 | \noindent | |
| 346 | A little thought should convince you that this grammar still derives | |
| 347 | all the binary numbers (for example 0 and 1 are derivable because $\meta{B'}$
 | |
| 348 | can be $\epsilon$). Less clear might be why this grammar is non-left recursive. | |
| 349 | For $\meta{B'}$ it is relatively clear because we will never be 
 | |
| 350 | able to derive things like | |
| 351 | ||
| 352 | \begin{center}
 | |
| 353 | $\meta{B'} \rightarrow\ldots\rightarrow \meta{B'}\cdot\ldots$
 | |
| 354 | \end{center}  
 | |
| 355 | ||
| 356 | \noindent | |
| 357 | because there will always be a $\meta{B}$ in front of a $\meta{B'}$, and
 | |
| 358 | $\meta{B}$ now has always a $0$ or $1$ in front, so a $\meta{B'}$ can
 | |
| 359 | never be in the first place. The reasoning is similar for $\meta{B}$:
 | |
| 360 | the $0$ and $1$ in the rule for $\meta{B}$ ``protect'' it from becoming
 | |
| 361 | left-recursive. This transformation does not mean the grammar is the | |
| 362 | simplest left-recursive grammar for binary numbers. For example the | |
| 363 | following grammar would do as well | |
| 364 | ||
| 365 | \begin{plstx}[margin=1cm]
 | |
| 366 |   : \meta{B} ::= 0 \cdot \meta{B} | 1 \cdot \meta{B} | 0 | 1\\
 | |
| 367 | \end{plstx}
 | |
| 368 | ||
| 369 | \noindent | |
| 370 | The point is that we can in principle transform every left-recursive | |
| 941 | 371 | grammar into one that is non-left-recursive. This explains why often | 
| 680 | 372 | the following grammar is used for arithmetic expressions: | 
| 373 | ||
| 374 | \begin{plstx}[margin=1cm]
 | |
| 375 |   : \meta{E} ::= \meta{T} | \meta{T} \cdot + \cdot \meta{E} |  \meta{T} \cdot - \cdot \meta{E}\\
 | |
| 376 |   : \meta{T} ::= \meta{F} | \meta{F} \cdot * \cdot \meta{T}\\
 | |
| 377 |   : \meta{F} ::= num\_token | ( \cdot \meta{E} \cdot )\\
 | |
| 378 | \end{plstx}
 | |
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 379 | |
| 680 | 380 | \noindent | 
| 937 | 381 | In this grammar all $\meta{E}$xpressions, $\meta{T}$erms and
 | 
| 382 | $\meta{F}$actors are in some way protected from being
 | |
| 941 | 383 | left-recursive. For example if you start $\meta{E}$ you can derive
 | 
| 937 | 384 | another one by going through $\meta{T}$, then $\meta{F}$, but then
 | 
| 385 | $\meta{E}$ is protected by the open-parenthesis in the last rule.
 | |
| 680 | 386 | |
| 387 | \subsection*{Removing $\epsilon$-Rules and CYK-Algorithm}
 | |
| 388 | ||
| 389 | I showed above that the non-left-recursive grammar for binary numbers is | |
| 390 | ||
| 391 | \begin{plstx}[margin=1cm]
 | |
| 392 |   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
 | |
| 393 |   : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\
 | |
| 394 | \end{plstx}
 | |
| 395 | ||
| 396 | \noindent | |
| 397 | The transformation made the original grammar non-left-recursive, but at | |
| 398 | the expense of introducing an $\epsilon$ in the second rule. Having an | |
| 937 | 399 | explicit $\epsilon$-rule is annoying, not in terms of looping, but in | 
| 680 | 400 | terms of efficiency. The reason is that the $\epsilon$-rule always | 
| 401 | applies but since it recognises the empty string, it does not make any | |
| 402 | progress with recognising a string. Better are rules like $( \cdot | |
| 403 | \meta{E} \cdot )$ where something of the input is consumed. Getting
 | |
| 404 | rid of $\epsilon$-rules is also important for the CYK parsing algorithm, | |
| 405 | which can give us an insight into the complexity class of parsing. | |
| 406 | ||
| 407 | It turns out we can also by some generic transformations eliminate | |
| 408 | $\epsilon$-rules from grammars. Consider again the grammar above for | |
| 409 | binary numbers where have a rule $\meta{B'} ::= \epsilon$. In this case
 | |
| 410 | we look for rules of the (generic) form \mbox{$\meta{A} :=
 | |
| 941 | 411 | \alpha\cdot\meta{B'}\cdot\beta$}. That is, there are rules that use
 | 
| 680 | 412 | $\meta{B'}$ and something ($\alpha$) is in front of $\meta{B'}$ and
 | 
| 413 | something follows ($\beta$). Such rules need to be replaced by | |
| 414 | additional rules of the form \mbox{$\meta{A} := \alpha\cdot\beta$}.
 | |
| 415 | In our running example there are the two rules for $\meta{B}$ which
 | |
| 416 | fall into this category | |
| 417 | ||
| 418 | \begin{plstx}[margin=1cm]
 | |
| 419 |   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
 | |
| 420 | \end{plstx} 
 | |
| 421 | ||
| 941 | 422 | \noindent To follow the general scheme of the transformation, | 
| 680 | 423 | the $\alpha$ is either is either $0$ or $1$, and the $\beta$ happens | 
| 798 | 424 | to be empty. So we need to generate new rules for the form | 
| 680 | 425 | \mbox{$\meta{A} := \alpha\cdot\beta$}, which in our particular 
 | 
| 426 | example means we obtain | |
| 427 | ||
| 428 | \begin{plstx}[margin=1cm]
 | |
| 429 |   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\
 | |
| 430 | \end{plstx} 
 | |
| 431 | ||
| 432 | \noindent | |
| 433 | Unfortunately $\meta{B'}$ is also used in the rule 
 | |
| 434 | ||
| 435 | \begin{plstx}[margin=1cm]
 | |
| 436 |   : \meta{B'} ::= \meta{B} \cdot \meta{B'}\\
 | |
| 437 | \end{plstx}
 | |
| 438 | ||
| 439 | \noindent | |
| 440 | For this we repeat the transformation, giving | |
| 441 | ||
| 442 | \begin{plstx}[margin=1cm]
 | |
| 443 |   : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\
 | |
| 444 | \end{plstx}
 | |
| 445 | ||
| 446 | \noindent | |
| 447 | In this case $\alpha$ was substituted with $\meta{B}$ and $\beta$
 | |
| 448 | was again empty. Once no rule is left over, we can simply throw | |
| 449 | away the $\epsilon$ rule. This gives the grammar | |
| 450 | ||
| 451 | \begin{plstx}[margin=1cm]
 | |
| 452 |   : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\
 | |
| 453 |   : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\
 | |
| 454 | \end{plstx}
 | |
| 455 | ||
| 456 | \noindent | |
| 457 | I let you think about whether this grammar can still recognise all | |
| 458 | binary numbers and whether this grammar is non-left-recursive. The | |
| 937 | 459 | precise statement for the transformation of removing $\epsilon$-rules | 
| 460 | is that if the original grammar was able to recognise only non-empty | |
| 680 | 461 | strings, then the transformed grammar will be equivalent (matching the | 
| 462 | same set of strings); if the original grammar was able to match the | |
| 463 | empty string, then the transformed grammar will be able to match the | |
| 937 | 464 | same strings, \emph{except} the empty string. So the
 | 
| 465 | $\epsilon$-removal does not preserve equivalence of grammars in | |
| 466 | general, but the small defect with the empty string is not important | |
| 467 | for practical purposes. | |
| 680 | 468 | |
| 469 | So why are these transformations all useful? Well apart from making the | |
| 470 | parser combinators work (remember they cannot deal with left-recursion and | |
| 471 | are inefficient with $\epsilon$-rules), a second reason is that they help | |
| 472 | with getting any insight into the complexity of the parsing problem. | |
| 473 | The parser combinators are very easy to implement, but are far from the | |
| 474 | most efficient way of processing input (they can blow up exponentially | |
| 475 | with ambiguous grammars). The question remains what is the best possible | |
| 476 | complexity for parsing? It turns out that this is $O(n^3)$ for context-free | |
| 477 | languages. | |
| 478 | ||
| 479 | To answer the question about complexity, let me describe next the CYK | |
| 480 | algorithm (named after the authors Cocke–Younger–Kasami). This algorithm | |
| 681 | 481 | works with grammars that are in \emph{Chomsky normalform}. In Chomsky
 | 
| 482 | normalform all rules must be of the form $\meta{A} ::= a$, where $a$ is 
 | |
| 483 | a terminal, or $\meta{A} ::= \meta{B}\cdot \meta{C}$, where $\meta{B}$ and
 | |
| 484 | $\meta{B}$ need to be non-terminals. And no rule can contain $\epsilon$.
 | |
| 485 | The following grammar is in Chomsky normalform: | |
| 486 | ||
| 487 | \begin{plstx}[margin=1cm]
 | |
| 682 | 488 |   : \meta{S} ::= \meta{N}\cdot \meta{P}\\
 | 
| 489 |   : \meta{P} ::= \meta{V}\cdot \meta{N}\\
 | |
| 490 |   : \meta{N} ::= \meta{N}\cdot \meta{N}\\
 | |
| 491 |   : \meta{N} ::= \meta{A}\cdot \meta{N}\\
 | |
| 492 |   : \meta{N} ::= \texttt{student} | \texttt{trainer} | \texttt{team} 
 | |
| 493 |                 | \texttt{trains}\\
 | |
| 494 |   : \meta{V} ::= \texttt{trains} | \texttt{team}\\
 | |
| 495 |   : \meta{A} ::= \texttt{The} | \texttt{the}\\
 | |
| 681 | 496 | \end{plstx}
 | 
| 497 | ||
| 498 | \noindent | |
| 499 | where $\meta{S}$ is the start symbol and $\meta{S}$, $\meta{P}$,
 | |
| 500 | $\meta{N}$, $\meta{V}$ and $\meta{A}$ are non-terminals. The ``words''
 | |
| 501 | are terminals. The rough idea behind this grammar is that $\meta{S}$
 | |
| 502 | stands for a sentence, $\meta{P}$ is a predicate, $\meta{N}$ is a noun
 | |
| 503 | and so on. For example the rule \mbox{$\meta{P} ::= \meta{V}\cdot
 | |
| 504 | \meta{N}$} states that a predicate can be a verb followed by a noun.
 | |
| 505 | Now the question is whether the string | |
| 506 | ||
| 507 | \begin{center}
 | |
| 508 |   \texttt{The trainer trains the student team}
 | |
| 509 | \end{center}
 | |
| 510 | ||
| 511 | \noindent | |
| 512 | is recognised by the grammar. The CYK algorithm starts with the | |
| 513 | following triangular data structure. | |
| 680 | 514 | |
| 798 | 515 | %%\begin{figure}[t]
 | 
| 682 | 516 | \begin{center}
 | 
| 798 | 517 |   \begin{tikzpicture}[scale=0.7,line width=0.8mm]
 | 
| 682 | 518 | \draw (-2,0) -- (4,0); | 
| 519 | \draw (-2,1) -- (4,1); | |
| 520 | \draw (-2,2) -- (3,2); | |
| 521 | \draw (-2,3) -- (2,3); | |
| 522 | \draw (-2,4) -- (1,4); | |
| 523 | \draw (-2,5) -- (0,5); | |
| 524 | \draw (-2,6) -- (-1,6); | |
| 525 | ||
| 526 | \draw (0,0) -- (0, 5); | |
| 527 | \draw (1,0) -- (1, 4); | |
| 528 | \draw (2,0) -- (2, 3); | |
| 529 | \draw (3,0) -- (3, 2); | |
| 530 | \draw (4,0) -- (4, 1); | |
| 531 | \draw (-1,0) -- (-1, 6); | |
| 532 | \draw (-2,0) -- (-2, 6); | |
| 533 | ||
| 534 |   \draw (-1.5,-0.5) node {\footnotesize{}\texttt{The}}; 
 | |
| 535 |   \draw (-0.5,-1.0) node {\footnotesize{}\texttt{trainer}}; 
 | |
| 536 |   \draw ( 0.5,-0.5) node {\footnotesize{}\texttt{trains}}; 
 | |
| 537 |   \draw ( 1.5,-1.0) node {\footnotesize{}\texttt{the}}; 
 | |
| 538 |   \draw ( 2.5,-0.5) node {\footnotesize{}\texttt{student}}; 
 | |
| 539 |   \draw ( 3.5,-1.0) node {\footnotesize{}\texttt{team}};
 | |
| 540 | ||
| 541 |   \draw (-1.5,0.5) node {$A$}; 
 | |
| 542 |   \draw (-0.5,0.5) node {$N$}; 
 | |
| 543 |   \draw ( 0.5,0.5) node {$N,V$}; 
 | |
| 544 |   \draw ( 1.5,0.5) node {$A$}; 
 | |
| 545 |   \draw ( 2.5,0.5) node {$N$}; 
 | |
| 546 |   \draw ( 3.5,0.5) node {$N,V$};
 | |
| 547 | ||
| 798 | 548 | %  \draw (-1.5,1.5) node {\small{}a}; 
 | 
| 549 | %  \draw (-0.5,1.5) node {\small{}b}; 
 | |
| 550 | %  \draw ( 0.5,1.5) node {\small{}c}; 
 | |
| 551 | %  \draw ( 1.5,1.5) node {\small{}d}; 
 | |
| 552 | %  \draw ( 2.5,1.5) node {\small{}e}; 
 | |
| 553 | ||
| 682 | 554 |   \draw (-2.4, 5.5) node {$1$}; 
 | 
| 555 |   \draw (-2.4, 4.5) node {$2$}; 
 | |
| 556 |   \draw (-2.4, 3.5) node {$3$}; 
 | |
| 557 |   \draw (-2.4, 2.5) node {$4$}; 
 | |
| 558 |   \draw (-2.4, 1.5) node {$5$}; 
 | |
| 559 |   \draw (-2.4, 0.5) node {$6$}; 
 | |
| 560 |   \end{tikzpicture}
 | |
| 561 |   \end{center}
 | |
| 798 | 562 | %%\end{figure}
 | 
| 563 | ||
| 564 | ||
| 565 | \noindent | |
| 566 | The last row contains the information about all words and their | |
| 567 | corresponding non-terminals. For example the field for \texttt{trains}
 | |
| 937 | 568 | contains the information $\meta{N}$ and $\meta{V}$ because \texttt{trains} can be a
 | 
| 798 | 569 | ``verb'' and a ``noun'' according to the grammar. The row above, | 
| 570 | let's call the corresponding fields 5a to 5e, contains information | |
| 937 | 571 | about \underline{2-word} parts of the sentence, namely
 | 
| 798 | 572 | |
| 573 | \begin{center}
 | |
| 574 | \begin{tabular}{llll}
 | |
| 575 |   5a) & $\underbrace{\texttt{The}}_{A}$ $\mid$ $\underbrace{\texttt{trainer}}_{N}$   \\
 | |
| 576 |   5b) & $\underbrace{\texttt{trainer}}_{N}$ $\mid$ $\underbrace{\texttt{trains}}_{N,V}$\\
 | |
| 577 |   5c) & \texttt{trains} $\mid$ \texttt{the}    \\
 | |
| 578 |   5d) & \texttt{the} $\mid$ \texttt{student}   \\
 | |
| 579 |   5e) & \texttt{student} $\mid$ \texttt{team}  \\
 | |
| 580 | \end{tabular}  
 | |
| 581 | \end{center}  
 | |
| 582 | ||
| 583 | \noindent | |
| 584 | For each of them, we look up in row 6 which non-terminals it belongs to | |
| 585 | (indicated above for 5a and 5b). For 5a, with the non-terminals | |
| 586 | \meta{A} and \meta{N}, we find the grammar rule
 | |
| 587 | ||
| 588 | \begin{plstx}[margin=1cm]
 | |
| 589 |   : \meta{N} ::= \meta{A}\cdot \meta{N}\\
 | |
| 590 | \end{plstx}
 | |
| 591 | ||
| 592 | \noindent | |
| 593 | which means into field 5a we put the left-hand side of this rule, | |
| 594 | which in this case is the non-terminal \meta{N}. For 5b we have to check
 | |
| 595 | for both $\meta{N}\cdot\meta{N}$ and $\meta{N}\cdot\meta{V}$ whether there
 | |
| 596 | is a right-hand side of this form in the grammar. But only the | |
| 597 | grammar rule | |
| 598 | ||
| 599 | \begin{plstx}[margin=1cm]
 | |
| 600 |   : \meta{N} ::= \meta{N}\cdot \meta{N}\\
 | |
| 601 | \end{plstx}
 | |
| 602 | ||
| 603 | \noindent | |
| 604 | matches, which means 5b gets also an \meta{N}. Continuing for all
 | |
| 605 | fields in row 5 gives: | |
| 606 | ||
| 607 | \begin{center}
 | |
| 608 |   \begin{tikzpicture}[scale=0.7,line width=0.8mm]
 | |
| 609 | \draw (-2,0) -- (4,0); | |
| 610 | \draw (-2,1) -- (4,1); | |
| 611 | \draw (-2,2) -- (3,2); | |
| 612 | \draw (-2,3) -- (2,3); | |
| 613 | \draw (-2,4) -- (1,4); | |
| 614 | \draw (-2,5) -- (0,5); | |
| 615 | \draw (-2,6) -- (-1,6); | |
| 616 | ||
| 617 | \draw (0,0) -- (0, 5); | |
| 618 | \draw (1,0) -- (1, 4); | |
| 619 | \draw (2,0) -- (2, 3); | |
| 620 | \draw (3,0) -- (3, 2); | |
| 621 | \draw (4,0) -- (4, 1); | |
| 622 | \draw (-1,0) -- (-1, 6); | |
| 623 | \draw (-2,0) -- (-2, 6); | |
| 624 | ||
| 625 |   \draw (-1.5,-0.5) node {\footnotesize{}\texttt{The}}; 
 | |
| 626 |   \draw (-0.5,-1.0) node {\footnotesize{}\texttt{trainer}}; 
 | |
| 627 |   \draw ( 0.5,-0.5) node {\footnotesize{}\texttt{trains}}; 
 | |
| 628 |   \draw ( 1.5,-1.0) node {\footnotesize{}\texttt{the}}; 
 | |
| 629 |   \draw ( 2.5,-0.5) node {\footnotesize{}\texttt{student}}; 
 | |
| 630 |   \draw ( 3.5,-1.0) node {\footnotesize{}\texttt{team}};
 | |
| 631 | ||
| 632 |   \draw (-1.5,0.5) node {$A$}; 
 | |
| 633 |   \draw (-0.5,0.5) node {$N$}; 
 | |
| 634 |   \draw ( 0.5,0.5) node {$N,V$}; 
 | |
| 635 |   \draw ( 1.5,0.5) node {$A$}; 
 | |
| 636 |   \draw ( 2.5,0.5) node {$N$}; 
 | |
| 637 |   \draw ( 3.5,0.5) node {$N,V$};
 | |
| 638 | ||
| 639 |   \draw (-1.5,1.5) node {$N$}; 
 | |
| 640 |   \draw (-0.5,1.5) node {$N$}; 
 | |
| 641 |   \draw ( 0.5,1.5) node {$$}; 
 | |
| 642 |   \draw ( 1.5,1.5) node {$N$}; 
 | |
| 643 |   \draw ( 2.5,1.5) node {$N$}; 
 | |
| 644 | ||
| 645 | ||
| 646 | %  \draw (-1.5,1.5) node {\small{}a}; 
 | |
| 647 | %  \draw (-0.5,1.5) node {\small{}b}; 
 | |
| 648 | %  \draw ( 0.5,1.5) node {\small{}c}; 
 | |
| 649 | %  \draw ( 1.5,1.5) node {\small{}d}; 
 | |
| 650 | %  \draw ( 2.5,1.5) node {\small{}e}; 
 | |
| 651 | ||
| 652 |   \draw (-2.4, 5.5) node {$1$}; 
 | |
| 653 |   \draw (-2.4, 4.5) node {$2$}; 
 | |
| 654 |   \draw (-2.4, 3.5) node {$3$}; 
 | |
| 655 |   \draw (-2.4, 2.5) node {$4$}; 
 | |
| 656 |   \draw (-2.4, 1.5) node {$5$}; 
 | |
| 657 |   \draw (-2.4, 0.5) node {$6$}; 
 | |
| 658 |   \end{tikzpicture}
 | |
| 659 |   \end{center}
 | |
| 660 | ||
| 661 | \noindent | |
| 662 | Now row 4 is in charge of all 3-word parts of the sentence, namely | |
| 663 | ||
| 664 | \begin{center}
 | |
| 665 | \begin{tabular}{llll}
 | |
| 666 | 4a) & The $\mid$ trainer trains\\ | |
| 667 | & The trainer $\mid$ trains\\ | |
| 668 | 4b) & trainer $\mid$ trains the\\ | |
| 669 | & trainer trains $\mid$ the\\ | |
| 670 | 4c) & trains $\mid$ the student\\ | |
| 671 | & trains the $\mid$ student\\ | |
| 672 | 4d) & the $\mid$ student team\\ | |
| 673 | & the student $\mid$ team\\ | |
| 674 | \end{tabular}  
 | |
| 675 | \end{center}
 | |
| 676 | ||
| 677 | \noindent | |
| 678 | Note that in case of 3-word parts we have two splits. For example for | |
| 679 | 4a: $\underbrace{\texttt{The}}_{A}$ and
 | |
| 680 | $\underbrace{\texttt{trainer trains}}_{N}$; and also
 | |
| 681 | $\underbrace{\texttt{The trainer}}_{N}$ and
 | |
| 682 | $\underbrace{\texttt{trains}}_{N,V}$. For each of these splits we have
 | |
| 683 | to look up in the rows below, which non-terminals we already | |
| 684 | computed. This allows us to look for right-hand sides in our grammar: | |
| 685 | $\meta{A}\cdot\meta{N}$, $\meta{N}\cdot\meta{N}$ and
 | |
| 686 | $\meta{N}\cdot\meta{V}$, which yield the only left-hand side
 | |
| 687 | \meta{N}. This is what we fill in for 4a. And so on for row 4.
 | |
| 688 | ||
| 689 | Row 3 is about all 4-word parts in the sentence, namely | |
| 690 | ||
| 691 | \begin{center}
 | |
| 692 | \begin{tabular}{llll}
 | |
| 693 | 3a) & The trainer trains the\\ | |
| 694 | 3b) & trainer trains the student\\ | |
| 695 | 3c) & trains the student team\\ | |
| 696 | \end{tabular}  
 | |
| 697 | \end{center}
 | |
| 698 | ||
| 699 | \noindent | |
| 700 | Each of them can be split up in 3 ways, for example | |
| 701 | ||
| 702 | \begin{center}
 | |
| 703 | \begin{tabular}{llll}
 | |
| 704 | 3a) & The $\mid$ trainer trains the\\ | |
| 705 | & The trainer $\mid$ trains the\\ | |
| 706 | & The trainer trains $\mid$ the\\ | |
| 707 | \end{tabular}  
 | |
| 708 | \end{center}
 | |
| 709 | ||
| 710 | \noindent | |
| 711 | and we have to consider them all in turn to fill in the non-terminals for | |
| 712 | 3a. You can guess how it continues: row 2 is for all 5-word parts, and finally | |
| 713 | the field on the top is for the whole sentence. | |
| 714 | The idea of the CYK algorithm is that if in the top-field the starting | |
| 715 | symbol \meta{S} appears (possibly together with other non-terminals),
 | |
| 716 | then the sentence is accepted by the grammar. If it does not, then the | |
| 717 | sentence is not accepted. | |
| 718 | ||
| 719 | Let us very quickly calculate the complexity of the CYK | |
| 720 | algorithm. Lookup operations inside the triangle and in the grammar | |
| 721 | are assumed to be of constant time, $O(1)$, meaning they do not | |
| 722 | matter. How many fields are in the triangle\ldots | |
| 723 | $\frac{n}{2} * (n+1)$, where $n$ is the size of the input. That means
 | |
| 724 | roughly $O(n^2)$ fields. How much work do we have to do for each | |
| 725 | field? Well, for the top-most we have to consider $n-1$ splits, which | |
| 726 | means roughly $O(n)$ for each field. The overall result is a $O(n^3)$ | |
| 727 | time-complexity for CYK. It turns out that this is the best worst-time | |
| 728 | complexity for parsing with context-free grammars. | |
| 680 | 729 | |
| 730 | \end{document}
 | |
| 731 | ||
| 732 | ||
| 733 | %%% Parser combinators are now part of handout 6 | |
| 459 | 734 | |
| 735 | \subsection*{Parser Combinators}
 | |
| 736 | ||
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 737 | Let us now turn to the problem of generating a parse-tree for | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 738 | a grammar and string. In what follows we explain \emph{parser
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 739 | combinators}, because they are easy to implement and closely | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 740 | resemble grammar rules. Imagine that a grammar describes the | 
| 665 | 741 | strings of natural numbers, such as the grammar \meta{N}  shown
 | 
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 742 | above. For all such strings we want to generate the | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 743 | parse-trees or later on we actually want to extract the | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 744 | meaning of these strings, that is the concrete integers | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 745 | ``behind'' these strings. In Scala the parser combinators will | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 746 | be functions of type | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 747 | |
| 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 748 | \begin{center}
 | 
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 749 | \texttt{I $\Rightarrow$ Set[(T, I)]}
 | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 750 | \end{center}
 | 
| 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 751 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 752 | \noindent that is they take as input something of type | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 753 | \texttt{I}, typically a list of tokens or a string, and return
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 754 | a set of pairs. The first component of these pairs corresponds | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 755 | to what the parser combinator was able to process from the | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 756 | input and the second is the unprocessed part of the input. As | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 757 | we shall see shortly, a parser combinator might return more | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 758 | than one such pair, with the idea that there are potentially | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 759 | several ways how to interpret the input. As a concrete | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 760 | example, consider the case where the input is of type string, | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 761 | say the string | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 762 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 763 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 764 | \tt\Grid{iffoo\VS testbar}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 765 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 766 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 767 | \noindent We might have a parser combinator which tries to | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 768 | interpret this string as a keyword (\texttt{if}) or an
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 769 | identifier (\texttt{iffoo}). Then the output will be the set
 | 
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 770 | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 771 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 772 | $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), 
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 773 |            \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 774 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 775 | |
| 362 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 776 | \noindent where the first pair means the parser could | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 777 | recognise \texttt{if} from the input and leaves the rest as
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 778 | `unprocessed' as the second component of the pair; in the | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 779 | other case it could recognise \texttt{iffoo} and leaves
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 780 | \texttt{\VS testbar} as unprocessed. If the parser cannot
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 781 | recognise anything from the input then parser combinators just | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 782 | return the empty set $\{\}$. This will indicate
 | 
| 
57ea439feaff
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
360diff
changeset | 783 | something ``went wrong''. | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 784 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 785 | The main attraction is that we can easily build parser combinators out of smaller components | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 786 | following very closely the structure of a grammar. In order to implement this in an object | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 787 | oriented programming language, like Scala, we need to specify an abstract class for parser | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 788 | combinators. This abstract class requires the implementation of the function | 
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 789 | \texttt{parse} taking an argument of type \texttt{I} and returns a set of type  
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 790 | \mbox{\texttt{Set[(T, I)]}}.
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 791 | |
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 792 | \begin{center}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 793 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 794 | abstract class Parser[I, T] {
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 795 | def parse(ts: I): Set[(T, I)] | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 796 | |
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 797 | def parse_all(ts: I): Set[T] = | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 798 | for ((head, tail) <- parse(ts); if (tail.isEmpty)) | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 799 | yield head | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 800 | } | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 801 | \end{lstlisting}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 802 | \end{center}
 | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 803 | |
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 804 | \noindent | 
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 805 | From the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all},
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 806 | which just filters out all pairs whose second component is not empty (that is has still some | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 807 | unprocessed part). The reason is that at the end of parsing we are only interested in the | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 808 | results where all the input has been consumed and no unprocessed part is left. | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 809 | |
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 810 | One of the simplest parser combinators recognises just a character, say $c$, | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 811 | from the beginning of strings. Its behaviour is as follows: | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 812 | |
| 177 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 813 | \begin{itemize}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 814 | \item if the head of the input string starts with a $c$, it returns | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 815 | 	the set $\{(c, \textit{tail of}\; s)\}$
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 816 | \item otherwise it returns the empty set $\varnothing$ | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 817 | \end{itemize}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 818 | |
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 819 | \noindent | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 820 | The input type of this simple parser combinator for characters is | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 821 | \texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. 
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 822 | The code in Scala is as follows: | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 823 | |
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 824 | \begin{center}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 825 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 826 | case class CharParser(c: Char) extends Parser[String, Char] {
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 827 | def parse(sb: String) = | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 828 | if (sb.head == c) Set((c, sb.tail)) else Set() | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 829 | } | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 830 | \end{lstlisting}
 | 
| 
53def1fbf472
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
176diff
changeset | 831 | \end{center}
 | 
| 176 
3c2653fc8b5a
updated
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
175diff
changeset | 832 | |
| 183 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 833 | \noindent | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 834 | The \texttt{parse} function tests whether the first character of the 
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 835 | input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits the
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 836 | string into the recognised part \texttt{c} and the unprocessed part
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 837 | \texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} then
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 838 | the parser returns the empty set (in Scala \texttt{Set()}).
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 839 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 840 | More interesting are the parser combinators that build larger parsers | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 841 | out of smaller component parsers. For example the alternative | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 842 | parser combinator is as follows. | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 843 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 844 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 845 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 846 | class AltParser[I, T] | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 847 | (p: => Parser[I, T], | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 848 |         q: => Parser[I, T]) extends Parser[I, T] {
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 849 | def parse(sb: I) = p.parse(sb) ++ q.parse(sb) | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 850 | } | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 851 | \end{lstlisting}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 852 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 853 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 854 | \noindent | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 855 | The types of this parser combinator are polymorphic (we just have \texttt{I}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 856 | for the input type, and \texttt{T} for the output type). The alternative parser
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 857 | builds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}.
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 858 | Both need to be able to process input of type \texttt{I} and return the same
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 859 | output type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the 
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 860 | \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 861 | evaluation of the arguments before they are used. This is often called | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 862 | \emph{lazy evaluation} of the arguments.) The alternative parser should run
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 863 | the input with the first parser \texttt{p} (producing a set of outputs) and then
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 864 | run the same input with \texttt{q}. The result should be then just the union
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 865 | of both sets, which is the operation \texttt{++} in Scala.
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 866 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 867 | This parser combinator already allows us to construct a parser that either | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 868 | a character \texttt{a} or \texttt{b}, as
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 869 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 870 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 871 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 872 | new AltParser(CharParser('a'), CharParser('b'))
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 873 | \end{lstlisting}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 874 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 875 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 876 | \noindent | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 877 | Scala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}. 
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 878 | We can call this parser combinator with the strings | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 879 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 880 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 881 | \begin{tabular}{rcl}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 882 | input string & & output\medskip\\ | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 883 | \texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 884 | \texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 885 | \texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 886 | \end{tabular}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 887 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 888 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 889 | \noindent | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 890 | We receive in the first two cases a successful output (that is a non-empty set). | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 891 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 892 | A bit more interesting is the \emph{sequence parser combinator} implemented in
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 893 | Scala as follows: | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 894 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 895 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 896 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 897 | class SeqParser[I, T, S] | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 898 | (p: => Parser[I, T], | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 899 |         q: => Parser[I, S]) extends Parser[I, (T, S)] {
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 900 | def parse(sb: I) = | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 901 | for ((head1, tail1) <- p.parse(sb); | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 902 | (head2, tail2) <- q.parse(tail1)) | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 903 | yield ((head1, head2), tail2) | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 904 | } | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 905 | \end{lstlisting}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 906 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 907 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 908 | \noindent | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 909 | This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} 
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 910 | as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}).
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 911 | The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}. 
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 912 | Let \texttt{q} run on these unprocessed parts
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 913 | producing again a set of pairs. The output of the sequence parser combinator is then a set | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 914 | containing pairs where the first components are again pairs, namely what the first parser could parse | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 915 | together with what the second parser could parse; the second component is the unprocessed | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 916 | part left over after running the second parser \texttt{q}. Therefore the input type of
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 917 | the sequence parser combinator is as usual \texttt{I}, but the output type is
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 918 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 919 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 920 | \texttt{Set[((T, S), I)]}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 921 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 922 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 923 | Scala allows us to provide some | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 924 | shorthand notation for the sequence parser combinator. So we can write for | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 925 | example \texttt{'a'  $\sim$ 'b'}, which is the
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 926 | parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}.
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 927 | Calling this parser combinator with the strings | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 928 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 929 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 930 | \begin{tabular}{rcl}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 931 | input string & & output\medskip\\ | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 932 | \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 933 | \texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 934 | \texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 935 | \end{tabular}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 936 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 937 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 938 | \noindent | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 939 | A slightly more complicated parser is \texttt{('a'  || 'b') $\sim$ 'b'} which parses as first character either
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 940 | an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results.
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 941 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 942 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 943 | \begin{tabular}{rcl}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 944 | input string & & output\medskip\\ | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 945 | \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 946 | \texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 947 | \texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 948 | \end{tabular}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 949 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 950 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 951 | Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error.
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 952 | The first parser has as output type a single character (recall the type of \texttt{CharParser}),
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 953 | but the second parser produces a pair of characters as output. The alternative parser is however | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 954 | required to have both component parsers to have the same type. We will see later how we can | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 955 | build this parser without the typing error. | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 956 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 957 | The next parser combinator does not actually combine smaller parsers, but applies | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 958 | a function to the result of the parser. It is implemented in Scala as follows | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 959 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 960 | \begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 961 | \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 962 | class FunParser[I, T, S] | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 963 | (p: => Parser[I, T], | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 964 |           f: T => S) extends Parser[I, S] {
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 965 | def parse(sb: I) = | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 966 | for ((head, tail) <- p.parse(sb)) yield (f(head), tail) | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 967 | } | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 968 | \end{lstlisting}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 969 | \end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 970 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 971 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 972 | \noindent | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 973 | This parser combinator takes a parser \texttt{p} with output type \texttt{T} as
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 974 | input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 975 | produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 976 | applies the function \texttt{f} to all the parer outputs. Since this function
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 977 | is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}.
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 978 | Again Scala lets us introduce some shorthand notation for this parser combinator. | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 979 | Therefore we will write \texttt{p ==> f} for it.
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 980 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 981 | %\bigskip | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 982 | %takes advantage of the full generality---have a look | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 983 | %what it produces if we call it with the string \texttt{abc}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 984 | % | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 985 | %\begin{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 986 | %\begin{tabular}{rcl}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 987 | %input string & & output\medskip\\ | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 988 | %\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 989 | %\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 990 | %\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 991 | %\end{tabular}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 992 | %\end{center}
 | 
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 993 | |
| 
b17eff695c7f
added new stuff
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 
177diff
changeset | 994 | |
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 995 | |
| 680 | 996 | |
| 997 | ||
| 173 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 998 | |
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 999 | %%% Local Variables: | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 1000 | %%% mode: latex | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 1001 | %%% TeX-master: t | 
| 
7cfb7a6f7c99
added slides
 Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff
changeset | 1002 | %%% End: | 
| 680 | 1003 |