5 |
5 |
6 \begin{document} |
6 \begin{document} |
7 |
7 |
8 \section*{Handout 6 (Grammars \& Parser)} |
8 \section*{Handout 6 (Grammars \& Parser)} |
9 |
9 |
10 While regular expressions are very useful for lexing and for recognising |
10 While regular expressions are very useful for lexing and for |
11 many patterns in strings (like email addresses), they have their limitations. For |
11 recognising many patterns in strings (like email addresses), |
12 example there is no regular expression that can recognise the language |
12 they have their limitations. For example there is no regular |
13 $a^nb^n$. Another example for which there exists no regular expression is the language of well-parenthesised |
13 expression that can recognise the language $a^nb^n$. Another |
14 expressions. In languages like Lisp, which use parentheses rather |
14 example for which there exists no regular expression is the |
15 extensively, it might be of interest whether the following two expressions |
15 language of well-parenthesised expressions. In languages like |
16 are well-parenthesised (the left one is, the right one is not): |
16 Lisp, which use parentheses rather extensively, it might be of |
|
17 interest whether the following two expressions are |
|
18 well-parenthesised (the left one is, the right one is not): |
17 |
19 |
18 \begin{center} |
20 \begin{center} |
19 $(((()()))())$ \hspace{10mm} $(((()()))()))$ |
21 $(((()()))())$ \hspace{10mm} $(((()()))()))$ |
20 \end{center} |
22 \end{center} |
21 |
23 |
22 \noindent |
24 \noindent Not being able to solve such recognition problems is |
23 Not being able to solve such recognition problems is a serious limitation. |
25 a serious limitation. In order to solve such recognition |
24 In order to solve such recognition problems, we need more powerful |
26 problems, we need more powerful techniques than regular |
25 techniques than regular expressions. We will in particular look at \emph{context-free |
27 expressions. We will in particular look at \emph{context-free |
26 languages}. They include the regular languages as the picture below shows: |
28 languages}. They include the regular languages as the picture |
|
29 below shows: |
27 |
30 |
28 |
31 |
29 \begin{center} |
32 \begin{center} |
30 \begin{tikzpicture} |
33 \begin{tikzpicture} |
31 [rect/.style={draw=black!50, |
34 [rect/.style={draw=black!50, |
38 \draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages}; |
41 \draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages}; |
39 \draw (0,-1.05) node [rect] {\small regular languages}; |
42 \draw (0,-1.05) node [rect] {\small regular languages}; |
40 \end{tikzpicture} |
43 \end{tikzpicture} |
41 \end{center} |
44 \end{center} |
42 |
45 |
43 \noindent |
46 \noindent Context-free languages play an important role in |
44 Context-free languages play an important role in `day-to-day' text processing and in |
47 `day-to-day' text processing and in programming languages. |
45 programming languages. Context-free languages are usually specified by grammars. |
48 Context-free languages are usually specified by grammars. For |
46 For example a grammar for well-parenthesised expressions is |
49 example a grammar for well-parenthesised expressions is |
47 |
50 |
48 \begin{center} |
51 \begin{center} |
49 $P \;\;\rightarrow\;\; ( \cdot P \cdot ) \cdot P \;|\; \epsilon$ |
52 $P \;\;\rightarrow\;\; ( \cdot P \cdot ) \cdot P \;|\; \epsilon$ |
50 \end{center} |
53 \end{center} |
51 |
54 |
52 \noindent |
55 \noindent |
53 In general grammars consist of finitely many rules built up from \emph{terminal symbols} |
56 or a grammar for recognising strings consisting of ones is |
54 (usually lower-case letters) and \emph{non-terminal symbols} (upper-case letters). Rules |
57 |
55 have the shape |
58 \begin{center} |
|
59 $O \;\;\rightarrow\;\; 1 \cdot O \;|\; 1$ |
|
60 \end{center} |
|
61 |
|
62 In general grammars consist of finitely many rules built up |
|
63 from \emph{terminal symbols} (usually lower-case letters) and |
|
64 \emph{non-terminal symbols} (upper-case letters). Rules have |
|
65 the shape |
56 |
66 |
57 \begin{center} |
67 \begin{center} |
58 $NT \;\;\rightarrow\;\; \textit{rhs}$ |
68 $NT \;\;\rightarrow\;\; \textit{rhs}$ |
59 \end{center} |
69 \end{center} |
60 |
70 |
61 \noindent |
71 \noindent where on the left-hand side is a single non-terminal |
62 where on the left-hand side is a single non-terminal and on the right a string consisting |
72 and on the right a string consisting of both terminals and |
63 of both terminals and non-terminals including the $\epsilon$-symbol for indicating the |
73 non-terminals including the $\epsilon$-symbol for indicating |
64 empty string. We use the convention to separate components on |
74 the empty string. We use the convention to separate components |
65 the right hand-side by using the $\cdot$ symbol, as in the grammar for well-parenthesised expressions. |
75 on the right hand-side by using the $\cdot$ symbol, as in the |
66 We also use the convention to use $|$ as a shorthand notation for several rules. For example |
76 grammar for well-parenthesised expressions. We also use the |
|
77 convention to use $|$ as a shorthand notation for several |
|
78 rules. For example |
67 |
79 |
68 \begin{center} |
80 \begin{center} |
69 $NT \;\;\rightarrow\;\; \textit{rhs}_1 \;|\; \textit{rhs}_2$ |
81 $NT \;\;\rightarrow\;\; \textit{rhs}_1 \;|\; \textit{rhs}_2$ |
70 \end{center} |
82 \end{center} |
71 |
83 |
72 \noindent |
84 \noindent means that the non-terminal $NT$ can be replaced by |
73 means that the non-terminal $NT$ can be replaced by either $\textit{rhs}_1$ or $\textit{rhs}_2$. |
85 either $\textit{rhs}_1$ or $\textit{rhs}_2$. If there are more |
74 If there are more than one non-terminal on the left-hand side of the rules, then we need to indicate |
86 than one non-terminal on the left-hand side of the rules, then |
75 what is the \emph{starting} symbol of the grammar. For example the grammar for arithmetic expressions |
87 we need to indicate what is the \emph{starting} symbol of the |
|
88 grammar. For example the grammar for arithmetic expressions |
76 can be given as follows |
89 can be given as follows |
77 |
90 |
78 \begin{center} |
91 \begin{center} |
79 \begin{tabular}{lcl} |
92 \begin{tabular}{lcl@{\hspace{2cm}}l} |
80 $E$ & $\rightarrow$ & $N$ \\ |
93 $E$ & $\rightarrow$ & $N$ & (1)\\ |
81 $E$ & $\rightarrow$ & $E \cdot + \cdot E$ \\ |
94 $E$ & $\rightarrow$ & $E \cdot + \cdot E$ & (2)\\ |
82 $E$ & $\rightarrow$ & $E \cdot - \cdot E$ \\ |
95 $E$ & $\rightarrow$ & $E \cdot - \cdot E$ & (3)\\ |
83 $E$ & $\rightarrow$ & $E \cdot * \cdot E$ \\ |
96 $E$ & $\rightarrow$ & $E \cdot * \cdot E$ & (4)\\ |
84 $E$ & $\rightarrow$ & $( \cdot E \cdot )$\\ |
97 $E$ & $\rightarrow$ & $( \cdot E \cdot )$ & (5)\\ |
85 $N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$ |
98 $N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$ & (6\ldots) |
86 \end{tabular} |
99 \end{tabular} |
87 \end{center} |
100 \end{center} |
88 |
101 |
89 \noindent |
102 \noindent where $E$ is the starting symbol. A |
90 where $E$ is the starting symbol. A \emph{derivation} for a grammar |
103 \emph{derivation} for a grammar starts with the starting |
91 starts with the staring symbol of the grammar and in each step replaces one |
104 symbol of the grammar and in each step replaces one |
92 non-terminal by a right-hand side of a rule. A derivation ends with a string |
105 non-terminal by a right-hand side of a rule. A derivation ends |
93 in which only terminal symbols are left. For example a derivation for the |
106 with a string in which only terminal symbols are left. For |
94 string $(1 + 2) + 3$ is as follows: |
107 example a derivation for the string $(1 + 2) + 3$ is as |
95 |
108 follows: |
96 \begin{center} |
109 |
97 \begin{tabular}{lll} |
110 \begin{center} |
98 $E$ & $\rightarrow$ & $E+E$\\ |
111 \begin{tabular}{lll@{\hspace{2cm}}l} |
99 & $\rightarrow$ & $(E)+E$\\ |
112 $E$ & $\rightarrow$ & $E+E$ & by (2)\\ |
100 & $\rightarrow$ & $(E+E)+E$\\ |
113 & $\rightarrow$ & $(E)+E$ & by (5)\\ |
101 & $\rightarrow$ & $(E+E)+N$\\ |
114 & $\rightarrow$ & $(E+E)+E$ & by (2)\\ |
102 & $\rightarrow$ & $(E+E)+3$\\ |
115 & $\rightarrow$ & $(E+E)+N$ & by (1)\\ |
103 & $\rightarrow$ & $(N+E)+3$\\ |
116 & $\rightarrow$ & $(E+E)+3$ & by (6\dots)\\ |
104 & $\rightarrow^+$ & $(1+2)+3$\\ |
117 & $\rightarrow$ & $(N+E)+3$ & by (1)\\ |
|
118 & $\rightarrow^+$ & $(1+2)+3$ & by (1, 6\ldots)\\ |
105 \end{tabular} |
119 \end{tabular} |
106 \end{center} |
120 \end{center} |
107 |
121 |
108 \noindent |
122 \noindent where on the right it is indicated which |
109 The \emph{language} of a context-free grammar $G$ with start symbol $S$ |
123 grammar rule has been applied. In the last step we |
110 is defined as the set of strings derivable by a derivation, that is |
124 merged several steps into one. |
|
125 |
|
126 The \emph{language} of a context-free grammar $G$ |
|
127 with start symbol $S$ is defined as the set of strings |
|
128 derivable by a derivation, that is |
111 |
129 |
112 \begin{center} |
130 \begin{center} |
113 $\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$ |
131 $\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$ |
114 \end{center} |
132 \end{center} |
115 |
133 |
138 child {node {$N$} child {node {$4$}}} |
156 child {node {$N$} child {node {$4$}}} |
139 }; |
157 }; |
140 \end{tikzpicture} |
158 \end{tikzpicture} |
141 \end{center} |
159 \end{center} |
142 |
160 |
143 \noindent |
161 \noindent We are often interested in these parse-trees since |
144 We are often interested in these parse-trees since they encode the structure of |
162 they encode the structure of how a string is derived by a |
145 how a string is derived by a grammar. Before we come to the problem of constructing |
163 grammar. Before we come to the problem of constructing such |
146 such parse-trees, we need to consider the following two properties of grammars. |
164 parse-trees, we need to consider the following two properties |
147 A grammar is \emph{left-recursive} if there is a derivation starting from a non-terminal, say |
165 of grammars. A grammar is \emph{left-recursive} if there is a |
148 $NT$ which leads to a string which again starts with $NT$. This means a derivation of the |
166 derivation starting from a non-terminal, say $NT$ which leads |
149 form. |
167 to a string which again starts with $NT$. This means a |
|
168 derivation of the form. |
150 |
169 |
151 \begin{center} |
170 \begin{center} |
152 $NT \rightarrow \ldots \rightarrow NT \cdot \ldots$ |
171 $NT \rightarrow \ldots \rightarrow NT \cdot \ldots$ |
153 \end{center} |
172 \end{center} |
154 |
173 |
155 \noindent |
174 \noindent It can be easily seen that the grammar above for |
156 It can be easily seen that the grammar above for arithmetic expressions is left-recursive: |
175 arithmetic expressions is left-recursive: for example the |
157 for example the rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow N\cdot N$ |
176 rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow |
158 show that this grammar is left-recursive. Some algorithms cannot cope with left-recursive |
177 N\cdot N$ show that this grammar is left-recursive. But note |
159 grammars. Fortunately every left-recursive grammar can be transformed into one that is |
178 that left-recursiveness can involve more than one step in the |
160 not left-recursive, although this transformation might make the grammar less human-readable. |
179 derivation. The problem with left-recursive grammars is that |
161 For example if we want to give a non-left-recursive grammar for numbers we might |
180 some algorithms cannot cope with them: they fall into a loop. |
162 specify |
181 Fortunately every left-recursive grammar can be transformed |
|
182 into one that is not left-recursive, although this |
|
183 transformation might make the grammar less ``human-readable''. |
|
184 For example if we want to give a non-left-recursive grammar |
|
185 for numbers we might specify |
163 |
186 |
164 \begin{center} |
187 \begin{center} |
165 $N \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot N\;|\;2\cdot N\;|\;\ldots\;|\;9\cdot N$ |
188 $N \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot N\;|\;2\cdot N\;|\;\ldots\;|\;9\cdot N$ |
166 \end{center} |
189 \end{center} |
167 |
190 |
168 \noindent |
191 \noindent Using this grammar we can still derive every number |
169 Using this grammar we can still derive every number string, but we will never be able |
192 string, but we will never be able to derive a string of the |
170 to derive a string of the form $\ldots \rightarrow N \cdot \ldots$. |
193 form $N \to \ldots \to N \cdot \ldots$. |
171 |
194 |
172 The other property we have to watch out for is when a grammar is |
195 The other property we have to watch out for is when a grammar |
173 \emph{ambiguous}. A grammar is said to be ambiguous if there are two parse-trees |
196 is \emph{ambiguous}. A grammar is said to be ambiguous if |
174 for one string. Again the grammar for arithmetic expressions shown above is ambiguous. |
197 there are two parse-trees for one string. Again the grammar |
175 While the shown parse tree for the string $(1 + 23) + 4$ is unique, this is not the case in |
198 for arithmetic expressions shown above is ambiguous. While the |
176 general. For example there are two parse |
199 shown parse tree for the string $(1 + 23) + 4$ is unique, this |
|
200 is not the case in general. For example there are two parse |
177 trees for the string $1 + 2 + 3$, namely |
201 trees for the string $1 + 2 + 3$, namely |
178 |
202 |
179 \begin{center} |
203 \begin{center} |
180 \begin{tabular}{c@{\hspace{10mm}}c} |
204 \begin{tabular}{c@{\hspace{10mm}}c} |
181 \begin{tikzpicture}[level distance=8mm, black] |
205 \begin{tikzpicture}[level distance=8mm, black] |
202 ; |
226 ; |
203 \end{tikzpicture} |
227 \end{tikzpicture} |
204 \end{tabular} |
228 \end{tabular} |
205 \end{center} |
229 \end{center} |
206 |
230 |
207 \noindent |
231 \noindent In particular in programming languages we will try |
208 In particular in programming languages we will try to avoid ambiguous |
232 to avoid ambiguous grammars because two different parse-trees |
209 grammars because two different parse-trees for a string mean a program can |
233 for a string mean a program can be interpreted in two |
210 be interpreted in two different ways. In such cases we have to somehow make sure |
234 different ways. In such cases we have to somehow make sure the |
211 the two different ways do not matter, or disambiguate the grammar in |
235 two different ways do not matter, or disambiguate the grammar |
212 some other way (for example making the $+$ left-associative). Unfortunately already |
236 in some other way (for example making the $+$ |
213 the problem of deciding whether a grammar |
237 left-associative). Unfortunately already the problem of |
214 is ambiguous or not is in general undecidable. |
238 deciding whether a grammar is ambiguous or not is in general |
215 |
239 undecidable. But in simple instance (the ones we deal in this |
216 Let us now turn to the problem of generating a parse-tree for a grammar and string. |
240 module) one can usually see when a grammar is ambiguous. |
217 In what follows we explain \emph{parser combinators}, because they are easy |
241 |
218 to implement and closely resemble grammar rules. Imagine that a grammar |
242 Let us now turn to the problem of generating a parse-tree for |
219 describes the strings of natural numbers, such as the grammar $N$ shown above. |
243 a grammar and string. In what follows we explain \emph{parser |
220 For all such strings we want to generate the parse-trees or later on we actually |
244 combinators}, because they are easy to implement and closely |
221 want to extract the meaning of these strings, that is the concrete integers ``behind'' |
245 resemble grammar rules. Imagine that a grammar describes the |
222 these strings. In Scala the parser combinators will be functions of type |
246 strings of natural numbers, such as the grammar $N$ shown |
|
247 above. For all such strings we want to generate the |
|
248 parse-trees or later on we actually want to extract the |
|
249 meaning of these strings, that is the concrete integers |
|
250 ``behind'' these strings. In Scala the parser combinators will |
|
251 be functions of type |
223 |
252 |
224 \begin{center} |
253 \begin{center} |
225 \texttt{I $\Rightarrow$ Set[(T, I)]} |
254 \texttt{I $\Rightarrow$ Set[(T, I)]} |
226 \end{center} |
255 \end{center} |
227 |
256 |
228 \noindent |
257 \noindent that is they take as input something of type |
229 that is they take as input something of type \texttt{I}, typically a list of tokens or a string, |
258 \texttt{I}, typically a list of tokens or a string, and return |
230 and return a set of pairs. The first component of these pairs corresponds to what the |
259 a set of pairs. The first component of these pairs corresponds |
231 parser combinator was able to process from the input and the second is the unprocessed |
260 to what the parser combinator was able to process from the |
232 part of the input. As we shall see shortly, a parser combinator might return more than one such pair, |
261 input and the second is the unprocessed part of the input. As |
233 with the idea that there are potentially several ways how to interpret the input. As a concrete |
262 we shall see shortly, a parser combinator might return more |
234 example, consider the case where the input is of type string, say the string |
263 than one such pair, with the idea that there are potentially |
|
264 several ways how to interpret the input. As a concrete |
|
265 example, consider the case where the input is of type string, |
|
266 say the string |
235 |
267 |
236 \begin{center} |
268 \begin{center} |
237 \tt\Grid{iffoo\VS testbar} |
269 \tt\Grid{iffoo\VS testbar} |
238 \end{center} |
270 \end{center} |
239 |
271 |
240 \noindent |
272 \noindent We might have a parser combinator which tries to |
241 We might have a parser combinator which tries to interpret this string as a keyword (\texttt{if}) or |
273 interpret this string as a keyword (\texttt{if}) or an |
242 an identifier (\texttt{iffoo}). Then the output will be the set |
274 identifier (\texttt{iffoo}). Then the output will be the set |
243 |
275 |
244 \begin{center} |
276 \begin{center} |
245 $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), |
277 $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), |
246 \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$ |
278 \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$ |
247 \end{center} |
279 \end{center} |
248 |
280 |
249 \noindent |
281 \noindent where the first pair means the parser could |
250 where the first pair means the parser could recognise \texttt{if} from the input and leaves |
282 recognise \texttt{if} from the input and leaves the rest as |
251 the rest as `unprocessed' as the second component of the pair; in the other case |
283 `unprocessed' as the second component of the pair; in the |
252 it could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parser |
284 other case it could recognise \texttt{iffoo} and leaves |
253 cannot recognise anything from the input then parser combinators just return the empty |
285 \texttt{\VS testbar} as unprocessed. If the parser cannot |
254 set $\varnothing$. This will indicate something ``went wrong''. |
286 recognise anything from the input then parser combinators just |
|
287 return the empty set $\{\}$. This will indicate |
|
288 something ``went wrong''. |
255 |
289 |
256 The main attraction is that we can easily build parser combinators out of smaller components |
290 The main attraction is that we can easily build parser combinators out of smaller components |
257 following very closely the structure of a grammar. In order to implement this in an object |
291 following very closely the structure of a grammar. In order to implement this in an object |
258 oriented programming language, like Scala, we need to specify an abstract class for parser |
292 oriented programming language, like Scala, we need to specify an abstract class for parser |
259 combinators. This abstract class requires the implementation of the function |
293 combinators. This abstract class requires the implementation of the function |