1 \documentclass{article} |
1 \documentclass{article} |
2 \usepackage{../style} |
2 \usepackage{../style} |
3 \usepackage{../langs} |
3 \usepackage{../langs} |
4 \usepackage{../graphics} |
4 |
5 |
5 |
6 \begin{document} |
6 \begin{document} |
7 |
7 |
8 \section*{Handout 5 (Lexing)} |
8 \section*{Handout 6 (Parser Combinators)} |
9 |
9 |
10 Whenever you want to design a new programming language or |
10 While regular expressions are very useful for lexing and for recognising |
11 implement a compiler for an existing language, the first task |
11 many patterns in strings (like email addresses), they have their limitations. For |
12 is to fix the basic ``words'' of the language. For example |
12 example there is no regular expression that can recognise the language |
13 what are the keywords, or reserved words, of the language, |
13 $a^nb^n$. Another example for which there exists no regular expression is the language of well-parenthesised |
14 what are permitted identifiers, numbers, expressions and so |
14 expressions. In languages like Lisp, which use parentheses rather |
15 on. One convenient way to do this is, of course, by using |
15 extensively, it might be of interest whether the following two expressions |
16 regular expressions. |
16 are well-parenthesised (the left one is, the right one is not): |
17 |
17 |
18 In this course we want to take a closer look at the WHILE |
18 \begin{center} |
19 programming language. This is a simple imperative programming |
19 $(((()()))())$ \hspace{10mm} $(((()()))()))$ |
20 language consisting of arithmetic expressions, assignments, |
20 \end{center} |
21 if-statements and loops. For example the Fibonacci program can |
21 |
22 be written in this language as follows: |
22 \noindent |
23 |
23 Not being able to solve such recognition problems is a serious limitation. |
24 \begin{center} |
24 In order to solve such recognition problems, we need more powerful |
25 \mbox{\lstinputlisting[language=while]{../progs/fib.while}} |
25 techniques than regular expressions. We will in particular look at \emph{context-free |
26 \end{center} |
26 languages}. They include the regular languages as the picture below shows: |
27 |
27 |
28 \noindent |
28 |
29 The keywords in this language will be |
29 \begin{center} |
30 |
30 \begin{tikzpicture} |
31 \begin{center} |
31 [rect/.style={draw=black!50, |
32 \texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read} |
32 top color=white,bottom color=black!20, |
33 \end{center} |
33 rectangle, very thick, rounded corners}] |
34 |
34 |
35 \noindent |
35 \draw (0,0) node [rect, text depth=30mm, text width=46mm] {\small all languages}; |
36 In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers |
36 \draw (0,-0.4) node [rect, text depth=20mm, text width=44mm] {\small decidable languages}; |
37 and strings (which we however ignore for the moment). We also need to specify what the ``whitespace'' |
37 \draw (0,-0.65) node [rect, text depth=13mm] {\small context sensitive languages}; |
38 is in our programming language and what comments should look like. |
38 \draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages}; |
39 As a first try, we might specify the regular expressions for our language roughly as follows |
39 \draw (0,-1.05) node [rect] {\small regular languages}; |
|
40 \end{tikzpicture} |
|
41 \end{center} |
|
42 |
|
43 \noindent |
|
44 Context-free languages play an important role in `day-to-day' text processing and in |
|
45 programming languages. Context-free languages are usually specified by grammars. |
|
46 For example a grammar for well-parenthesised expressions is |
|
47 |
|
48 \begin{center} |
|
49 $P \;\;\rightarrow\;\; ( \cdot P \cdot ) \cdot P \;|\; \epsilon$ |
|
50 \end{center} |
|
51 |
|
52 \noindent |
|
53 In general grammars consist of finitely many rules built up from \emph{terminal symbols} |
|
54 (usually lower-case letters) and \emph{non-terminal symbols} (upper-case letters). Rules |
|
55 have the shape |
|
56 |
|
57 \begin{center} |
|
58 $NT \;\;\rightarrow\;\; \textit{rhs}$ |
|
59 \end{center} |
|
60 |
|
61 \noindent |
|
62 where on the left-hand side is a single non-terminal and on the right a string consisting |
|
63 of both terminals and non-terminals including the $\epsilon$-symbol for indicating the |
|
64 empty string. We use the convention to separate components on |
|
65 the right hand-side by using the $\cdot$ symbol, as in the grammar for well-parenthesised expressions. |
|
66 We also use the convention to use $|$ as a shorthand notation for several rules. For example |
|
67 |
|
68 \begin{center} |
|
69 $NT \;\;\rightarrow\;\; \textit{rhs}_1 \;|\; \textit{rhs}_2$ |
|
70 \end{center} |
|
71 |
|
72 \noindent |
|
73 means that the non-terminal $NT$ can be replaced by either $\textit{rhs}_1$ or $\textit{rhs}_2$. |
|
74 If there are more than one non-terminal on the left-hand side of the rules, then we need to indicate |
|
75 what is the \emph{starting} symbol of the grammar. For example the grammar for arithmetic expressions |
|
76 can be given as follows |
40 |
77 |
41 \begin{center} |
78 \begin{center} |
42 \begin{tabular}{lcl} |
79 \begin{tabular}{lcl} |
43 $\textit{LETTER}$ & $:=$ & $\texttt{a} + \texttt{A} + \texttt{b} + \texttt{B} + \ldots$\\ |
80 $E$ & $\rightarrow$ & $N$ \\ |
44 $\textit{DIGIT}$ & $:=$ & $\texttt{0} + \texttt{1} + \texttt{2} + \ldots$\\ |
81 $E$ & $\rightarrow$ & $E \cdot + \cdot E$ \\ |
45 $\textit{KEYWORD}$ & $:=$ & $\texttt{while} + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\ |
82 $E$ & $\rightarrow$ & $E \cdot - \cdot E$ \\ |
46 $\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ |
83 $E$ & $\rightarrow$ & $E \cdot * \cdot E$ \\ |
47 $\textit{OP}$ & $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\ |
84 $E$ & $\rightarrow$ & $( \cdot E \cdot )$\\ |
48 $\textit{NUM}$ & $:=$ & $\textit{DIGIT}^+$\\ |
85 $N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$ |
49 $\textit{WHITESPACE}$ & $:=$ & $("\hspace{2mm}" + \backslash\texttt{n})^+$ |
|
50 \end{tabular} |
86 \end{tabular} |
51 \end{center} |
87 \end{center} |
52 |
88 |
53 Having the regular expressions in place, the problem we have to solve is: |
89 \noindent |
54 given a string of our programming language, which regular expression |
90 where $E$ is the starting symbol. A \emph{derivation} for a grammar |
55 matches which part of the string. By solving this problem, we can split up a string |
91 starts with the staring symbol of the grammar and in each step replaces one |
56 of our language into components. |
92 non-terminal by a right-hand side of a rule. A derivation ends with a string |
57 For example given the input string |
93 in which only terminal symbols are left. For example a derivation for the |
58 |
94 string $(1 + 2) + 3$ is as follows: |
59 \begin{center} |
95 |
60 \texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}} |
96 \begin{center} |
61 \end{center} |
97 \begin{tabular}{lll} |
62 |
98 $E$ & $\rightarrow$ & $E+E$\\ |
63 \noindent |
99 & $\rightarrow$ & $(E)+E$\\ |
64 we expect it is split up as follows |
100 & $\rightarrow$ & $(E+E)+E$\\ |
65 |
101 & $\rightarrow$ & $(E+E)+N$\\ |
66 \begin{center} |
102 & $\rightarrow$ & $(E+E)+3$\\ |
67 \tt |
103 & $\rightarrow$ & $(N+E)+3$\\ |
68 \Grid{if}\, |
104 & $\rightarrow^+$ & $(1+2)+3$\\ |
69 \Grid{\VS}\, |
105 \end{tabular} |
70 \Grid{true}\, |
106 \end{center} |
71 \Grid{\VS}\, |
107 |
72 \Grid{then}\, |
108 \noindent |
73 \Grid{\VS}\, |
109 The \emph{language} of a context-free grammar $G$ with start symbol $S$ |
74 \Grid{x}\, |
110 is defined as the set of strings derivable by a derivation, that is |
75 \Grid{+}\, |
111 |
76 \Grid{2}\, |
112 \begin{center} |
77 \Grid{\VS}\, |
113 $\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$ |
78 \Grid{else}\, |
114 \end{center} |
79 \Grid{\VS}\, |
115 |
80 \Grid{x}\, |
116 \noindent |
81 \Grid{+}\, |
117 A \emph{parse-tree} encodes how a string is derived with the starting symbol on |
82 \Grid{3} |
118 top and each non-terminal containing a subtree for how it is replaced in a derivation. |
83 \end{center} |
119 The parse tree for the string $(1 + 23)+4$ is as follows: |
84 |
120 |
85 \noindent |
121 \begin{center} |
86 This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}. |
122 \begin{tikzpicture}[level distance=8mm, black] |
87 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, |
123 \node {$E$} |
88 be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace |
124 child {node {$E$} |
89 in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are |
125 child {node {$($}} |
90 not separated by any whitespace. Another reason for recognising whitespaces explicitly is |
126 child {node {$E$} |
91 that in some languages, for example Python, whitespaces matters, that is carry meaning. However in |
127 child {node {$E$} child {node {$N$} child {node {$1$}}}} |
92 our small language we will eventually just filter out all whitespaces and also all comments. |
128 child {node {$+$}} |
93 |
129 child {node {$E$} |
94 Lexing not just separates a string into its components, but also classifies the components, meaning it explicitly records that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on. |
130 child {node {$N$} child {node {$2$}}} |
95 For the moment, though, we will only focus on the simpler problem of just splitting up |
131 child {node {$N$} child {node {$3$}}} |
96 a string into components. |
132 } |
97 |
133 } |
98 There are a few subtleties we need to consider first. For example, say the string is |
134 child {node {$)$}} |
99 |
135 } |
100 \begin{center} |
136 child {node {$+$}} |
101 \texttt{\Grid{iffoo\VS}}\;\ldots |
137 child {node {$E$} |
102 \end{center} |
138 child {node {$N$} child {node {$4$}}} |
103 |
139 }; |
104 \noindent |
140 \end{tikzpicture} |
105 then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed |
141 \end{center} |
106 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a |
142 |
107 single identifier. The choice that is often made in lexers is to look for the longest possible match. |
143 \noindent |
108 This leaves \texttt{iffoo} as the only match in this case (since it is longer than \texttt{if}). |
144 We are often interested in these parse-trees since they encode the structure of |
109 |
145 how a string is derived by a grammar. Before we come to the problem of constructing |
110 Unfortunately, the convention about the longest match does not yet make the process |
146 such parse-trees, we need to consider the following two properties of grammars. |
111 of lexing completely deterministic. Consider the string |
147 A grammar is \emph{left-recursive} if there is a derivation starting from a non-terminal, say |
112 |
148 $NT$ which leads to a string which again starts with $NT$. This means a derivation of the |
113 \begin{center} |
149 form. |
114 \texttt{\Grid{then\VS}}\;\dots |
150 |
115 \end{center} |
151 \begin{center} |
116 |
152 $NT \rightarrow \ldots \rightarrow NT \cdot \ldots$ |
117 \noindent |
153 \end{center} |
118 Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our |
154 |
119 regular expressions. In our running example we just use the ranking |
155 \noindent |
120 |
156 It can be easily seen that the grammar above for arithmetic expressions is left-recursive: |
121 \[ |
157 for example the rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow N\cdot N$ |
122 \textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots |
158 show that this grammar is left-recursive. Some algorithms cannot cope with left-recursive |
123 \] |
159 grammars. Fortunately every left-recursive grammar can be transformed into one that is |
124 |
160 not left-recursive, although this transformation might make the grammar less human-readable. |
125 \noindent |
161 For example if we want to give a non-left-recursive grammar for numbers we might |
126 So even if both regular expressions match in the example above, |
162 specify |
127 we give preference to the regular expression for keywords. |
163 |
128 |
164 \begin{center} |
129 Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$ |
165 $N \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot N\;|\;2\cdot N\;|\;\ldots\;|\;9\cdot N$ |
130 and $der$, it will use the function \emph{zeroable} defined as follows: |
166 \end{center} |
131 |
167 |
132 \begin{center} |
168 \noindent |
133 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}} |
169 Using this grammar we can still derive every number string, but we will never be able |
134 $zeroable(\varnothing)$ & $\dn$ & $true$\\ |
170 to derive a string of the form $\ldots \rightarrow N \cdot \ldots$. |
135 $zeroable(\epsilon)$ & $\dn$ & $f\!alse$\\ |
171 |
136 $zeroable (c)$ & $\dn$ & $f\!alse$\\ |
172 The other property we have to watch out for is when a grammar is |
137 $zeroable (r_1 + r_2)$ & $\dn$ & $zeroable(r_1) \wedge zeroable(r_2)$ \\ |
173 \emph{ambiguous}. A grammar is said to be ambiguous if there are two parse-trees |
138 $zeroable (r_1 \cdot r_2)$ & $\dn$ & $zeroable(r_1) \vee zeroable(r_2)$ \\ |
174 for one string. Again the grammar for arithmetic expressions shown above is ambiguous. |
139 $zeroable (r^*)$ & $\dn$ & $f\!alse$\\ |
175 While the shown parse tree for the string $(1 + 23) + 4$ is unique, this is not the case in |
|
176 general. For example there are two parse |
|
177 trees for the string $1 + 2 + 3$, namely |
|
178 |
|
179 \begin{center} |
|
180 \begin{tabular}{c@{\hspace{10mm}}c} |
|
181 \begin{tikzpicture}[level distance=8mm, black] |
|
182 \node {$E$} |
|
183 child {node {$E$} child {node {$N$} child {node {$1$}}}} |
|
184 child {node {$+$}} |
|
185 child {node {$E$} |
|
186 child {node {$E$} child {node {$N$} child {node {$2$}}}} |
|
187 child {node {$+$}} |
|
188 child {node {$E$} child {node {$N$} child {node {$3$}}}} |
|
189 } |
|
190 ; |
|
191 \end{tikzpicture} |
|
192 & |
|
193 \begin{tikzpicture}[level distance=8mm, black] |
|
194 \node {$E$} |
|
195 child {node {$E$} |
|
196 child {node {$E$} child {node {$N$} child {node {$1$}}}} |
|
197 child {node {$+$}} |
|
198 child {node {$E$} child {node {$N$} child {node {$2$}}}} |
|
199 } |
|
200 child {node {$+$}} |
|
201 child {node {$E$} child {node {$N$} child {node {$3$}}}} |
|
202 ; |
|
203 \end{tikzpicture} |
|
204 \end{tabular} |
|
205 \end{center} |
|
206 |
|
207 \noindent |
|
208 In particular in programming languages we will try to avoid ambiguous |
|
209 grammars because two different parse-trees for a string mean a program can |
|
210 be interpreted in two different ways. In such cases we have to somehow make sure |
|
211 the two different ways do not matter, or disambiguate the grammar in |
|
212 some other way (for example making the $+$ left-associative). Unfortunately already |
|
213 the problem of deciding whether a grammar |
|
214 is ambiguous or not is in general undecidable. |
|
215 |
|
216 Let us now turn to the problem of generating a parse-tree for a grammar and string. |
|
217 In what follows we explain \emph{parser combinators}, because they are easy |
|
218 to implement and closely resemble grammar rules. Imagine that a grammar |
|
219 describes the strings of natural numbers, such as the grammar $N$ shown above. |
|
220 For all such strings we want to generate the parse-trees or later on we actually |
|
221 want to extract the meaning of these strings, that is the concrete integers ``behind'' |
|
222 these strings. In Scala the parser combinators will be functions of type |
|
223 |
|
224 \begin{center} |
|
225 \texttt{I $\Rightarrow$ Set[(T, I)]} |
|
226 \end{center} |
|
227 |
|
228 \noindent |
|
229 that is they take as input something of type \texttt{I}, typically a list of tokens or a string, |
|
230 and return a set of pairs. The first component of these pairs corresponds to what the |
|
231 parser combinator was able to process from the input and the second is the unprocessed |
|
232 part of the input. As we shall see shortly, a parser combinator might return more than one such pair, |
|
233 with the idea that there are potentially several ways how to interpret the input. As a concrete |
|
234 example, consider the case where the input is of type string, say the string |
|
235 |
|
236 \begin{center} |
|
237 \tt\Grid{iffoo\VS testbar} |
|
238 \end{center} |
|
239 |
|
240 \noindent |
|
241 We might have a parser combinator which tries to interpret this string as a keyword (\texttt{if}) or |
|
242 an identifier (\texttt{iffoo}). Then the output will be the set |
|
243 |
|
244 \begin{center} |
|
245 $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), |
|
246 \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$ |
|
247 \end{center} |
|
248 |
|
249 \noindent |
|
250 where the first pair means the parser could recognise \texttt{if} from the input and leaves |
|
251 the rest as `unprocessed' as the second component of the pair; in the other case |
|
252 it could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parser |
|
253 cannot recognise anything from the input then parser combinators just return the empty |
|
254 set $\varnothing$. This will indicate something ``went wrong''. |
|
255 |
|
256 The main attraction is that we can easily build parser combinators out of smaller components |
|
257 following very closely the structure of a grammar. In order to implement this in an object |
|
258 oriented programming language, like Scala, we need to specify an abstract class for parser |
|
259 combinators. This abstract class requires the implementation of the function |
|
260 \texttt{parse} taking an argument of type \texttt{I} and returns a set of type |
|
261 \mbox{\texttt{Set[(T, I)]}}. |
|
262 |
|
263 \begin{center} |
|
264 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
265 abstract class Parser[I, T] { |
|
266 def parse(ts: I): Set[(T, I)] |
|
267 |
|
268 def parse_all(ts: I): Set[T] = |
|
269 for ((head, tail) <- parse(ts); if (tail.isEmpty)) |
|
270 yield head |
|
271 } |
|
272 \end{lstlisting} |
|
273 \end{center} |
|
274 |
|
275 \noindent |
|
276 From the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all}, |
|
277 which just filters out all pairs whose second component is not empty (that is has still some |
|
278 unprocessed part). The reason is that at the end of parsing we are only interested in the |
|
279 results where all the input has been consumed and no unprocessed part is left. |
|
280 |
|
281 One of the simplest parser combinators recognises just a character, say $c$, |
|
282 from the beginning of strings. Its behaviour is as follows: |
|
283 |
|
284 \begin{itemize} |
|
285 \item if the head of the input string starts with a $c$, it returns |
|
286 the set $\{(c, \textit{tail of}\; s)\}$ |
|
287 \item otherwise it returns the empty set $\varnothing$ |
|
288 \end{itemize} |
|
289 |
|
290 \noindent |
|
291 The input type of this simple parser combinator for characters is |
|
292 \texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. |
|
293 The code in Scala is as follows: |
|
294 |
|
295 \begin{center} |
|
296 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
297 case class CharParser(c: Char) extends Parser[String, Char] { |
|
298 def parse(sb: String) = |
|
299 if (sb.head == c) Set((c, sb.tail)) else Set() |
|
300 } |
|
301 \end{lstlisting} |
|
302 \end{center} |
|
303 |
|
304 \noindent |
|
305 The \texttt{parse} function tests whether the first character of the |
|
306 input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits the |
|
307 string into the recognised part \texttt{c} and the unprocessed part |
|
308 \texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} then |
|
309 the parser returns the empty set (in Scala \texttt{Set()}). |
|
310 |
|
311 More interesting are the parser combinators that build larger parsers |
|
312 out of smaller component parsers. For example the alternative |
|
313 parser combinator is as follows. |
|
314 |
|
315 \begin{center} |
|
316 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
317 class AltParser[I, T] |
|
318 (p: => Parser[I, T], |
|
319 q: => Parser[I, T]) extends Parser[I, T] { |
|
320 def parse(sb: I) = p.parse(sb) ++ q.parse(sb) |
|
321 } |
|
322 \end{lstlisting} |
|
323 \end{center} |
|
324 |
|
325 \noindent |
|
326 The types of this parser combinator are polymorphic (we just have \texttt{I} |
|
327 for the input type, and \texttt{T} for the output type). The alternative parser |
|
328 builds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}. |
|
329 Both need to be able to process input of type \texttt{I} and return the same |
|
330 output type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the |
|
331 \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the |
|
332 evaluation of the arguments before they are used. This is often called |
|
333 \emph{lazy evaluation} of the arguments.) The alternative parser should run |
|
334 the input with the first parser \texttt{p} (producing a set of outputs) and then |
|
335 run the same input with \texttt{q}. The result should be then just the union |
|
336 of both sets, which is the operation \texttt{++} in Scala. |
|
337 |
|
338 This parser combinator already allows us to construct a parser that either |
|
339 a character \texttt{a} or \texttt{b}, as |
|
340 |
|
341 \begin{center} |
|
342 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
343 new AltParser(CharParser('a'), CharParser('b')) |
|
344 \end{lstlisting} |
|
345 \end{center} |
|
346 |
|
347 \noindent |
|
348 Scala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}. |
|
349 We can call this parser combinator with the strings |
|
350 |
|
351 \begin{center} |
|
352 \begin{tabular}{rcl} |
|
353 input string & & output\medskip\\ |
|
354 \texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\ |
|
355 \texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\ |
|
356 \texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$ |
140 \end{tabular} |
357 \end{tabular} |
141 \end{center} |
358 \end{center} |
142 |
359 |
143 \noindent |
360 \noindent |
144 Recall that the function $nullable(r)$ tests whether a regular expression |
361 We receive in the first two cases a successful output (that is a non-empty set). |
145 can match the empty string. The function $zeroable$, on the other hand, tests whether a regular |
362 |
146 expression cannot match anything at all. The mathematical way of stating this |
363 A bit more interesting is the \emph{sequence parser combinator} implemented in |
147 property is |
364 Scala as follows: |
148 |
365 |
149 \begin{center} |
366 \begin{center} |
150 $zeroable(r)$ if and only if $L(r) = \varnothing$ |
367 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
151 \end{center} |
368 class SeqParser[I, T, S] |
152 |
369 (p: => Parser[I, T], |
153 \noindent |
370 q: => Parser[I, S]) extends Parser[I, (T, S)] { |
154 For what follows let us fix a set of regular expressions $rs$ as being |
371 def parse(sb: I) = |
155 |
372 for ((head1, tail1) <- p.parse(sb); |
156 \begin{center} |
373 (head2, tail2) <- q.parse(tail1)) |
157 \textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE} |
374 yield ((head1, head2), tail2) |
158 \end{center} |
375 } |
159 |
376 \end{lstlisting} |
160 \noindent |
377 \end{center} |
161 specifying the ``words'' |
378 |
162 of our programming language. The algorithm takes as input the $rs$ and a string, say |
379 \noindent |
163 |
380 This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} |
164 \begin{center} |
381 as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}). |
165 \texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots |
382 The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}. |
166 \end{center} |
383 Let \texttt{q} run on these unprocessed parts |
167 |
384 producing again a set of pairs. The output of the sequence parser combinator is then a set |
168 \noindent |
385 containing pairs where the first components are again pairs, namely what the first parser could parse |
169 and tries to chop off one word from the beginning of the string. If none of the |
386 together with what the second parser could parse; the second component is the unprocessed |
170 regular expression in $rs$ matches, we will just return |
387 part left over after running the second parser \texttt{q}. Therefore the input type of |
171 the empty string. |
388 the sequence parser combinator is as usual \texttt{I}, but the output type is |
172 |
389 |
173 The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect |
390 \begin{center} |
174 to the first character $c_1$. Then we take the results and continue with |
391 \texttt{Set[((T, S), I)]} |
175 building the derivatives with respect to $c_2$ until we have either exhausted our |
392 \end{center} |
176 input string or all of the regular expressions are ``zeroable''. Suppose the input string is |
393 |
177 |
394 Scala allows us to provide some |
178 \begin{center} |
395 shorthand notation for the sequence parser combinator. So we can write for |
179 \texttt{\Grid{if2\VS}}\;\dots |
396 example \texttt{'a' $\sim$ 'b'}, which is the |
180 \end{center} |
397 parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}. |
181 |
398 Calling this parser combinator with the strings |
182 \noindent |
399 |
183 then building the derivatives with respect to \texttt{i} gives |
400 \begin{center} |
184 |
401 \begin{tabular}{rcl} |
185 \begin{center} |
402 input string & & output\medskip\\ |
186 \begin{tabular}{l|c} |
403 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
187 & $zeroable$\\\hline |
404 \texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\ |
188 $der\;\texttt{i}\;(\textit{KEYWORD})$ & no\\ |
405 \texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$ |
189 $der\;\texttt{i}\;(\textit{IDENT})$ & no\\ |
|
190 $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\ |
|
191 \end{tabular} |
406 \end{tabular} |
192 \end{center} |
407 \end{center} |
193 |
408 |
194 \noindent |
409 \noindent |
195 We can eliminate \textit{WHITESPACE} as a potential candidate, because |
410 A slightly more complicated parser is \texttt{('a' || 'b') $\sim$ 'b'} which parses as first character either |
196 no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other |
411 an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results. |
197 two regular expressions as potential candidate and we have to consider the |
412 |
198 next character, \texttt{f}, from the input string |
413 \begin{center} |
199 |
414 \begin{tabular}{rcl} |
200 \begin{center} |
415 input string & & output\medskip\\ |
201 \begin{tabular}{l|c} |
416 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
202 & $zeroable$\\\hline |
417 \texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
203 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ & no\\ |
418 \texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$ |
204 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$ & no\\ |
|
205 \end{tabular} |
419 \end{tabular} |
206 \end{center} |
420 \end{center} |
207 |
421 |
208 \noindent |
422 Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error. |
209 Since both are `no', we have to continue with \texttt{2} from the input string |
423 The first parser has as output type a single character (recall the type of \texttt{CharParser}), |
210 |
424 but the second parser produces a pair of characters as output. The alternative parser is however |
211 \begin{center} |
425 required to have both component parsers to have the same type. We will see later how we can |
212 \begin{tabular}{l|c} |
426 build this parser without the typing error. |
213 & $zeroable$\\\hline |
427 |
214 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$ & yes\\ |
428 The next parser combinator does not actually combine smaller parsers, but applies |
215 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ & no\\ |
429 a function to the result of the parser. It is implemented in Scala as follows |
216 \end{tabular} |
430 |
217 \end{center} |
431 \begin{center} |
218 |
432 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
219 \noindent |
433 class FunParser[I, T, S] |
220 Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet |
434 (p: => Parser[I, T], |
221 know how much of the input string should be considered as an \textit{IDENT}. So we |
435 f: T => S) extends Parser[I, S] { |
222 still have to continue and consider the next derivative. |
436 def parse(sb: I) = |
223 |
437 for ((head, tail) <- p.parse(sb)) yield (f(head), tail) |
224 \begin{center} |
438 } |
225 \begin{tabular}{l|c} |
439 \end{lstlisting} |
226 & $zeroable$\\\hline |
440 \end{center} |
227 $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$ & yes\\ |
441 |
228 \end{tabular} |
442 |
229 \end{center} |
443 \noindent |
230 |
444 This parser combinator takes a parser \texttt{p} with output type \texttt{T} as |
231 \noindent |
445 input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p} |
232 Since the answer is now `yes' also in this case, we can stop: once all derivatives are |
446 produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then |
233 zeroable, we know the regular expressions cannot match any more letters from |
447 applies the function \texttt{f} to all the parer outputs. Since this function |
234 the input. In this case we only have to go back to the derivative that is nullable. In this case it |
448 is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}. |
235 is |
449 Again Scala lets us introduce some shorthand notation for this parser combinator. |
236 |
450 Therefore we will write \texttt{p ==> f} for it. |
237 \begin{center} |
451 |
238 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ |
452 %\bigskip |
239 \end{center} |
453 %takes advantage of the full generality---have a look |
240 |
454 %what it produces if we call it with the string \texttt{abc} |
241 \noindent |
455 % |
242 which means we recognised an identifier. In case where there is a choice of more than one |
456 %\begin{center} |
243 regular expressions that are nullable, then we choose the one with the highest precedence. |
457 %\begin{tabular}{rcl} |
244 You can try out such a case with the input string |
458 %input string & & output\medskip\\ |
245 |
459 %\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
246 \begin{center} |
460 %\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
247 \texttt{\Grid{then\VS}}\;\dots |
461 %\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$ |
248 \end{center} |
462 %\end{tabular} |
249 |
463 %\end{center} |
250 \noindent |
|
251 which can both be recognised as a keyword, but also an identifier. |
|
252 |
|
253 While in the example above the last nullable derivative is the one directly before |
|
254 the derivative turns zeroable, this is not always the case. Imagine, identifiers can be |
|
255 letters, as permuted by the regular expression \textit{IDENT}, but must end with an |
|
256 undercore. |
|
257 |
|
258 \begin{center} |
|
259 \begin{tabular}{lcl} |
|
260 $\textit{NEWIDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^* \cdot \_$\\ |
|
261 \end{tabular} |
|
262 \end{center} |
|
263 |
|
264 \noindent |
|
265 If we use \textit{NEWIDENT} with the input string |
|
266 |
|
267 \begin{center} |
|
268 \texttt{\Grid{iffoo\VS}}\;\ldots |
|
269 \end{center} |
|
270 |
|
271 \noindent |
|
272 then it will only become $zeroable$ after the $\VS$ has been analysed. In this case we have to go back to the |
|
273 first \texttt{f} because only |
|
274 |
|
275 \begin{center} |
|
276 $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ |
|
277 \end{center} |
|
278 |
|
279 \noindent |
|
280 is nullable. As a result we recognise successfully the keyword \texttt{if} and the remaining |
|
281 string needs to be consumed by other regular expressions or lead to a lexing error. |
|
282 |
464 |
283 |
465 |
284 |
466 |
285 \end{document} |
467 \end{document} |
286 |
468 |
287 %%% Local Variables: |
469 %%% Local Variables: |
288 %%% mode: latex |
470 %%% mode: latex |
289 %%% TeX-master: t |
471 %%% TeX-master: t |
290 %%% End: |
472 %%% End: |