| author | Christian Urban <christian.urban@kcl.ac.uk> | 
| Tue, 30 May 2023 16:58:06 +0100 | |
| changeset 909 | a29d74fb8697 | 
| parent 798 | 409e5014edde | 
| child 936 | aabd9168c7ac | 
| permissions | -rw-r--r-- | 
| 665 | 1  | 
% !TEX program = xelatex  | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
2  | 
\documentclass{article}
 | 
| 
297
 
5c51839c88fd
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
292 
diff
changeset
 | 
3  | 
\usepackage{../style}
 | 
| 
217
 
cd6066f1056a
updated handouts
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
183 
diff
changeset
 | 
4  | 
\usepackage{../langs}
 | 
| 459 | 5  | 
\usepackage{../grammar}
 | 
| 798 | 6  | 
\usepackage{../graphics}
 | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
7  | 
|
| 545 | 8  | 
% epsilon and left-recursion elimination  | 
9  | 
% http://www.mollypages.org/page/grammar/index.mp  | 
|
10  | 
||
| 618 | 11  | 
%% parsing scala files  | 
| 691 | 12  | 
%% https://scalameta.org/  | 
13  | 
||
14  | 
% chomsky normal form transformation  | 
|
15  | 
% https://youtu.be/FNPSlnj3Vt0  | 
|
| 618 | 16  | 
|
| 710 | 17  | 
% Language hierachy is about complexity  | 
18  | 
% How hard is it to recognise an element in a language  | 
|
19  | 
||
| 722 | 20  | 
% Pratt parsing  | 
21  | 
% https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html  | 
|
22  | 
% https://www.oilshell.org/blog/2017/03/31.html  | 
|
23  | 
||
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
24  | 
\begin{document}
 | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
25  | 
|
| 
385
 
7f8516ff408d
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
362 
diff
changeset
 | 
26  | 
\section*{Handout 5 (Grammars \& Parser)}
 | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
27  | 
|
| 727 | 28  | 
So far we have focused on regular expressions as well as matching and  | 
29  | 
lexing algorithms. While regular expressions are very useful for lexing  | 
|
30  | 
and for recognising many patterns in strings (like email addresses),  | 
|
31  | 
they have their limitations. For example there is no regular expression  | 
|
32  | 
that can recognise the language $a^nb^n$ (where you have strings  | 
|
33  | 
starting with $n$ $a$'s followed by the same amount of $b$'s). Another  | 
|
34  | 
example for which there exists no regular expression is the language of  | 
|
35  | 
well-parenthesised expressions. In languages like Lisp, which use  | 
|
36  | 
parentheses rather extensively, it might be of interest to know whether  | 
|
37  | 
the following two expressions are well-parenthesised or not (the left  | 
|
38  | 
one is, the right one is not):  | 
|
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
39  | 
|
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
40  | 
\begin{center}
 | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
41  | 
$(((()()))())$  \hspace{10mm} $(((()()))()))$
 | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
42  | 
\end{center}
 | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
43  | 
|
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
44  | 
\noindent Not being able to solve such recognition problems is  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
45  | 
a serious limitation. In order to solve such recognition  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
46  | 
problems, we need more powerful techniques than regular  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
47  | 
expressions. We will in particular look at \emph{context-free
 | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
48  | 
languages}. They include the regular languages as the picture  | 
| 582 | 49  | 
below about language classes shows:  | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
50  | 
|
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
51  | 
|
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
52  | 
\begin{center}
 | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
53  | 
\begin{tikzpicture}
 | 
| 
297
 
5c51839c88fd
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
292 
diff
changeset
 | 
54  | 
[rect/.style={draw=black!50, 
 | 
| 
 
5c51839c88fd
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
292 
diff
changeset
 | 
55  | 
top color=white,bottom color=black!20,  | 
| 
 
5c51839c88fd
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
292 
diff
changeset
 | 
56  | 
rectangle, very thick, rounded corners}]  | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
57  | 
|
| 
297
 
5c51839c88fd
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
292 
diff
changeset
 | 
58  | 
\draw (0,0) node [rect, text depth=30mm, text width=46mm] {\small all languages};
 | 
| 
 
5c51839c88fd
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
292 
diff
changeset
 | 
59  | 
\draw (0,-0.4) node [rect, text depth=20mm, text width=44mm] {\small decidable languages};
 | 
| 
 
5c51839c88fd
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
292 
diff
changeset
 | 
60  | 
\draw (0,-0.65) node [rect, text depth=13mm] {\small context sensitive languages};
 | 
| 
 
5c51839c88fd
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
292 
diff
changeset
 | 
61  | 
\draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages};
 | 
| 
 
5c51839c88fd
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
292 
diff
changeset
 | 
62  | 
\draw (0,-1.05) node [rect] {\small regular languages};
 | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
63  | 
\end{tikzpicture}
 | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
64  | 
\end{center}
 | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
65  | 
|
| 582 | 66  | 
\noindent Each ``bubble'' stands for sets of languages (remember  | 
67  | 
languages are sets of strings). As indicated the set of regular  | 
|
| 680 | 68  | 
languages is fully included inside the context-free languages,  | 
| 582 | 69  | 
meaning every regular language is also context-free, but not vice  | 
| 665 | 70  | 
versa. Below I will let you think, for example, what the context-free  | 
| 582 | 71  | 
grammar is for the language corresponding to the regular expression  | 
72  | 
$(aaa)^*a$.  | 
|
73  | 
||
74  | 
Because of their convenience, context-free languages play an important  | 
|
75  | 
role in `day-to-day' text processing and in programming  | 
|
76  | 
languages. Context-free in this setting means that ``words'' have one  | 
|
| 680 | 77  | 
meaning only and this meaning is independent from the context  | 
78  | 
the ``words'' appear in. For example ambiguity issues like  | 
|
| 582 | 79  | 
|
80  | 
\begin{center}
 | 
|
| 682 | 81  | 
\tt Time flies like an arrow. Fruit flies like bananas.  | 
| 582 | 82  | 
\end{center}  
 | 
83  | 
||
84  | 
\noindent  | 
|
| 680 | 85  | 
from natural languages were the meaning of \emph{flies} depends on the
 | 
| 686 | 86  | 
surrounding \emph{context} are avoided as much as possible. Here is
 | 
87  | 
an interesting video about C++ being not a context-free language  | 
|
88  | 
||
89  | 
\begin{center}
 | 
|
90  | 
\url{https://www.youtube.com/watch?v=OzK8pUu4UfM}
 | 
|
91  | 
\end{center}  
 | 
|
| 582 | 92  | 
|
93  | 
Context-free languages are usually specified by grammars. For example  | 
|
94  | 
a grammar for well-parenthesised expressions can be given as follows:  | 
|
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
95  | 
|
| 459 | 96  | 
\begin{plstx}[margin=3cm]
 | 
97  | 
: \meta{P} ::=  ( \cdot  \meta{P} \cdot ) \cdot \meta{P}
 | 
|
98  | 
| \epsilon\\  | 
|
99  | 
\end{plstx}
 | 
|
100  | 
||
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
101  | 
\noindent  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
102  | 
or a grammar for recognising strings consisting of ones is  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
103  | 
|
| 459 | 104  | 
\begin{plstx}[margin=3cm]
 | 
105  | 
: \meta{O} ::= 1 \cdot  \meta{O} 
 | 
|
106  | 
| 1\\  | 
|
107  | 
\end{plstx}
 | 
|
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
108  | 
|
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
109  | 
In general grammars consist of finitely many rules built up  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
110  | 
from \emph{terminal symbols} (usually lower-case letters) and
 | 
| 582 | 111  | 
\emph{non-terminal symbols} (upper-case letters written in
 | 
112  | 
bold like \meta{A}, \meta{N} and so on). Rules have
 | 
|
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
113  | 
the shape  | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
114  | 
|
| 459 | 115  | 
\begin{plstx}[margin=3cm]
 | 
116  | 
: \meta{NT} ::= rhs\\
 | 
|
117  | 
\end{plstx}
 | 
|
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
118  | 
|
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
119  | 
\noindent where on the left-hand side is a single non-terminal  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
120  | 
and on the right a string consisting of both terminals and  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
121  | 
non-terminals including the $\epsilon$-symbol for indicating  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
122  | 
the empty string. We use the convention to separate components  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
123  | 
on the right hand-side by using the $\cdot$ symbol, as in the  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
124  | 
grammar for well-parenthesised expressions. We also use the  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
125  | 
convention to use $|$ as a shorthand notation for several  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
126  | 
rules. For example  | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
127  | 
|
| 459 | 128  | 
\begin{plstx}[margin=3cm]
 | 
129  | 
: \meta{NT} ::= rhs_1
 | 
|
130  | 
| rhs_2\\  | 
|
131  | 
\end{plstx}
 | 
|
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
132  | 
|
| 459 | 133  | 
\noindent means that the non-terminal \meta{NT} can be replaced by
 | 
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
134  | 
either $\textit{rhs}_1$ or $\textit{rhs}_2$. If there are more
 | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
135  | 
than one non-terminal on the left-hand side of the rules, then  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
136  | 
we need to indicate what is the \emph{starting} symbol of the
 | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
137  | 
grammar. For example the grammar for arithmetic expressions  | 
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
138  | 
can be given as follows  | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
139  | 
|
| 459 | 140  | 
\begin{plstx}[margin=3cm,one per line]
 | 
141  | 
\mbox{\rm (1)}: \meta{E} ::= \meta{N}\\
 | 
|
142  | 
\mbox{\rm (2)}: \meta{E} ::= \meta{E} \cdot + \cdot \meta{E}\\
 | 
|
143  | 
\mbox{\rm (3)}: \meta{E} ::= \meta{E} \cdot - \cdot \meta{E}\\
 | 
|
144  | 
\mbox{\rm (4)}: \meta{E} ::= \meta{E} \cdot * \cdot \meta{E}\\
 | 
|
145  | 
\mbox{\rm (5)}: \meta{E} ::= ( \cdot \meta{E} \cdot )\\
 | 
|
146  | 
\mbox{\rm (6\ldots)}: \meta{N} ::= \meta{N} \cdot \meta{N} 
 | 
|
147  | 
\mid 0 \mid 1 \mid \ldots \mid 9\\  | 
|
148  | 
\end{plstx}
 | 
|
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
149  | 
|
| 459 | 150  | 
\noindent where \meta{E} is the starting symbol. A
 | 
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
151  | 
\emph{derivation} for a grammar starts with the starting
 | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
152  | 
symbol of the grammar and in each step replaces one  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
153  | 
non-terminal by a right-hand side of a rule. A derivation ends  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
154  | 
with a string in which only terminal symbols are left. For  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
155  | 
example a derivation for the string $(1 + 2) + 3$ is as  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
156  | 
follows:  | 
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
157  | 
|
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
158  | 
\begin{center}
 | 
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
159  | 
\begin{tabular}{lll@{\hspace{2cm}}l}
 | 
| 459 | 160  | 
\meta{E} & $\rightarrow$ & $\meta{E}+\meta{E}$          & by (2)\\
 | 
161  | 
       & $\rightarrow$ & $(\meta{E})+\meta{E}$     & by (5)\\
 | 
|
162  | 
       & $\rightarrow$ & $(\meta{E}+\meta{E})+\meta{E}$   & by (2)\\
 | 
|
163  | 
       & $\rightarrow$ & $(\meta{E}+\meta{E})+\meta{N}$   & by (1)\\
 | 
|
164  | 
       & $\rightarrow$ & $(\meta{E}+\meta{E})+3$   & by (6\dots)\\
 | 
|
165  | 
       & $\rightarrow$ & $(\meta{N}+\meta{E})+3$   & by (1)\\
 | 
|
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
166  | 
& $\rightarrow^+$ & $(1+2)+3$ & by (1, 6\ldots)\\  | 
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
167  | 
\end{tabular} 
 | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
168  | 
\end{center}
 | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
169  | 
|
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
170  | 
\noindent where on the right it is indicated which  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
171  | 
grammar rule has been applied. In the last step we  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
172  | 
merged several steps into one.  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
173  | 
|
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
174  | 
The \emph{language} of a context-free grammar $G$
 | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
175  | 
with start symbol $S$ is defined as the set of strings  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
176  | 
derivable by a derivation, that is  | 
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
177  | 
|
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
178  | 
\begin{center}
 | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
179  | 
$\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
 | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
180  | 
\end{center}
 | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
181  | 
|
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
182  | 
\noindent  | 
| 680 | 183  | 
A \emph{parse-tree} encodes how a string is derived with the starting
 | 
184  | 
symbol on top and each non-terminal containing a subtree for how it is  | 
|
185  | 
replaced in a derivation. The parse tree for the string $(1 + 23)+4$ is  | 
|
186  | 
as follows:  | 
|
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
187  | 
|
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
188  | 
\begin{center}
 | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
189  | 
\begin{tikzpicture}[level distance=8mm, black]
 | 
| 665 | 190  | 
  \node {\meta{E}}
 | 
191  | 
    child {node {\meta{E} } 
 | 
|
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
192  | 
       child {node {$($}}
 | 
| 665 | 193  | 
       child {node {\meta{E} }       
 | 
194  | 
         child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}}
 | 
|
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
195  | 
         child {node {$+$}}
 | 
| 665 | 196  | 
         child {node {\meta{E} } 
 | 
197  | 
            child {node {\meta{N} } child {node {$2$}}}
 | 
|
198  | 
            child {node {\meta{N} } child {node {$3$}}}
 | 
|
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
199  | 
}  | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
200  | 
}  | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
201  | 
       child {node {$)$}}
 | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
202  | 
}  | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
203  | 
     child {node {$+$}}
 | 
| 665 | 204  | 
     child {node {\meta{E} }
 | 
205  | 
        child {node {\meta{N} } child {node {$4$}}}
 | 
|
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
206  | 
};  | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
207  | 
\end{tikzpicture}
 | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
208  | 
\end{center}
 | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
209  | 
|
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
210  | 
\noindent We are often interested in these parse-trees since  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
211  | 
they encode the structure of how a string is derived by a  | 
| 680 | 212  | 
grammar.  | 
213  | 
||
214  | 
Before we come to the problem of constructing such parse-trees, we need  | 
|
215  | 
to consider the following two properties of grammars. A grammar is  | 
|
216  | 
\emph{left-recursive} if there is a derivation starting from a
 | 
|
217  | 
non-terminal, say \meta{NT} which leads to a string which again starts
 | 
|
218  | 
with \meta{NT}. This means a derivation of the form.
 | 
|
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
219  | 
|
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
220  | 
\begin{center}
 | 
| 665 | 221  | 
$\meta{NT} \rightarrow \ldots \rightarrow \meta{NT} \cdot \ldots$
 | 
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
222  | 
\end{center}
 | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
223  | 
|
| 680 | 224  | 
\noindent It can be easily seen that the grammar above for arithmetic  | 
225  | 
expressions is left-recursive: for example the rules $\meta{E}
 | 
|
226  | 
\rightarrow \meta{E}\cdot + \cdot \meta{E}$ and $\meta{N} \rightarrow
 | 
|
227  | 
\meta{N}\cdot \meta{N}$ show that this grammar is left-recursive. But
 | 
|
228  | 
note that left-recursiveness can involve more than one step in the  | 
|
229  | 
derivation. The problem with left-recursive grammars is that some  | 
|
230  | 
algorithms cannot cope with them: with left-recursive grammars they will  | 
|
231  | 
fall into a loop. Fortunately every left-recursive grammar can be  | 
|
232  | 
transformed into one that is not left-recursive, although this  | 
|
233  | 
transformation might make the grammar less ``human-readable''. For  | 
|
234  | 
example if we want to give a non-left-recursive grammar for numbers we  | 
|
235  | 
might specify  | 
|
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
236  | 
|
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
237  | 
\begin{center}
 | 
| 665 | 238  | 
$\meta{N} \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;
 | 
239  | 
1\cdot \meta{N}\;|\;2\cdot \meta{N}\;|\;\ldots\;|\;9\cdot \meta{N}$
 | 
|
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
240  | 
\end{center}
 | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
241  | 
|
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
242  | 
\noindent Using this grammar we can still derive every number  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
243  | 
string, but we will never be able to derive a string of the  | 
| 665 | 244  | 
form $\meta{N} \to \ldots \to \meta{N} \cdot \ldots$.
 | 
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
245  | 
|
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
246  | 
The other property we have to watch out for is when a grammar  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
247  | 
is \emph{ambiguous}. A grammar is said to be ambiguous if
 | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
248  | 
there are two parse-trees for one string. Again the grammar  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
249  | 
for arithmetic expressions shown above is ambiguous. While the  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
250  | 
shown parse tree for the string $(1 + 23) + 4$ is unique, this  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
251  | 
is not the case in general. For example there are two parse  | 
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
252  | 
trees for the string $1 + 2 + 3$, namely  | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
253  | 
|
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
254  | 
\begin{center}
 | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
255  | 
\begin{tabular}{c@{\hspace{10mm}}c}
 | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
256  | 
\begin{tikzpicture}[level distance=8mm, black]
 | 
| 665 | 257  | 
  \node {\meta{E} }
 | 
258  | 
    child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}}
 | 
|
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
259  | 
    child {node {$+$}}
 | 
| 665 | 260  | 
    child {node {\meta{E} }
 | 
261  | 
       child {node {\meta{E} } child {node {\meta{N} } child {node {$2$}}}}
 | 
|
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
262  | 
       child {node {$+$}}
 | 
| 665 | 263  | 
       child {node {\meta{E} } child {node {\meta{N} } child {node {$3$}}}}
 | 
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
264  | 
}  | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
265  | 
;  | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
266  | 
\end{tikzpicture} 
 | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
267  | 
&  | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
268  | 
\begin{tikzpicture}[level distance=8mm, black]
 | 
| 665 | 269  | 
  \node {\meta{E} }
 | 
270  | 
    child {node {\meta{E} }
 | 
|
271  | 
       child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}}
 | 
|
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
272  | 
       child {node {$+$}}
 | 
| 665 | 273  | 
       child {node {\meta{E} } child {node {\meta{N} } child {node {$2$}}}} 
 | 
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
274  | 
}  | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
275  | 
    child {node {$+$}}
 | 
| 665 | 276  | 
    child {node {\meta{E} } child {node {\meta{N} } child {node {$3$}}}}
 | 
| 
175
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
277  | 
;  | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
278  | 
\end{tikzpicture}
 | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
279  | 
\end{tabular} 
 | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
280  | 
\end{center}
 | 
| 
 
5801e8c0e528
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
173 
diff
changeset
 | 
281  | 
|
| 680 | 282  | 
\noindent In particular in programming languages we will try to avoid  | 
283  | 
ambiguous grammars because two different parse-trees for a string mean a  | 
|
284  | 
program can be interpreted in two different ways. In such cases we have  | 
|
285  | 
to somehow make sure the two different ways do not matter, or  | 
|
286  | 
disambiguate the grammar in some other way (for example making the $+$  | 
|
287  | 
left-associative). Unfortunately already the problem of deciding whether  | 
|
288  | 
a grammar is ambiguous or not is in general undecidable. But in simple  | 
|
289  | 
instance (the ones we deal with in this module) one can usually see when  | 
|
290  | 
a grammar is ambiguous.  | 
|
291  | 
||
292  | 
\subsection*{Removing Left-Recursion}
 | 
|
293  | 
||
294  | 
Let us come back to the problem of left-recursion and consider the  | 
|
295  | 
following grammar for binary numbers:  | 
|
296  | 
||
297  | 
\begin{plstx}[margin=1cm]
 | 
|
298  | 
: \meta{B} ::= \meta{B} \cdot \meta{B} | 0 | 1\\
 | 
|
299  | 
\end{plstx}
 | 
|
300  | 
||
301  | 
\noindent  | 
|
302  | 
It is clear that this grammar can create all binary numbers, but  | 
|
303  | 
it is also clear that this grammar is left-recursive. Giving this  | 
|
304  | 
grammar as is to parser combinators will result in an infinite  | 
|
305  | 
loop. Fortunately, every left-recursive grammar can be translated  | 
|
306  | 
into one that is not left-recursive with the help of some  | 
|
307  | 
transformation rules. Suppose we identified the ``offensive''  | 
|
308  | 
rule, then we can separate the grammar into this offensive rule  | 
|
309  | 
and the ``rest'':  | 
|
310  | 
||
311  | 
\begin{plstx}[margin=1cm]
 | 
|
312  | 
  : \meta{B} ::= \underbrace{\meta{B} \cdot \meta{B}}_{\textit{lft-rec}} 
 | 
|
313  | 
  | \underbrace{0 \;\;|\;\; 1}_{\textit{rest}}\\
 | 
|
314  | 
\end{plstx}
 | 
|
315  | 
||
316  | 
\noindent  | 
|
317  | 
To make the idea of the transformation clearer, suppose the left-recursive  | 
|
318  | 
rule is of the form $\meta{B}\alpha$ (the left-recursive non-terminal 
 | 
|
319  | 
followed by something called $\alpha$) and the ``rest'' is called $\beta$.  | 
|
320  | 
That means our grammar looks schematically as follows  | 
|
321  | 
||
322  | 
\begin{plstx}[margin=1cm]
 | 
|
323  | 
  : \meta{B} ::= \meta{B} \cdot \alpha | \beta\\
 | 
|
324  | 
\end{plstx}
 | 
|
325  | 
||
326  | 
\noindent  | 
|
327  | 
To get rid of the left-recursion, we are required to introduce  | 
|
328  | 
a new non-terminal, say $\meta{B'}$ and transform the rule
 | 
|
329  | 
as follows:  | 
|
330  | 
||
331  | 
\begin{plstx}[margin=1cm]
 | 
|
332  | 
  : \meta{B} ::= \beta \cdot \meta{B'}\\
 | 
|
333  | 
  : \meta{B'} ::= \alpha \cdot \meta{B'} | \epsilon\\
 | 
|
334  | 
\end{plstx}
 | 
|
335  | 
||
336  | 
\noindent  | 
|
337  | 
In our example of binary numbers we would after the transformation  | 
|
338  | 
end up with the rules  | 
|
339  | 
||
340  | 
\begin{plstx}[margin=1cm]
 | 
|
341  | 
  : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
 | 
|
342  | 
  : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\
 | 
|
343  | 
\end{plstx}
 | 
|
344  | 
||
345  | 
\noindent  | 
|
346  | 
A little thought should convince you that this grammar still derives  | 
|
347  | 
all the binary numbers (for example 0 and 1 are derivable because $\meta{B'}$
 | 
|
348  | 
can be $\epsilon$). Less clear might be why this grammar is non-left recursive.  | 
|
349  | 
For $\meta{B'}$ it is relatively clear because we will never be 
 | 
|
350  | 
able to derive things like  | 
|
351  | 
||
352  | 
\begin{center}
 | 
|
353  | 
$\meta{B'} \rightarrow\ldots\rightarrow \meta{B'}\cdot\ldots$
 | 
|
354  | 
\end{center}  
 | 
|
355  | 
||
356  | 
\noindent  | 
|
357  | 
because there will always be a $\meta{B}$ in front of a $\meta{B'}$, and
 | 
|
358  | 
$\meta{B}$ now has always a $0$ or $1$ in front, so a $\meta{B'}$ can
 | 
|
359  | 
never be in the first place. The reasoning is similar for $\meta{B}$:
 | 
|
360  | 
the $0$ and $1$ in the rule for $\meta{B}$ ``protect'' it from becoming
 | 
|
361  | 
left-recursive. This transformation does not mean the grammar is the  | 
|
362  | 
simplest left-recursive grammar for binary numbers. For example the  | 
|
363  | 
following grammar would do as well  | 
|
364  | 
||
365  | 
\begin{plstx}[margin=1cm]
 | 
|
366  | 
  : \meta{B} ::= 0 \cdot \meta{B} | 1 \cdot \meta{B} | 0 | 1\\
 | 
|
367  | 
\end{plstx}
 | 
|
368  | 
||
369  | 
\noindent  | 
|
370  | 
The point is that we can in principle transform every left-recursive  | 
|
371  | 
grammar into one that is non-left-recursive one. This explains why often  | 
|
372  | 
the following grammar is used for arithmetic expressions:  | 
|
373  | 
||
374  | 
\begin{plstx}[margin=1cm]
 | 
|
375  | 
  : \meta{E} ::= \meta{T} | \meta{T} \cdot + \cdot \meta{E} |  \meta{T} \cdot - \cdot \meta{E}\\
 | 
|
376  | 
  : \meta{T} ::= \meta{F} | \meta{F} \cdot * \cdot \meta{T}\\
 | 
|
377  | 
  : \meta{F} ::= num\_token | ( \cdot \meta{E} \cdot )\\
 | 
|
378  | 
\end{plstx}
 | 
|
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
379  | 
|
| 680 | 380  | 
\noindent  | 
381  | 
In this grammar all $\meta{E}$xpressions, $\meta{T}$erms and $\meta{F}$actors
 | 
|
382  | 
are in some way protected from being left-recusive. For example if you  | 
|
383  | 
start $\meta{E}$ you can derive another one by going through $\meta{T}$, then 
 | 
|
384  | 
$\meta{F}$, but then $\meta{E}$ is protected by the open-parenthesis.
 | 
|
385  | 
||
386  | 
\subsection*{Removing $\epsilon$-Rules and CYK-Algorithm}
 | 
|
387  | 
||
388  | 
I showed above that the non-left-recursive grammar for binary numbers is  | 
|
389  | 
||
390  | 
\begin{plstx}[margin=1cm]
 | 
|
391  | 
  : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
 | 
|
392  | 
  : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\
 | 
|
393  | 
\end{plstx}
 | 
|
394  | 
||
395  | 
\noindent  | 
|
396  | 
The transformation made the original grammar non-left-recursive, but at  | 
|
397  | 
the expense of introducing an $\epsilon$ in the second rule. Having an  | 
|
398  | 
explicit $\epsilon$-rule is annoying to, not in terms of looping, but in  | 
|
399  | 
terms of efficiency. The reason is that the $\epsilon$-rule always  | 
|
400  | 
applies but since it recognises the empty string, it does not make any  | 
|
401  | 
progress with recognising a string. Better are rules like $( \cdot  | 
|
402  | 
\meta{E} \cdot )$ where something of the input is consumed. Getting
 | 
|
403  | 
rid of $\epsilon$-rules is also important for the CYK parsing algorithm,  | 
|
404  | 
which can give us an insight into the complexity class of parsing.  | 
|
405  | 
||
406  | 
It turns out we can also by some generic transformations eliminate  | 
|
407  | 
$\epsilon$-rules from grammars. Consider again the grammar above for  | 
|
408  | 
binary numbers where have a rule $\meta{B'} ::= \epsilon$. In this case
 | 
|
409  | 
we look for rules of the (generic) form \mbox{$\meta{A} :=
 | 
|
410  | 
\alpha\cdot\meta{B'}\cdot\beta$}. That is there are rules that use
 | 
|
411  | 
$\meta{B'}$ and something ($\alpha$) is in front of $\meta{B'}$ and
 | 
|
412  | 
something follows ($\beta$). Such rules need to be replaced by  | 
|
413  | 
additional rules of the form \mbox{$\meta{A} := \alpha\cdot\beta$}.
 | 
|
414  | 
In our running example there are the two rules for $\meta{B}$ which
 | 
|
415  | 
fall into this category  | 
|
416  | 
||
417  | 
\begin{plstx}[margin=1cm]
 | 
|
418  | 
  : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\
 | 
|
419  | 
\end{plstx} 
 | 
|
420  | 
||
421  | 
\noindent To follow the general scheme of the transfromation,  | 
|
422  | 
the $\alpha$ is either is either $0$ or $1$, and the $\beta$ happens  | 
|
| 798 | 423  | 
to be empty. So we need to generate new rules for the form  | 
| 680 | 424  | 
\mbox{$\meta{A} := \alpha\cdot\beta$}, which in our particular 
 | 
425  | 
example means we obtain  | 
|
426  | 
||
427  | 
\begin{plstx}[margin=1cm]
 | 
|
428  | 
  : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\
 | 
|
429  | 
\end{plstx} 
 | 
|
430  | 
||
431  | 
\noindent  | 
|
432  | 
Unfortunately $\meta{B'}$ is also used in the rule 
 | 
|
433  | 
||
434  | 
\begin{plstx}[margin=1cm]
 | 
|
435  | 
  : \meta{B'} ::= \meta{B} \cdot \meta{B'}\\
 | 
|
436  | 
\end{plstx}
 | 
|
437  | 
||
438  | 
\noindent  | 
|
439  | 
For this we repeat the transformation, giving  | 
|
440  | 
||
441  | 
\begin{plstx}[margin=1cm]
 | 
|
442  | 
  : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\
 | 
|
443  | 
\end{plstx}
 | 
|
444  | 
||
445  | 
\noindent  | 
|
446  | 
In this case $\alpha$ was substituted with $\meta{B}$ and $\beta$
 | 
|
447  | 
was again empty. Once no rule is left over, we can simply throw  | 
|
448  | 
away the $\epsilon$ rule. This gives the grammar  | 
|
449  | 
||
450  | 
\begin{plstx}[margin=1cm]
 | 
|
451  | 
  : \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\
 | 
|
452  | 
  : \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\
 | 
|
453  | 
\end{plstx}
 | 
|
454  | 
||
455  | 
\noindent  | 
|
456  | 
I let you think about whether this grammar can still recognise all  | 
|
457  | 
binary numbers and whether this grammar is non-left-recursive. The  | 
|
458  | 
precise statement for the transformation of removing $\epsilon$-rules is  | 
|
459  | 
that if the original grammar was able to recognise only non-empty  | 
|
460  | 
strings, then the transformed grammar will be equivalent (matching the  | 
|
461  | 
same set of strings); if the original grammar was able to match the  | 
|
462  | 
empty string, then the transformed grammar will be able to match the  | 
|
463  | 
same strings, \emph{except} the empty string. So the  $\epsilon$-removal
 | 
|
464  | 
does not preserve equivalence of grammars, but the small defect with the  | 
|
465  | 
empty string is not important for practical purposes.  | 
|
466  | 
||
467  | 
So why are these transformations all useful? Well apart from making the  | 
|
468  | 
parser combinators work (remember they cannot deal with left-recursion and  | 
|
469  | 
are inefficient with $\epsilon$-rules), a second reason is that they help  | 
|
470  | 
with getting any insight into the complexity of the parsing problem.  | 
|
471  | 
The parser combinators are very easy to implement, but are far from the  | 
|
472  | 
most efficient way of processing input (they can blow up exponentially  | 
|
473  | 
with ambiguous grammars). The question remains what is the best possible  | 
|
474  | 
complexity for parsing? It turns out that this is $O(n^3)$ for context-free  | 
|
475  | 
languages.  | 
|
476  | 
||
477  | 
To answer the question about complexity, let me describe next the CYK  | 
|
478  | 
algorithm (named after the authors Cocke–Younger–Kasami). This algorithm  | 
|
| 681 | 479  | 
works with grammars that are in \emph{Chomsky normalform}. In Chomsky
 | 
480  | 
normalform all rules must be of the form $\meta{A} ::= a$, where $a$ is 
 | 
|
481  | 
a terminal, or $\meta{A} ::= \meta{B}\cdot \meta{C}$, where $\meta{B}$ and
 | 
|
482  | 
$\meta{B}$ need to be non-terminals. And no rule can contain $\epsilon$.
 | 
|
483  | 
The following grammar is in Chomsky normalform:  | 
|
484  | 
||
485  | 
\begin{plstx}[margin=1cm]
 | 
|
| 682 | 486  | 
  : \meta{S} ::= \meta{N}\cdot \meta{P}\\
 | 
487  | 
  : \meta{P} ::= \meta{V}\cdot \meta{N}\\
 | 
|
488  | 
  : \meta{N} ::= \meta{N}\cdot \meta{N}\\
 | 
|
489  | 
  : \meta{N} ::= \meta{A}\cdot \meta{N}\\
 | 
|
490  | 
  : \meta{N} ::= \texttt{student} | \texttt{trainer} | \texttt{team} 
 | 
|
491  | 
                | \texttt{trains}\\
 | 
|
492  | 
  : \meta{V} ::= \texttt{trains} | \texttt{team}\\
 | 
|
493  | 
  : \meta{A} ::= \texttt{The} | \texttt{the}\\
 | 
|
| 681 | 494  | 
\end{plstx}
 | 
495  | 
||
496  | 
\noindent  | 
|
497  | 
where $\meta{S}$ is the start symbol and $\meta{S}$, $\meta{P}$,
 | 
|
498  | 
$\meta{N}$, $\meta{V}$ and $\meta{A}$ are non-terminals. The ``words''
 | 
|
499  | 
are terminals. The rough idea behind this grammar is that $\meta{S}$
 | 
|
500  | 
stands for a sentence, $\meta{P}$ is a predicate, $\meta{N}$ is a noun
 | 
|
501  | 
and so on. For example the rule \mbox{$\meta{P} ::= \meta{V}\cdot
 | 
|
502  | 
\meta{N}$} states that a predicate can be a verb followed by a noun.
 | 
|
503  | 
Now the question is whether the string  | 
|
504  | 
||
505  | 
\begin{center}
 | 
|
506  | 
  \texttt{The trainer trains the student team}
 | 
|
507  | 
\end{center}
 | 
|
508  | 
||
509  | 
\noindent  | 
|
510  | 
is recognised by the grammar. The CYK algorithm starts with the  | 
|
511  | 
following triangular data structure.  | 
|
| 680 | 512  | 
|
| 798 | 513  | 
%%\begin{figure}[t]
 | 
| 682 | 514  | 
\begin{center}
 | 
| 798 | 515  | 
  \begin{tikzpicture}[scale=0.7,line width=0.8mm]
 | 
| 682 | 516  | 
\draw (-2,0) -- (4,0);  | 
517  | 
\draw (-2,1) -- (4,1);  | 
|
518  | 
\draw (-2,2) -- (3,2);  | 
|
519  | 
\draw (-2,3) -- (2,3);  | 
|
520  | 
\draw (-2,4) -- (1,4);  | 
|
521  | 
\draw (-2,5) -- (0,5);  | 
|
522  | 
\draw (-2,6) -- (-1,6);  | 
|
523  | 
||
524  | 
\draw (0,0) -- (0, 5);  | 
|
525  | 
\draw (1,0) -- (1, 4);  | 
|
526  | 
\draw (2,0) -- (2, 3);  | 
|
527  | 
\draw (3,0) -- (3, 2);  | 
|
528  | 
\draw (4,0) -- (4, 1);  | 
|
529  | 
\draw (-1,0) -- (-1, 6);  | 
|
530  | 
\draw (-2,0) -- (-2, 6);  | 
|
531  | 
||
532  | 
  \draw (-1.5,-0.5) node {\footnotesize{}\texttt{The}}; 
 | 
|
533  | 
  \draw (-0.5,-1.0) node {\footnotesize{}\texttt{trainer}}; 
 | 
|
534  | 
  \draw ( 0.5,-0.5) node {\footnotesize{}\texttt{trains}}; 
 | 
|
535  | 
  \draw ( 1.5,-1.0) node {\footnotesize{}\texttt{the}}; 
 | 
|
536  | 
  \draw ( 2.5,-0.5) node {\footnotesize{}\texttt{student}}; 
 | 
|
537  | 
  \draw ( 3.5,-1.0) node {\footnotesize{}\texttt{team}};
 | 
|
538  | 
||
539  | 
  \draw (-1.5,0.5) node {$A$}; 
 | 
|
540  | 
  \draw (-0.5,0.5) node {$N$}; 
 | 
|
541  | 
  \draw ( 0.5,0.5) node {$N,V$}; 
 | 
|
542  | 
  \draw ( 1.5,0.5) node {$A$}; 
 | 
|
543  | 
  \draw ( 2.5,0.5) node {$N$}; 
 | 
|
544  | 
  \draw ( 3.5,0.5) node {$N,V$};
 | 
|
545  | 
||
| 798 | 546  | 
%  \draw (-1.5,1.5) node {\small{}a}; 
 | 
547  | 
%  \draw (-0.5,1.5) node {\small{}b}; 
 | 
|
548  | 
%  \draw ( 0.5,1.5) node {\small{}c}; 
 | 
|
549  | 
%  \draw ( 1.5,1.5) node {\small{}d}; 
 | 
|
550  | 
%  \draw ( 2.5,1.5) node {\small{}e}; 
 | 
|
551  | 
||
| 682 | 552  | 
  \draw (-2.4, 5.5) node {$1$}; 
 | 
553  | 
  \draw (-2.4, 4.5) node {$2$}; 
 | 
|
554  | 
  \draw (-2.4, 3.5) node {$3$}; 
 | 
|
555  | 
  \draw (-2.4, 2.5) node {$4$}; 
 | 
|
556  | 
  \draw (-2.4, 1.5) node {$5$}; 
 | 
|
557  | 
  \draw (-2.4, 0.5) node {$6$}; 
 | 
|
558  | 
  \end{tikzpicture}
 | 
|
559  | 
  \end{center}
 | 
|
| 798 | 560  | 
%%\end{figure}
 | 
561  | 
||
562  | 
||
563  | 
\noindent  | 
|
564  | 
The last row contains the information about all words and their  | 
|
565  | 
corresponding non-terminals. For example the field for \texttt{trains}
 | 
|
566  | 
contains the information $\meta{N}$ and $\meta{V}$ because it can be a
 | 
|
567  | 
``verb'' and a ``noun'' according to the grammar. The row above,  | 
|
568  | 
let's call the corresponding fields 5a to 5e, contains information  | 
|
569  | 
about 2-word parts of the sentence, namely  | 
|
570  | 
||
571  | 
\begin{center}
 | 
|
572  | 
\begin{tabular}{llll}
 | 
|
573  | 
  5a) & $\underbrace{\texttt{The}}_{A}$ $\mid$ $\underbrace{\texttt{trainer}}_{N}$   \\
 | 
|
574  | 
  5b) & $\underbrace{\texttt{trainer}}_{N}$ $\mid$ $\underbrace{\texttt{trains}}_{N,V}$\\
 | 
|
575  | 
  5c) & \texttt{trains} $\mid$ \texttt{the}    \\
 | 
|
576  | 
  5d) & \texttt{the} $\mid$ \texttt{student}   \\
 | 
|
577  | 
  5e) & \texttt{student} $\mid$ \texttt{team}  \\
 | 
|
578  | 
\end{tabular}  
 | 
|
579  | 
\end{center}  
 | 
|
580  | 
||
581  | 
\noindent  | 
|
582  | 
For each of them, we look up in row 6 which non-terminals it belongs to  | 
|
583  | 
(indicated above for 5a and 5b). For 5a, with the non-terminals  | 
|
584  | 
\meta{A} and \meta{N}, we find the grammar rule
 | 
|
585  | 
||
586  | 
\begin{plstx}[margin=1cm]
 | 
|
587  | 
  : \meta{N} ::= \meta{A}\cdot \meta{N}\\
 | 
|
588  | 
\end{plstx}
 | 
|
589  | 
||
590  | 
\noindent  | 
|
591  | 
which means into field 5a we put the left-hand side of this rule,  | 
|
592  | 
which in this case is the non-terminal \meta{N}. For 5b we have to check
 | 
|
593  | 
for both $\meta{N}\cdot\meta{N}$ and $\meta{N}\cdot\meta{V}$ whether there
 | 
|
594  | 
is a right-hand side of this form in the grammar. But only the  | 
|
595  | 
grammar rule  | 
|
596  | 
||
597  | 
\begin{plstx}[margin=1cm]
 | 
|
598  | 
  : \meta{N} ::= \meta{N}\cdot \meta{N}\\
 | 
|
599  | 
\end{plstx}
 | 
|
600  | 
||
601  | 
\noindent  | 
|
602  | 
matches, which means 5b gets also an \meta{N}. Continuing for all
 | 
|
603  | 
fields in row 5 gives:  | 
|
604  | 
||
605  | 
\begin{center}
 | 
|
606  | 
  \begin{tikzpicture}[scale=0.7,line width=0.8mm]
 | 
|
607  | 
\draw (-2,0) -- (4,0);  | 
|
608  | 
\draw (-2,1) -- (4,1);  | 
|
609  | 
\draw (-2,2) -- (3,2);  | 
|
610  | 
\draw (-2,3) -- (2,3);  | 
|
611  | 
\draw (-2,4) -- (1,4);  | 
|
612  | 
\draw (-2,5) -- (0,5);  | 
|
613  | 
\draw (-2,6) -- (-1,6);  | 
|
614  | 
||
615  | 
\draw (0,0) -- (0, 5);  | 
|
616  | 
\draw (1,0) -- (1, 4);  | 
|
617  | 
\draw (2,0) -- (2, 3);  | 
|
618  | 
\draw (3,0) -- (3, 2);  | 
|
619  | 
\draw (4,0) -- (4, 1);  | 
|
620  | 
\draw (-1,0) -- (-1, 6);  | 
|
621  | 
\draw (-2,0) -- (-2, 6);  | 
|
622  | 
||
623  | 
  \draw (-1.5,-0.5) node {\footnotesize{}\texttt{The}}; 
 | 
|
624  | 
  \draw (-0.5,-1.0) node {\footnotesize{}\texttt{trainer}}; 
 | 
|
625  | 
  \draw ( 0.5,-0.5) node {\footnotesize{}\texttt{trains}}; 
 | 
|
626  | 
  \draw ( 1.5,-1.0) node {\footnotesize{}\texttt{the}}; 
 | 
|
627  | 
  \draw ( 2.5,-0.5) node {\footnotesize{}\texttt{student}}; 
 | 
|
628  | 
  \draw ( 3.5,-1.0) node {\footnotesize{}\texttt{team}};
 | 
|
629  | 
||
630  | 
  \draw (-1.5,0.5) node {$A$}; 
 | 
|
631  | 
  \draw (-0.5,0.5) node {$N$}; 
 | 
|
632  | 
  \draw ( 0.5,0.5) node {$N,V$}; 
 | 
|
633  | 
  \draw ( 1.5,0.5) node {$A$}; 
 | 
|
634  | 
  \draw ( 2.5,0.5) node {$N$}; 
 | 
|
635  | 
  \draw ( 3.5,0.5) node {$N,V$};
 | 
|
636  | 
||
637  | 
  \draw (-1.5,1.5) node {$N$}; 
 | 
|
638  | 
  \draw (-0.5,1.5) node {$N$}; 
 | 
|
639  | 
  \draw ( 0.5,1.5) node {$$}; 
 | 
|
640  | 
  \draw ( 1.5,1.5) node {$N$}; 
 | 
|
641  | 
  \draw ( 2.5,1.5) node {$N$}; 
 | 
|
642  | 
||
643  | 
||
644  | 
%  \draw (-1.5,1.5) node {\small{}a}; 
 | 
|
645  | 
%  \draw (-0.5,1.5) node {\small{}b}; 
 | 
|
646  | 
%  \draw ( 0.5,1.5) node {\small{}c}; 
 | 
|
647  | 
%  \draw ( 1.5,1.5) node {\small{}d}; 
 | 
|
648  | 
%  \draw ( 2.5,1.5) node {\small{}e}; 
 | 
|
649  | 
||
650  | 
  \draw (-2.4, 5.5) node {$1$}; 
 | 
|
651  | 
  \draw (-2.4, 4.5) node {$2$}; 
 | 
|
652  | 
  \draw (-2.4, 3.5) node {$3$}; 
 | 
|
653  | 
  \draw (-2.4, 2.5) node {$4$}; 
 | 
|
654  | 
  \draw (-2.4, 1.5) node {$5$}; 
 | 
|
655  | 
  \draw (-2.4, 0.5) node {$6$}; 
 | 
|
656  | 
  \end{tikzpicture}
 | 
|
657  | 
  \end{center}
 | 
|
658  | 
||
659  | 
\noindent  | 
|
660  | 
Now row 4 is in charge of all 3-word parts of the sentence, namely  | 
|
661  | 
||
662  | 
\begin{center}
 | 
|
663  | 
\begin{tabular}{llll}
 | 
|
664  | 
4a) & The $\mid$ trainer trains\\  | 
|
665  | 
& The trainer $\mid$ trains\\  | 
|
666  | 
4b) & trainer $\mid$ trains the\\  | 
|
667  | 
& trainer trains $\mid$ the\\  | 
|
668  | 
4c) & trains $\mid$ the student\\  | 
|
669  | 
& trains the $\mid$ student\\  | 
|
670  | 
4d) & the $\mid$ student team\\  | 
|
671  | 
& the student $\mid$ team\\  | 
|
672  | 
\end{tabular}  
 | 
|
673  | 
\end{center}
 | 
|
674  | 
||
675  | 
\noindent  | 
|
676  | 
Note that in case of 3-word parts we have two splits. For example for  | 
|
677  | 
4a: $\underbrace{\texttt{The}}_{A}$ and
 | 
|
678  | 
$\underbrace{\texttt{trainer trains}}_{N}$; and also
 | 
|
679  | 
$\underbrace{\texttt{The trainer}}_{N}$ and
 | 
|
680  | 
$\underbrace{\texttt{trains}}_{N,V}$. For each of these splits we have
 | 
|
681  | 
to look up in the rows below, which non-terminals we already  | 
|
682  | 
computed. This allows us to look for right-hand sides in our grammar:  | 
|
683  | 
$\meta{A}\cdot\meta{N}$, $\meta{N}\cdot\meta{N}$ and
 | 
|
684  | 
$\meta{N}\cdot\meta{V}$, which yield the only left-hand side
 | 
|
685  | 
\meta{N}. This is what we fill in for 4a. And so on for row 4.
 | 
|
686  | 
||
687  | 
Row 3 is about all 4-word parts in the sentence, namely  | 
|
688  | 
||
689  | 
\begin{center}
 | 
|
690  | 
\begin{tabular}{llll}
 | 
|
691  | 
3a) & The trainer trains the\\  | 
|
692  | 
3b) & trainer trains the student\\  | 
|
693  | 
3c) & trains the student team\\  | 
|
694  | 
\end{tabular}  
 | 
|
695  | 
\end{center}
 | 
|
696  | 
||
697  | 
\noindent  | 
|
698  | 
Each of them can be split up in 3 ways, for example  | 
|
699  | 
||
700  | 
\begin{center}
 | 
|
701  | 
\begin{tabular}{llll}
 | 
|
702  | 
3a) & The $\mid$ trainer trains the\\  | 
|
703  | 
& The trainer $\mid$ trains the\\  | 
|
704  | 
& The trainer trains $\mid$ the\\  | 
|
705  | 
\end{tabular}  
 | 
|
706  | 
\end{center}
 | 
|
707  | 
||
708  | 
\noindent  | 
|
709  | 
and we have to consider them all in turn to fill in the non-terminals for  | 
|
710  | 
3a. You can guess how it continues: row 2 is for all 5-word parts, and finally  | 
|
711  | 
the field on the top is for the whole sentence.  | 
|
712  | 
The idea of the CYK algorithm is that if in the top-field the starting  | 
|
713  | 
symbol \meta{S} appears (possibly together with other non-terminals),
 | 
|
714  | 
then the sentence is accepted by the grammar. If it does not, then the  | 
|
715  | 
sentence is not accepted.  | 
|
716  | 
||
717  | 
Let us very quickly calculate the complexity of the CYK  | 
|
718  | 
algorithm. Lookup operations inside the triangle and in the grammar  | 
|
719  | 
are assumed to be of constant time, $O(1)$, meaning they do not  | 
|
720  | 
matter. How many fields are in the triangle\ldots  | 
|
721  | 
$\frac{n}{2} * (n+1)$, where $n$ is the size of the input. That means
 | 
|
722  | 
roughly $O(n^2)$ fields. How much work do we have to do for each  | 
|
723  | 
field? Well, for the top-most we have to consider $n-1$ splits, which  | 
|
724  | 
means roughly $O(n)$ for each field. The overall result is a $O(n^3)$  | 
|
725  | 
time-complexity for CYK. It turns out that this is the best worst-time  | 
|
726  | 
complexity for parsing with context-free grammars.  | 
|
| 680 | 727  | 
|
728  | 
\end{document}
 | 
|
729  | 
||
730  | 
||
731  | 
%%% Parser combinators are now part of handout 6  | 
|
| 459 | 732  | 
|
733  | 
\subsection*{Parser Combinators}
 | 
|
734  | 
||
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
735  | 
Let us now turn to the problem of generating a parse-tree for  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
736  | 
a grammar and string. In what follows we explain \emph{parser
 | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
737  | 
combinators}, because they are easy to implement and closely  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
738  | 
resemble grammar rules. Imagine that a grammar describes the  | 
| 665 | 739  | 
strings of natural numbers, such as the grammar \meta{N}  shown
 | 
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
740  | 
above. For all such strings we want to generate the  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
741  | 
parse-trees or later on we actually want to extract the  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
742  | 
meaning of these strings, that is the concrete integers  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
743  | 
``behind'' these strings. In Scala the parser combinators will  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
744  | 
be functions of type  | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
745  | 
|
| 
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
746  | 
\begin{center}
 | 
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
747  | 
\texttt{I $\Rightarrow$ Set[(T, I)]}
 | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
748  | 
\end{center}
 | 
| 
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
749  | 
|
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
750  | 
\noindent that is they take as input something of type  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
751  | 
\texttt{I}, typically a list of tokens or a string, and return
 | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
752  | 
a set of pairs. The first component of these pairs corresponds  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
753  | 
to what the parser combinator was able to process from the  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
754  | 
input and the second is the unprocessed part of the input. As  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
755  | 
we shall see shortly, a parser combinator might return more  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
756  | 
than one such pair, with the idea that there are potentially  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
757  | 
several ways how to interpret the input. As a concrete  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
758  | 
example, consider the case where the input is of type string,  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
759  | 
say the string  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
760  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
761  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
762  | 
\tt\Grid{iffoo\VS testbar}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
763  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
764  | 
|
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
765  | 
\noindent We might have a parser combinator which tries to  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
766  | 
interpret this string as a keyword (\texttt{if}) or an
 | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
767  | 
identifier (\texttt{iffoo}). Then the output will be the set
 | 
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
768  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
769  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
770  | 
$\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), 
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
771  | 
           \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
772  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
773  | 
|
| 
362
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
774  | 
\noindent where the first pair means the parser could  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
775  | 
recognise \texttt{if} from the input and leaves the rest as
 | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
776  | 
`unprocessed' as the second component of the pair; in the  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
777  | 
other case it could recognise \texttt{iffoo} and leaves
 | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
778  | 
\texttt{\VS testbar} as unprocessed. If the parser cannot
 | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
779  | 
recognise anything from the input then parser combinators just  | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
780  | 
return the empty set $\{\}$. This will indicate
 | 
| 
 
57ea439feaff
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
360 
diff
changeset
 | 
781  | 
something ``went wrong''.  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
782  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
783  | 
The main attraction is that we can easily build parser combinators out of smaller components  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
784  | 
following very closely the structure of a grammar. In order to implement this in an object  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
785  | 
oriented programming language, like Scala, we need to specify an abstract class for parser  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
786  | 
combinators. This abstract class requires the implementation of the function  | 
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
787  | 
\texttt{parse} taking an argument of type \texttt{I} and returns a set of type  
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
788  | 
\mbox{\texttt{Set[(T, I)]}}.
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
789  | 
|
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
790  | 
\begin{center}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
791  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
792  | 
abstract class Parser[I, T] {
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
793  | 
def parse(ts: I): Set[(T, I)]  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
794  | 
|
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
795  | 
def parse_all(ts: I): Set[T] =  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
796  | 
for ((head, tail) <- parse(ts); if (tail.isEmpty))  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
797  | 
yield head  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
798  | 
}  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
799  | 
\end{lstlisting}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
800  | 
\end{center}
 | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
801  | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
802  | 
\noindent  | 
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
803  | 
From the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all},
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
804  | 
which just filters out all pairs whose second component is not empty (that is has still some  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
805  | 
unprocessed part). The reason is that at the end of parsing we are only interested in the  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
806  | 
results where all the input has been consumed and no unprocessed part is left.  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
807  | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
808  | 
One of the simplest parser combinators recognises just a character, say $c$,  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
809  | 
from the beginning of strings. Its behaviour is as follows:  | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
810  | 
|
| 
177
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
811  | 
\begin{itemize}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
812  | 
\item if the head of the input string starts with a $c$, it returns  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
813  | 
	the set $\{(c, \textit{tail of}\; s)\}$
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
814  | 
\item otherwise it returns the empty set $\varnothing$  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
815  | 
\end{itemize}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
816  | 
|
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
817  | 
\noindent  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
818  | 
The input type of this simple parser combinator for characters is  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
819  | 
\texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. 
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
820  | 
The code in Scala is as follows:  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
821  | 
|
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
822  | 
\begin{center}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
823  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
824  | 
case class CharParser(c: Char) extends Parser[String, Char] {
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
825  | 
def parse(sb: String) =  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
826  | 
if (sb.head == c) Set((c, sb.tail)) else Set()  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
827  | 
}  | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
828  | 
\end{lstlisting}
 | 
| 
 
53def1fbf472
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
176 
diff
changeset
 | 
829  | 
\end{center}
 | 
| 
176
 
3c2653fc8b5a
updated
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
175 
diff
changeset
 | 
830  | 
|
| 
183
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
831  | 
\noindent  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
832  | 
The \texttt{parse} function tests whether the first character of the 
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
833  | 
input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits the
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
834  | 
string into the recognised part \texttt{c} and the unprocessed part
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
835  | 
\texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} then
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
836  | 
the parser returns the empty set (in Scala \texttt{Set()}).
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
837  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
838  | 
More interesting are the parser combinators that build larger parsers  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
839  | 
out of smaller component parsers. For example the alternative  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
840  | 
parser combinator is as follows.  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
841  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
842  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
843  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
844  | 
class AltParser[I, T]  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
845  | 
(p: => Parser[I, T],  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
846  | 
        q: => Parser[I, T]) extends Parser[I, T] {
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
847  | 
def parse(sb: I) = p.parse(sb) ++ q.parse(sb)  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
848  | 
}  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
849  | 
\end{lstlisting}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
850  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
851  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
852  | 
\noindent  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
853  | 
The types of this parser combinator are polymorphic (we just have \texttt{I}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
854  | 
for the input type, and \texttt{T} for the output type). The alternative parser
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
855  | 
builds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}.
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
856  | 
Both need to be able to process input of type \texttt{I} and return the same
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
857  | 
output type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the 
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
858  | 
\texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
859  | 
evaluation of the arguments before they are used. This is often called  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
860  | 
\emph{lazy evaluation} of the arguments.) The alternative parser should run
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
861  | 
the input with the first parser \texttt{p} (producing a set of outputs) and then
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
862  | 
run the same input with \texttt{q}. The result should be then just the union
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
863  | 
of both sets, which is the operation \texttt{++} in Scala.
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
864  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
865  | 
This parser combinator already allows us to construct a parser that either  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
866  | 
a character \texttt{a} or \texttt{b}, as
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
867  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
868  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
869  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
870  | 
new AltParser(CharParser('a'), CharParser('b'))
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
871  | 
\end{lstlisting}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
872  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
873  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
874  | 
\noindent  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
875  | 
Scala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}. 
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
876  | 
We can call this parser combinator with the strings  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
877  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
878  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
879  | 
\begin{tabular}{rcl}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
880  | 
input string & & output\medskip\\  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
881  | 
\texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
882  | 
\texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
883  | 
\texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
884  | 
\end{tabular}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
885  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
886  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
887  | 
\noindent  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
888  | 
We receive in the first two cases a successful output (that is a non-empty set).  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
889  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
890  | 
A bit more interesting is the \emph{sequence parser combinator} implemented in
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
891  | 
Scala as follows:  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
892  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
893  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
894  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
895  | 
class SeqParser[I, T, S]  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
896  | 
(p: => Parser[I, T],  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
897  | 
        q: => Parser[I, S]) extends Parser[I, (T, S)] {
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
898  | 
def parse(sb: I) =  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
899  | 
for ((head1, tail1) <- p.parse(sb);  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
900  | 
(head2, tail2) <- q.parse(tail1))  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
901  | 
yield ((head1, head2), tail2)  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
902  | 
}  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
903  | 
\end{lstlisting}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
904  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
905  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
906  | 
\noindent  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
907  | 
This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} 
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
908  | 
as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}).
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
909  | 
The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}. 
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
910  | 
Let \texttt{q} run on these unprocessed parts
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
911  | 
producing again a set of pairs. The output of the sequence parser combinator is then a set  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
912  | 
containing pairs where the first components are again pairs, namely what the first parser could parse  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
913  | 
together with what the second parser could parse; the second component is the unprocessed  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
914  | 
part left over after running the second parser \texttt{q}. Therefore the input type of
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
915  | 
the sequence parser combinator is as usual \texttt{I}, but the output type is
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
916  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
917  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
918  | 
\texttt{Set[((T, S), I)]}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
919  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
920  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
921  | 
Scala allows us to provide some  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
922  | 
shorthand notation for the sequence parser combinator. So we can write for  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
923  | 
example \texttt{'a'  $\sim$ 'b'}, which is the
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
924  | 
parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}.
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
925  | 
Calling this parser combinator with the strings  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
926  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
927  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
928  | 
\begin{tabular}{rcl}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
929  | 
input string & & output\medskip\\  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
930  | 
\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
931  | 
\texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
932  | 
\texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
933  | 
\end{tabular}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
934  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
935  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
936  | 
\noindent  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
937  | 
A slightly more complicated parser is \texttt{('a'  || 'b') $\sim$ 'b'} which parses as first character either
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
938  | 
an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results.
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
939  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
940  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
941  | 
\begin{tabular}{rcl}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
942  | 
input string & & output\medskip\\  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
943  | 
\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
944  | 
\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
945  | 
\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
946  | 
\end{tabular}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
947  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
948  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
949  | 
Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error.
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
950  | 
The first parser has as output type a single character (recall the type of \texttt{CharParser}),
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
951  | 
but the second parser produces a pair of characters as output. The alternative parser is however  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
952  | 
required to have both component parsers to have the same type. We will see later how we can  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
953  | 
build this parser without the typing error.  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
954  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
955  | 
The next parser combinator does not actually combine smaller parsers, but applies  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
956  | 
a function to the result of the parser. It is implemented in Scala as follows  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
957  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
958  | 
\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
959  | 
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
960  | 
class FunParser[I, T, S]  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
961  | 
(p: => Parser[I, T],  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
962  | 
          f: T => S) extends Parser[I, S] {
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
963  | 
def parse(sb: I) =  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
964  | 
for ((head, tail) <- p.parse(sb)) yield (f(head), tail)  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
965  | 
}  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
966  | 
\end{lstlisting}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
967  | 
\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
968  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
969  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
970  | 
\noindent  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
971  | 
This parser combinator takes a parser \texttt{p} with output type \texttt{T} as
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
972  | 
input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
973  | 
produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
974  | 
applies the function \texttt{f} to all the parer outputs. Since this function
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
975  | 
is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}.
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
976  | 
Again Scala lets us introduce some shorthand notation for this parser combinator.  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
977  | 
Therefore we will write \texttt{p ==> f} for it.
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
978  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
979  | 
%\bigskip  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
980  | 
%takes advantage of the full generality---have a look  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
981  | 
%what it produces if we call it with the string \texttt{abc}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
982  | 
%  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
983  | 
%\begin{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
984  | 
%\begin{tabular}{rcl}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
985  | 
%input string & & output\medskip\\  | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
986  | 
%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
987  | 
%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
988  | 
%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
989  | 
%\end{tabular}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
990  | 
%\end{center}
 | 
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
991  | 
|
| 
 
b17eff695c7f
added new stuff
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents: 
177 
diff
changeset
 | 
992  | 
|
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
993  | 
|
| 680 | 994  | 
|
995  | 
||
| 
173
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
996  | 
|
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
997  | 
%%% Local Variables:  | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
998  | 
%%% mode: latex  | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
999  | 
%%% TeX-master: t  | 
| 
 
7cfb7a6f7c99
added slides
 
Christian Urban <christian dot urban at kcl dot ac dot uk> 
parents:  
diff
changeset
 | 
1000  | 
%%% End:  | 
| 680 | 1001  |