author | Christian Urban <christian.urban@kcl.ac.uk> |
Sat, 23 Sep 2023 21:22:17 +0100 | |
changeset 925 | ddb521b57e0c |
parent 798 | aaf0bd0a211d |
child 937 | dc5ab66b11cc |
permissions | -rw-r--r-- |
665 | 1 |
% !TEX program = xelatex |
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
2 |
\documentclass{article} |
297
5c51839c88fd
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
292
diff
changeset
|
3 |
\usepackage{../style} |
217
cd6066f1056a
updated handouts
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
183
diff
changeset
|
4 |
\usepackage{../langs} |
459 | 5 |
\usepackage{../grammar} |
798 | 6 |
\usepackage{../graphics} |
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
7 |
|
545 | 8 |
% epsilon and left-recursion elimination |
9 |
% http://www.mollypages.org/page/grammar/index.mp |
|
10 |
||
618 | 11 |
%% parsing scala files |
691 | 12 |
%% https://scalameta.org/ |
13 |
||
14 |
% chomsky normal form transformation |
|
15 |
% https://youtu.be/FNPSlnj3Vt0 |
|
618 | 16 |
|
710 | 17 |
% Language hierachy is about complexity |
18 |
% How hard is it to recognise an element in a language |
|
19 |
||
722 | 20 |
% Pratt parsing |
21 |
% https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html |
|
22 |
% https://www.oilshell.org/blog/2017/03/31.html |
|
23 |
||
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
24 |
\begin{document} |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
25 |
|
385
7f8516ff408d
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
362
diff
changeset
|
26 |
\section*{Handout 5 (Grammars \& Parser)} |
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
27 |
|
727 | 28 |
So far we have focused on regular expressions as well as matching and |
29 |
lexing algorithms. While regular expressions are very useful for lexing |
|
30 |
and for recognising many patterns in strings (like email addresses), |
|
31 |
they have their limitations. For example there is no regular expression |
|
32 |
that can recognise the language $a^nb^n$ (where you have strings |
|
33 |
starting with $n$ $a$'s followed by the same amount of $b$'s). Another |
|
34 |
example for which there exists no regular expression is the language of |
|
35 |
well-parenthesised expressions. In languages like Lisp, which use |
|
36 |
parentheses rather extensively, it might be of interest to know whether |
|
37 |
the following two expressions are well-parenthesised or not (the left |
|
38 |
one is, the right one is not): |
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
39 |
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
40 |
\begin{center} |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
41 |
$(((()()))())$ \hspace{10mm} $(((()()))()))$ |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
42 |
\end{center} |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
43 |
|
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
44 |
\noindent Not being able to solve such recognition problems is |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
45 |
a serious limitation. In order to solve such recognition |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
46 |
problems, we need more powerful techniques than regular |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
47 |
expressions. We will in particular look at \emph{context-free |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
48 |
languages}. They include the regular languages as the picture |
582 | 49 |
below about language classes shows: |
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
50 |
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
51 |
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
52 |
\begin{center} |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
53 |
\begin{tikzpicture} |
297
5c51839c88fd
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
292
diff
changeset
|
54 |
[rect/.style={draw=black!50, |
5c51839c88fd
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
292
diff
changeset
|
55 |
top color=white,bottom color=black!20, |
5c51839c88fd
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
292
diff
changeset
|
56 |
rectangle, very thick, rounded corners}] |
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
57 |
|
297
5c51839c88fd
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
292
diff
changeset
|
58 |
\draw (0,0) node [rect, text depth=30mm, text width=46mm] {\small all languages}; |
5c51839c88fd
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
292
diff
changeset
|
59 |
\draw (0,-0.4) node [rect, text depth=20mm, text width=44mm] {\small decidable languages}; |
5c51839c88fd
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
292
diff
changeset
|
60 |
\draw (0,-0.65) node [rect, text depth=13mm] {\small context sensitive languages}; |
5c51839c88fd
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
292
diff
changeset
|
61 |
\draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages}; |
5c51839c88fd
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
292
diff
changeset
|
62 |
\draw (0,-1.05) node [rect] {\small regular languages}; |
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
63 |
\end{tikzpicture} |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
64 |
\end{center} |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
65 |
|
582 | 66 |
\noindent Each ``bubble'' stands for sets of languages (remember |
67 |
languages are sets of strings). As indicated the set of regular |
|
680 | 68 |
languages is fully included inside the context-free languages, |
582 | 69 |
meaning every regular language is also context-free, but not vice |
665 | 70 |
versa. Below I will let you think, for example, what the context-free |
582 | 71 |
grammar is for the language corresponding to the regular expression |
72 |
$(aaa)^*a$. |
|
73 |
||
74 |
Because of their convenience, context-free languages play an important |
|
75 |
role in `day-to-day' text processing and in programming |
|
76 |
languages. Context-free in this setting means that ``words'' have one |
|
680 | 77 |
meaning only and this meaning is independent from the context |
78 |
the ``words'' appear in. For example ambiguity issues like |
|
582 | 79 |
|
80 |
\begin{center} |
|
682 | 81 |
\tt Time flies like an arrow. Fruit flies like bananas. |
582 | 82 |
\end{center} |
83 |
||
84 |
\noindent |
|
680 | 85 |
from natural languages were the meaning of \emph{flies} depends on the |
686 | 86 |
surrounding \emph{context} are avoided as much as possible. Here is |
87 |
an interesting video about C++ being not a context-free language |
|
88 |
||
89 |
\begin{center} |
|
90 |
\url{https://www.youtube.com/watch?v=OzK8pUu4UfM} |
|
91 |
\end{center} |
|
582 | 92 |
|
93 |
Context-free languages are usually specified by grammars. For example |
|
94 |
a grammar for well-parenthesised expressions can be given as follows: |
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
95 |
|
459 | 96 |
\begin{plstx}[margin=3cm] |
97 |
: \meta{P} ::= ( \cdot \meta{P} \cdot ) \cdot \meta{P} |
|
98 |
| \epsilon\\ |
|
99 |
\end{plstx} |
|
100 |
||
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
101 |
\noindent |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
102 |
or a grammar for recognising strings consisting of ones is |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
103 |
|
459 | 104 |
\begin{plstx}[margin=3cm] |
105 |
: \meta{O} ::= 1 \cdot \meta{O} |
|
106 |
| 1\\ |
|
107 |
\end{plstx} |
|
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
108 |
|
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
109 |
In general grammars consist of finitely many rules built up |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
110 |
from \emph{terminal symbols} (usually lower-case letters) and |
582 | 111 |
\emph{non-terminal symbols} (upper-case letters written in |
112 |
bold like \meta{A}, \meta{N} and so on). Rules have |
|
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
113 |
the shape |
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
114 |
|
459 | 115 |
\begin{plstx}[margin=3cm] |
116 |
: \meta{NT} ::= rhs\\ |
|
117 |
\end{plstx} |
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
118 |
|
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
119 |
\noindent where on the left-hand side is a single non-terminal |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
120 |
and on the right a string consisting of both terminals and |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
121 |
non-terminals including the $\epsilon$-symbol for indicating |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
122 |
the empty string. We use the convention to separate components |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
123 |
on the right hand-side by using the $\cdot$ symbol, as in the |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
124 |
grammar for well-parenthesised expressions. We also use the |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
125 |
convention to use $|$ as a shorthand notation for several |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
126 |
rules. For example |
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
127 |
|
459 | 128 |
\begin{plstx}[margin=3cm] |
129 |
: \meta{NT} ::= rhs_1 |
|
130 |
| rhs_2\\ |
|
131 |
\end{plstx} |
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
132 |
|
459 | 133 |
\noindent means that the non-terminal \meta{NT} can be replaced by |
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
134 |
either $\textit{rhs}_1$ or $\textit{rhs}_2$. If there are more |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
135 |
than one non-terminal on the left-hand side of the rules, then |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
136 |
we need to indicate what is the \emph{starting} symbol of the |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
137 |
grammar. For example the grammar for arithmetic expressions |
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
138 |
can be given as follows |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
139 |
|
459 | 140 |
\begin{plstx}[margin=3cm,one per line] |
141 |
\mbox{\rm (1)}: \meta{E} ::= \meta{N}\\ |
|
142 |
\mbox{\rm (2)}: \meta{E} ::= \meta{E} \cdot + \cdot \meta{E}\\ |
|
143 |
\mbox{\rm (3)}: \meta{E} ::= \meta{E} \cdot - \cdot \meta{E}\\ |
|
144 |
\mbox{\rm (4)}: \meta{E} ::= \meta{E} \cdot * \cdot \meta{E}\\ |
|
145 |
\mbox{\rm (5)}: \meta{E} ::= ( \cdot \meta{E} \cdot )\\ |
|
146 |
\mbox{\rm (6\ldots)}: \meta{N} ::= \meta{N} \cdot \meta{N} |
|
147 |
\mid 0 \mid 1 \mid \ldots \mid 9\\ |
|
148 |
\end{plstx} |
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
149 |
|
459 | 150 |
\noindent where \meta{E} is the starting symbol. A |
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
151 |
\emph{derivation} for a grammar starts with the starting |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
152 |
symbol of the grammar and in each step replaces one |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
153 |
non-terminal by a right-hand side of a rule. A derivation ends |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
154 |
with a string in which only terminal symbols are left. For |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
155 |
example a derivation for the string $(1 + 2) + 3$ is as |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
156 |
follows: |
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
157 |
|
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
158 |
\begin{center} |
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
159 |
\begin{tabular}{lll@{\hspace{2cm}}l} |
459 | 160 |
\meta{E} & $\rightarrow$ & $\meta{E}+\meta{E}$ & by (2)\\ |
161 |
& $\rightarrow$ & $(\meta{E})+\meta{E}$ & by (5)\\ |
|
162 |
& $\rightarrow$ & $(\meta{E}+\meta{E})+\meta{E}$ & by (2)\\ |
|
163 |
& $\rightarrow$ & $(\meta{E}+\meta{E})+\meta{N}$ & by (1)\\ |
|
164 |
& $\rightarrow$ & $(\meta{E}+\meta{E})+3$ & by (6\dots)\\ |
|
165 |
& $\rightarrow$ & $(\meta{N}+\meta{E})+3$ & by (1)\\ |
|
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
166 |
& $\rightarrow^+$ & $(1+2)+3$ & by (1, 6\ldots)\\ |
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
167 |
\end{tabular} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
168 |
\end{center} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
169 |
|
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
170 |
\noindent where on the right it is indicated which |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
171 |
grammar rule has been applied. In the last step we |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
172 |
merged several steps into one. |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
173 |
|
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
174 |
The \emph{language} of a context-free grammar $G$ |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
175 |
with start symbol $S$ is defined as the set of strings |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
176 |
derivable by a derivation, that is |
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
177 |
|
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
178 |
\begin{center} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
179 |
$\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$ |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
180 |
\end{center} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
181 |
|
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
182 |
\noindent |
680 | 183 |
A \emph{parse-tree} encodes how a string is derived with the starting |
184 |
symbol on top and each non-terminal containing a subtree for how it is |
|
185 |
replaced in a derivation. The parse tree for the string $(1 + 23)+4$ is |
|
186 |
as follows: |
|
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
187 |
|
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
188 |
\begin{center} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
189 |
\begin{tikzpicture}[level distance=8mm, black] |
665 | 190 |
\node {\meta{E}} |
191 |
child {node {\meta{E} } |
|
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
192 |
child {node {$($}} |
665 | 193 |
child {node {\meta{E} } |
194 |
child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}} |
|
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
195 |
child {node {$+$}} |
665 | 196 |
child {node {\meta{E} } |
197 |
child {node {\meta{N} } child {node {$2$}}} |
|
198 |
child {node {\meta{N} } child {node {$3$}}} |
|
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
199 |
} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
200 |
} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
201 |
child {node {$)$}} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
202 |
} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
203 |
child {node {$+$}} |
665 | 204 |
child {node {\meta{E} } |
205 |
child {node {\meta{N} } child {node {$4$}}} |
|
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
206 |
}; |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
207 |
\end{tikzpicture} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
208 |
\end{center} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
209 |
|
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
210 |
\noindent We are often interested in these parse-trees since |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
211 |
they encode the structure of how a string is derived by a |
680 | 212 |
grammar. |
213 |
||
214 |
Before we come to the problem of constructing such parse-trees, we need |
|
215 |
to consider the following two properties of grammars. A grammar is |
|
216 |
\emph{left-recursive} if there is a derivation starting from a |
|
217 |
non-terminal, say \meta{NT} which leads to a string which again starts |
|
218 |
with \meta{NT}. This means a derivation of the form. |
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
219 |
|
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
220 |
\begin{center} |
665 | 221 |
$\meta{NT} \rightarrow \ldots \rightarrow \meta{NT} \cdot \ldots$ |
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
222 |
\end{center} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
223 |
|
680 | 224 |
\noindent It can be easily seen that the grammar above for arithmetic |
225 |
expressions is left-recursive: for example the rules $\meta{E} |
|
226 |
\rightarrow \meta{E}\cdot + \cdot \meta{E}$ and $\meta{N} \rightarrow |
|
227 |
\meta{N}\cdot \meta{N}$ show that this grammar is left-recursive. But |
|
228 |
note that left-recursiveness can involve more than one step in the |
|
229 |
derivation. The problem with left-recursive grammars is that some |
|
230 |
algorithms cannot cope with them: with left-recursive grammars they will |
|
231 |
fall into a loop. Fortunately every left-recursive grammar can be |
|
232 |
transformed into one that is not left-recursive, although this |
|
233 |
transformation might make the grammar less ``human-readable''. For |
|
234 |
example if we want to give a non-left-recursive grammar for numbers we |
|
235 |
might specify |
|
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
236 |
|
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
237 |
\begin{center} |
665 | 238 |
$\meta{N} \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\; |
239 |
1\cdot \meta{N}\;|\;2\cdot \meta{N}\;|\;\ldots\;|\;9\cdot \meta{N}$ |
|
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
240 |
\end{center} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
241 |
|
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
242 |
\noindent Using this grammar we can still derive every number |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
243 |
string, but we will never be able to derive a string of the |
665 | 244 |
form $\meta{N} \to \ldots \to \meta{N} \cdot \ldots$. |
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
245 |
|
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
246 |
The other property we have to watch out for is when a grammar |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
247 |
is \emph{ambiguous}. A grammar is said to be ambiguous if |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
248 |
there are two parse-trees for one string. Again the grammar |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
249 |
for arithmetic expressions shown above is ambiguous. While the |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
250 |
shown parse tree for the string $(1 + 23) + 4$ is unique, this |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
251 |
is not the case in general. For example there are two parse |
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
252 |
trees for the string $1 + 2 + 3$, namely |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
253 |
|
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
254 |
\begin{center} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
255 |
\begin{tabular}{c@{\hspace{10mm}}c} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
256 |
\begin{tikzpicture}[level distance=8mm, black] |
665 | 257 |
\node {\meta{E} } |
258 |
child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}} |
|
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
259 |
child {node {$+$}} |
665 | 260 |
child {node {\meta{E} } |
261 |
child {node {\meta{E} } child {node {\meta{N} } child {node {$2$}}}} |
|
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
262 |
child {node {$+$}} |
665 | 263 |
child {node {\meta{E} } child {node {\meta{N} } child {node {$3$}}}} |
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
264 |
} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
265 |
; |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
266 |
\end{tikzpicture} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
267 |
& |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
268 |
\begin{tikzpicture}[level distance=8mm, black] |
665 | 269 |
\node {\meta{E} } |
270 |
child {node {\meta{E} } |
|
271 |
child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}} |
|
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
272 |
child {node {$+$}} |
665 | 273 |
child {node {\meta{E} } child {node {\meta{N} } child {node {$2$}}}} |
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
274 |
} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
275 |
child {node {$+$}} |
665 | 276 |
child {node {\meta{E} } child {node {\meta{N} } child {node {$3$}}}} |
175
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
277 |
; |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
278 |
\end{tikzpicture} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
279 |
\end{tabular} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
280 |
\end{center} |
5801e8c0e528
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
173
diff
changeset
|
281 |
|
680 | 282 |
\noindent In particular in programming languages we will try to avoid |
283 |
ambiguous grammars because two different parse-trees for a string mean a |
|
284 |
program can be interpreted in two different ways. In such cases we have |
|
285 |
to somehow make sure the two different ways do not matter, or |
|
286 |
disambiguate the grammar in some other way (for example making the $+$ |
|
287 |
left-associative). Unfortunately already the problem of deciding whether |
|
288 |
a grammar is ambiguous or not is in general undecidable. But in simple |
|
289 |
instance (the ones we deal with in this module) one can usually see when |
|
290 |
a grammar is ambiguous. |
|
291 |
||
292 |
\subsection*{Removing Left-Recursion} |
|
293 |
||
294 |
Let us come back to the problem of left-recursion and consider the |
|
295 |
following grammar for binary numbers: |
|
296 |
||
297 |
\begin{plstx}[margin=1cm] |
|
298 |
: \meta{B} ::= \meta{B} \cdot \meta{B} | 0 | 1\\ |
|
299 |
\end{plstx} |
|
300 |
||
301 |
\noindent |
|
302 |
It is clear that this grammar can create all binary numbers, but |
|
303 |
it is also clear that this grammar is left-recursive. Giving this |
|
304 |
grammar as is to parser combinators will result in an infinite |
|
305 |
loop. Fortunately, every left-recursive grammar can be translated |
|
306 |
into one that is not left-recursive with the help of some |
|
307 |
transformation rules. Suppose we identified the ``offensive'' |
|
308 |
rule, then we can separate the grammar into this offensive rule |
|
309 |
and the ``rest'': |
|
310 |
||
311 |
\begin{plstx}[margin=1cm] |
|
312 |
: \meta{B} ::= \underbrace{\meta{B} \cdot \meta{B}}_{\textit{lft-rec}} |
|
313 |
| \underbrace{0 \;\;|\;\; 1}_{\textit{rest}}\\ |
|
314 |
\end{plstx} |
|
315 |
||
316 |
\noindent |
|
317 |
To make the idea of the transformation clearer, suppose the left-recursive |
|
318 |
rule is of the form $\meta{B}\alpha$ (the left-recursive non-terminal |
|
319 |
followed by something called $\alpha$) and the ``rest'' is called $\beta$. |
|
320 |
That means our grammar looks schematically as follows |
|
321 |
||
322 |
\begin{plstx}[margin=1cm] |
|
323 |
: \meta{B} ::= \meta{B} \cdot \alpha | \beta\\ |
|
324 |
\end{plstx} |
|
325 |
||
326 |
\noindent |
|
327 |
To get rid of the left-recursion, we are required to introduce |
|
328 |
a new non-terminal, say $\meta{B'}$ and transform the rule |
|
329 |
as follows: |
|
330 |
||
331 |
\begin{plstx}[margin=1cm] |
|
332 |
: \meta{B} ::= \beta \cdot \meta{B'}\\ |
|
333 |
: \meta{B'} ::= \alpha \cdot \meta{B'} | \epsilon\\ |
|
334 |
\end{plstx} |
|
335 |
||
336 |
\noindent |
|
337 |
In our example of binary numbers we would after the transformation |
|
338 |
end up with the rules |
|
339 |
||
340 |
\begin{plstx}[margin=1cm] |
|
341 |
: \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\ |
|
342 |
: \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\ |
|
343 |
\end{plstx} |
|
344 |
||
345 |
\noindent |
|
346 |
A little thought should convince you that this grammar still derives |
|
347 |
all the binary numbers (for example 0 and 1 are derivable because $\meta{B'}$ |
|
348 |
can be $\epsilon$). Less clear might be why this grammar is non-left recursive. |
|
349 |
For $\meta{B'}$ it is relatively clear because we will never be |
|
350 |
able to derive things like |
|
351 |
||
352 |
\begin{center} |
|
353 |
$\meta{B'} \rightarrow\ldots\rightarrow \meta{B'}\cdot\ldots$ |
|
354 |
\end{center} |
|
355 |
||
356 |
\noindent |
|
357 |
because there will always be a $\meta{B}$ in front of a $\meta{B'}$, and |
|
358 |
$\meta{B}$ now has always a $0$ or $1$ in front, so a $\meta{B'}$ can |
|
359 |
never be in the first place. The reasoning is similar for $\meta{B}$: |
|
360 |
the $0$ and $1$ in the rule for $\meta{B}$ ``protect'' it from becoming |
|
361 |
left-recursive. This transformation does not mean the grammar is the |
|
362 |
simplest left-recursive grammar for binary numbers. For example the |
|
363 |
following grammar would do as well |
|
364 |
||
365 |
\begin{plstx}[margin=1cm] |
|
366 |
: \meta{B} ::= 0 \cdot \meta{B} | 1 \cdot \meta{B} | 0 | 1\\ |
|
367 |
\end{plstx} |
|
368 |
||
369 |
\noindent |
|
370 |
The point is that we can in principle transform every left-recursive |
|
371 |
grammar into one that is non-left-recursive one. This explains why often |
|
372 |
the following grammar is used for arithmetic expressions: |
|
373 |
||
374 |
\begin{plstx}[margin=1cm] |
|
375 |
: \meta{E} ::= \meta{T} | \meta{T} \cdot + \cdot \meta{E} | \meta{T} \cdot - \cdot \meta{E}\\ |
|
376 |
: \meta{T} ::= \meta{F} | \meta{F} \cdot * \cdot \meta{T}\\ |
|
377 |
: \meta{F} ::= num\_token | ( \cdot \meta{E} \cdot )\\ |
|
378 |
\end{plstx} |
|
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
379 |
|
680 | 380 |
\noindent |
381 |
In this grammar all $\meta{E}$xpressions, $\meta{T}$erms and $\meta{F}$actors |
|
382 |
are in some way protected from being left-recusive. For example if you |
|
383 |
start $\meta{E}$ you can derive another one by going through $\meta{T}$, then |
|
384 |
$\meta{F}$, but then $\meta{E}$ is protected by the open-parenthesis. |
|
385 |
||
386 |
\subsection*{Removing $\epsilon$-Rules and CYK-Algorithm} |
|
387 |
||
388 |
I showed above that the non-left-recursive grammar for binary numbers is |
|
389 |
||
390 |
\begin{plstx}[margin=1cm] |
|
391 |
: \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\ |
|
392 |
: \meta{B'} ::= \meta{B} \cdot \meta{B'} | \epsilon\\ |
|
393 |
\end{plstx} |
|
394 |
||
395 |
\noindent |
|
396 |
The transformation made the original grammar non-left-recursive, but at |
|
397 |
the expense of introducing an $\epsilon$ in the second rule. Having an |
|
398 |
explicit $\epsilon$-rule is annoying to, not in terms of looping, but in |
|
399 |
terms of efficiency. The reason is that the $\epsilon$-rule always |
|
400 |
applies but since it recognises the empty string, it does not make any |
|
401 |
progress with recognising a string. Better are rules like $( \cdot |
|
402 |
\meta{E} \cdot )$ where something of the input is consumed. Getting |
|
403 |
rid of $\epsilon$-rules is also important for the CYK parsing algorithm, |
|
404 |
which can give us an insight into the complexity class of parsing. |
|
405 |
||
406 |
It turns out we can also by some generic transformations eliminate |
|
407 |
$\epsilon$-rules from grammars. Consider again the grammar above for |
|
408 |
binary numbers where have a rule $\meta{B'} ::= \epsilon$. In this case |
|
409 |
we look for rules of the (generic) form \mbox{$\meta{A} := |
|
410 |
\alpha\cdot\meta{B'}\cdot\beta$}. That is there are rules that use |
|
411 |
$\meta{B'}$ and something ($\alpha$) is in front of $\meta{B'}$ and |
|
412 |
something follows ($\beta$). Such rules need to be replaced by |
|
413 |
additional rules of the form \mbox{$\meta{A} := \alpha\cdot\beta$}. |
|
414 |
In our running example there are the two rules for $\meta{B}$ which |
|
415 |
fall into this category |
|
416 |
||
417 |
\begin{plstx}[margin=1cm] |
|
418 |
: \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'}\\ |
|
419 |
\end{plstx} |
|
420 |
||
421 |
\noindent To follow the general scheme of the transfromation, |
|
422 |
the $\alpha$ is either is either $0$ or $1$, and the $\beta$ happens |
|
798 | 423 |
to be empty. So we need to generate new rules for the form |
680 | 424 |
\mbox{$\meta{A} := \alpha\cdot\beta$}, which in our particular |
425 |
example means we obtain |
|
426 |
||
427 |
\begin{plstx}[margin=1cm] |
|
428 |
: \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\ |
|
429 |
\end{plstx} |
|
430 |
||
431 |
\noindent |
|
432 |
Unfortunately $\meta{B'}$ is also used in the rule |
|
433 |
||
434 |
\begin{plstx}[margin=1cm] |
|
435 |
: \meta{B'} ::= \meta{B} \cdot \meta{B'}\\ |
|
436 |
\end{plstx} |
|
437 |
||
438 |
\noindent |
|
439 |
For this we repeat the transformation, giving |
|
440 |
||
441 |
\begin{plstx}[margin=1cm] |
|
442 |
: \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\ |
|
443 |
\end{plstx} |
|
444 |
||
445 |
\noindent |
|
446 |
In this case $\alpha$ was substituted with $\meta{B}$ and $\beta$ |
|
447 |
was again empty. Once no rule is left over, we can simply throw |
|
448 |
away the $\epsilon$ rule. This gives the grammar |
|
449 |
||
450 |
\begin{plstx}[margin=1cm] |
|
451 |
: \meta{B} ::= 0 \cdot \meta{B'} | 1 \cdot \meta{B'} | 0 | 1\\ |
|
452 |
: \meta{B'} ::= \meta{B} \cdot \meta{B'} | \meta{B}\\ |
|
453 |
\end{plstx} |
|
454 |
||
455 |
\noindent |
|
456 |
I let you think about whether this grammar can still recognise all |
|
457 |
binary numbers and whether this grammar is non-left-recursive. The |
|
458 |
precise statement for the transformation of removing $\epsilon$-rules is |
|
459 |
that if the original grammar was able to recognise only non-empty |
|
460 |
strings, then the transformed grammar will be equivalent (matching the |
|
461 |
same set of strings); if the original grammar was able to match the |
|
462 |
empty string, then the transformed grammar will be able to match the |
|
463 |
same strings, \emph{except} the empty string. So the $\epsilon$-removal |
|
464 |
does not preserve equivalence of grammars, but the small defect with the |
|
465 |
empty string is not important for practical purposes. |
|
466 |
||
467 |
So why are these transformations all useful? Well apart from making the |
|
468 |
parser combinators work (remember they cannot deal with left-recursion and |
|
469 |
are inefficient with $\epsilon$-rules), a second reason is that they help |
|
470 |
with getting any insight into the complexity of the parsing problem. |
|
471 |
The parser combinators are very easy to implement, but are far from the |
|
472 |
most efficient way of processing input (they can blow up exponentially |
|
473 |
with ambiguous grammars). The question remains what is the best possible |
|
474 |
complexity for parsing? It turns out that this is $O(n^3)$ for context-free |
|
475 |
languages. |
|
476 |
||
477 |
To answer the question about complexity, let me describe next the CYK |
|
478 |
algorithm (named after the authors Cocke–Younger–Kasami). This algorithm |
|
681 | 479 |
works with grammars that are in \emph{Chomsky normalform}. In Chomsky |
480 |
normalform all rules must be of the form $\meta{A} ::= a$, where $a$ is |
|
481 |
a terminal, or $\meta{A} ::= \meta{B}\cdot \meta{C}$, where $\meta{B}$ and |
|
482 |
$\meta{B}$ need to be non-terminals. And no rule can contain $\epsilon$. |
|
483 |
The following grammar is in Chomsky normalform: |
|
484 |
||
485 |
\begin{plstx}[margin=1cm] |
|
682 | 486 |
: \meta{S} ::= \meta{N}\cdot \meta{P}\\ |
487 |
: \meta{P} ::= \meta{V}\cdot \meta{N}\\ |
|
488 |
: \meta{N} ::= \meta{N}\cdot \meta{N}\\ |
|
489 |
: \meta{N} ::= \meta{A}\cdot \meta{N}\\ |
|
490 |
: \meta{N} ::= \texttt{student} | \texttt{trainer} | \texttt{team} |
|
491 |
| \texttt{trains}\\ |
|
492 |
: \meta{V} ::= \texttt{trains} | \texttt{team}\\ |
|
493 |
: \meta{A} ::= \texttt{The} | \texttt{the}\\ |
|
681 | 494 |
\end{plstx} |
495 |
||
496 |
\noindent |
|
497 |
where $\meta{S}$ is the start symbol and $\meta{S}$, $\meta{P}$, |
|
498 |
$\meta{N}$, $\meta{V}$ and $\meta{A}$ are non-terminals. The ``words'' |
|
499 |
are terminals. The rough idea behind this grammar is that $\meta{S}$ |
|
500 |
stands for a sentence, $\meta{P}$ is a predicate, $\meta{N}$ is a noun |
|
501 |
and so on. For example the rule \mbox{$\meta{P} ::= \meta{V}\cdot |
|
502 |
\meta{N}$} states that a predicate can be a verb followed by a noun. |
|
503 |
Now the question is whether the string |
|
504 |
||
505 |
\begin{center} |
|
506 |
\texttt{The trainer trains the student team} |
|
507 |
\end{center} |
|
508 |
||
509 |
\noindent |
|
510 |
is recognised by the grammar. The CYK algorithm starts with the |
|
511 |
following triangular data structure. |
|
680 | 512 |
|
798 | 513 |
%%\begin{figure}[t] |
682 | 514 |
\begin{center} |
798 | 515 |
\begin{tikzpicture}[scale=0.7,line width=0.8mm] |
682 | 516 |
\draw (-2,0) -- (4,0); |
517 |
\draw (-2,1) -- (4,1); |
|
518 |
\draw (-2,2) -- (3,2); |
|
519 |
\draw (-2,3) -- (2,3); |
|
520 |
\draw (-2,4) -- (1,4); |
|
521 |
\draw (-2,5) -- (0,5); |
|
522 |
\draw (-2,6) -- (-1,6); |
|
523 |
||
524 |
\draw (0,0) -- (0, 5); |
|
525 |
\draw (1,0) -- (1, 4); |
|
526 |
\draw (2,0) -- (2, 3); |
|
527 |
\draw (3,0) -- (3, 2); |
|
528 |
\draw (4,0) -- (4, 1); |
|
529 |
\draw (-1,0) -- (-1, 6); |
|
530 |
\draw (-2,0) -- (-2, 6); |
|
531 |
||
532 |
\draw (-1.5,-0.5) node {\footnotesize{}\texttt{The}}; |
|
533 |
\draw (-0.5,-1.0) node {\footnotesize{}\texttt{trainer}}; |
|
534 |
\draw ( 0.5,-0.5) node {\footnotesize{}\texttt{trains}}; |
|
535 |
\draw ( 1.5,-1.0) node {\footnotesize{}\texttt{the}}; |
|
536 |
\draw ( 2.5,-0.5) node {\footnotesize{}\texttt{student}}; |
|
537 |
\draw ( 3.5,-1.0) node {\footnotesize{}\texttt{team}}; |
|
538 |
||
539 |
\draw (-1.5,0.5) node {$A$}; |
|
540 |
\draw (-0.5,0.5) node {$N$}; |
|
541 |
\draw ( 0.5,0.5) node {$N,V$}; |
|
542 |
\draw ( 1.5,0.5) node {$A$}; |
|
543 |
\draw ( 2.5,0.5) node {$N$}; |
|
544 |
\draw ( 3.5,0.5) node {$N,V$}; |
|
545 |
||
798 | 546 |
% \draw (-1.5,1.5) node {\small{}a}; |
547 |
% \draw (-0.5,1.5) node {\small{}b}; |
|
548 |
% \draw ( 0.5,1.5) node {\small{}c}; |
|
549 |
% \draw ( 1.5,1.5) node {\small{}d}; |
|
550 |
% \draw ( 2.5,1.5) node {\small{}e}; |
|
551 |
||
682 | 552 |
\draw (-2.4, 5.5) node {$1$}; |
553 |
\draw (-2.4, 4.5) node {$2$}; |
|
554 |
\draw (-2.4, 3.5) node {$3$}; |
|
555 |
\draw (-2.4, 2.5) node {$4$}; |
|
556 |
\draw (-2.4, 1.5) node {$5$}; |
|
557 |
\draw (-2.4, 0.5) node {$6$}; |
|
558 |
\end{tikzpicture} |
|
559 |
\end{center} |
|
798 | 560 |
%%\end{figure} |
561 |
||
562 |
||
563 |
\noindent |
|
564 |
The last row contains the information about all words and their |
|
565 |
corresponding non-terminals. For example the field for \texttt{trains} |
|
566 |
contains the information $\meta{N}$ and $\meta{V}$ because it can be a |
|
567 |
``verb'' and a ``noun'' according to the grammar. The row above, |
|
568 |
let's call the corresponding fields 5a to 5e, contains information |
|
569 |
about 2-word parts of the sentence, namely |
|
570 |
||
571 |
\begin{center} |
|
572 |
\begin{tabular}{llll} |
|
573 |
5a) & $\underbrace{\texttt{The}}_{A}$ $\mid$ $\underbrace{\texttt{trainer}}_{N}$ \\ |
|
574 |
5b) & $\underbrace{\texttt{trainer}}_{N}$ $\mid$ $\underbrace{\texttt{trains}}_{N,V}$\\ |
|
575 |
5c) & \texttt{trains} $\mid$ \texttt{the} \\ |
|
576 |
5d) & \texttt{the} $\mid$ \texttt{student} \\ |
|
577 |
5e) & \texttt{student} $\mid$ \texttt{team} \\ |
|
578 |
\end{tabular} |
|
579 |
\end{center} |
|
580 |
||
581 |
\noindent |
|
582 |
For each of them, we look up in row 6 which non-terminals it belongs to |
|
583 |
(indicated above for 5a and 5b). For 5a, with the non-terminals |
|
584 |
\meta{A} and \meta{N}, we find the grammar rule |
|
585 |
||
586 |
\begin{plstx}[margin=1cm] |
|
587 |
: \meta{N} ::= \meta{A}\cdot \meta{N}\\ |
|
588 |
\end{plstx} |
|
589 |
||
590 |
\noindent |
|
591 |
which means into field 5a we put the left-hand side of this rule, |
|
592 |
which in this case is the non-terminal \meta{N}. For 5b we have to check |
|
593 |
for both $\meta{N}\cdot\meta{N}$ and $\meta{N}\cdot\meta{V}$ whether there |
|
594 |
is a right-hand side of this form in the grammar. But only the |
|
595 |
grammar rule |
|
596 |
||
597 |
\begin{plstx}[margin=1cm] |
|
598 |
: \meta{N} ::= \meta{N}\cdot \meta{N}\\ |
|
599 |
\end{plstx} |
|
600 |
||
601 |
\noindent |
|
602 |
matches, which means 5b gets also an \meta{N}. Continuing for all |
|
603 |
fields in row 5 gives: |
|
604 |
||
605 |
\begin{center} |
|
606 |
\begin{tikzpicture}[scale=0.7,line width=0.8mm] |
|
607 |
\draw (-2,0) -- (4,0); |
|
608 |
\draw (-2,1) -- (4,1); |
|
609 |
\draw (-2,2) -- (3,2); |
|
610 |
\draw (-2,3) -- (2,3); |
|
611 |
\draw (-2,4) -- (1,4); |
|
612 |
\draw (-2,5) -- (0,5); |
|
613 |
\draw (-2,6) -- (-1,6); |
|
614 |
||
615 |
\draw (0,0) -- (0, 5); |
|
616 |
\draw (1,0) -- (1, 4); |
|
617 |
\draw (2,0) -- (2, 3); |
|
618 |
\draw (3,0) -- (3, 2); |
|
619 |
\draw (4,0) -- (4, 1); |
|
620 |
\draw (-1,0) -- (-1, 6); |
|
621 |
\draw (-2,0) -- (-2, 6); |
|
622 |
||
623 |
\draw (-1.5,-0.5) node {\footnotesize{}\texttt{The}}; |
|
624 |
\draw (-0.5,-1.0) node {\footnotesize{}\texttt{trainer}}; |
|
625 |
\draw ( 0.5,-0.5) node {\footnotesize{}\texttt{trains}}; |
|
626 |
\draw ( 1.5,-1.0) node {\footnotesize{}\texttt{the}}; |
|
627 |
\draw ( 2.5,-0.5) node {\footnotesize{}\texttt{student}}; |
|
628 |
\draw ( 3.5,-1.0) node {\footnotesize{}\texttt{team}}; |
|
629 |
||
630 |
\draw (-1.5,0.5) node {$A$}; |
|
631 |
\draw (-0.5,0.5) node {$N$}; |
|
632 |
\draw ( 0.5,0.5) node {$N,V$}; |
|
633 |
\draw ( 1.5,0.5) node {$A$}; |
|
634 |
\draw ( 2.5,0.5) node {$N$}; |
|
635 |
\draw ( 3.5,0.5) node {$N,V$}; |
|
636 |
||
637 |
\draw (-1.5,1.5) node {$N$}; |
|
638 |
\draw (-0.5,1.5) node {$N$}; |
|
639 |
\draw ( 0.5,1.5) node {$$}; |
|
640 |
\draw ( 1.5,1.5) node {$N$}; |
|
641 |
\draw ( 2.5,1.5) node {$N$}; |
|
642 |
||
643 |
||
644 |
% \draw (-1.5,1.5) node {\small{}a}; |
|
645 |
% \draw (-0.5,1.5) node {\small{}b}; |
|
646 |
% \draw ( 0.5,1.5) node {\small{}c}; |
|
647 |
% \draw ( 1.5,1.5) node {\small{}d}; |
|
648 |
% \draw ( 2.5,1.5) node {\small{}e}; |
|
649 |
||
650 |
\draw (-2.4, 5.5) node {$1$}; |
|
651 |
\draw (-2.4, 4.5) node {$2$}; |
|
652 |
\draw (-2.4, 3.5) node {$3$}; |
|
653 |
\draw (-2.4, 2.5) node {$4$}; |
|
654 |
\draw (-2.4, 1.5) node {$5$}; |
|
655 |
\draw (-2.4, 0.5) node {$6$}; |
|
656 |
\end{tikzpicture} |
|
657 |
\end{center} |
|
658 |
||
659 |
\noindent |
|
660 |
Now row 4 is in charge of all 3-word parts of the sentence, namely |
|
661 |
||
662 |
\begin{center} |
|
663 |
\begin{tabular}{llll} |
|
664 |
4a) & The $\mid$ trainer trains\\ |
|
665 |
& The trainer $\mid$ trains\\ |
|
666 |
4b) & trainer $\mid$ trains the\\ |
|
667 |
& trainer trains $\mid$ the\\ |
|
668 |
4c) & trains $\mid$ the student\\ |
|
669 |
& trains the $\mid$ student\\ |
|
670 |
4d) & the $\mid$ student team\\ |
|
671 |
& the student $\mid$ team\\ |
|
672 |
\end{tabular} |
|
673 |
\end{center} |
|
674 |
||
675 |
\noindent |
|
676 |
Note that in case of 3-word parts we have two splits. For example for |
|
677 |
4a: $\underbrace{\texttt{The}}_{A}$ and |
|
678 |
$\underbrace{\texttt{trainer trains}}_{N}$; and also |
|
679 |
$\underbrace{\texttt{The trainer}}_{N}$ and |
|
680 |
$\underbrace{\texttt{trains}}_{N,V}$. For each of these splits we have |
|
681 |
to look up in the rows below, which non-terminals we already |
|
682 |
computed. This allows us to look for right-hand sides in our grammar: |
|
683 |
$\meta{A}\cdot\meta{N}$, $\meta{N}\cdot\meta{N}$ and |
|
684 |
$\meta{N}\cdot\meta{V}$, which yield the only left-hand side |
|
685 |
\meta{N}. This is what we fill in for 4a. And so on for row 4. |
|
686 |
||
687 |
Row 3 is about all 4-word parts in the sentence, namely |
|
688 |
||
689 |
\begin{center} |
|
690 |
\begin{tabular}{llll} |
|
691 |
3a) & The trainer trains the\\ |
|
692 |
3b) & trainer trains the student\\ |
|
693 |
3c) & trains the student team\\ |
|
694 |
\end{tabular} |
|
695 |
\end{center} |
|
696 |
||
697 |
\noindent |
|
698 |
Each of them can be split up in 3 ways, for example |
|
699 |
||
700 |
\begin{center} |
|
701 |
\begin{tabular}{llll} |
|
702 |
3a) & The $\mid$ trainer trains the\\ |
|
703 |
& The trainer $\mid$ trains the\\ |
|
704 |
& The trainer trains $\mid$ the\\ |
|
705 |
\end{tabular} |
|
706 |
\end{center} |
|
707 |
||
708 |
\noindent |
|
709 |
and we have to consider them all in turn to fill in the non-terminals for |
|
710 |
3a. You can guess how it continues: row 2 is for all 5-word parts, and finally |
|
711 |
the field on the top is for the whole sentence. |
|
712 |
The idea of the CYK algorithm is that if in the top-field the starting |
|
713 |
symbol \meta{S} appears (possibly together with other non-terminals), |
|
714 |
then the sentence is accepted by the grammar. If it does not, then the |
|
715 |
sentence is not accepted. |
|
716 |
||
717 |
Let us very quickly calculate the complexity of the CYK |
|
718 |
algorithm. Lookup operations inside the triangle and in the grammar |
|
719 |
are assumed to be of constant time, $O(1)$, meaning they do not |
|
720 |
matter. How many fields are in the triangle\ldots |
|
721 |
$\frac{n}{2} * (n+1)$, where $n$ is the size of the input. That means |
|
722 |
roughly $O(n^2)$ fields. How much work do we have to do for each |
|
723 |
field? Well, for the top-most we have to consider $n-1$ splits, which |
|
724 |
means roughly $O(n)$ for each field. The overall result is a $O(n^3)$ |
|
725 |
time-complexity for CYK. It turns out that this is the best worst-time |
|
726 |
complexity for parsing with context-free grammars. |
|
680 | 727 |
|
728 |
\end{document} |
|
729 |
||
730 |
||
731 |
%%% Parser combinators are now part of handout 6 |
|
459 | 732 |
|
733 |
\subsection*{Parser Combinators} |
|
734 |
||
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
735 |
Let us now turn to the problem of generating a parse-tree for |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
736 |
a grammar and string. In what follows we explain \emph{parser |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
737 |
combinators}, because they are easy to implement and closely |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
738 |
resemble grammar rules. Imagine that a grammar describes the |
665 | 739 |
strings of natural numbers, such as the grammar \meta{N} shown |
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
740 |
above. For all such strings we want to generate the |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
741 |
parse-trees or later on we actually want to extract the |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
742 |
meaning of these strings, that is the concrete integers |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
743 |
``behind'' these strings. In Scala the parser combinators will |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
744 |
be functions of type |
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
745 |
|
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
746 |
\begin{center} |
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
747 |
\texttt{I $\Rightarrow$ Set[(T, I)]} |
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
748 |
\end{center} |
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
749 |
|
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
750 |
\noindent that is they take as input something of type |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
751 |
\texttt{I}, typically a list of tokens or a string, and return |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
752 |
a set of pairs. The first component of these pairs corresponds |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
753 |
to what the parser combinator was able to process from the |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
754 |
input and the second is the unprocessed part of the input. As |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
755 |
we shall see shortly, a parser combinator might return more |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
756 |
than one such pair, with the idea that there are potentially |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
757 |
several ways how to interpret the input. As a concrete |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
758 |
example, consider the case where the input is of type string, |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
759 |
say the string |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
760 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
761 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
762 |
\tt\Grid{iffoo\VS testbar} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
763 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
764 |
|
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
765 |
\noindent We might have a parser combinator which tries to |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
766 |
interpret this string as a keyword (\texttt{if}) or an |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
767 |
identifier (\texttt{iffoo}). Then the output will be the set |
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
768 |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
769 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
770 |
$\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
771 |
\left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
772 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
773 |
|
362
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
774 |
\noindent where the first pair means the parser could |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
775 |
recognise \texttt{if} from the input and leaves the rest as |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
776 |
`unprocessed' as the second component of the pair; in the |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
777 |
other case it could recognise \texttt{iffoo} and leaves |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
778 |
\texttt{\VS testbar} as unprocessed. If the parser cannot |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
779 |
recognise anything from the input then parser combinators just |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
780 |
return the empty set $\{\}$. This will indicate |
57ea439feaff
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
360
diff
changeset
|
781 |
something ``went wrong''. |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
782 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
783 |
The main attraction is that we can easily build parser combinators out of smaller components |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
784 |
following very closely the structure of a grammar. In order to implement this in an object |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
785 |
oriented programming language, like Scala, we need to specify an abstract class for parser |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
786 |
combinators. This abstract class requires the implementation of the function |
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
787 |
\texttt{parse} taking an argument of type \texttt{I} and returns a set of type |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
788 |
\mbox{\texttt{Set[(T, I)]}}. |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
789 |
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
790 |
\begin{center} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
791 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
792 |
abstract class Parser[I, T] { |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
793 |
def parse(ts: I): Set[(T, I)] |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
794 |
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
795 |
def parse_all(ts: I): Set[T] = |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
796 |
for ((head, tail) <- parse(ts); if (tail.isEmpty)) |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
797 |
yield head |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
798 |
} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
799 |
\end{lstlisting} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
800 |
\end{center} |
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
801 |
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
802 |
\noindent |
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
803 |
From the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all}, |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
804 |
which just filters out all pairs whose second component is not empty (that is has still some |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
805 |
unprocessed part). The reason is that at the end of parsing we are only interested in the |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
806 |
results where all the input has been consumed and no unprocessed part is left. |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
807 |
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
808 |
One of the simplest parser combinators recognises just a character, say $c$, |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
809 |
from the beginning of strings. Its behaviour is as follows: |
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
810 |
|
177
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
811 |
\begin{itemize} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
812 |
\item if the head of the input string starts with a $c$, it returns |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
813 |
the set $\{(c, \textit{tail of}\; s)\}$ |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
814 |
\item otherwise it returns the empty set $\varnothing$ |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
815 |
\end{itemize} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
816 |
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
817 |
\noindent |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
818 |
The input type of this simple parser combinator for characters is |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
819 |
\texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
820 |
The code in Scala is as follows: |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
821 |
|
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
822 |
\begin{center} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
823 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
824 |
case class CharParser(c: Char) extends Parser[String, Char] { |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
825 |
def parse(sb: String) = |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
826 |
if (sb.head == c) Set((c, sb.tail)) else Set() |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
827 |
} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
828 |
\end{lstlisting} |
53def1fbf472
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
176
diff
changeset
|
829 |
\end{center} |
176
3c2653fc8b5a
updated
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
175
diff
changeset
|
830 |
|
183
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
831 |
\noindent |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
832 |
The \texttt{parse} function tests whether the first character of the |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
833 |
input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits the |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
834 |
string into the recognised part \texttt{c} and the unprocessed part |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
835 |
\texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} then |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
836 |
the parser returns the empty set (in Scala \texttt{Set()}). |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
837 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
838 |
More interesting are the parser combinators that build larger parsers |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
839 |
out of smaller component parsers. For example the alternative |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
840 |
parser combinator is as follows. |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
841 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
842 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
843 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
844 |
class AltParser[I, T] |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
845 |
(p: => Parser[I, T], |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
846 |
q: => Parser[I, T]) extends Parser[I, T] { |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
847 |
def parse(sb: I) = p.parse(sb) ++ q.parse(sb) |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
848 |
} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
849 |
\end{lstlisting} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
850 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
851 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
852 |
\noindent |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
853 |
The types of this parser combinator are polymorphic (we just have \texttt{I} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
854 |
for the input type, and \texttt{T} for the output type). The alternative parser |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
855 |
builds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}. |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
856 |
Both need to be able to process input of type \texttt{I} and return the same |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
857 |
output type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
858 |
\texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
859 |
evaluation of the arguments before they are used. This is often called |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
860 |
\emph{lazy evaluation} of the arguments.) The alternative parser should run |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
861 |
the input with the first parser \texttt{p} (producing a set of outputs) and then |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
862 |
run the same input with \texttt{q}. The result should be then just the union |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
863 |
of both sets, which is the operation \texttt{++} in Scala. |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
864 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
865 |
This parser combinator already allows us to construct a parser that either |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
866 |
a character \texttt{a} or \texttt{b}, as |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
867 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
868 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
869 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
870 |
new AltParser(CharParser('a'), CharParser('b')) |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
871 |
\end{lstlisting} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
872 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
873 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
874 |
\noindent |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
875 |
Scala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}. |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
876 |
We can call this parser combinator with the strings |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
877 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
878 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
879 |
\begin{tabular}{rcl} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
880 |
input string & & output\medskip\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
881 |
\texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
882 |
\texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
883 |
\texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
884 |
\end{tabular} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
885 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
886 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
887 |
\noindent |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
888 |
We receive in the first two cases a successful output (that is a non-empty set). |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
889 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
890 |
A bit more interesting is the \emph{sequence parser combinator} implemented in |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
891 |
Scala as follows: |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
892 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
893 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
894 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
895 |
class SeqParser[I, T, S] |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
896 |
(p: => Parser[I, T], |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
897 |
q: => Parser[I, S]) extends Parser[I, (T, S)] { |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
898 |
def parse(sb: I) = |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
899 |
for ((head1, tail1) <- p.parse(sb); |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
900 |
(head2, tail2) <- q.parse(tail1)) |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
901 |
yield ((head1, head2), tail2) |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
902 |
} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
903 |
\end{lstlisting} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
904 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
905 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
906 |
\noindent |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
907 |
This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
908 |
as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}). |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
909 |
The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}. |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
910 |
Let \texttt{q} run on these unprocessed parts |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
911 |
producing again a set of pairs. The output of the sequence parser combinator is then a set |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
912 |
containing pairs where the first components are again pairs, namely what the first parser could parse |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
913 |
together with what the second parser could parse; the second component is the unprocessed |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
914 |
part left over after running the second parser \texttt{q}. Therefore the input type of |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
915 |
the sequence parser combinator is as usual \texttt{I}, but the output type is |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
916 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
917 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
918 |
\texttt{Set[((T, S), I)]} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
919 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
920 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
921 |
Scala allows us to provide some |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
922 |
shorthand notation for the sequence parser combinator. So we can write for |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
923 |
example \texttt{'a' $\sim$ 'b'}, which is the |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
924 |
parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}. |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
925 |
Calling this parser combinator with the strings |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
926 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
927 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
928 |
\begin{tabular}{rcl} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
929 |
input string & & output\medskip\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
930 |
\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
931 |
\texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
932 |
\texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
933 |
\end{tabular} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
934 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
935 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
936 |
\noindent |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
937 |
A slightly more complicated parser is \texttt{('a' || 'b') $\sim$ 'b'} which parses as first character either |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
938 |
an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results. |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
939 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
940 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
941 |
\begin{tabular}{rcl} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
942 |
input string & & output\medskip\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
943 |
\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
944 |
\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
945 |
\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
946 |
\end{tabular} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
947 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
948 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
949 |
Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error. |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
950 |
The first parser has as output type a single character (recall the type of \texttt{CharParser}), |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
951 |
but the second parser produces a pair of characters as output. The alternative parser is however |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
952 |
required to have both component parsers to have the same type. We will see later how we can |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
953 |
build this parser without the typing error. |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
954 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
955 |
The next parser combinator does not actually combine smaller parsers, but applies |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
956 |
a function to the result of the parser. It is implemented in Scala as follows |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
957 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
958 |
\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
959 |
\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
960 |
class FunParser[I, T, S] |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
961 |
(p: => Parser[I, T], |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
962 |
f: T => S) extends Parser[I, S] { |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
963 |
def parse(sb: I) = |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
964 |
for ((head, tail) <- p.parse(sb)) yield (f(head), tail) |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
965 |
} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
966 |
\end{lstlisting} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
967 |
\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
968 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
969 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
970 |
\noindent |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
971 |
This parser combinator takes a parser \texttt{p} with output type \texttt{T} as |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
972 |
input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
973 |
produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
974 |
applies the function \texttt{f} to all the parer outputs. Since this function |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
975 |
is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}. |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
976 |
Again Scala lets us introduce some shorthand notation for this parser combinator. |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
977 |
Therefore we will write \texttt{p ==> f} for it. |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
978 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
979 |
%\bigskip |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
980 |
%takes advantage of the full generality---have a look |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
981 |
%what it produces if we call it with the string \texttt{abc} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
982 |
% |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
983 |
%\begin{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
984 |
%\begin{tabular}{rcl} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
985 |
%input string & & output\medskip\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
986 |
%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
987 |
%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
988 |
%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$ |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
989 |
%\end{tabular} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
990 |
%\end{center} |
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
991 |
|
b17eff695c7f
added new stuff
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
177
diff
changeset
|
992 |
|
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
993 |
|
680 | 994 |
|
995 |
||
173
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
996 |
|
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
997 |
%%% Local Variables: |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
998 |
%%% mode: latex |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
999 |
%%% TeX-master: t |
7cfb7a6f7c99
added slides
Christian Urban <christian dot urban at kcl dot ac dot uk>
parents:
diff
changeset
|
1000 |
%%% End: |
680 | 1001 |