afl-material: handouts/ho05.tex@0c491eff5b01 (annotated)

665 6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	1	% !TEX program = xelatex
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	2	\documentclass{article}
297 5c51839c88fd updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 292 diff changeset	3	\usepackage{../style}
217 cd6066f1056a updated handouts Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 183 diff changeset	4	\usepackage{../langs}
459 780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	5	\usepackage{../grammar}
798 aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	6	\usepackage{../graphics}
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	7
545 76a98ed71a2a updated Christian Urban <urbanc@in.tum.de> parents: 459 diff changeset	8	% epsilon and left-recursion elimination
76a98ed71a2a updated Christian Urban <urbanc@in.tum.de> parents: 459 diff changeset	9	% http://www.mollypages.org/page/grammar/index.mp
76a98ed71a2a updated Christian Urban <urbanc@in.tum.de> parents: 459 diff changeset	10
618 f4818c95a32e updated Christian Urban <urbanc@in.tum.de> parents: 582 diff changeset	11	%% parsing scala files
691 991849dfbcb1 updated Christian Urban <urbanc@in.tum.de> parents: 686 diff changeset	12	%% https://scalameta.org/
991849dfbcb1 updated Christian Urban <urbanc@in.tum.de> parents: 686 diff changeset	13
991849dfbcb1 updated Christian Urban <urbanc@in.tum.de> parents: 686 diff changeset	14	% chomsky normal form transformation
991849dfbcb1 updated Christian Urban <urbanc@in.tum.de> parents: 686 diff changeset	15	% https://youtu.be/FNPSlnj3Vt0
618 f4818c95a32e updated Christian Urban <urbanc@in.tum.de> parents: 582 diff changeset	16
710 183663740fb7 updated Christian Urban <urbanc@in.tum.de> parents: 691 diff changeset	17	% Language hierachy is about complexity
183663740fb7 updated Christian Urban <urbanc@in.tum.de> parents: 691 diff changeset	18	% How hard is it to recognise an element in a language
183663740fb7 updated Christian Urban <urbanc@in.tum.de> parents: 691 diff changeset	19
722 14914b57e207 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 710 diff changeset	20	% Pratt parsing
14914b57e207 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 710 diff changeset	21	% https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html
14914b57e207 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 710 diff changeset	22	% https://www.oilshell.org/blog/2017/03/31.html
14914b57e207 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 710 diff changeset	23
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	24	\begin{document}
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	25
385 7f8516ff408d updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 362 diff changeset	26	\section*{Handout 5 (Grammars \& Parser)}
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	27
727 eb9343126625 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 722 diff changeset	28	So far we have focused on regular expressions as well as matching and
eb9343126625 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 722 diff changeset	29	lexing algorithms. While regular expressions are very useful for lexing
eb9343126625 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 722 diff changeset	30	and for recognising many patterns in strings (like email addresses),
eb9343126625 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 722 diff changeset	31	they have their limitations. For example there is no regular expression
eb9343126625 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 722 diff changeset	32	that can recognise the language $a^nb^n$ (where you have strings
eb9343126625 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 722 diff changeset	33	starting with $n$ $a$'s followed by the same amount of $b$'s). Another
eb9343126625 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 722 diff changeset	34	example for which there exists no regular expression is the language of
eb9343126625 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 722 diff changeset	35	well-parenthesised expressions. In languages like Lisp, which use
eb9343126625 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 722 diff changeset	36	parentheses rather extensively, it might be of interest to know whether
eb9343126625 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 722 diff changeset	37	the following two expressions are well-parenthesised or not (the left
eb9343126625 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 722 diff changeset	38	one is, the right one is not):
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	39
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	40	\begin{center}
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	41	$(((()()))())$ \hspace{10mm} $(((()()))()))$
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	42	\end{center}
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	43
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	44	\noindent Not being able to solve such recognition problems is
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	45	a serious limitation. In order to solve such recognition
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	46	problems, we need more powerful techniques than regular
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	47	expressions. We will in particular look at \emph{context-free
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	48	languages}. They include the regular languages as the picture
582 d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	49	below about language classes shows:
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	50
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	51
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	52	\begin{center}
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	53	\begin{tikzpicture}
297 5c51839c88fd updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 292 diff changeset	54	[rect/.style={draw=black!50,
5c51839c88fd updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 292 diff changeset	55	top color=white,bottom color=black!20,
5c51839c88fd updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 292 diff changeset	56	rectangle, very thick, rounded corners}]
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	57
297 5c51839c88fd updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 292 diff changeset	58	\draw (0,0) node [rect, text depth=30mm, text width=46mm] {\small all languages};
5c51839c88fd updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 292 diff changeset	59	\draw (0,-0.4) node [rect, text depth=20mm, text width=44mm] {\small decidable languages};
5c51839c88fd updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 292 diff changeset	60	\draw (0,-0.65) node [rect, text depth=13mm] {\small context sensitive languages};
5c51839c88fd updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 292 diff changeset	61	\draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages};
5c51839c88fd updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 292 diff changeset	62	\draw (0,-1.05) node [rect] {\small regular languages};
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	63	\end{tikzpicture}
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	64	\end{center}
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	65
582 d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	66	\noindent Each ``bubble'' stands for sets of languages (remember
d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	67	languages are sets of strings). As indicated the set of regular
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	68	languages is fully included inside the context-free languages,
582 d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	69	meaning every regular language is also context-free, but not vice
665 6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	70	versa. Below I will let you think, for example, what the context-free
582 d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	71	grammar is for the language corresponding to the regular expression
d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	72	$(aaa)^*a$.
d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	73
d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	74	Because of their convenience, context-free languages play an important
d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	75	role in `day-to-day' text processing and in programming
d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	76	languages. Context-free in this setting means that ``words'' have one
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	77	meaning only and this meaning is independent from the context
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	78	the ``words'' appear in. For example ambiguity issues like
582 d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	79
d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	80	\begin{center}
682 553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	81	\tt Time flies like an arrow. Fruit flies like bananas.
582 d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	82	\end{center}
d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	83
d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	84	\noindent
941 66adcae6c762 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 937 diff changeset	85	from natural languages where the meaning of \emph{flies} depends on the
686 05cfce0fdef7 updated Christian Urban <urbanc@in.tum.de> parents: 682 diff changeset	86	surrounding \emph{context} are avoided as much as possible. Here is
937 dc5ab66b11cc updated Christian Urban <christian.urban@kcl.ac.uk> parents: 798 diff changeset	87	an interesting video about C++ not being a context-free language
686 05cfce0fdef7 updated Christian Urban <urbanc@in.tum.de> parents: 682 diff changeset	88
05cfce0fdef7 updated Christian Urban <urbanc@in.tum.de> parents: 682 diff changeset	89	\begin{center}
05cfce0fdef7 updated Christian Urban <urbanc@in.tum.de> parents: 682 diff changeset	90	\url{https://www.youtube.com/watch?v=OzK8pUu4UfM}
05cfce0fdef7 updated Christian Urban <urbanc@in.tum.de> parents: 682 diff changeset	91	\end{center}
582 d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	92
d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	93	Context-free languages are usually specified by grammars. For example
d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	94	a grammar for well-parenthesised expressions can be given as follows:
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	95
459 780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	96	\begin{plstx}[margin=3cm]
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	97	: \meta{P} ::= ( \cdot \meta{P} \cdot ) \cdot \meta{P}
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	98	\| \epsilon\\
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	99	\end{plstx}
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	100
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	101	\noindent
937 dc5ab66b11cc updated Christian Urban <christian.urban@kcl.ac.uk> parents: 798 diff changeset	102	or a grammar for recognising strings consisting of ones (at least one) is
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	103
459 780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	104	\begin{plstx}[margin=3cm]
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	105	: \meta{O} ::= 1 \cdot \meta{O}
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	106	\| 1\\
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	107	\end{plstx}
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	108
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	109	In general grammars consist of finitely many rules built up
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	110	from \emph{terminal symbols} (usually lower-case letters) and
582 d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	111	\emph{non-terminal symbols} (upper-case letters written in
d236e75e1d55 updated Christian Urban <urbanc@in.tum.de> parents: 545 diff changeset	112	bold like \meta{A}, \meta{N} and so on). Rules have
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	113	the shape
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	114
459 780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	115	\begin{plstx}[margin=3cm]
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	116	: \meta{NT} ::= rhs\\
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	117	\end{plstx}
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	118
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	119	\noindent where on the left-hand side is a single non-terminal
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	120	and on the right a string consisting of both terminals and
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	121	non-terminals including the $\epsilon$-symbol for indicating
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	122	the empty string. We use the convention to separate components
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	123	on the right hand-side by using the $\cdot$ symbol, as in the
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	124	grammar for well-parenthesised expressions. We also use the
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	125	convention to use $\|$ as a shorthand notation for several
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	126	rules. For example
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	127
459 780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	128	\begin{plstx}[margin=3cm]
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	129	: \meta{NT} ::= rhs_1
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	130	\| rhs_2\\
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	131	\end{plstx}
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	132
459 780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	133	\noindent means that the non-terminal \meta{NT} can be replaced by
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	134	either $\textit{rhs}_1$ or $\textit{rhs}_2$. If there are more
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	135	than one non-terminal on the left-hand side of the rules, then
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	136	we need to indicate what is the \emph{starting} symbol of the
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	137	grammar. For example the grammar for arithmetic expressions
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	138	can be given as follows
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	139
459 780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	140	\begin{plstx}[margin=3cm,one per line]
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	141	\mbox{\rm (1)}: \meta{E} ::= \meta{N}\\
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	142	\mbox{\rm (2)}: \meta{E} ::= \meta{E} \cdot + \cdot \meta{E}\\
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	143	\mbox{\rm (3)}: \meta{E} ::= \meta{E} \cdot - \cdot \meta{E}\\
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	144	\mbox{\rm (4)}: \meta{E} ::= \meta{E} \cdot * \cdot \meta{E}\\
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	145	\mbox{\rm (5)}: \meta{E} ::= ( \cdot \meta{E} \cdot )\\
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	146	\mbox{\rm (6\ldots)}: \meta{N} ::= \meta{N} \cdot \meta{N}
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	147	\mid 0 \mid 1 \mid \ldots \mid 9\\
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	148	\end{plstx}
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	149
459 780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	150	\noindent where \meta{E} is the starting symbol. A
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	151	\emph{derivation} for a grammar starts with the starting
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	152	symbol of the grammar and in each step replaces one
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	153	non-terminal by a right-hand side of a rule. A derivation ends
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	154	with a string in which only terminal symbols are left. For
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	155	example a derivation for the string $(1 + 2) + 3$ is as
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	156	follows:
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	157
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	158	\begin{center}
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	159	\begin{tabular}{lll@{\hspace{2cm}}l}
459 780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	160	\meta{E} & $\rightarrow$ & $\meta{E}+\meta{E}$ & by (2)\\
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	161	& $\rightarrow$ & $(\meta{E})+\meta{E}$ & by (5)\\
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	162	& $\rightarrow$ & $(\meta{E}+\meta{E})+\meta{E}$ & by (2)\\
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	163	& $\rightarrow$ & $(\meta{E}+\meta{E})+\meta{N}$ & by (1)\\
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	164	& $\rightarrow$ & $(\meta{E}+\meta{E})+3$ & by (6\dots)\\
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	165	& $\rightarrow$ & $(\meta{N}+\meta{E})+3$ & by (1)\\
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	166	& $\rightarrow^+$ & $(1+2)+3$ & by (1, 6\ldots)\\
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	167	\end{tabular}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	168	\end{center}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	169
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	170	\noindent where on the right it is indicated which
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	171	grammar rule has been applied. In the last step we
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	172	merged several steps into one.
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	173
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	174	The \emph{language} of a context-free grammar $G$
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	175	with start symbol $S$ is defined as the set of strings
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	176	derivable by a derivation, that is
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	177
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	178	\begin{center}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	179	$\{c_1\ldots c_n \;\|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	180	\end{center}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	181
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	182	\noindent
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	183	A \emph{parse-tree} encodes how a string is derived with the starting
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	184	symbol on top and each non-terminal containing a subtree for how it is
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	185	replaced in a derivation. The parse tree for the string $(1 + 23)+4$ is
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	186	as follows:
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	187
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	188	\begin{center}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	189	\begin{tikzpicture}[level distance=8mm, black]
665 6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	190	\node {\meta{E}}
6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	191	child {node {\meta{E} }
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	192	child {node {$($}}
665 6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	193	child {node {\meta{E} }
6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	194	child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}}
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	195	child {node {$+$}}
665 6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	196	child {node {\meta{E} }
6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	197	child {node {\meta{N} } child {node {$2$}}}
6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	198	child {node {\meta{N} } child {node {$3$}}}
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	199	}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	200	}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	201	child {node {$)$}}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	202	}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	203	child {node {$+$}}
665 6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	204	child {node {\meta{E} }
6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	205	child {node {\meta{N} } child {node {$4$}}}
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	206	};
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	207	\end{tikzpicture}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	208	\end{center}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	209
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	210	\noindent We are often interested in these parse-trees since
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	211	they encode the structure of how a string is derived by a
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	212	grammar.
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	213
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	214	Before we come to the problem of constructing such parse-trees, we need
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	215	to consider the following two properties of grammars. A grammar is
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	216	\emph{left-recursive} if there is a derivation starting from a
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	217	non-terminal, say \meta{NT} which leads to a string which again starts
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	218	with \meta{NT}. This means a derivation of the form.
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	219
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	220	\begin{center}
665 6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	221	$\meta{NT} \rightarrow \ldots \rightarrow \meta{NT} \cdot \ldots$
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	222	\end{center}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	223
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	224	\noindent It can be easily seen that the grammar above for arithmetic
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	225	expressions is left-recursive: for example the rules $\meta{E}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	226	\rightarrow \meta{E}\cdot + \cdot \meta{E}$ and $\meta{N} \rightarrow
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	227	\meta{N}\cdot \meta{N}$ show that this grammar is left-recursive. But
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	228	note that left-recursiveness can involve more than one step in the
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	229	derivation. The problem with left-recursive grammars is that some
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	230	algorithms cannot cope with them: with left-recursive grammars they will
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	231	fall into a loop. Fortunately every left-recursive grammar can be
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	232	transformed into one that is not left-recursive, although this
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	233	transformation might make the grammar less ``human-readable''. For
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	234	example if we want to give a non-left-recursive grammar for numbers we
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	235	might specify
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	236
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	237	\begin{center}
665 6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	238	$\meta{N} \;\;\rightarrow\;\; 0\;\|\;\ldots\;\|\;9\;\|\;
6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	239	1\cdot \meta{N}\;\|\;2\cdot \meta{N}\;\|\;\ldots\;\|\;9\cdot \meta{N}$
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	240	\end{center}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	241
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	242	\noindent Using this grammar we can still derive every number
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	243	string, but we will never be able to derive a string of the
665 6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	244	form $\meta{N} \to \ldots \to \meta{N} \cdot \ldots$.
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	245
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	246	The other property we have to watch out for is when a grammar
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	247	is \emph{ambiguous}. A grammar is said to be ambiguous if
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	248	there are two parse-trees for one string. Again the grammar
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	249	for arithmetic expressions shown above is ambiguous. While the
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	250	shown parse tree for the string $(1 + 23) + 4$ is unique, this
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	251	is not the case in general. For example there are two parse
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	252	trees for the string $1 + 2 + 3$, namely
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	253
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	254	\begin{center}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	255	\begin{tabular}{c@{\hspace{10mm}}c}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	256	\begin{tikzpicture}[level distance=8mm, black]
665 6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	257	\node {\meta{E} }
6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	258	child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}}
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	259	child {node {$+$}}
665 6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	260	child {node {\meta{E} }
6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	261	child {node {\meta{E} } child {node {\meta{N} } child {node {$2$}}}}
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	262	child {node {$+$}}
665 6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	263	child {node {\meta{E} } child {node {\meta{N} } child {node {$3$}}}}
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	264	}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	265	;
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	266	\end{tikzpicture}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	267	&
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	268	\begin{tikzpicture}[level distance=8mm, black]
665 6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	269	\node {\meta{E} }
6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	270	child {node {\meta{E} }
6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	271	child {node {\meta{E} } child {node {\meta{N} } child {node {$1$}}}}
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	272	child {node {$+$}}
665 6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	273	child {node {\meta{E} } child {node {\meta{N} } child {node {$2$}}}}
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	274	}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	275	child {node {$+$}}
665 6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	276	child {node {\meta{E} } child {node {\meta{N} } child {node {$3$}}}}
175 5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	277	;
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	278	\end{tikzpicture}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	279	\end{tabular}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	280	\end{center}
5801e8c0e528 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 173 diff changeset	281
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	282	\noindent In particular in programming languages we will try to avoid
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	283	ambiguous grammars because two different parse-trees for a string mean a
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	284	program can be interpreted in two different ways. In such cases we have
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	285	to somehow make sure the two different ways do not matter, or
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	286	disambiguate the grammar in some other way (for example making the $+$
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	287	left-associative). Unfortunately already the problem of deciding whether
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	288	a grammar is ambiguous or not is in general undecidable. But in simple
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	289	instance (the ones we deal with in this module) one can usually see when
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	290	a grammar is ambiguous.
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	291
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	292	\subsection*{Removing Left-Recursion}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	293
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	294	Let us come back to the problem of left-recursion and consider the
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	295	following grammar for binary numbers:
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	296
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	297	\begin{plstx}[margin=1cm]
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	298	: \meta{B} ::= \meta{B} \cdot \meta{B} \| 0 \| 1\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	299	\end{plstx}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	300
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	301	\noindent
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	302	It is clear that this grammar can create all binary numbers, but
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	303	it is also clear that this grammar is left-recursive. Giving this
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	304	grammar as is to parser combinators will result in an infinite
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	305	loop. Fortunately, every left-recursive grammar can be translated
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	306	into one that is not left-recursive with the help of some
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	307	transformation rules. Suppose we identified the ``offensive''
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	308	rule, then we can separate the grammar into this offensive rule
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	309	and the ``rest'':
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	310
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	311	\begin{plstx}[margin=1cm]
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	312	: \meta{B} ::= \underbrace{\meta{B} \cdot \meta{B}}_{\textit{lft-rec}}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	313	\| \underbrace{0 \;\;\|\;\; 1}_{\textit{rest}}\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	314	\end{plstx}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	315
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	316	\noindent
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	317	To make the idea of the transformation clearer, suppose the left-recursive
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	318	rule is of the form $\meta{B}\alpha$ (the left-recursive non-terminal
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	319	followed by something called $\alpha$) and the ``rest'' is called $\beta$.
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	320	That means our grammar looks schematically as follows
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	321
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	322	\begin{plstx}[margin=1cm]
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	323	: \meta{B} ::= \meta{B} \cdot \alpha \| \beta\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	324	\end{plstx}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	325
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	326	\noindent
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	327	To get rid of the left-recursion, we are required to introduce
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	328	a new non-terminal, say $\meta{B'}$ and transform the rule
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	329	as follows:
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	330
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	331	\begin{plstx}[margin=1cm]
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	332	: \meta{B} ::= \beta \cdot \meta{B'}\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	333	: \meta{B'} ::= \alpha \cdot \meta{B'} \| \epsilon\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	334	\end{plstx}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	335
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	336	\noindent
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	337	In our example of binary numbers we would after the transformation
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	338	end up with the rules
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	339
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	340	\begin{plstx}[margin=1cm]
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	341	: \meta{B} ::= 0 \cdot \meta{B'} \| 1 \cdot \meta{B'}\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	342	: \meta{B'} ::= \meta{B} \cdot \meta{B'} \| \epsilon\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	343	\end{plstx}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	344
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	345	\noindent
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	346	A little thought should convince you that this grammar still derives
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	347	all the binary numbers (for example 0 and 1 are derivable because $\meta{B'}$
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	348	can be $\epsilon$). Less clear might be why this grammar is non-left recursive.
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	349	For $\meta{B'}$ it is relatively clear because we will never be
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	350	able to derive things like
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	351
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	352	\begin{center}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	353	$\meta{B'} \rightarrow\ldots\rightarrow \meta{B'}\cdot\ldots$
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	354	\end{center}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	355
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	356	\noindent
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	357	because there will always be a $\meta{B}$ in front of a $\meta{B'}$, and
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	358	$\meta{B}$ now has always a $0$ or $1$ in front, so a $\meta{B'}$ can
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	359	never be in the first place. The reasoning is similar for $\meta{B}$:
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	360	the $0$ and $1$ in the rule for $\meta{B}$ ``protect'' it from becoming
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	361	left-recursive. This transformation does not mean the grammar is the
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	362	simplest left-recursive grammar for binary numbers. For example the
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	363	following grammar would do as well
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	364
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	365	\begin{plstx}[margin=1cm]
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	366	: \meta{B} ::= 0 \cdot \meta{B} \| 1 \cdot \meta{B} \| 0 \| 1\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	367	\end{plstx}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	368
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	369	\noindent
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	370	The point is that we can in principle transform every left-recursive
941 66adcae6c762 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 937 diff changeset	371	grammar into one that is non-left-recursive. This explains why often
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	372	the following grammar is used for arithmetic expressions:
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	373
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	374	\begin{plstx}[margin=1cm]
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	375	: \meta{E} ::= \meta{T} \| \meta{T} \cdot + \cdot \meta{E} \| \meta{T} \cdot - \cdot \meta{E}\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	376	: \meta{T} ::= \meta{F} \| \meta{F} \cdot * \cdot \meta{T}\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	377	: \meta{F} ::= num\_token \| ( \cdot \meta{E} \cdot )\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	378	\end{plstx}
176 3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	379
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	380	\noindent
937 dc5ab66b11cc updated Christian Urban <christian.urban@kcl.ac.uk> parents: 798 diff changeset	381	In this grammar all $\meta{E}$xpressions, $\meta{T}$erms and
dc5ab66b11cc updated Christian Urban <christian.urban@kcl.ac.uk> parents: 798 diff changeset	382	$\meta{F}$actors are in some way protected from being
941 66adcae6c762 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 937 diff changeset	383	left-recursive. For example if you start $\meta{E}$ you can derive
937 dc5ab66b11cc updated Christian Urban <christian.urban@kcl.ac.uk> parents: 798 diff changeset	384	another one by going through $\meta{T}$, then $\meta{F}$, but then
dc5ab66b11cc updated Christian Urban <christian.urban@kcl.ac.uk> parents: 798 diff changeset	385	$\meta{E}$ is protected by the open-parenthesis in the last rule.
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	386
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	387	\subsection*{Removing $\epsilon$-Rules and CYK-Algorithm}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	388
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	389	I showed above that the non-left-recursive grammar for binary numbers is
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	390
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	391	\begin{plstx}[margin=1cm]
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	392	: \meta{B} ::= 0 \cdot \meta{B'} \| 1 \cdot \meta{B'}\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	393	: \meta{B'} ::= \meta{B} \cdot \meta{B'} \| \epsilon\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	394	\end{plstx}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	395
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	396	\noindent
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	397	The transformation made the original grammar non-left-recursive, but at
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	398	the expense of introducing an $\epsilon$ in the second rule. Having an
937 dc5ab66b11cc updated Christian Urban <christian.urban@kcl.ac.uk> parents: 798 diff changeset	399	explicit $\epsilon$-rule is annoying, not in terms of looping, but in
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	400	terms of efficiency. The reason is that the $\epsilon$-rule always
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	401	applies but since it recognises the empty string, it does not make any
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	402	progress with recognising a string. Better are rules like $( \cdot
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	403	\meta{E} \cdot )$ where something of the input is consumed. Getting
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	404	rid of $\epsilon$-rules is also important for the CYK parsing algorithm,
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	405	which can give us an insight into the complexity class of parsing.
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	406
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	407	It turns out we can also by some generic transformations eliminate
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	408	$\epsilon$-rules from grammars. Consider again the grammar above for
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	409	binary numbers where have a rule $\meta{B'} ::= \epsilon$. In this case
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	410	we look for rules of the (generic) form \mbox{$\meta{A} :=
941 66adcae6c762 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 937 diff changeset	411	\alpha\cdot\meta{B'}\cdot\beta$}. That is, there are rules that use
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	412	$\meta{B'}$ and something ($\alpha$) is in front of $\meta{B'}$ and
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	413	something follows ($\beta$). Such rules need to be replaced by
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	414	additional rules of the form \mbox{$\meta{A} := \alpha\cdot\beta$}.
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	415	In our running example there are the two rules for $\meta{B}$ which
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	416	fall into this category
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	417
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	418	\begin{plstx}[margin=1cm]
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	419	: \meta{B} ::= 0 \cdot \meta{B'} \| 1 \cdot \meta{B'}\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	420	\end{plstx}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	421
941 66adcae6c762 updated Christian Urban <christian.urban@kcl.ac.uk> parents: 937 diff changeset	422	\noindent To follow the general scheme of the transformation,
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	423	the $\alpha$ is either is either $0$ or $1$, and the $\beta$ happens
798 aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	424	to be empty. So we need to generate new rules for the form
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	425	\mbox{$\meta{A} := \alpha\cdot\beta$}, which in our particular
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	426	example means we obtain
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	427
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	428	\begin{plstx}[margin=1cm]
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	429	: \meta{B} ::= 0 \cdot \meta{B'} \| 1 \cdot \meta{B'} \| 0 \| 1\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	430	\end{plstx}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	431
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	432	\noindent
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	433	Unfortunately $\meta{B'}$ is also used in the rule
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	434
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	435	\begin{plstx}[margin=1cm]
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	436	: \meta{B'} ::= \meta{B} \cdot \meta{B'}\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	437	\end{plstx}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	438
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	439	\noindent
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	440	For this we repeat the transformation, giving
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	441
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	442	\begin{plstx}[margin=1cm]
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	443	: \meta{B'} ::= \meta{B} \cdot \meta{B'} \| \meta{B}\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	444	\end{plstx}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	445
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	446	\noindent
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	447	In this case $\alpha$ was substituted with $\meta{B}$ and $\beta$
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	448	was again empty. Once no rule is left over, we can simply throw
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	449	away the $\epsilon$ rule. This gives the grammar
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	450
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	451	\begin{plstx}[margin=1cm]
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	452	: \meta{B} ::= 0 \cdot \meta{B'} \| 1 \cdot \meta{B'} \| 0 \| 1\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	453	: \meta{B'} ::= \meta{B} \cdot \meta{B'} \| \meta{B}\\
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	454	\end{plstx}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	455
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	456	\noindent
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	457	I let you think about whether this grammar can still recognise all
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	458	binary numbers and whether this grammar is non-left-recursive. The
937 dc5ab66b11cc updated Christian Urban <christian.urban@kcl.ac.uk> parents: 798 diff changeset	459	precise statement for the transformation of removing $\epsilon$-rules
dc5ab66b11cc updated Christian Urban <christian.urban@kcl.ac.uk> parents: 798 diff changeset	460	is that if the original grammar was able to recognise only non-empty
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	461	strings, then the transformed grammar will be equivalent (matching the
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	462	same set of strings); if the original grammar was able to match the
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	463	empty string, then the transformed grammar will be able to match the
937 dc5ab66b11cc updated Christian Urban <christian.urban@kcl.ac.uk> parents: 798 diff changeset	464	same strings, \emph{except} the empty string. So the
dc5ab66b11cc updated Christian Urban <christian.urban@kcl.ac.uk> parents: 798 diff changeset	465	$\epsilon$-removal does not preserve equivalence of grammars in
dc5ab66b11cc updated Christian Urban <christian.urban@kcl.ac.uk> parents: 798 diff changeset	466	general, but the small defect with the empty string is not important
dc5ab66b11cc updated Christian Urban <christian.urban@kcl.ac.uk> parents: 798 diff changeset	467	for practical purposes.
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	468
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	469	So why are these transformations all useful? Well apart from making the
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	470	parser combinators work (remember they cannot deal with left-recursion and
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	471	are inefficient with $\epsilon$-rules), a second reason is that they help
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	472	with getting any insight into the complexity of the parsing problem.
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	473	The parser combinators are very easy to implement, but are far from the
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	474	most efficient way of processing input (they can blow up exponentially
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	475	with ambiguous grammars). The question remains what is the best possible
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	476	complexity for parsing? It turns out that this is $O(n^3)$ for context-free
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	477	languages.
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	478
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	479	To answer the question about complexity, let me describe next the CYK
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	480	algorithm (named after the authors Cockeâ€“Youngerâ€“Kasami). This algorithm
681 7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	481	works with grammars that are in \emph{Chomsky normalform}. In Chomsky
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	482	normalform all rules must be of the form $\meta{A} ::= a$, where $a$ is
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	483	a terminal, or $\meta{A} ::= \meta{B}\cdot \meta{C}$, where $\meta{B}$ and
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	484	$\meta{B}$ need to be non-terminals. And no rule can contain $\epsilon$.
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	485	The following grammar is in Chomsky normalform:
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	486
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	487	\begin{plstx}[margin=1cm]
682 553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	488	: \meta{S} ::= \meta{N}\cdot \meta{P}\\
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	489	: \meta{P} ::= \meta{V}\cdot \meta{N}\\
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	490	: \meta{N} ::= \meta{N}\cdot \meta{N}\\
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	491	: \meta{N} ::= \meta{A}\cdot \meta{N}\\
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	492	: \meta{N} ::= \texttt{student} \| \texttt{trainer} \| \texttt{team}
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	493	\| \texttt{trains}\\
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	494	: \meta{V} ::= \texttt{trains} \| \texttt{team}\\
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	495	: \meta{A} ::= \texttt{The} \| \texttt{the}\\
681 7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	496	\end{plstx}
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	497
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	498	\noindent
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	499	where $\meta{S}$ is the start symbol and $\meta{S}$, $\meta{P}$,
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	500	$\meta{N}$, $\meta{V}$ and $\meta{A}$ are non-terminals. The ``words''
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	501	are terminals. The rough idea behind this grammar is that $\meta{S}$
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	502	stands for a sentence, $\meta{P}$ is a predicate, $\meta{N}$ is a noun
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	503	and so on. For example the rule \mbox{$\meta{P} ::= \meta{V}\cdot
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	504	\meta{N}$} states that a predicate can be a verb followed by a noun.
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	505	Now the question is whether the string
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	506
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	507	\begin{center}
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	508	\texttt{The trainer trains the student team}
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	509	\end{center}
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	510
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	511	\noindent
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	512	is recognised by the grammar. The CYK algorithm starts with the
7b7736bea3ca updated Christian Urban <urbanc@in.tum.de> parents: 680 diff changeset	513	following triangular data structure.
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	514
798 aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	515	%%\begin{figure}[t]
682 553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	516	\begin{center}
798 aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	517	\begin{tikzpicture}[scale=0.7,line width=0.8mm]
682 553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	518	\draw (-2,0) -- (4,0);
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	519	\draw (-2,1) -- (4,1);
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	520	\draw (-2,2) -- (3,2);
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	521	\draw (-2,3) -- (2,3);
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	522	\draw (-2,4) -- (1,4);
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	523	\draw (-2,5) -- (0,5);
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	524	\draw (-2,6) -- (-1,6);
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	525
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	526	\draw (0,0) -- (0, 5);
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	527	\draw (1,0) -- (1, 4);
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	528	\draw (2,0) -- (2, 3);
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	529	\draw (3,0) -- (3, 2);
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	530	\draw (4,0) -- (4, 1);
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	531	\draw (-1,0) -- (-1, 6);
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	532	\draw (-2,0) -- (-2, 6);
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	533
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	534	\draw (-1.5,-0.5) node {\footnotesize{}\texttt{The}};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	535	\draw (-0.5,-1.0) node {\footnotesize{}\texttt{trainer}};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	536	\draw ( 0.5,-0.5) node {\footnotesize{}\texttt{trains}};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	537	\draw ( 1.5,-1.0) node {\footnotesize{}\texttt{the}};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	538	\draw ( 2.5,-0.5) node {\footnotesize{}\texttt{student}};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	539	\draw ( 3.5,-1.0) node {\footnotesize{}\texttt{team}};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	540
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	541	\draw (-1.5,0.5) node {$A$};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	542	\draw (-0.5,0.5) node {$N$};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	543	\draw ( 0.5,0.5) node {$N,V$};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	544	\draw ( 1.5,0.5) node {$A$};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	545	\draw ( 2.5,0.5) node {$N$};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	546	\draw ( 3.5,0.5) node {$N,V$};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	547
798 aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	548	% \draw (-1.5,1.5) node {\small{}a};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	549	% \draw (-0.5,1.5) node {\small{}b};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	550	% \draw ( 0.5,1.5) node {\small{}c};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	551	% \draw ( 1.5,1.5) node {\small{}d};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	552	% \draw ( 2.5,1.5) node {\small{}e};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	553
682 553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	554	\draw (-2.4, 5.5) node {$1$};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	555	\draw (-2.4, 4.5) node {$2$};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	556	\draw (-2.4, 3.5) node {$3$};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	557	\draw (-2.4, 2.5) node {$4$};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	558	\draw (-2.4, 1.5) node {$5$};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	559	\draw (-2.4, 0.5) node {$6$};
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	560	\end{tikzpicture}
553b4d4e3719 updated Christian Urban <urbanc@in.tum.de> parents: 681 diff changeset	561	\end{center}
798 aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	562	%%\end{figure}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	563
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	564
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	565	\noindent
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	566	The last row contains the information about all words and their
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	567	corresponding non-terminals. For example the field for \texttt{trains}
937 dc5ab66b11cc updated Christian Urban <christian.urban@kcl.ac.uk> parents: 798 diff changeset	568	contains the information $\meta{N}$ and $\meta{V}$ because \texttt{trains} can be a
798 aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	569	``verb'' and a ``noun'' according to the grammar. The row above,
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	570	let's call the corresponding fields 5a to 5e, contains information
937 dc5ab66b11cc updated Christian Urban <christian.urban@kcl.ac.uk> parents: 798 diff changeset	571	about \underline{2-word} parts of the sentence, namely
798 aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	572
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	573	\begin{center}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	574	\begin{tabular}{llll}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	575	5a) & $\underbrace{\texttt{The}}_{A}$ $\mid$ $\underbrace{\texttt{trainer}}_{N}$ \\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	576	5b) & $\underbrace{\texttt{trainer}}_{N}$ $\mid$ $\underbrace{\texttt{trains}}_{N,V}$\\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	577	5c) & \texttt{trains} $\mid$ \texttt{the} \\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	578	5d) & \texttt{the} $\mid$ \texttt{student} \\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	579	5e) & \texttt{student} $\mid$ \texttt{team} \\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	580	\end{tabular}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	581	\end{center}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	582
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	583	\noindent
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	584	For each of them, we look up in row 6 which non-terminals it belongs to
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	585	(indicated above for 5a and 5b). For 5a, with the non-terminals
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	586	\meta{A} and \meta{N}, we find the grammar rule
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	587
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	588	\begin{plstx}[margin=1cm]
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	589	: \meta{N} ::= \meta{A}\cdot \meta{N}\\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	590	\end{plstx}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	591
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	592	\noindent
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	593	which means into field 5a we put the left-hand side of this rule,
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	594	which in this case is the non-terminal \meta{N}. For 5b we have to check
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	595	for both $\meta{N}\cdot\meta{N}$ and $\meta{N}\cdot\meta{V}$ whether there
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	596	is a right-hand side of this form in the grammar. But only the
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	597	grammar rule
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	598
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	599	\begin{plstx}[margin=1cm]
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	600	: \meta{N} ::= \meta{N}\cdot \meta{N}\\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	601	\end{plstx}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	602
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	603	\noindent
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	604	matches, which means 5b gets also an \meta{N}. Continuing for all
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	605	fields in row 5 gives:
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	606
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	607	\begin{center}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	608	\begin{tikzpicture}[scale=0.7,line width=0.8mm]
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	609	\draw (-2,0) -- (4,0);
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	610	\draw (-2,1) -- (4,1);
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	611	\draw (-2,2) -- (3,2);
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	612	\draw (-2,3) -- (2,3);
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	613	\draw (-2,4) -- (1,4);
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	614	\draw (-2,5) -- (0,5);
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	615	\draw (-2,6) -- (-1,6);
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	616
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	617	\draw (0,0) -- (0, 5);
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	618	\draw (1,0) -- (1, 4);
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	619	\draw (2,0) -- (2, 3);
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	620	\draw (3,0) -- (3, 2);
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	621	\draw (4,0) -- (4, 1);
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	622	\draw (-1,0) -- (-1, 6);
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	623	\draw (-2,0) -- (-2, 6);
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	624
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	625	\draw (-1.5,-0.5) node {\footnotesize{}\texttt{The}};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	626	\draw (-0.5,-1.0) node {\footnotesize{}\texttt{trainer}};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	627	\draw ( 0.5,-0.5) node {\footnotesize{}\texttt{trains}};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	628	\draw ( 1.5,-1.0) node {\footnotesize{}\texttt{the}};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	629	\draw ( 2.5,-0.5) node {\footnotesize{}\texttt{student}};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	630	\draw ( 3.5,-1.0) node {\footnotesize{}\texttt{team}};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	631
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	632	\draw (-1.5,0.5) node {$A$};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	633	\draw (-0.5,0.5) node {$N$};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	634	\draw ( 0.5,0.5) node {$N,V$};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	635	\draw ( 1.5,0.5) node {$A$};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	636	\draw ( 2.5,0.5) node {$N$};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	637	\draw ( 3.5,0.5) node {$N,V$};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	638
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	639	\draw (-1.5,1.5) node {$N$};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	640	\draw (-0.5,1.5) node {$N$};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	641	\draw ( 0.5,1.5) node {$$};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	642	\draw ( 1.5,1.5) node {$N$};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	643	\draw ( 2.5,1.5) node {$N$};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	644
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	645
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	646	% \draw (-1.5,1.5) node {\small{}a};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	647	% \draw (-0.5,1.5) node {\small{}b};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	648	% \draw ( 0.5,1.5) node {\small{}c};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	649	% \draw ( 1.5,1.5) node {\small{}d};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	650	% \draw ( 2.5,1.5) node {\small{}e};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	651
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	652	\draw (-2.4, 5.5) node {$1$};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	653	\draw (-2.4, 4.5) node {$2$};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	654	\draw (-2.4, 3.5) node {$3$};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	655	\draw (-2.4, 2.5) node {$4$};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	656	\draw (-2.4, 1.5) node {$5$};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	657	\draw (-2.4, 0.5) node {$6$};
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	658	\end{tikzpicture}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	659	\end{center}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	660
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	661	\noindent
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	662	Now row 4 is in charge of all 3-word parts of the sentence, namely
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	663
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	664	\begin{center}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	665	\begin{tabular}{llll}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	666	4a) & The $\mid$ trainer trains\\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	667	& The trainer $\mid$ trains\\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	668	4b) & trainer $\mid$ trains the\\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	669	& trainer trains $\mid$ the\\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	670	4c) & trains $\mid$ the student\\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	671	& trains the $\mid$ student\\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	672	4d) & the $\mid$ student team\\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	673	& the student $\mid$ team\\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	674	\end{tabular}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	675	\end{center}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	676
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	677	\noindent
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	678	Note that in case of 3-word parts we have two splits. For example for
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	679	4a: $\underbrace{\texttt{The}}_{A}$ and
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	680	$\underbrace{\texttt{trainer trains}}_{N}$; and also
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	681	$\underbrace{\texttt{The trainer}}_{N}$ and
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	682	$\underbrace{\texttt{trains}}_{N,V}$. For each of these splits we have
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	683	to look up in the rows below, which non-terminals we already
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	684	computed. This allows us to look for right-hand sides in our grammar:
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	685	$\meta{A}\cdot\meta{N}$, $\meta{N}\cdot\meta{N}$ and
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	686	$\meta{N}\cdot\meta{V}$, which yield the only left-hand side
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	687	\meta{N}. This is what we fill in for 4a. And so on for row 4.
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	688
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	689	Row 3 is about all 4-word parts in the sentence, namely
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	690
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	691	\begin{center}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	692	\begin{tabular}{llll}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	693	3a) & The trainer trains the\\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	694	3b) & trainer trains the student\\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	695	3c) & trains the student team\\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	696	\end{tabular}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	697	\end{center}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	698
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	699	\noindent
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	700	Each of them can be split up in 3 ways, for example
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	701
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	702	\begin{center}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	703	\begin{tabular}{llll}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	704	3a) & The $\mid$ trainer trains the\\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	705	& The trainer $\mid$ trains the\\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	706	& The trainer trains $\mid$ the\\
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	707	\end{tabular}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	708	\end{center}
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	709
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	710	\noindent
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	711	and we have to consider them all in turn to fill in the non-terminals for
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	712	3a. You can guess how it continues: row 2 is for all 5-word parts, and finally
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	713	the field on the top is for the whole sentence.
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	714	The idea of the CYK algorithm is that if in the top-field the starting
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	715	symbol \meta{S} appears (possibly together with other non-terminals),
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	716	then the sentence is accepted by the grammar. If it does not, then the
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	717	sentence is not accepted.
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	718
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	719	Let us very quickly calculate the complexity of the CYK
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	720	algorithm. Lookup operations inside the triangle and in the grammar
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	721	are assumed to be of constant time, $O(1)$, meaning they do not
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	722	matter. How many fields are in the triangle\ldots
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	723	$\frac{n}{2} * (n+1)$, where $n$ is the size of the input. That means
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	724	roughly $O(n^2)$ fields. How much work do we have to do for each
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	725	field? Well, for the top-most we have to consider $n-1$ splits, which
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	726	means roughly $O(n)$ for each field. The overall result is a $O(n^3)$
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	727	time-complexity for CYK. It turns out that this is the best worst-time
aaf0bd0a211d updated Christian Urban <christian.urban@kcl.ac.uk> parents: 727 diff changeset	728	complexity for parsing with context-free grammars.
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	729
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	730	\end{document}
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	731
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	732
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	733	%%% Parser combinators are now part of handout 6
459 780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	734
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	735	\subsection*{Parser Combinators}
780486571e38 updated Christian Urban <urbanc@in.tum.de> parents: 385 diff changeset	736
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	737	Let us now turn to the problem of generating a parse-tree for
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	738	a grammar and string. In what follows we explain \emph{parser
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	739	combinators}, because they are easy to implement and closely
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	740	resemble grammar rules. Imagine that a grammar describes the
665 6d74d2a0a4b0 updated Christian Urban <urbanc@in.tum.de> parents: 618 diff changeset	741	strings of natural numbers, such as the grammar \meta{N} shown
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	742	above. For all such strings we want to generate the
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	743	parse-trees or later on we actually want to extract the
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	744	meaning of these strings, that is the concrete integers
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	745	``behind'' these strings. In Scala the parser combinators will
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	746	be functions of type
176 3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	747
3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	748	\begin{center}
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	749	\texttt{I $\Rightarrow$ Set[(T, I)]}
176 3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	750	\end{center}
3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	751
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	752	\noindent that is they take as input something of type
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	753	\texttt{I}, typically a list of tokens or a string, and return
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	754	a set of pairs. The first component of these pairs corresponds
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	755	to what the parser combinator was able to process from the
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	756	input and the second is the unprocessed part of the input. As
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	757	we shall see shortly, a parser combinator might return more
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	758	than one such pair, with the idea that there are potentially
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	759	several ways how to interpret the input. As a concrete
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	760	example, consider the case where the input is of type string,
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	761	say the string
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	762
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	763	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	764	\tt\Grid{iffoo\VS testbar}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	765	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	766
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	767	\noindent We might have a parser combinator which tries to
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	768	interpret this string as a keyword (\texttt{if}) or an
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	769	identifier (\texttt{iffoo}). Then the output will be the set
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	770
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	771	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	772	$\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right),
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	773	\left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	774	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	775
362 57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	776	\noindent where the first pair means the parser could
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	777	recognise \texttt{if} from the input and leaves the rest as
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	778	`unprocessed' as the second component of the pair; in the
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	779	other case it could recognise \texttt{iffoo} and leaves
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	780	\texttt{\VS testbar} as unprocessed. If the parser cannot
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	781	recognise anything from the input then parser combinators just
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	782	return the empty set $\{\}$. This will indicate
57ea439feaff updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 360 diff changeset	783	something ``went wrong''.
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	784
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	785	The main attraction is that we can easily build parser combinators out of smaller components
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	786	following very closely the structure of a grammar. In order to implement this in an object
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	787	oriented programming language, like Scala, we need to specify an abstract class for parser
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	788	combinators. This abstract class requires the implementation of the function
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	789	\texttt{parse} taking an argument of type \texttt{I} and returns a set of type
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	790	\mbox{\texttt{Set[(T, I)]}}.
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	791
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	792	\begin{center}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	793	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	794	abstract class Parser[I, T] {
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	795	def parse(ts: I): Set[(T, I)]
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	796
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	797	def parse_all(ts: I): Set[T] =
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	798	for ((head, tail) <- parse(ts); if (tail.isEmpty))
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	799	yield head
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	800	}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	801	\end{lstlisting}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	802	\end{center}
176 3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	803
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	804	\noindent
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	805	From the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all},
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	806	which just filters out all pairs whose second component is not empty (that is has still some
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	807	unprocessed part). The reason is that at the end of parsing we are only interested in the
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	808	results where all the input has been consumed and no unprocessed part is left.
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	809
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	810	One of the simplest parser combinators recognises just a character, say $c$,
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	811	from the beginning of strings. Its behaviour is as follows:
176 3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	812
177 53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	813	\begin{itemize}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	814	\item if the head of the input string starts with a $c$, it returns
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	815	the set $\{(c, \textit{tail of}\; s)\}$
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	816	\item otherwise it returns the empty set $\varnothing$
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	817	\end{itemize}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	818
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	819	\noindent
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	820	The input type of this simple parser combinator for characters is
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	821	\texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}.
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	822	The code in Scala is as follows:
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	823
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	824	\begin{center}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	825	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	826	case class CharParser(c: Char) extends Parser[String, Char] {
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	827	def parse(sb: String) =
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	828	if (sb.head == c) Set((c, sb.tail)) else Set()
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	829	}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	830	\end{lstlisting}
53def1fbf472 updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 176 diff changeset	831	\end{center}
176 3c2653fc8b5a updated Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 175 diff changeset	832
183 b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	833	\noindent
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	834	The \texttt{parse} function tests whether the first character of the
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	835	input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits the
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	836	string into the recognised part \texttt{c} and the unprocessed part
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	837	\texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} then
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	838	the parser returns the empty set (in Scala \texttt{Set()}).
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	839
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	840	More interesting are the parser combinators that build larger parsers
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	841	out of smaller component parsers. For example the alternative
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	842	parser combinator is as follows.
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	843
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	844	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	845	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	846	class AltParser[I, T]
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	847	(p: => Parser[I, T],
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	848	q: => Parser[I, T]) extends Parser[I, T] {
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	849	def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	850	}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	851	\end{lstlisting}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	852	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	853
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	854	\noindent
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	855	The types of this parser combinator are polymorphic (we just have \texttt{I}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	856	for the input type, and \texttt{T} for the output type). The alternative parser
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	857	builds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}.
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	858	Both need to be able to process input of type \texttt{I} and return the same
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	859	output type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	860	\texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	861	evaluation of the arguments before they are used. This is often called
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	862	\emph{lazy evaluation} of the arguments.) The alternative parser should run
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	863	the input with the first parser \texttt{p} (producing a set of outputs) and then
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	864	run the same input with \texttt{q}. The result should be then just the union
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	865	of both sets, which is the operation \texttt{++} in Scala.
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	866
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	867	This parser combinator already allows us to construct a parser that either
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	868	a character \texttt{a} or \texttt{b}, as
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	869
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	870	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	871	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	872	new AltParser(CharParser('a'), CharParser('b'))
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	873	\end{lstlisting}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	874	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	875
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	876	\noindent
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	877	Scala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' \|\| 'b'}.
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	878	We can call this parser combinator with the strings
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	879
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	880	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	881	\begin{tabular}{rcl}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	882	input string & & output\medskip\\
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	883	\texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	884	\texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	885	\texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	886	\end{tabular}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	887	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	888
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	889	\noindent
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	890	We receive in the first two cases a successful output (that is a non-empty set).
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	891
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	892	A bit more interesting is the \emph{sequence parser combinator} implemented in
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	893	Scala as follows:
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	894
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	895	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	896	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	897	class SeqParser[I, T, S]
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	898	(p: => Parser[I, T],
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	899	q: => Parser[I, S]) extends Parser[I, (T, S)] {
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	900	def parse(sb: I) =
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	901	for ((head1, tail1) <- p.parse(sb);
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	902	(head2, tail2) <- q.parse(tail1))
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	903	yield ((head1, head2), tail2)
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	904	}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	905	\end{lstlisting}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	906	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	907
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	908	\noindent
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	909	This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	910	as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}).
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	911	The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}.
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	912	Let \texttt{q} run on these unprocessed parts
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	913	producing again a set of pairs. The output of the sequence parser combinator is then a set
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	914	containing pairs where the first components are again pairs, namely what the first parser could parse
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	915	together with what the second parser could parse; the second component is the unprocessed
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	916	part left over after running the second parser \texttt{q}. Therefore the input type of
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	917	the sequence parser combinator is as usual \texttt{I}, but the output type is
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	918
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	919	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	920	\texttt{Set[((T, S), I)]}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	921	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	922
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	923	Scala allows us to provide some
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	924	shorthand notation for the sequence parser combinator. So we can write for
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	925	example \texttt{'a' $\sim$ 'b'}, which is the
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	926	parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}.
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	927	Calling this parser combinator with the strings
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	928
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	929	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	930	\begin{tabular}{rcl}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	931	input string & & output\medskip\\
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	932	\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	933	\texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	934	\texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	935	\end{tabular}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	936	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	937
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	938	\noindent
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	939	A slightly more complicated parser is \texttt{('a' \|\| 'b') $\sim$ 'b'} which parses as first character either
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	940	an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results.
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	941
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	942	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	943	\begin{tabular}{rcl}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	944	input string & & output\medskip\\
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	945	\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	946	\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	947	\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	948	\end{tabular}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	949	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	950
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	951	Note carefully that constructing the parser \texttt{'a' \|\| ('a' $\sim$ 'b')} will result in a tying error.
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	952	The first parser has as output type a single character (recall the type of \texttt{CharParser}),
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	953	but the second parser produces a pair of characters as output. The alternative parser is however
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	954	required to have both component parsers to have the same type. We will see later how we can
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	955	build this parser without the typing error.
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	956
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	957	The next parser combinator does not actually combine smaller parsers, but applies
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	958	a function to the result of the parser. It is implemented in Scala as follows
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	959
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	960	\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	961	\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	962	class FunParser[I, T, S]
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	963	(p: => Parser[I, T],
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	964	f: T => S) extends Parser[I, S] {
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	965	def parse(sb: I) =
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	966	for ((head, tail) <- p.parse(sb)) yield (f(head), tail)
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	967	}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	968	\end{lstlisting}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	969	\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	970
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	971
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	972	\noindent
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	973	This parser combinator takes a parser \texttt{p} with output type \texttt{T} as
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	974	input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	975	produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	976	applies the function \texttt{f} to all the parer outputs. Since this function
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	977	is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}.
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	978	Again Scala lets us introduce some shorthand notation for this parser combinator.
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	979	Therefore we will write \texttt{p ==> f} for it.
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	980
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	981	%\bigskip
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	982	%takes advantage of the full generality---have a look
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	983	%what it produces if we call it with the string \texttt{abc}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	984	%
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	985	%\begin{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	986	%\begin{tabular}{rcl}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	987	%input string & & output\medskip\\
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	988	%\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	989	%\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	990	%\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	991	%\end{tabular}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	992	%\end{center}
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	993
b17eff695c7f added new stuff Christian Urban <christian dot urban at kcl dot ac dot uk> parents: 177 diff changeset	994
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	995
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	996
eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	997
173 7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	998
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	999	%%% Local Variables:
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	1000	%%% mode: latex
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	1001	%%% TeX-master: t
7cfb7a6f7c99 added slides Christian Urban <christian dot urban at kcl dot ac dot uk> parents: diff changeset	1002	%%% End:
680 eecc4d5a2172 updated Christian Urban <urbanc@in.tum.de> parents: 665 diff changeset	1003

author	Christian Urban <christian.urban@kcl.ac.uk>
	Mon, 03 Feb 2025 13:25:59 +0000 (2 weeks ago)
changeset 980	0c491eff5b01
parent 941	66adcae6c762
permissions	-rw-r--r--