afl-material: comparison handouts/ho05.tex

equal deleted inserted replaced

-:9c7eb266594c
+:57ea439feaff
 \begin{document}
 \section*{Handout 6 (Grammars \& Parser)}
-While regular expressions are very useful for lexing and for recognising
+While regular expressions are very useful for lexing and for
-many patterns in strings (like email addresses), they have their limitations. For
+recognising many patterns in strings (like email addresses),
-example there is no regular expression that can recognise the language
+they have their limitations. For example there is no regular
-$a^nb^n$. Another example for which there exists no regular expression is the language of well-parenthesised
+expression that can recognise the language $a^nb^n$. Another
-expressions.  In languages like Lisp, which use parentheses rather
+example for which there exists no regular expression is the
-extensively, it might be of interest whether the following two expressions
+language of well-parenthesised expressions. In languages like
-are well-parenthesised (the left one is, the right one is not):
+Lisp, which use parentheses rather extensively, it might be of
+interest whether the following two expressions are
+well-parenthesised (the left one is, the right one is not):
 \begin{center}
 $(((()()))())$  \hspace{10mm} $(((()()))()))$
 \end{center}
-\noindent
+\noindent Not being able to solve such recognition problems is
-Not being able to solve such recognition problems is a serious limitation.
+a serious limitation. In order to solve such recognition
-In order to solve such recognition problems, we need more powerful
+problems, we need more powerful techniques than regular
-techniques than regular expressions. We will in particular look at \emph{context-free
+expressions. We will in particular look at \emph{context-free
-languages}. They include the regular languages as the picture below shows:
+languages}. They include the regular languages as the picture
+below shows:
 \begin{center}
 \begin{tikzpicture}
 [rect/.style={draw=black!50,
 \draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages};
 \draw (0,-1.05) node [rect] {\small regular languages};
 \end{tikzpicture}
 \end{center}
-\noindent
+\noindent Context-free languages play an important role in
-Context-free languages play an important role in `day-to-day' text processing and in
+`day-to-day' text processing and in programming languages.
-programming languages. Context-free languages are usually specified by grammars.
+Context-free languages are usually specified by grammars. For
-For example a grammar for well-parenthesised  expressions is
+example a grammar for well-parenthesised expressions is
 \begin{center}
 $P \;\;\rightarrow\;\; ( \cdot  P \cdot ) \cdot P \;|\; \epsilon$
 \end{center}
 \noindent
-In general grammars consist of finitely many rules built up from \emph{terminal symbols}
+or a grammar for recognising strings consisting of ones is
-(usually lower-case letters) and \emph{non-terminal symbols} (upper-case letters).  Rules
-have the shape
+\begin{center}
+$O \;\;\rightarrow\;\; 1 \cdot  O \;|\; 1$
+\end{center}
+In general grammars consist of finitely many rules built up
+from \emph{terminal symbols} (usually lower-case letters) and
+\emph{non-terminal symbols} (upper-case letters). Rules have
+the shape
 \begin{center}
 $NT \;\;\rightarrow\;\; \textit{rhs}$
 \end{center}
-\noindent
+\noindent where on the left-hand side is a single non-terminal
-where on the left-hand side is a single non-terminal and on the right a string consisting
+and on the right a string consisting of both terminals and
-of both terminals and non-terminals including the $\epsilon$-symbol for indicating the
+non-terminals including the $\epsilon$-symbol for indicating
-empty string. We use the convention  to separate components on
+the empty string. We use the convention to separate components
-the right hand-side by using the $\cdot$ symbol, as in the grammar for well-parenthesised  expressions.
+on the right hand-side by using the $\cdot$ symbol, as in the
-We also use the convention to use $|$ as a shorthand notation for several rules. For example
+grammar for well-parenthesised expressions. We also use the
+convention to use $|$ as a shorthand notation for several
+rules. For example
 \begin{center}
 $NT \;\;\rightarrow\;\; \textit{rhs}_1 \;|\; \textit{rhs}_2$
 \end{center}
-\noindent
+\noindent means that the non-terminal $NT$ can be replaced by
-means that the non-terminal $NT$ can be replaced by either $\textit{rhs}_1$ or $\textit{rhs}_2$.
+either $\textit{rhs}_1$ or $\textit{rhs}_2$. If there are more
-If there are more than one non-terminal on the left-hand side of the rules, then we need to indicate
+than one non-terminal on the left-hand side of the rules, then
-what is the \emph{starting} symbol of the grammar. For example the grammar for arithmetic expressions
+we need to indicate what is the \emph{starting} symbol of the
+grammar. For example the grammar for arithmetic expressions
 can be given as follows
 \begin{center}
-\begin{tabular}{lcl}
+\begin{tabular}{lcl@{\hspace{2cm}}l}
-$E$ & $\rightarrow$ &  $N$ \\
+$E$ & $\rightarrow$ &  $N$                 & (1)\\
-$E$ & $\rightarrow$ &  $E \cdot + \cdot E$ \\
+$E$ & $\rightarrow$ &  $E \cdot + \cdot E$ & (2)\\
-$E$ & $\rightarrow$ &  $E \cdot - \cdot E$ \\
+$E$ & $\rightarrow$ &  $E \cdot - \cdot E$ & (3)\\
-$E$ & $\rightarrow$ &  $E \cdot * \cdot E$ \\
+$E$ & $\rightarrow$ &  $E \cdot * \cdot E$ & (4)\\
-$E$ & $\rightarrow$ &  $( \cdot E \cdot )$\\
+$E$ & $\rightarrow$ &  $( \cdot E \cdot )$ & (5)\\
-$N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$
+$N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$ & (6\ldots)
 \end{tabular}
 \end{center}
-\noindent
+\noindent where $E$ is the starting symbol. A
-where $E$ is the starting symbol. A \emph{derivation} for a grammar
+\emph{derivation} for a grammar starts with the starting
-starts with the staring symbol of the grammar and in each step replaces one
+symbol of the grammar and in each step replaces one
-non-terminal by a right-hand side of a rule. A derivation ends with a string
+non-terminal by a right-hand side of a rule. A derivation ends
-in which only terminal symbols are left. For example a derivation for the
+with a string in which only terminal symbols are left. For
-string $(1 + 2) + 3$ is as follows:
+example a derivation for the string $(1 + 2) + 3$ is as
+follows:
-\begin{center}
-\begin{tabular}{lll}
+\begin{center}
-$E$ & $\rightarrow$ & $E+E$\\
+\begin{tabular}{lll@{\hspace{2cm}}l}
-& $\rightarrow$ & $(E)+E$\\
+$E$ & $\rightarrow$ & $E+E$          & by (2)\\
-& $\rightarrow$ & $(E+E)+E$\\
+& $\rightarrow$ & $(E)+E$     & by (5)\\
-& $\rightarrow$ & $(E+E)+N$\\
+& $\rightarrow$ & $(E+E)+E$   & by (2)\\
-& $\rightarrow$ & $(E+E)+3$\\
+& $\rightarrow$ & $(E+E)+N$   & by (1)\\
-& $\rightarrow$ & $(N+E)+3$\\
+& $\rightarrow$ & $(E+E)+3$   & by (6\dots)\\
-& $\rightarrow^+$ & $(1+2)+3$\\
+& $\rightarrow$ & $(N+E)+3$   & by (1)\\
+& $\rightarrow^+$ & $(1+2)+3$ & by (1, 6\ldots)\\
 \end{tabular}
 \end{center}
-\noindent
+\noindent where on the right it is indicated which
-The \emph{language} of a context-free grammar $G$ with start symbol $S$
+grammar rule has been applied. In the last step we
-is defined as the set of strings derivable by a derivation, that is
+merged several steps into one.
+The \emph{language} of a context-free grammar $G$
+with start symbol $S$ is defined as the set of strings
+derivable by a derivation, that is
 \begin{center}
 $\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
 \end{center}
 child {node {$N$} child {node {$4$}}}
 };
 \end{tikzpicture}
 \end{center}
-\noindent
+\noindent We are often interested in these parse-trees since
-We are often interested in these parse-trees since they encode the structure of
+they encode the structure of how a string is derived by a
-how a string is derived by a grammar. Before we come to the problem of constructing
+grammar. Before we come to the problem of constructing such
-such parse-trees, we need to consider the following two properties of grammars.
+parse-trees, we need to consider the following two properties
-A grammar is \emph{left-recursive} if there is a derivation starting from a non-terminal, say
+of grammars. A grammar is \emph{left-recursive} if there is a
-$NT$ which leads to a string which again starts with $NT$. This means a derivation of the
+derivation starting from a non-terminal, say $NT$ which leads
-form.
+to a string which again starts with $NT$. This means a
+derivation of the form.
 \begin{center}
 $NT \rightarrow \ldots \rightarrow NT \cdot \ldots$
 \end{center}
-\noindent
+\noindent It can be easily seen that the grammar above for
-It can be easily seen that the grammar above for arithmetic expressions is left-recursive:
+arithmetic expressions is left-recursive: for example the
-for example the rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow N\cdot N$
+rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow
-show that this grammar is left-recursive. Some algorithms cannot cope with left-recursive
+N\cdot N$ show that this grammar is left-recursive. But note
-grammars. Fortunately every left-recursive grammar can be transformed into one that is
+that left-recursiveness can involve more than one step in the
-not left-recursive, although this transformation might make the grammar less human-readable.
+derivation. The problem with left-recursive grammars is that
-For example if we want to give a non-left-recursive grammar for numbers we might
+some algorithms cannot cope with them: they fall into a loop.
-specify
+Fortunately every left-recursive grammar can be transformed
+into one that is not left-recursive, although this
+transformation might make the grammar less ``human-readable''.
+For example if we want to give a non-left-recursive grammar
+for numbers we might specify
 \begin{center}
 $N \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot N\;|\;2\cdot N\;|\;\ldots\;|\;9\cdot N$
 \end{center}
-\noindent
+\noindent Using this grammar we can still derive every number
-Using this grammar we can still derive every number string, but we will never be able
+string, but we will never be able to derive a string of the
-to derive a string of the form $\ldots \rightarrow N \cdot \ldots$.
+form $N \to \ldots \to N \cdot \ldots$.
-The other property we have to watch out for is when a grammar is
+The other property we have to watch out for is when a grammar
-\emph{ambiguous}. A grammar is said to be ambiguous if there are two parse-trees
+is \emph{ambiguous}. A grammar is said to be ambiguous if
-for one string. Again the grammar for arithmetic expressions shown above is ambiguous.
+there are two parse-trees for one string. Again the grammar
-While the shown parse tree for the string $(1 + 23) + 4$ is unique, this is not the case in
+for arithmetic expressions shown above is ambiguous. While the
-general. For example there are two parse
+shown parse tree for the string $(1 + 23) + 4$ is unique, this
+is not the case in general. For example there are two parse
 trees for the string $1 + 2 + 3$, namely
 \begin{center}
 \begin{tabular}{c@{\hspace{10mm}}c}
 \begin{tikzpicture}[level distance=8mm, black]
 ;
 \end{tikzpicture}
 \end{tabular}
 \end{center}
-\noindent
+\noindent In particular in programming languages we will try
-In particular in programming languages we will try to avoid ambiguous
+to avoid ambiguous grammars because two different parse-trees
-grammars because two different parse-trees for a string mean a program can
+for a string mean a program can be interpreted in two
-be interpreted in two different ways. In such cases we have to somehow make sure
+different ways. In such cases we have to somehow make sure the
-the two different ways do not matter, or disambiguate the grammar in
+two different ways do not matter, or disambiguate the grammar
-some other way (for example making the $+$ left-associative). Unfortunately already
+in some other way (for example making the $+$
-the problem of deciding whether a grammar
+left-associative). Unfortunately already the problem of
-is ambiguous or not is in general undecidable.
+deciding whether a grammar is ambiguous or not is in general
+undecidable. But in simple instance (the ones we deal in this
-Let us now turn to the problem of generating a parse-tree for a grammar and string.
+module) one can usually see when a grammar is ambiguous.
-In what follows we explain \emph{parser combinators}, because they are easy
-to implement and closely resemble grammar rules. Imagine that a grammar
+Let us now turn to the problem of generating a parse-tree for
-describes the strings of natural numbers, such as the grammar $N$ shown above.
+a grammar and string. In what follows we explain \emph{parser
-For all such strings we want to generate the parse-trees or later on we actually
+combinators}, because they are easy to implement and closely
-want to extract the meaning of these strings, that is the concrete integers ``behind''
+resemble grammar rules. Imagine that a grammar describes the
-these strings. In Scala the parser combinators will be functions of type
+strings of natural numbers, such as the grammar $N$ shown
+above. For all such strings we want to generate the
+parse-trees or later on we actually want to extract the
+meaning of these strings, that is the concrete integers
+``behind'' these strings. In Scala the parser combinators will
+be functions of type
 \begin{center}
 \texttt{I $\Rightarrow$ Set[(T, I)]}
 \end{center}
-\noindent
+\noindent that is they take as input something of type
-that is they take as input something of type \texttt{I}, typically a list of tokens or a string,
+\texttt{I}, typically a list of tokens or a string, and return
-and return a set of pairs. The first component of these pairs corresponds to what the
+a set of pairs. The first component of these pairs corresponds
-parser combinator was able to process from the input and the second is the unprocessed
+to what the parser combinator was able to process from the
-part of the input. As we shall see shortly, a parser combinator might return more than one such pair,
+input and the second is the unprocessed part of the input. As
-with the idea that there are potentially several ways how to interpret the input. As a concrete
+we shall see shortly, a parser combinator might return more
-example, consider the case where the input is of type string, say the string
+than one such pair, with the idea that there are potentially
+several ways how to interpret the input. As a concrete
+example, consider the case where the input is of type string,
+say the string
 \begin{center}
 \tt\Grid{iffoo\VS testbar}
 \end{center}
-\noindent
+\noindent We might have a parser combinator which tries to
-We might have a parser combinator which tries to interpret this string as a keyword (\texttt{if}) or
+interpret this string as a keyword (\texttt{if}) or an
-an identifier (\texttt{iffoo}). Then the output will be the set
+identifier (\texttt{iffoo}). Then the output will be the set
 \begin{center}
 $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right),
 \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
 \end{center}
-\noindent
+\noindent where the first pair means the parser could
-where the first pair means the parser could recognise \texttt{if} from the input and leaves
+recognise \texttt{if} from the input and leaves the rest as
-the rest as `unprocessed' as the second component of the pair; in the other case
+`unprocessed' as the second component of the pair; in the
-it could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parser
+other case it could recognise \texttt{iffoo} and leaves
-cannot recognise anything from the input then parser combinators just return the empty
+\texttt{\VS testbar} as unprocessed. If the parser cannot
-set $\varnothing$. This will indicate something ``went wrong''.
+recognise anything from the input then parser combinators just
+return the empty set $\{\}$. This will indicate
+something ``went wrong''.
 The main attraction is that we can easily build parser combinators out of smaller components
 following very closely the structure of a grammar. In order to implement this in an object
 oriented programming language, like Scala, we need to specify an abstract class for parser
 combinators. This abstract class requires the implementation of the function

changeset 362	57ea439feaff
parent 360	c6c574d2ca0c
child 385	7f8516ff408d