updated
authorChristian Urban <christian dot urban at kcl dot ac dot uk>
Fri, 23 Oct 2015 14:45:57 +0100
changeset 362 57ea439feaff
parent 361 9c7eb266594c
child 363 0d6deecdb2eb
updated
handouts/ho05.pdf
handouts/ho05.tex
progs/Matcher2.thy
progs/comb1.scala
Binary file handouts/ho05.pdf has changed
--- a/handouts/ho05.tex	Fri Oct 23 08:35:17 2015 +0100
+++ b/handouts/ho05.tex	Fri Oct 23 14:45:57 2015 +0100
@@ -7,23 +7,26 @@
 
 \section*{Handout 6 (Grammars \& Parser)}
 
-While regular expressions are very useful for lexing and for recognising
-many patterns in strings (like email addresses), they have their limitations. For
-example there is no regular expression that can recognise the language 
-$a^nb^n$. Another example for which there exists no regular expression is the language of well-parenthesised 
-expressions.  In languages like Lisp, which use parentheses rather
-extensively, it might be of interest whether the following two expressions
-are well-parenthesised (the left one is, the right one is not):
+While regular expressions are very useful for lexing and for
+recognising many patterns in strings (like email addresses),
+they have their limitations. For example there is no regular
+expression that can recognise the language $a^nb^n$. Another
+example for which there exists no regular expression is the
+language of well-parenthesised expressions. In languages like
+Lisp, which use parentheses rather extensively, it might be of
+interest whether the following two expressions are
+well-parenthesised (the left one is, the right one is not):
 
 \begin{center}
 $(((()()))())$  \hspace{10mm} $(((()()))()))$
 \end{center}
 
-\noindent
-Not being able to solve such recognition problems is a serious limitation.
-In order to solve such recognition problems, we need more powerful 
-techniques than regular expressions. We will in particular look at \emph{context-free
-languages}. They include the regular languages as the picture below shows:
+\noindent Not being able to solve such recognition problems is
+a serious limitation. In order to solve such recognition
+problems, we need more powerful techniques than regular
+expressions. We will in particular look at \emph{context-free
+languages}. They include the regular languages as the picture
+below shows:
 
 
 \begin{center}
@@ -40,74 +43,89 @@
 \end{tikzpicture}
 \end{center}
 
-\noindent
-Context-free languages play an important role in `day-to-day' text processing and in
-programming languages. Context-free languages are usually specified by grammars.
-For example a grammar for well-parenthesised  expressions is
+\noindent Context-free languages play an important role in
+`day-to-day' text processing and in programming languages.
+Context-free languages are usually specified by grammars. For
+example a grammar for well-parenthesised expressions is
 
 \begin{center}
 $P \;\;\rightarrow\;\; ( \cdot  P \cdot ) \cdot P \;|\; \epsilon$
 \end{center}
  
-\noindent
-In general grammars consist of finitely many rules built up from \emph{terminal symbols} 
-(usually lower-case letters) and \emph{non-terminal symbols} (upper-case letters).  Rules 
-have the shape
+\noindent 
+or a grammar for recognising strings consisting of ones is
+
+\begin{center}
+$O \;\;\rightarrow\;\; 1 \cdot  O \;|\; 1$
+\end{center}
+
+In general grammars consist of finitely many rules built up
+from \emph{terminal symbols} (usually lower-case letters) and
+\emph{non-terminal symbols} (upper-case letters). Rules have
+the shape
 
 \begin{center}
 $NT \;\;\rightarrow\;\; \textit{rhs}$
 \end{center}
  
-\noindent
-where on the left-hand side is a single non-terminal and on the right a string consisting
-of both terminals and non-terminals including the $\epsilon$-symbol for indicating the
-empty string. We use the convention  to separate components on
-the right hand-side by using the $\cdot$ symbol, as in the grammar for well-parenthesised  expressions.
-We also use the convention to use $|$ as a shorthand notation for several rules. For example
+\noindent where on the left-hand side is a single non-terminal
+and on the right a string consisting of both terminals and
+non-terminals including the $\epsilon$-symbol for indicating
+the empty string. We use the convention to separate components
+on the right hand-side by using the $\cdot$ symbol, as in the
+grammar for well-parenthesised expressions. We also use the
+convention to use $|$ as a shorthand notation for several
+rules. For example
 
 \begin{center}
 $NT \;\;\rightarrow\;\; \textit{rhs}_1 \;|\; \textit{rhs}_2$
 \end{center}
 
-\noindent
-means that the non-terminal $NT$ can be replaced by either $\textit{rhs}_1$ or $\textit{rhs}_2$.
-If there are more than one non-terminal on the left-hand side of the rules, then we need to indicate
-what is the \emph{starting} symbol of the grammar. For example the grammar for arithmetic expressions
+\noindent means that the non-terminal $NT$ can be replaced by
+either $\textit{rhs}_1$ or $\textit{rhs}_2$. If there are more
+than one non-terminal on the left-hand side of the rules, then
+we need to indicate what is the \emph{starting} symbol of the
+grammar. For example the grammar for arithmetic expressions
 can be given as follows
 
 \begin{center}
-\begin{tabular}{lcl}
-$E$ & $\rightarrow$ &  $N$ \\
-$E$ & $\rightarrow$ &  $E \cdot + \cdot E$ \\
-$E$ & $\rightarrow$ &  $E \cdot - \cdot E$ \\
-$E$ & $\rightarrow$ &  $E \cdot * \cdot E$ \\
-$E$ & $\rightarrow$ &  $( \cdot E \cdot )$\\
-$N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$ 
+\begin{tabular}{lcl@{\hspace{2cm}}l}
+$E$ & $\rightarrow$ &  $N$                 & (1)\\
+$E$ & $\rightarrow$ &  $E \cdot + \cdot E$ & (2)\\
+$E$ & $\rightarrow$ &  $E \cdot - \cdot E$ & (3)\\
+$E$ & $\rightarrow$ &  $E \cdot * \cdot E$ & (4)\\
+$E$ & $\rightarrow$ &  $( \cdot E \cdot )$ & (5)\\
+$N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$ & (6\ldots)
 \end{tabular}
 \end{center}
 
-\noindent
-where $E$ is the starting symbol. A \emph{derivation} for a grammar
-starts with the staring symbol of the grammar and in each step replaces one
-non-terminal by a right-hand side of a rule. A derivation ends with a string
-in which only terminal symbols are left. For example a derivation for the
-string $(1 + 2) + 3$ is as follows:
+\noindent where $E$ is the starting symbol. A
+\emph{derivation} for a grammar starts with the starting
+symbol of the grammar and in each step replaces one
+non-terminal by a right-hand side of a rule. A derivation ends
+with a string in which only terminal symbols are left. For
+example a derivation for the string $(1 + 2) + 3$ is as
+follows:
 
 \begin{center}
-\begin{tabular}{lll}
-$E$ & $\rightarrow$ & $E+E$\\
-       & $\rightarrow$ & $(E)+E$\\
-       & $\rightarrow$ & $(E+E)+E$\\
-       & $\rightarrow$ & $(E+E)+N$\\
-       & $\rightarrow$ & $(E+E)+3$\\
-       & $\rightarrow$ & $(N+E)+3$\\	
-       & $\rightarrow^+$ & $(1+2)+3$\\
+\begin{tabular}{lll@{\hspace{2cm}}l}
+$E$ & $\rightarrow$ & $E+E$          & by (2)\\
+       & $\rightarrow$ & $(E)+E$     & by (5)\\
+       & $\rightarrow$ & $(E+E)+E$   & by (2)\\
+       & $\rightarrow$ & $(E+E)+N$   & by (1)\\
+       & $\rightarrow$ & $(E+E)+3$   & by (6\dots)\\
+       & $\rightarrow$ & $(N+E)+3$   & by (1)\\
+       & $\rightarrow^+$ & $(1+2)+3$ & by (1, 6\ldots)\\
 \end{tabular} 
 \end{center}
 
-\noindent
-The \emph{language} of a context-free grammar $G$ with start symbol $S$ 
-is defined as the set of strings derivable by a derivation, that is
+\noindent where on the right it is indicated which 
+grammar rule has been applied. In the last step we
+merged several steps into one.
+
+The \emph{language} of a context-free grammar $G$
+with start symbol $S$ is defined as the set of strings
+derivable by a derivation, that is
 
 \begin{center}
 $\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
@@ -140,40 +158,46 @@
 \end{tikzpicture}
 \end{center}
 
-\noindent
-We are often interested in these parse-trees since they encode the structure of
-how a string is derived by a grammar. Before we come to the problem of constructing
-such parse-trees, we need to consider the following two properties of grammars.
-A grammar is \emph{left-recursive} if there is a derivation starting from a non-terminal, say
-$NT$ which leads to a string which again starts with $NT$. This means a derivation of the
-form.
+\noindent We are often interested in these parse-trees since
+they encode the structure of how a string is derived by a
+grammar. Before we come to the problem of constructing such
+parse-trees, we need to consider the following two properties
+of grammars. A grammar is \emph{left-recursive} if there is a
+derivation starting from a non-terminal, say $NT$ which leads
+to a string which again starts with $NT$. This means a
+derivation of the form.
 
 \begin{center}
 $NT \rightarrow \ldots \rightarrow NT \cdot \ldots$
 \end{center}
 
-\noindent
-It can be easily seen that the grammar above for arithmetic expressions is left-recursive:
-for example the rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow N\cdot N$ 
-show that this grammar is left-recursive. Some algorithms cannot cope with left-recursive 
-grammars. Fortunately every left-recursive grammar can be transformed into one that is
-not left-recursive, although this transformation might make the grammar less human-readable.
-For example if we want to give a non-left-recursive grammar for numbers we might
-specify
+\noindent It can be easily seen that the grammar above for
+arithmetic expressions is left-recursive: for example the
+rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow
+N\cdot N$ show that this grammar is left-recursive. But note
+that left-recursiveness can involve more than one step in the
+derivation. The problem with left-recursive grammars is that
+some algorithms cannot cope with them: they fall into a loop.
+Fortunately every left-recursive grammar can be transformed
+into one that is not left-recursive, although this
+transformation might make the grammar less ``human-readable''.
+For example if we want to give a non-left-recursive grammar
+for numbers we might specify
 
 \begin{center}
 $N \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot N\;|\;2\cdot N\;|\;\ldots\;|\;9\cdot N$
 \end{center}
 
-\noindent
-Using this grammar we can still derive every number string, but we will never be able 
-to derive a string of the form $\ldots \rightarrow N \cdot \ldots$.
+\noindent Using this grammar we can still derive every number
+string, but we will never be able to derive a string of the
+form $N \to \ldots \to N \cdot \ldots$.
 
-The other property we have to watch out for is when a grammar is
-\emph{ambiguous}. A grammar is said to be ambiguous if there are two parse-trees
-for one string. Again the grammar for arithmetic expressions shown above is ambiguous.
-While the shown parse tree for the string $(1 + 23) + 4$ is unique, this is not the case in
-general. For example there are two parse
+The other property we have to watch out for is when a grammar
+is \emph{ambiguous}. A grammar is said to be ambiguous if
+there are two parse-trees for one string. Again the grammar
+for arithmetic expressions shown above is ambiguous. While the
+shown parse tree for the string $(1 + 23) + 4$ is unique, this
+is not the case in general. For example there are two parse
 trees for the string $1 + 2 + 3$, namely
 
 \begin{center}
@@ -204,54 +228,64 @@
 \end{tabular} 
 \end{center}
 
-\noindent
-In particular in programming languages we will try to avoid ambiguous
-grammars because two different parse-trees for a string mean a program can
-be interpreted in two different ways. In such cases we have to somehow make sure
-the two different ways do not matter, or disambiguate the grammar in
-some other way (for example making the $+$ left-associative). Unfortunately already 
-the problem of deciding whether a grammar
-is ambiguous or not is in general undecidable. 
+\noindent In particular in programming languages we will try
+to avoid ambiguous grammars because two different parse-trees
+for a string mean a program can be interpreted in two
+different ways. In such cases we have to somehow make sure the
+two different ways do not matter, or disambiguate the grammar
+in some other way (for example making the $+$
+left-associative). Unfortunately already the problem of
+deciding whether a grammar is ambiguous or not is in general
+undecidable. But in simple instance (the ones we deal in this
+module) one can usually see when a grammar is ambiguous.
 
-Let us now turn to the problem of generating a parse-tree for a grammar and string.
-In what follows we explain \emph{parser combinators}, because they are easy
-to implement and closely resemble grammar rules. Imagine that a grammar
-describes the strings of natural numbers, such as the grammar $N$ shown above.
-For all such strings we want to generate the parse-trees or later on we actually 
-want to extract the meaning of these strings, that is the concrete integers ``behind'' 
-these strings. In Scala the parser combinators will be functions of type
+Let us now turn to the problem of generating a parse-tree for
+a grammar and string. In what follows we explain \emph{parser
+combinators}, because they are easy to implement and closely
+resemble grammar rules. Imagine that a grammar describes the
+strings of natural numbers, such as the grammar $N$ shown
+above. For all such strings we want to generate the
+parse-trees or later on we actually want to extract the
+meaning of these strings, that is the concrete integers
+``behind'' these strings. In Scala the parser combinators will
+be functions of type
 
 \begin{center}
 \texttt{I $\Rightarrow$ Set[(T, I)]}
 \end{center}
 
-\noindent 
-that is they take as input something of type \texttt{I}, typically a list of tokens or a string,
-and return a set of pairs. The first component of these pairs corresponds to what the
-parser combinator was able to process from the input and the second is the unprocessed 
-part of the input. As we shall see shortly, a parser combinator might return more than one such pair,
-with the idea that there are potentially several ways how to interpret the input. As a concrete
-example, consider the case where the input is of type string, say the string
+\noindent that is they take as input something of type
+\texttt{I}, typically a list of tokens or a string, and return
+a set of pairs. The first component of these pairs corresponds
+to what the parser combinator was able to process from the
+input and the second is the unprocessed part of the input. As
+we shall see shortly, a parser combinator might return more
+than one such pair, with the idea that there are potentially
+several ways how to interpret the input. As a concrete
+example, consider the case where the input is of type string,
+say the string
 
 \begin{center}
 \tt\Grid{iffoo\VS testbar}
 \end{center}
 
-\noindent
-We might have a parser combinator which tries to interpret this string as a keyword (\texttt{if}) or
-an identifier (\texttt{iffoo}). Then the output will be the set
+\noindent We might have a parser combinator which tries to
+interpret this string as a keyword (\texttt{if}) or an
+identifier (\texttt{iffoo}). Then the output will be the set
 
 \begin{center}
 $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), 
            \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
 \end{center}
 
-\noindent
-where the first pair means the parser could recognise \texttt{if} from the input and leaves 
-the rest as `unprocessed' as the second component of the pair; in the other case
-it could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parser
-cannot recognise anything from the input then parser combinators just return the empty 
-set $\varnothing$. This will indicate something ``went wrong''.
+\noindent where the first pair means the parser could
+recognise \texttt{if} from the input and leaves the rest as
+`unprocessed' as the second component of the pair; in the
+other case it could recognise \texttt{iffoo} and leaves
+\texttt{\VS testbar} as unprocessed. If the parser cannot
+recognise anything from the input then parser combinators just
+return the empty set $\{\}$. This will indicate
+something ``went wrong''.
 
 The main attraction is that we can easily build parser combinators out of smaller components
 following very closely the structure of a grammar. In order to implement this in an object
--- a/progs/Matcher2.thy	Fri Oct 23 08:35:17 2015 +0100
+++ b/progs/Matcher2.thy	Fri Oct 23 14:45:57 2015 +0100
@@ -24,8 +24,8 @@
 | PLUS rexp
 | OPT rexp
 | NTIMES rexp nat
-| NMTIMES2 rexp nat nat
-(*
+| NMTIMES rexp nat nat
+
 fun M :: "rexp \<Rightarrow> nat"
 where
   "M (NULL) = 0"
@@ -38,8 +38,8 @@
 | "M (PLUS r) = Suc (M r)"
 | "M (OPT r) = Suc (M r)"
 | "M (NTIMES r n) = Suc (M r) * 2 * (Suc n)"
-| "M (NMTIMES2 r n m) = Suc (M r) * 2 * (Suc m - Suc n)"
-*)
+| "M (NMTIMES r n m) = Suc (M r) * 2 * (Suc m - Suc n)"
+
 section {* Sequential Composition of Sets *}
 
 definition
@@ -159,7 +159,7 @@
 | "L (PLUS r) = (L r) ;; ((L r)\<star>)"
 | "L (OPT r) = (L r) \<union> {[]}"
 | "L (NTIMES r n) = (L r) \<up> n"
-| "L (NMTIMES2 r n m) = (\<Union>i\<in> {n..m} . ((L r) \<up> i))" 
+| "L (NMTIMES r n m) = (\<Union>i\<in> {n..m} . ((L r) \<up> i))" 
 
 
 lemma "L (NOT NULL) = UNIV"
@@ -181,21 +181,7 @@
 | "nullable (PLUS r) = (nullable r)"
 | "nullable (OPT r) = True"
 | "nullable (NTIMES r n) = (if n = 0 then True else nullable r)"
-| "nullable (NMTIMES2 r n m) = (if m < n then False else (if n = 0 then True else nullable r))"
-
-fun M :: "rexp \<Rightarrow> nat"
-where
-  "M (NULL) = 0"
-| "M (EMPTY) = 0"
-| "M (CHAR char) = 0"
-| "M (SEQ r1 r2) = Suc ((M r1) + (M r2))"
-| "M (ALT r1 r2) = Suc ((M r1) + (M r2))"
-| "M (STAR r) = Suc (M r)"
-| "M (NOT r) = Suc (M r)"
-| "M (PLUS r) = Suc (M r)"
-| "M (OPT r) = Suc (M r)"
-| "M (NTIMES r n) = Suc (M r) * 2 * (Suc n)"
-| "M (NMTIMES2 r n m) = 3 * (Suc (M r) + n) * 3 * (Suc n) * (Suc (Suc m) - (Suc n))"
+| "nullable (NMTIMES r n m) = (if m < n then False else (if n = 0 then True else nullable r))"
 
 function der :: "char \<Rightarrow> rexp \<Rightarrow> rexp"
 where
@@ -210,9 +196,9 @@
 | "der c (OPT r) = der c r"
 | "der c (NTIMES r 0) = NULL"
 | "der c (NTIMES r (Suc n)) = der c (SEQ r (NTIMES r n))"
-| "der c (NMTIMES2 r n m) = (if m < n then NULL else 
+| "der c (NMTIMES r n m) = (if m < n then NULL else 
                               (if n = m then der c (NTIMES r n) else
-                                ALT (der c (NTIMES r n)) (der c (NMTIMES2 r (Suc n) m))))"
+                                ALT (der c (NTIMES r n)) (der c (NMTIMES r (Suc n) m))))"
 by pat_completeness auto
 
 termination der 
--- a/progs/comb1.scala	Fri Oct 23 08:35:17 2015 +0100
+++ b/progs/comb1.scala	Fri Oct 23 14:45:57 2015 +0100
@@ -1,6 +1,10 @@
 import scala.language.implicitConversions
 import scala.language.reflectiveCalls
 
+/* Note, in the lectures I did not show the type consraint
+ * I <% Seq[_] , which means that the input type I can be
+ * treated, or seen, as a sequence. */
+
 abstract class Parser[I <% Seq[_], T] {
   def parse(ts: I): Set[(T, I)]
 
@@ -67,20 +71,21 @@
     new SeqParser[String, String, String](s, r)
 }
 
-// palindromes
+// a parse palindromes
 lazy val Pal : Parser[String, String] = 
   (("a" ~ Pal ~ "a") ==> { case ((x, y), z) => x + y + z } ||
    ("b" ~ Pal ~ "b") ==> { case ((x, y), z) => x + y + z } || "")
 
 println("Palindrom: " + Pal.parse_all("ababbaba"))
 
-// well-nested parenthesis
+// well-nested parenthesis parser
 lazy val P : Parser[String, String] = 
   "(" ~ P ~ ")" ~ P ==> { case (((u, x), y), z) => "{" + x + "}" + z } || ""
 
 P.parse_all("(((()()))())")
 P.parse_all("(((()()))()))")
 P.parse_all(")(")
+P.parse_all("()")
 
 // arithmetic expressions
 lazy val E: Parser[String, String] = 
@@ -93,10 +98,11 @@
 
 println(E.parse_all("1*2+3"))
 println(E.parse_all("1+2*3"))
-println(E.parse_all("1+2+3"))
+println(E.parse_all("(1+2)+3"))
+println(E.parse_all("1+2+3"))  // this is not parsed, because of 
+                               // how the grammar is set up
 
-
-// non-ambiguous vs ambiguous
+// non-ambiguous vs ambiguous grammars
 lazy val U : Parser[String, String] =
   ("1" ~ U) ==> { case (x, y) => x + y } || ""
 
@@ -106,3 +112,32 @@
   ("1" ~ S ~ S) ==> { case ((x, y), z) => x + y + z } || ""
 
 S.parse_all("1" * 15)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+