# HG changeset patch # User Christian Urban # Date 1445607957 -3600 # Node ID 57ea439feaffda778755a58bd9fad38d12aed409 # Parent 9c7eb266594cfc4933f4a02eb3ecfe178c33c40d updated diff -r 9c7eb266594c -r 57ea439feaff handouts/ho05.pdf Binary file handouts/ho05.pdf has changed diff -r 9c7eb266594c -r 57ea439feaff handouts/ho05.tex --- a/handouts/ho05.tex Fri Oct 23 08:35:17 2015 +0100 +++ b/handouts/ho05.tex Fri Oct 23 14:45:57 2015 +0100 @@ -7,23 +7,26 @@ \section*{Handout 6 (Grammars \& Parser)} -While regular expressions are very useful for lexing and for recognising -many patterns in strings (like email addresses), they have their limitations. For -example there is no regular expression that can recognise the language -$a^nb^n$. Another example for which there exists no regular expression is the language of well-parenthesised -expressions. In languages like Lisp, which use parentheses rather -extensively, it might be of interest whether the following two expressions -are well-parenthesised (the left one is, the right one is not): +While regular expressions are very useful for lexing and for +recognising many patterns in strings (like email addresses), +they have their limitations. For example there is no regular +expression that can recognise the language $a^nb^n$. Another +example for which there exists no regular expression is the +language of well-parenthesised expressions. In languages like +Lisp, which use parentheses rather extensively, it might be of +interest whether the following two expressions are +well-parenthesised (the left one is, the right one is not): \begin{center} $(((()()))())$ \hspace{10mm} $(((()()))()))$ \end{center} -\noindent -Not being able to solve such recognition problems is a serious limitation. -In order to solve such recognition problems, we need more powerful -techniques than regular expressions. We will in particular look at \emph{context-free -languages}. They include the regular languages as the picture below shows: +\noindent Not being able to solve such recognition problems is +a serious limitation. In order to solve such recognition +problems, we need more powerful techniques than regular +expressions. We will in particular look at \emph{context-free +languages}. They include the regular languages as the picture +below shows: \begin{center} @@ -40,74 +43,89 @@ \end{tikzpicture} \end{center} -\noindent -Context-free languages play an important role in `day-to-day' text processing and in -programming languages. Context-free languages are usually specified by grammars. -For example a grammar for well-parenthesised expressions is +\noindent Context-free languages play an important role in +`day-to-day' text processing and in programming languages. +Context-free languages are usually specified by grammars. For +example a grammar for well-parenthesised expressions is \begin{center} $P \;\;\rightarrow\;\; ( \cdot P \cdot ) \cdot P \;|\; \epsilon$ \end{center} -\noindent -In general grammars consist of finitely many rules built up from \emph{terminal symbols} -(usually lower-case letters) and \emph{non-terminal symbols} (upper-case letters). Rules -have the shape +\noindent +or a grammar for recognising strings consisting of ones is + +\begin{center} +$O \;\;\rightarrow\;\; 1 \cdot O \;|\; 1$ +\end{center} + +In general grammars consist of finitely many rules built up +from \emph{terminal symbols} (usually lower-case letters) and +\emph{non-terminal symbols} (upper-case letters). Rules have +the shape \begin{center} $NT \;\;\rightarrow\;\; \textit{rhs}$ \end{center} -\noindent -where on the left-hand side is a single non-terminal and on the right a string consisting -of both terminals and non-terminals including the $\epsilon$-symbol for indicating the -empty string. We use the convention to separate components on -the right hand-side by using the $\cdot$ symbol, as in the grammar for well-parenthesised expressions. -We also use the convention to use $|$ as a shorthand notation for several rules. For example +\noindent where on the left-hand side is a single non-terminal +and on the right a string consisting of both terminals and +non-terminals including the $\epsilon$-symbol for indicating +the empty string. We use the convention to separate components +on the right hand-side by using the $\cdot$ symbol, as in the +grammar for well-parenthesised expressions. We also use the +convention to use $|$ as a shorthand notation for several +rules. For example \begin{center} $NT \;\;\rightarrow\;\; \textit{rhs}_1 \;|\; \textit{rhs}_2$ \end{center} -\noindent -means that the non-terminal $NT$ can be replaced by either $\textit{rhs}_1$ or $\textit{rhs}_2$. -If there are more than one non-terminal on the left-hand side of the rules, then we need to indicate -what is the \emph{starting} symbol of the grammar. For example the grammar for arithmetic expressions +\noindent means that the non-terminal $NT$ can be replaced by +either $\textit{rhs}_1$ or $\textit{rhs}_2$. If there are more +than one non-terminal on the left-hand side of the rules, then +we need to indicate what is the \emph{starting} symbol of the +grammar. For example the grammar for arithmetic expressions can be given as follows \begin{center} -\begin{tabular}{lcl} -$E$ & $\rightarrow$ & $N$ \\ -$E$ & $\rightarrow$ & $E \cdot + \cdot E$ \\ -$E$ & $\rightarrow$ & $E \cdot - \cdot E$ \\ -$E$ & $\rightarrow$ & $E \cdot * \cdot E$ \\ -$E$ & $\rightarrow$ & $( \cdot E \cdot )$\\ -$N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$ +\begin{tabular}{lcl@{\hspace{2cm}}l} +$E$ & $\rightarrow$ & $N$ & (1)\\ +$E$ & $\rightarrow$ & $E \cdot + \cdot E$ & (2)\\ +$E$ & $\rightarrow$ & $E \cdot - \cdot E$ & (3)\\ +$E$ & $\rightarrow$ & $E \cdot * \cdot E$ & (4)\\ +$E$ & $\rightarrow$ & $( \cdot E \cdot )$ & (5)\\ +$N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$ & (6\ldots) \end{tabular} \end{center} -\noindent -where $E$ is the starting symbol. A \emph{derivation} for a grammar -starts with the staring symbol of the grammar and in each step replaces one -non-terminal by a right-hand side of a rule. A derivation ends with a string -in which only terminal symbols are left. For example a derivation for the -string $(1 + 2) + 3$ is as follows: +\noindent where $E$ is the starting symbol. A +\emph{derivation} for a grammar starts with the starting +symbol of the grammar and in each step replaces one +non-terminal by a right-hand side of a rule. A derivation ends +with a string in which only terminal symbols are left. For +example a derivation for the string $(1 + 2) + 3$ is as +follows: \begin{center} -\begin{tabular}{lll} -$E$ & $\rightarrow$ & $E+E$\\ - & $\rightarrow$ & $(E)+E$\\ - & $\rightarrow$ & $(E+E)+E$\\ - & $\rightarrow$ & $(E+E)+N$\\ - & $\rightarrow$ & $(E+E)+3$\\ - & $\rightarrow$ & $(N+E)+3$\\ - & $\rightarrow^+$ & $(1+2)+3$\\ +\begin{tabular}{lll@{\hspace{2cm}}l} +$E$ & $\rightarrow$ & $E+E$ & by (2)\\ + & $\rightarrow$ & $(E)+E$ & by (5)\\ + & $\rightarrow$ & $(E+E)+E$ & by (2)\\ + & $\rightarrow$ & $(E+E)+N$ & by (1)\\ + & $\rightarrow$ & $(E+E)+3$ & by (6\dots)\\ + & $\rightarrow$ & $(N+E)+3$ & by (1)\\ + & $\rightarrow^+$ & $(1+2)+3$ & by (1, 6\ldots)\\ \end{tabular} \end{center} -\noindent -The \emph{language} of a context-free grammar $G$ with start symbol $S$ -is defined as the set of strings derivable by a derivation, that is +\noindent where on the right it is indicated which +grammar rule has been applied. In the last step we +merged several steps into one. + +The \emph{language} of a context-free grammar $G$ +with start symbol $S$ is defined as the set of strings +derivable by a derivation, that is \begin{center} $\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$ @@ -140,40 +158,46 @@ \end{tikzpicture} \end{center} -\noindent -We are often interested in these parse-trees since they encode the structure of -how a string is derived by a grammar. Before we come to the problem of constructing -such parse-trees, we need to consider the following two properties of grammars. -A grammar is \emph{left-recursive} if there is a derivation starting from a non-terminal, say -$NT$ which leads to a string which again starts with $NT$. This means a derivation of the -form. +\noindent We are often interested in these parse-trees since +they encode the structure of how a string is derived by a +grammar. Before we come to the problem of constructing such +parse-trees, we need to consider the following two properties +of grammars. A grammar is \emph{left-recursive} if there is a +derivation starting from a non-terminal, say $NT$ which leads +to a string which again starts with $NT$. This means a +derivation of the form. \begin{center} $NT \rightarrow \ldots \rightarrow NT \cdot \ldots$ \end{center} -\noindent -It can be easily seen that the grammar above for arithmetic expressions is left-recursive: -for example the rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow N\cdot N$ -show that this grammar is left-recursive. Some algorithms cannot cope with left-recursive -grammars. Fortunately every left-recursive grammar can be transformed into one that is -not left-recursive, although this transformation might make the grammar less human-readable. -For example if we want to give a non-left-recursive grammar for numbers we might -specify +\noindent It can be easily seen that the grammar above for +arithmetic expressions is left-recursive: for example the +rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow +N\cdot N$ show that this grammar is left-recursive. But note +that left-recursiveness can involve more than one step in the +derivation. The problem with left-recursive grammars is that +some algorithms cannot cope with them: they fall into a loop. +Fortunately every left-recursive grammar can be transformed +into one that is not left-recursive, although this +transformation might make the grammar less ``human-readable''. +For example if we want to give a non-left-recursive grammar +for numbers we might specify \begin{center} $N \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot N\;|\;2\cdot N\;|\;\ldots\;|\;9\cdot N$ \end{center} -\noindent -Using this grammar we can still derive every number string, but we will never be able -to derive a string of the form $\ldots \rightarrow N \cdot \ldots$. +\noindent Using this grammar we can still derive every number +string, but we will never be able to derive a string of the +form $N \to \ldots \to N \cdot \ldots$. -The other property we have to watch out for is when a grammar is -\emph{ambiguous}. A grammar is said to be ambiguous if there are two parse-trees -for one string. Again the grammar for arithmetic expressions shown above is ambiguous. -While the shown parse tree for the string $(1 + 23) + 4$ is unique, this is not the case in -general. For example there are two parse +The other property we have to watch out for is when a grammar +is \emph{ambiguous}. A grammar is said to be ambiguous if +there are two parse-trees for one string. Again the grammar +for arithmetic expressions shown above is ambiguous. While the +shown parse tree for the string $(1 + 23) + 4$ is unique, this +is not the case in general. For example there are two parse trees for the string $1 + 2 + 3$, namely \begin{center} @@ -204,54 +228,64 @@ \end{tabular} \end{center} -\noindent -In particular in programming languages we will try to avoid ambiguous -grammars because two different parse-trees for a string mean a program can -be interpreted in two different ways. In such cases we have to somehow make sure -the two different ways do not matter, or disambiguate the grammar in -some other way (for example making the $+$ left-associative). Unfortunately already -the problem of deciding whether a grammar -is ambiguous or not is in general undecidable. +\noindent In particular in programming languages we will try +to avoid ambiguous grammars because two different parse-trees +for a string mean a program can be interpreted in two +different ways. In such cases we have to somehow make sure the +two different ways do not matter, or disambiguate the grammar +in some other way (for example making the $+$ +left-associative). Unfortunately already the problem of +deciding whether a grammar is ambiguous or not is in general +undecidable. But in simple instance (the ones we deal in this +module) one can usually see when a grammar is ambiguous. -Let us now turn to the problem of generating a parse-tree for a grammar and string. -In what follows we explain \emph{parser combinators}, because they are easy -to implement and closely resemble grammar rules. Imagine that a grammar -describes the strings of natural numbers, such as the grammar $N$ shown above. -For all such strings we want to generate the parse-trees or later on we actually -want to extract the meaning of these strings, that is the concrete integers ``behind'' -these strings. In Scala the parser combinators will be functions of type +Let us now turn to the problem of generating a parse-tree for +a grammar and string. In what follows we explain \emph{parser +combinators}, because they are easy to implement and closely +resemble grammar rules. Imagine that a grammar describes the +strings of natural numbers, such as the grammar $N$ shown +above. For all such strings we want to generate the +parse-trees or later on we actually want to extract the +meaning of these strings, that is the concrete integers +``behind'' these strings. In Scala the parser combinators will +be functions of type \begin{center} \texttt{I $\Rightarrow$ Set[(T, I)]} \end{center} -\noindent -that is they take as input something of type \texttt{I}, typically a list of tokens or a string, -and return a set of pairs. The first component of these pairs corresponds to what the -parser combinator was able to process from the input and the second is the unprocessed -part of the input. As we shall see shortly, a parser combinator might return more than one such pair, -with the idea that there are potentially several ways how to interpret the input. As a concrete -example, consider the case where the input is of type string, say the string +\noindent that is they take as input something of type +\texttt{I}, typically a list of tokens or a string, and return +a set of pairs. The first component of these pairs corresponds +to what the parser combinator was able to process from the +input and the second is the unprocessed part of the input. As +we shall see shortly, a parser combinator might return more +than one such pair, with the idea that there are potentially +several ways how to interpret the input. As a concrete +example, consider the case where the input is of type string, +say the string \begin{center} \tt\Grid{iffoo\VS testbar} \end{center} -\noindent -We might have a parser combinator which tries to interpret this string as a keyword (\texttt{if}) or -an identifier (\texttt{iffoo}). Then the output will be the set +\noindent We might have a parser combinator which tries to +interpret this string as a keyword (\texttt{if}) or an +identifier (\texttt{iffoo}). Then the output will be the set \begin{center} $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$ \end{center} -\noindent -where the first pair means the parser could recognise \texttt{if} from the input and leaves -the rest as `unprocessed' as the second component of the pair; in the other case -it could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parser -cannot recognise anything from the input then parser combinators just return the empty -set $\varnothing$. This will indicate something ``went wrong''. +\noindent where the first pair means the parser could +recognise \texttt{if} from the input and leaves the rest as +`unprocessed' as the second component of the pair; in the +other case it could recognise \texttt{iffoo} and leaves +\texttt{\VS testbar} as unprocessed. If the parser cannot +recognise anything from the input then parser combinators just +return the empty set $\{\}$. This will indicate +something ``went wrong''. The main attraction is that we can easily build parser combinators out of smaller components following very closely the structure of a grammar. In order to implement this in an object diff -r 9c7eb266594c -r 57ea439feaff progs/Matcher2.thy --- a/progs/Matcher2.thy Fri Oct 23 08:35:17 2015 +0100 +++ b/progs/Matcher2.thy Fri Oct 23 14:45:57 2015 +0100 @@ -24,8 +24,8 @@ | PLUS rexp | OPT rexp | NTIMES rexp nat -| NMTIMES2 rexp nat nat -(* +| NMTIMES rexp nat nat + fun M :: "rexp \ nat" where "M (NULL) = 0" @@ -38,8 +38,8 @@ | "M (PLUS r) = Suc (M r)" | "M (OPT r) = Suc (M r)" | "M (NTIMES r n) = Suc (M r) * 2 * (Suc n)" -| "M (NMTIMES2 r n m) = Suc (M r) * 2 * (Suc m - Suc n)" -*) +| "M (NMTIMES r n m) = Suc (M r) * 2 * (Suc m - Suc n)" + section {* Sequential Composition of Sets *} definition @@ -159,7 +159,7 @@ | "L (PLUS r) = (L r) ;; ((L r)\)" | "L (OPT r) = (L r) \ {[]}" | "L (NTIMES r n) = (L r) \ n" -| "L (NMTIMES2 r n m) = (\i\ {n..m} . ((L r) \ i))" +| "L (NMTIMES r n m) = (\i\ {n..m} . ((L r) \ i))" lemma "L (NOT NULL) = UNIV" @@ -181,21 +181,7 @@ | "nullable (PLUS r) = (nullable r)" | "nullable (OPT r) = True" | "nullable (NTIMES r n) = (if n = 0 then True else nullable r)" -| "nullable (NMTIMES2 r n m) = (if m < n then False else (if n = 0 then True else nullable r))" - -fun M :: "rexp \ nat" -where - "M (NULL) = 0" -| "M (EMPTY) = 0" -| "M (CHAR char) = 0" -| "M (SEQ r1 r2) = Suc ((M r1) + (M r2))" -| "M (ALT r1 r2) = Suc ((M r1) + (M r2))" -| "M (STAR r) = Suc (M r)" -| "M (NOT r) = Suc (M r)" -| "M (PLUS r) = Suc (M r)" -| "M (OPT r) = Suc (M r)" -| "M (NTIMES r n) = Suc (M r) * 2 * (Suc n)" -| "M (NMTIMES2 r n m) = 3 * (Suc (M r) + n) * 3 * (Suc n) * (Suc (Suc m) - (Suc n))" +| "nullable (NMTIMES r n m) = (if m < n then False else (if n = 0 then True else nullable r))" function der :: "char \ rexp \ rexp" where @@ -210,9 +196,9 @@ | "der c (OPT r) = der c r" | "der c (NTIMES r 0) = NULL" | "der c (NTIMES r (Suc n)) = der c (SEQ r (NTIMES r n))" -| "der c (NMTIMES2 r n m) = (if m < n then NULL else +| "der c (NMTIMES r n m) = (if m < n then NULL else (if n = m then der c (NTIMES r n) else - ALT (der c (NTIMES r n)) (der c (NMTIMES2 r (Suc n) m))))" + ALT (der c (NTIMES r n)) (der c (NMTIMES r (Suc n) m))))" by pat_completeness auto termination der diff -r 9c7eb266594c -r 57ea439feaff progs/comb1.scala --- a/progs/comb1.scala Fri Oct 23 08:35:17 2015 +0100 +++ b/progs/comb1.scala Fri Oct 23 14:45:57 2015 +0100 @@ -1,6 +1,10 @@ import scala.language.implicitConversions import scala.language.reflectiveCalls +/* Note, in the lectures I did not show the type consraint + * I <% Seq[_] , which means that the input type I can be + * treated, or seen, as a sequence. */ + abstract class Parser[I <% Seq[_], T] { def parse(ts: I): Set[(T, I)] @@ -67,20 +71,21 @@ new SeqParser[String, String, String](s, r) } -// palindromes +// a parse palindromes lazy val Pal : Parser[String, String] = (("a" ~ Pal ~ "a") ==> { case ((x, y), z) => x + y + z } || ("b" ~ Pal ~ "b") ==> { case ((x, y), z) => x + y + z } || "") println("Palindrom: " + Pal.parse_all("ababbaba")) -// well-nested parenthesis +// well-nested parenthesis parser lazy val P : Parser[String, String] = "(" ~ P ~ ")" ~ P ==> { case (((u, x), y), z) => "{" + x + "}" + z } || "" P.parse_all("(((()()))())") P.parse_all("(((()()))()))") P.parse_all(")(") +P.parse_all("()") // arithmetic expressions lazy val E: Parser[String, String] = @@ -93,10 +98,11 @@ println(E.parse_all("1*2+3")) println(E.parse_all("1+2*3")) -println(E.parse_all("1+2+3")) +println(E.parse_all("(1+2)+3")) +println(E.parse_all("1+2+3")) // this is not parsed, because of + // how the grammar is set up - -// non-ambiguous vs ambiguous +// non-ambiguous vs ambiguous grammars lazy val U : Parser[String, String] = ("1" ~ U) ==> { case (x, y) => x + y } || "" @@ -106,3 +112,32 @@ ("1" ~ S ~ S) ==> { case ((x, y), z) => x + y + z } || "" S.parse_all("1" * 15) + + + + + + + + + + + + + + + + + + + + + + + + + + + + +