handouts/ho06.tex
changeset 591 863e502f6a5c
parent 590 c6a1e19e9801
child 592 0d309fafa9f0
--- a/handouts/ho06.tex	Thu Oct 25 18:36:04 2018 +0100
+++ b/handouts/ho06.tex	Fri Oct 26 13:28:33 2018 +0100
@@ -20,7 +20,7 @@
 drawbacks. For example they require that the grammar to be parsed is
 \emph{not} left-recursive and they are efficient only when the grammar
 is unambiguous. It is the responsibility of the grammar designer to
-ensure these two properties.
+ensure these two properties hold.
 
 The general idea behind parser combinators is to transform the input
 into sets of pairs, like so
@@ -28,17 +28,17 @@
 \begin{center}
 $\underbrace{\text{list of tokens}}_{\text{input}}$ 
 $\Rightarrow$
-$\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$
+$\underbrace{\text{set of (parsed part, unprocessed part)}}_{\text{output}}$
 \end{center} 
 
 \noindent
 Given the extended effort we have spent implementing a lexer in order
-to generate list of tokens, it might be surprising that in what
-follows we shall often use strings as input, rather than list of
-tokens. This is for making the explanation more lucid. It does not
-make our previous work on lexers obsolete (remember they transform a
-string into a list of tokens). Lexers will still be needed for
-building a somewhat realistic compiler.
+to generate lists of tokens, it might be surprising that in what
+follows we shall often use strings as input, rather than lists of
+tokens. This is for making the explanation more lucid and for quick
+examples. It does not make our previous work on lexers obsolete
+(remember they transform a string into a list of tokens). Lexers will
+still be needed for building a somewhat realistic compiler.
 
 As mentioned above, parser combinators are relatively agnostic about what
 kind of input they process. In my Scala code I use the following
@@ -51,14 +51,14 @@
 \noindent That is they take as input something of type \texttt{I} and
 return a set of pairs of type \texttt{Set[(T, I)]}. Since the input
 needs to be of ``sequence-kind'', I actually have to often write
-\texttt{I <\% Seq[\_]} for the input type. This ensures the input is a
-subtype of Scala sequences. The first component of the generated pairs
-corresponds to what the parser combinator was able to process from the
-input and the second is the unprocessed part, or leftover, of the
-input (therefore the type of this unprocessed part is the same as the
-input). A parser combinator might return more than one such pair; the
-idea is that there are potentially several ways of how to parse the
-input.  As a concrete example, consider the string
+\texttt{I <\% Seq[\_]} for the input type. This ensures the
+input is a subtype of Scala sequences. The first component of the
+generated pairs corresponds to what the parser combinator was able to
+parse from the input and the second is the unprocessed, or
+leftover, part of the input (therefore the type of this unprocessed part is
+the same as the input). A parser combinator might return more than one
+such pair; the idea is that there are potentially several ways of how
+to parse the input.  As a concrete example, consider the string
 
 \begin{center}
 \tt\Grid{iffoo\VS testbar}
@@ -75,7 +75,7 @@
 
 \noindent where the first pair means the parser could recognise
 \texttt{if} from the input and leaves the \texttt{foo\VS testbar} as
-`unprocessed' part; in the other case it could recognise
+unprocessed part; in the other case it could recognise
 \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the
 parser cannot recognise anything from the input at all, then parser
 combinators just return the empty set $\{\}$. This will indicate
@@ -85,7 +85,7 @@
 Also important to note is that the type \texttt{T} for the processed
 part is different from the input type \texttt{I} in the parse. In the
 example above is just happens to be the same. The reason for the
-potential difference is that in general we are interested in
+difference is that in general we are interested in
 transforming our input into something ``different''\ldots for example
 into a tree; or if we implement the grammar for arithmetic
 expressions, we might be interested in the actual integer number the
@@ -95,11 +95,13 @@
 
 The main idea of parser combinators is that we can easily build parser
 combinators out of smaller components following very closely the
-structure of a grammar. In order to implement this in an
-object-oriented programming language, like Scala, we need to specify
-an abstract class for parser combinators. This abstract class states
-that the function \texttt{parse} takes an argument of type \texttt{I}
-and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
+structure of a grammar. In order to implement this in a
+functional/object-oriented programming language, like Scala, we need
+to specify an abstract class for parser combinators. In the abstract
+class we specify that \texttt{I} is the \emph{input type} of the
+parser combinator and that \texttt{T} is the \emph{ouput type}.  This
+implies that the function \texttt{parse} takes an argument of type
+\texttt{I} and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
 
 \begin{center}
 \begin{lstlisting}[language=Scala]
@@ -113,7 +115,7 @@
 \end{lstlisting}
 \end{center}
 
-\noindent It is the obligation in each instance (parser combinator) to
+\noindent It is the obligation in each instance of this class to
 supply an implementation for \texttt{parse}.  From this function we
 can then ``centrally'' derive the function \texttt{parse\_all}, which
 just filters out all pairs whose second component is not empty (that
@@ -255,7 +257,7 @@
 
 \noindent We receive in the first two cases a successful
 output (that is a non-empty set). In each case, either
-\pcode{a} or \pcode{b} is in the processed part, and
+\pcode{a} or \pcode{b} is in the parsed part, and
 \pcode{cde} in the unprocessed part. Clearly this parser cannot
 parse anything with \pcode{ccde}, therefore the empty
 set is returned.
@@ -264,7 +266,7 @@
 two parsers, say again, $p$ and $q$, we want to apply first the input
 to $p$ producing a set of pairs; then apply $q$ to all the unparsed  
 parts in the pairs; and then combine the results. Mathematically we would
-write something like this for the result set of pairs:
+write something like this for the set of pairs:
 
 \begin{center}
 \begin{tabular}{lcl}
@@ -277,16 +279,17 @@
 
 \noindent Notice that the $p$ will first be run on the input,
 producing pairs of the form $(\textit{output}_1, u_1)$ where the $u_1$
-stands for the unprocessed, or leftover, parts pf $p$. We want that
+stands for the unprocessed, or leftover, parts of $p$. We want that
 $q$ runs on all these unprocessed parts $u_1$. Therefore these
 unprocessed parts are fed into the second parser $q$. The overall
 result of the sequence parser combinator is pairs of the form
 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the
-unprocessed part of the sequqnce p[arser combinator is the unprocessed
-part the second parser $q$ leaves as leftover. The processed parts of
-of the component parsers is a pair consisting of the outputs of $p$
-and $q$, namely $(\textit{output}_1, \textit{output}_2)$. This
-behaviour can be implemented in Scala as follows:
+unprocessed part of the sequqnce parser combinator is the unprocessed
+part the second parser $q$ leaves as leftover. The parsed parts of the
+component parsers are combined in a pair, namely
+$(\textit{output}_1, \textit{output}_2)$. The reason is we want to
+know what $p$ and $q$ were able to parse. This behaviour can be
+implemented in Scala as follows:
 
 \begin{center}
 \begin{lstlisting}[language=Scala,numbers=none]
@@ -302,37 +305,39 @@
 \end{center}
 
 \noindent This parser takes again as arguments two parsers, \texttt{p}
-and \texttt{q}. It implements \texttt{parse} as follows: let first run
-the parser \texttt{p} on the input producing a set of pairs
+and \texttt{q}. It implements \texttt{parse} as follows: first run the
+parser \texttt{p} on the input producing a set of pairs
 (\texttt{output1}, \texttt{u1}). The \texttt{u1} stands for the
-unprocessed parts left over by \texttt{p}. Let then \texttt{q} run on these
-unprocessed parts producing again a set of pairs. The output of the
-sequence parser combinator is then a set containing pairs where the
-first components are again pairs, namely what the first parser could
-parse together with what the second parser could parse; the second
-component is the unprocessed part left over after running the second
-parser \texttt{q}. Therefore the input type of the sequence parser
-combinator is as usual \texttt{I}, but the output type is
+unprocessed parts left over by \texttt{p} (recall that there can be
+several such pairs). Let then \texttt{q} run on these unprocessed
+parts producing again a set of pairs. The output of the sequence
+parser combinator is then a set containing pairs where the first
+components are again pairs, namely what the first parser could parse
+together with what the second parser could parse; the second component
+is the unprocessed part left over after running the second parser
+\texttt{q}. Note that the input type of the sequence parser combinator
+is as usual \texttt{I}, but the output type is
 
 \begin{center}
 \texttt{(T, S)}
 \end{center}
 
 \noindent
-This means \texttt{parse} in the sequence parser combinator returns
-sets of type \texttt{Set[((T, S), I)]}.  Notice that we have
-essentially two output types for the sequence parser combinator
-(packaged in a pair), because in general \textit{p} and \textit{q}
-might produce different things (for example first we recognise a
-number and then a string corresponding to an operator).  If any of the
-runs of \textit{p} and \textit{q} fail, that is produce the empty set,
-then \texttt{parse} will also produce the empty set.
+Consequently, the function \texttt{parse} in the sequence parser
+combinator returns sets of type \texttt{Set[((T, S), I)]}.  That means
+we have essentially two output types for the sequence parser
+combinator (packaged in a pair), because in general \textit{p} and
+\textit{q} might produce different things (for example we recognise a
+number with \texttt{p} and then with \texttt{q} a string corresponding
+to an operator).  If any of the runs of \textit{p} and \textit{q}
+fail, that is produce the empty set, then \texttt{parse} will also
+produce the empty set.
 
 With the shorthand notation we shall introduce later for the sequence
 parser combinator, we can write for example \pcode{"a" ~ "b"}, which
 is the parser combinator that first recognises the character
 \texttt{a} from a string and then \texttt{b}. Let us look again at
-three examples of how this parser combinator processes some strings:
+some examples of how this parser combinator processes some strings:
 
 \begin{center}
 \begin{tabular}{rcl}
@@ -385,7 +390,7 @@
 ``missing'' in the sequence we are looking for. The first succeeds but
 notice how the results nest with sequences: the parsed part is a
 nested pair of the form \pcode{((a, b), c)}. If we nest the sequence
-parser differently, for example \pcode{"a" ~ ("b" ~ "c")}, then also
+parser differently, say \pcode{"a" ~ ("b" ~ "c")}, then also
 our output pairs nest differently
 
 \begin{center}
@@ -407,7 +412,7 @@
 \end{tabular}
 \end{center}
 
-\noindent Notice how again the results nest deeper and deeper as pairs (the
+\noindent Notice again how the results nest deeper and deeper as pairs (the
 last \pcode{a} is in the unprocessed part). To consume everything of
 this string we can use the parser \pcode{(("a" ~ "a") ~ "a") ~
   "a"}. Then the output is as follows:
@@ -439,18 +444,18 @@
 
 Consider also carefully that constructing a parser such \pcode{"a" |
   ("a" ~ "b")} will result in a typing error. The intention with this
-parser is that we want to parse an \texttt{a}, or an \texttt{a}
+parser is that we want to parse either an \texttt{a}, or an \texttt{a}
 followed by a \texttt{b}. However, the first parser has as output type
 a single character (recall the type of \texttt{CharParser}), but the
 second parser produces a pair of characters as output. The alternative
 parser is required to have both component parsers to have the same
-type---we need to be able to build the union of two sets, which means
-in Scala they need to be of the same type.  Since ther are not, there
-is a typing error in this example.  We will see later how we can build
-this parser without the typing error.
+type---the reason is that we need to be able to build the union of two
+sets, which requires in Scala that the sets have the same type.  Since
+they are not in this case, there is a typing error.  We will see later
+how we can build this parser without the typing error.
 
 The next parser combinator, called \emph{semantic action}, does not
-actually combine smaller parsers, but applies a function to the result
+actually combine two smaller parsers, but applies a function to the result
 of a parser.  It is implemented in Scala as follows
 
 \begin{center}
@@ -473,15 +478,16 @@
 parser outputs. Since this function is of type \texttt{T => S}, we
 obtain a parser with output type \texttt{S}. Again Scala lets us
 introduce some shorthand notation for this parser
-combinator. Therefore we will write \texttt{p ==> f} for it.
+combinator. Therefore we will write short \texttt{p ==> f} for it.
 
 What are semantic actions good for? Well, they allow you to transform
 the parsed input into datastructures you can use for further
-processing. A simple example would be to transform parsed characters
-into ASCII numbers. Suppose we define a function \texttt{f} (from
-characters to ints) and use a \texttt{CharParser} for parsing
+processing. A simple (contrived) example would be to transform parsed
+characters into ASCII numbers. Suppose we define a function \texttt{f}
+(from characters to \texttt{Int}s) and use a \texttt{CharParser} for parsing
 the character \texttt{c}.
 
+
 \begin{center}
 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
 val f = (c: Char) => c.toInt
@@ -505,8 +511,9 @@
 character has been transformed into an ASCII number.
 
 A slightly less contrived example is about parsing numbers (recall
-\texttt{NumParser} above). However, we want to do this here for strings.
-For this assume we have the following \texttt{RegexParser}.
+\texttt{NumParser} above). However, we want to do this here for
+strings, not for tokens.  For this assume we have the following
+(atomic) \texttt{RegexParser}.
 
 \begin{center}
   \begin{lstlisting}[language=Scala,xleftmargin=0mm,
@@ -527,8 +534,10 @@
 prefix and the rest according to this regex
 (\texttt{reg.findPrefixMatchOf} generates a match---in the successful
 case---and the corresponding strings can be extracted with
-\texttt{matched} and \texttt{after}). Using this parser we can define a
-\texttt{NumParser} for strings as follows:
+\texttt{matched} and \texttt{after}). The input and output type for
+this parser is \texttt{String}. Using \texttt{RegexParser} we can
+define a \texttt{NumParser} for \texttt{Strings} to \texttt{Int} as
+follows:
 
 \begin{center}
 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
@@ -537,7 +546,7 @@
 \end{center}
 
 \noindent
-This parser will recognise a number at the beginning of a string, for
+This parser will recognise a number at the beginning of a string. For
 example
 
 \begin{center}
@@ -560,24 +569,35 @@
 
 \noindent
 The function in the semantic action converts a string into an
-\texttt{Int}. Let us come back to semantic actions when we are going
-to implement actual context-free gammars.
+\texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))},
+but this time \texttt{123} is an \texttt{Int}. Let us come back to
+semantic actions when we are going to implement actual context-free
+gammars.
 
 \subsubsection*{Shorthand notation for parser combinators}
 
 Before we proceed, let us just explain the shorthand notation for
 parser combinators. Like for regular expressions, the shorthand notation
 will make our life much easier when writing actual parsers. We can define
-some implicits which allow us to write \pcode{p | q}, \pcode{p ~ q} and
-\pcode{p ==> f} as well as to use plain strings for specifying simple
-string parsers.
+some implicits which allow us to write
+
+\begin{center}
+\begin{tabular}{ll}  
+  \pcode{p | q} & alternative parser\\
+  \pcode{p ~ q} & sequence parser\\ 
+  \pcode{p ==> f} & semantic action parser
+\end{tabular}
+\end{center}
+
+\noindent
+as well as to use plain strings for specifying simple string parsers.
 
 The idea is that this shorthand notation allows us to easily translate
 context-free grammars into code. For example recall our context-free
 grammar for palindromes:
 
 \begin{plstx}[margin=3cm]
-: \meta{P} ::=  a\cdot \meta{P}\cdot a | b\cdot \meta{P}\cdot b | a | b | \epsilon\\
+: \meta{Pal} ::=  a\cdot \meta{Pal}\cdot a | b\cdot \meta{Pal}\cdot b | a | b | \epsilon\\
 \end{plstx}
 
 \noindent
@@ -592,7 +612,7 @@
 The beauty of parser combinators is the ease with which they can be
 implemented and how easy it is to translate context-free grammars into
 code (though the grammars need to be non-left-recursive). To
-demonstrate this recall the grammar for palindromes from above.
+demonstrate this consider again the grammar for palindromes from above.
 The first idea would be to translate it into the following code
 
 \begin{center}
@@ -607,11 +627,17 @@
 error. The reason is that the parsers \texttt{"a"}, \texttt{"b"} and
 \texttt{""} all produce strings as output type and therefore can be
 put into an alternative \texttt{...| "a" | "b" | ""}. But both
-\pcode{"a" ~ Pal ~ "a"} and \pcode{"b" ~ Pal ~ "b"} produce pairs of
-the form $(((\_, \_), \_), \_)$---that is how the sequence parser
-combinator nests results when \pcode{\~} is used between two
-components. The solution is to use a semantic action that ``flattens''
-these pairs and appends the corresponding strings, like
+sequence parsers \pcode{"a" ~ Pal ~ "a"} and \pcode{"b" ~ Pal ~ "b"}
+produce pairs of the form
+
+\begin{center}
+(((\texttt{a}-part, \texttt{Pal}-part), \texttt{a}-part), unprocessed part)
+\end{center}
+
+\noindent That is how the
+sequence parser combinator nests results when \pcode{\~} is used
+between two components. The solution is to use a semantic action that
+``flattens'' these pairs and appends the corresponding strings, like
 
 \begin{center}
 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
@@ -623,13 +649,139 @@
 \end{center}
 
 \noindent
-Now in all cases we have strings as output type for the parser variants.
-The semantic action
+How does this work? Well, recall again what the pairs look like for
+the parser \pcode{"a" ~ Pal ~ "a"}.  The pattern in the semantic
+action matches the nested pairs (the \texttt{x} with the
+\texttt{a}-part and so on).  Unfortunately when we have such nested
+pairs, Scala requires us to define the function using the
+\pcode{case}-syntax
+
+\begin{center}
+\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
+{ case ((x, y), z) => ... }
+\end{lstlisting}
+\end{center}
+
+\noindent
+If we have more sequence parser combinators or have them differently nested,
+then the pattern in the semantic action needs to be adjusted accordingly.
+The action we implement above is to concatenate all three strings, which
+means after the semantic action is applied the output type of the parser 
+is \texttt{String}, which means it fits with the alternative parsers
+\texttt{...| "a" | "b" | ""}.
+
+If we run the parser above with \pcode{Pal.parse_all("abaaaba")} we obtain
+as result the \pcode{Set(abaaaba)}, which indicates that the string is a palindrom
+(an empty set would mean something is wrong). But also notice what the
+intermediate results are generated by \pcode{Pal.parse("abaaaba")}
+
+\begin{center}
+\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
+Set((abaaaba,""),(aba,aaba), (a,baaaba), ("",abaaaba))
+\end{lstlisting}
+\end{center}
+
+\noindent
+That there are more than one output might be slightly unexpected, but
+can be explained as follows: the pairs represent all possible
+(partial) parses of the string \pcode{"abaaaba"}. The first pair above
+correesponds to a complete parse (all output is consumed) and this is
+what \pcode{Pal.parse_all} returns. The second pair is a small
+``sub-palindrome'' that can also be parsed, but the parse fails with
+the rest \pcode{aaba}, which is therefore left as unprocessed. The
+third one is an attempt to parse the whole string with the
+single-character parser \pcode{a}. That of course only partially
+succeeds, by leaving \pcode{"baaaba"} as the unprocessed
+part. Finally, since we allow the empty string to be a palindrom we
+also obtain the last pair, where actually nothing is consumed from the
+input string. While all this works as intended, we need to be careful
+with this (especially with including the \pcode{""} parser in our
+grammar): if during parsing the set of parsing attempts gets too big,
+then the parsing process can become very slow as the potential
+candidates for applying rules can snowball.
 
 
+Important is also to note is that we must define the
+\texttt{Pal}-parser as a \emph{lazy} value in Scala. Look again at the
+code: \texttt{Pal} occurs on the right-hand side of the definition. If we had
+just written
+
+\begin{center}
+\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
+val Pal : Parser[String, String] =  ...rhs...
+\end{lstlisting}
+\end{center}
+
 \noindent
-Important to note is that we must define \texttt{Pal}-parser as a
-\emph{lazy} value.
+then Scala before making this assignemnt to \texttt{Pal} attempts to
+find out what the expression on the right-hand side evaluates to. This
+is straightforward in case of simple expressions \texttt{2 + 3}, but
+the expression above contains \texttt{Pal} in the right-hand
+side. Without \pcode{lazy} it would try to evaluate what \texttt{Pal}
+evaluates to and start a new recursion, which means it falls into an
+infinite loop. The definition of \texttt{Pal} is recursive and the
+\pcode{lazy} key-word prevents it from being fully evaluated. Therefore
+whenever we want to define a recursive parser we have to write
+
+\begin{center}
+\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
+lazy val SomeParser : Parser[...,...] =  ...rhs...
+\end{lstlisting}
+\end{center}
+
+\noindent That was not necessary for our atomic parsers, like
+\texttt{RegexParser} or \texttt{CharParser}, because they are not recursive.
+Note that this is also the reason why we had to write
+
+\begin{center}
+\begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
+class AltParser[I, T]
+       (p: => Parser[I, T], 
+        q: => Parser[I, T]) extends Parser[I, T] {...}
+
+class SeqParser[I, T, S]
+       (p: => Parser[I, T], 
+        q: => Parser[I, S]) extends Parser[I, (T, S)] {...}
+\end{lstlisting}
+\end{center}
+
+\noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front of
+the argument types for \texttt{p} and \texttt{q} prevent Scala from
+evaluating the arguments. Normally, Scala would first evaluate what
+kind of parsers \texttt{p} and \texttt{q} are, and only then generate
+the alternative parser combinator, repsectively sequence parser
+combinator. Since the argumants can be recursive parsers, such as
+\texttt{Pal}, this would lead again to an infinite loop.
+
+As a final example in this section, let us consider the grammar for
+well-nested parentheses:
+
+\begin{plstx}[margin=3cm]
+: \meta{P} ::=  (\cdot \meta{P}\cdot ) \cdot \meta{P} | \epsilon\\
+\end{plstx}
+
+\noindent
+Let us assume we want to not just recognise strings of
+well-nested parentheses but also transfrom round parentheses
+into curly braces. We can do this by using a semantic
+action:
+
+\begin{center}
+  \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,
+    xleftmargin=0mm, numbers=none]
+lazy val P : Parser[String, String] = 
+  "(" ~ P ~ ")" ~ P ==> { case (((_,x),_),y) => "{" + x + "}" + y } | ""
+\end{lstlisting}
+\end{center}
+
+\noindent
+Here we define a function where which ignores the parentheses in the
+pairs, but replaces them in the right places with curly braces when
+assembling the new string in the right-hand side. If we run
+\pcode{P.parse_all("(((()()))())")} we obtain
+\texttt{Set(\{\{\{\{\}\{\}\}\}\{\}\})} as expected.
+
+
 
 \subsubsection*{Implementing an Interpreter}