handouts/ho06.tex
changeset 593 bb24d4e207b6
parent 592 0d309fafa9f0
child 594 d40d7d7b85bc
equal deleted inserted replaced
592:0d309fafa9f0 593:bb24d4e207b6
    97 combinators out of smaller components following very closely the
    97 combinators out of smaller components following very closely the
    98 structure of a grammar. In order to implement this in a
    98 structure of a grammar. In order to implement this in a
    99 functional/object-oriented programming language, like Scala, we need
    99 functional/object-oriented programming language, like Scala, we need
   100 to specify an abstract class for parser combinators. In the abstract
   100 to specify an abstract class for parser combinators. In the abstract
   101 class we specify that \texttt{I} is the \emph{input type} of the
   101 class we specify that \texttt{I} is the \emph{input type} of the
   102 parser combinator and that \texttt{T} is the \emph{ouput type}.  This
   102 parser combinator and that \texttt{T} is the \emph{output type}.  This
   103 implies that the function \texttt{parse} takes an argument of type
   103 implies that the function \texttt{parse} takes an argument of type
   104 \texttt{I} and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
   104 \texttt{I} and returns a set of type \mbox{\texttt{Set[(T, I)]}}.
   105 
   105 
   106 \begin{center}
   106 \begin{center}
   107 \begin{lstlisting}[language=Scala]
   107 \begin{lstlisting}[language=Scala]
   282 stands for the unprocessed, or leftover, parts of $p$. We want that
   282 stands for the unprocessed, or leftover, parts of $p$. We want that
   283 $q$ runs on all these unprocessed parts $u_1$. Therefore these
   283 $q$ runs on all these unprocessed parts $u_1$. Therefore these
   284 unprocessed parts are fed into the second parser $q$. The overall
   284 unprocessed parts are fed into the second parser $q$. The overall
   285 result of the sequence parser combinator is pairs of the form
   285 result of the sequence parser combinator is pairs of the form
   286 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the
   286 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the
   287 unprocessed part of the sequqnce parser combinator is the unprocessed
   287 unprocessed part of the sequence parser combinator is the unprocessed
   288 part the second parser $q$ leaves as leftover. The parsed parts of the
   288 part the second parser $q$ leaves as leftover. The parsed parts of the
   289 component parsers are combined in a pair, namely
   289 component parsers are combined in a pair, namely
   290 $(\textit{output}_1, \textit{output}_2)$. The reason is we want to
   290 $(\textit{output}_1, \textit{output}_2)$. The reason is we want to
   291 know what $p$ and $q$ were able to parse. This behaviour can be
   291 know what $p$ and $q$ were able to parse. This behaviour can be
   292 implemented in Scala as follows:
   292 implemented in Scala as follows:
   570 \noindent
   570 \noindent
   571 The function in the semantic action converts a string into an
   571 The function in the semantic action converts a string into an
   572 \texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))},
   572 \texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))},
   573 but this time \texttt{123} is an \texttt{Int}. Let us come back to
   573 but this time \texttt{123} is an \texttt{Int}. Let us come back to
   574 semantic actions when we are going to implement actual context-free
   574 semantic actions when we are going to implement actual context-free
   575 gammars.
   575 grammars.
   576 
   576 
   577 \subsubsection*{Shorthand notation for parser combinators}
   577 \subsubsection*{Shorthand notation for parser combinators}
   578 
   578 
   579 Before we proceed, let us just explain the shorthand notation for
   579 Before we proceed, let us just explain the shorthand notation for
   580 parser combinators. Like for regular expressions, the shorthand notation
   580 parser combinators. Like for regular expressions, the shorthand notation
   669 means after the semantic action is applied the output type of the parser 
   669 means after the semantic action is applied the output type of the parser 
   670 is \texttt{String}, which means it fits with the alternative parsers
   670 is \texttt{String}, which means it fits with the alternative parsers
   671 \texttt{...| "a" | "b" | ""}.
   671 \texttt{...| "a" | "b" | ""}.
   672 
   672 
   673 If we run the parser above with \pcode{Pal.parse_all("abaaaba")} we obtain
   673 If we run the parser above with \pcode{Pal.parse_all("abaaaba")} we obtain
   674 as result the \pcode{Set(abaaaba)}, which indicates that the string is a palindrom
   674 as result the \pcode{Set(abaaaba)}, which indicates that the string is a palindrome
   675 (an empty set would mean something is wrong). But also notice what the
   675 (an empty set would mean something is wrong). But also notice what the
   676 intermediate results are generated by \pcode{Pal.parse("abaaaba")}
   676 intermediate results are generated by \pcode{Pal.parse("abaaaba")}
   677 
   677 
   678 \begin{center}
   678 \begin{center}
   679 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   679 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   683 
   683 
   684 \noindent
   684 \noindent
   685 That there are more than one output might be slightly unexpected, but
   685 That there are more than one output might be slightly unexpected, but
   686 can be explained as follows: the pairs represent all possible
   686 can be explained as follows: the pairs represent all possible
   687 (partial) parses of the string \pcode{"abaaaba"}. The first pair above
   687 (partial) parses of the string \pcode{"abaaaba"}. The first pair above
   688 correesponds to a complete parse (all output is consumed) and this is
   688 corresponds to a complete parse (all output is consumed) and this is
   689 what \pcode{Pal.parse_all} returns. The second pair is a small
   689 what \pcode{Pal.parse_all} returns. The second pair is a small
   690 ``sub-palindrome'' that can also be parsed, but the parse fails with
   690 ``sub-palindrome'' that can also be parsed, but the parse fails with
   691 the rest \pcode{aaba}, which is therefore left as unprocessed. The
   691 the rest \pcode{aaba}, which is therefore left as unprocessed. The
   692 third one is an attempt to parse the whole string with the
   692 third one is an attempt to parse the whole string with the
   693 single-character parser \pcode{a}. That of course only partially
   693 single-character parser \pcode{a}. That of course only partially
   694 succeeds, by leaving \pcode{"baaaba"} as the unprocessed
   694 succeeds, by leaving \pcode{"baaaba"} as the unprocessed
   695 part. Finally, since we allow the empty string to be a palindrom we
   695 part. Finally, since we allow the empty string to be a palindrome we
   696 also obtain the last pair, where actually nothing is consumed from the
   696 also obtain the last pair, where actually nothing is consumed from the
   697 input string. While all this works as intended, we need to be careful
   697 input string. While all this works as intended, we need to be careful
   698 with this (especially with including the \pcode{""} parser in our
   698 with this (especially with including the \pcode{""} parser in our
   699 grammar): if during parsing the set of parsing attempts gets too big,
   699 grammar): if during parsing the set of parsing attempts gets too big,
   700 then the parsing process can become very slow as the potential
   700 then the parsing process can become very slow as the potential
   711 val Pal : Parser[String, String] =  ...rhs...
   711 val Pal : Parser[String, String] =  ...rhs...
   712 \end{lstlisting}
   712 \end{lstlisting}
   713 \end{center}
   713 \end{center}
   714 
   714 
   715 \noindent
   715 \noindent
   716 then Scala before making this assignemnt to \texttt{Pal} attempts to
   716 then Scala before making this assignment to \texttt{Pal} attempts to
   717 find out what the expression on the right-hand side evaluates to. This
   717 find out what the expression on the right-hand side evaluates to. This
   718 is straightforward in case of simple expressions \texttt{2 + 3}, but
   718 is straightforward in case of simple expressions \texttt{2 + 3}, but
   719 the expression above contains \texttt{Pal} in the right-hand
   719 the expression above contains \texttt{Pal} in the right-hand
   720 side. Without \pcode{lazy} it would try to evaluate what \texttt{Pal}
   720 side. Without \pcode{lazy} it would try to evaluate what \texttt{Pal}
   721 evaluates to and start a new recursion, which means it falls into an
   721 evaluates to and start a new recursion, which means it falls into an
   747 
   747 
   748 \noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front of
   748 \noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front of
   749 the argument types for \texttt{p} and \texttt{q} prevent Scala from
   749 the argument types for \texttt{p} and \texttt{q} prevent Scala from
   750 evaluating the arguments. Normally, Scala would first evaluate what
   750 evaluating the arguments. Normally, Scala would first evaluate what
   751 kind of parsers \texttt{p} and \texttt{q} are, and only then generate
   751 kind of parsers \texttt{p} and \texttt{q} are, and only then generate
   752 the alternative parser combinator, repsectively sequence parser
   752 the alternative parser combinator, respectively sequence parser
   753 combinator. Since the argumants can be recursive parsers, such as
   753 combinator. Since the arguments can be recursive parsers, such as
   754 \texttt{Pal}, this would lead again to an infinite loop.
   754 \texttt{Pal}, this would lead again to an infinite loop.
   755 
   755 
   756 As a final example in this section, let us consider the grammar for
   756 As a final example in this section, let us consider the grammar for
   757 well-nested parentheses:
   757 well-nested parentheses:
   758 
   758 
   760 : \meta{P} ::=  (\cdot \meta{P}\cdot ) \cdot \meta{P} | \epsilon\\
   760 : \meta{P} ::=  (\cdot \meta{P}\cdot ) \cdot \meta{P} | \epsilon\\
   761 \end{plstx}
   761 \end{plstx}
   762 
   762 
   763 \noindent
   763 \noindent
   764 Let us assume we want to not just recognise strings of
   764 Let us assume we want to not just recognise strings of
   765 well-nested parentheses but also transfrom round parentheses
   765 well-nested parentheses but also transform round parentheses
   766 into curly braces. We can do this by using a semantic
   766 into curly braces. We can do this by using a semantic
   767 action:
   767 action:
   768 
   768 
   769 \begin{center}
   769 \begin{center}
   770   \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,
   770   \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily,
   783 
   783 
   784 
   784 
   785 
   785 
   786 \subsubsection*{Implementing an Interpreter}
   786 \subsubsection*{Implementing an Interpreter}
   787 
   787 
   788 The first step before implementing an interpreter for fullblown
   788 The first step before implementing an interpreter for a full-blown
   789 language is to implement a simple calculator for arithmetic
   789 language is to implement a simple calculator for arithmetic
   790 expressions. Suppose our arithmetic expressions are given by the
   790 expressions. Suppose our arithmetic expressions are given by the
   791 grammar:
   791 grammar:
   792 
   792 
   793 \begin{plstx}[margin=3cm,one per line]
   793 \begin{plstx}[margin=3cm,one per line]
   794 : \meta{E} ::= 
   794 : \meta{E} ::= \meta{E} \cdot + \cdot \meta{E} 
   795    | \meta{E} \cdot + \cdot \meta{E} 
       
   796    | \meta{E} \cdot - \cdot \meta{E} 
   795    | \meta{E} \cdot - \cdot \meta{E} 
   797    | \meta{E} \cdot * \cdot \meta{E} 
   796    | \meta{E} \cdot * \cdot \meta{E} 
   798    | ( \cdot \meta{E} \cdot )
   797    | ( \cdot \meta{E} \cdot )
   799    | Number \\
   798    | Number \\
   800 \end{plstx}
   799 \end{plstx}
   801 
   800 
   802 \noindent
   801 \noindent
   803 Naturally we want to implement the grammar in such a way that we can
   802 Naturally we want to implement the grammar in such a way that we can
   804 calculate what the result of \texttt{4*2+3} is---we are interested in
   803 calculate what the result of, for example, \texttt{4*2+3} is---we are
   805 an \texttt{Int} rather than a string. This means every component
   804 interested in an \texttt{Int} rather than a string. This means every
   806 parser needs to have as output type \texttt{Int} and when we assemble
   805 component parser needs to have as output type \texttt{Int} and when we
   807 the intermediate results, strings like \texttt{"+"}, \texttt{"*"} and
   806 assemble the intermediate results, strings like \texttt{"+"},
   808 so on, need to be translated into the appropriate Scala operation.
   807 \texttt{"*"} and so on, need to be translated into the appropriate
   809 Being inspired by the parser for well-nested parentheses and ignoring
   808 Scala operation of adding, multiplying and so on.  Being inspired by
   810 the fact that we want $*$ to take precedence, we might write something
   809 the parser for well-nested parentheses above and ignoring the fact
   811 like
   810 that we want $*$ to take precedence over $+$ and $-$, we might want to
       
   811 write something like
   812 
   812 
   813 \begin{center}
   813 \begin{center}
   814 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   814 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   815 lazy val E: Parser[String, Int] = 
   815 lazy val E: Parser[String, Int] = 
   816   (E ~ "+" ~ E ==> { case ((x, y), z) => x + z} |
   816   (E ~ "+" ~ E ==> { case ((x, y), z) => x + z} |
   820    NumParserInt)
   820    NumParserInt)
   821 \end{lstlisting}
   821 \end{lstlisting}
   822 \end{center}
   822 \end{center}
   823 
   823 
   824 \noindent
   824 \noindent
   825 Consider again carfully how the semantic actions pick out the correct
   825 Consider again carefully how the semantic actions pick out the correct
   826 arguments. In case of plus, we need \texttt{x} and \texttt{z}, because
   826 arguments for the calculation. In case of plus, we need \texttt{x} and
   827 they correspond to the results of the parser \texttt{E}. We can just
   827 \texttt{z}, because they correspond to the results of the component
   828 add \texttt{x + z} in order to obtain \texttt{Int} because the output
   828 parser \texttt{E}. We can just add \texttt{x + z} in order to obtain
   829 type of \texttt{E} is \texttt{Int}.  Similarly with subtraction and
   829 an \texttt{Int} because the output type of \texttt{E} is
   830 multiplication. In contrast in the fourth clause we need to return
   830 \texttt{Int}.  Similarly with subtraction and multiplication. In
   831 \texttt{y}, because it is the result enclosed inside the parentheses.
   831 contrast in the fourth clause we need to return \texttt{y}, because it
       
   832 is the result enclosed inside the parentheses. The information about
       
   833 parentheses, roughly speaking, we just throw away.
   832 
   834 
   833 So far so good. The problem arises when we try to call \pcode{parse_all} with the
   835 So far so good. The problem arises when we try to call \pcode{parse_all} with the
   834 expression \texttt{"1+2+3"}. Lets try it
   836 expression \texttt{"1+2+3"}. Lets try it
   835 
   837 
   836 \begin{center}
   838 \begin{center}
   838 E.parse_all("1+2+3")
   840 E.parse_all("1+2+3")
   839 \end{lstlisting}
   841 \end{lstlisting}
   840 \end{center}
   842 \end{center}
   841 
   843 
   842 \noindent
   844 \noindent
   843 \ldots and we wait and wait and \ldots wait. What is the problem? Actually,
   845 \ldots and we wait and wait and \ldots still wait. What is the
   844 the parser just fell into an infinite loop. The reason is that the above
   846 problem? Actually, the parser just fell into an infinite loop! The
   845 grammar is left-recursive and recall that parser combinator cannot deal with
   847 reason is that the above grammar is left-recursive and recall that our
   846 such grammars. Luckily every left-recursive context-free grammar can be
   848 parser combinators cannot deal with such left-recursive
   847 transformed into a non-left-recursive grammars that still recognise the
   849 grammars. Fortunately, every left-recursive context-free grammar can be
   848 same strings. This allows us to design the following grammar
   850 transformed into a non-left-recursive grammars that still recognises
   849 
   851 the same strings. This allows us to design the following grammar
   850 
   852 
   851 
   853 \begin{plstx}[margin=3cm]
   852 
   854   : \meta{E} ::=  \meta{T} \cdot + \cdot \meta{E} |  \meta{T} \cdot - \cdot \meta{E} | \meta{T}\\
   853 
   855 : \meta{T} ::=  \meta{F} \cdot * \cdot \meta{T} | \meta{F}\\
   854 
   856 : \meta{F} ::= ( \cdot \meta{E} \cdot ) | Number\\
   855 
   857 \end{plstx}
       
   858 
       
   859 \noindent
       
   860 Recall what left-recursive means from Handout 5 and make sure you see
       
   861 why this grammar is \emph{non} left-recursive. This version of the grammar
       
   862 also deals with the fact that $*$ should have a higher precedence. This does not
       
   863 affect which strings this grammar can recognise, but in which order we are going
       
   864 to evaluate any arithmetic expression. We can translate this grammar into
       
   865 parsing combinators as follows:
       
   866 
       
   867 
       
   868 \begin{center}
       
   869 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   870 lazy val E: Parser[String, Int] = 
       
   871   (T ~ "+" ~ E) ==> { case ((x, y), z) => x + z } |
       
   872   (T ~ "-" ~ E) ==> { case ((x, y), z) => x - z } | T 
       
   873 lazy val T: Parser[String, Int] = 
       
   874   (F ~ "*" ~ T) ==> { case ((x, y), z) => x * z } | F
       
   875 lazy val F: Parser[String, Int] = 
       
   876   ("(" ~ E ~ ")") ==> { case ((x, y), z) => y } | NumParserInt
       
   877 \end{lstlisting}
       
   878 \end{center}
       
   879 
       
   880 \noindent
       
   881 Let us try out on some examples:
       
   882 
       
   883 \begin{center}
       
   884 \begin{tabular}{rcl}
       
   885   input strings & & output of \pcode{parse_all}\medskip\\
       
   886   \texttt{\Grid{1+2+3}} & $\rightarrow$ & \texttt{Set(6)}\\
       
   887   \texttt{\Grid{4*2+3}} & $\rightarrow$ & \texttt{Set(11)}\\
       
   888   \texttt{\Grid{4*(2+3)}} & $\rightarrow$ & \texttt{Set(20)}\\
       
   889   \texttt{\Grid{4/2+3}} & $\rightarrow$ & \texttt{Set()}\\
       
   890   \texttt{\Grid{1\VS +\VS 2\VS +\VS 3}} & $\rightarrow$ & \texttt{Set()}\\                      
       
   891 \end{tabular}
       
   892 \end{center}
       
   893 
       
   894 \noindent
       
   895 All examples should be quite self-explanatory: the last two do not
       
   896 produce any result because our parser did not define what to do in
       
   897 case of division (could be easily added) but also has no idea what to
       
   898 do with whitescpaces. To deal with them is the task of the lexer. We
       
   899 can deal with them inside the grammar, but that would render many
       
   900 grammars becoming unintelligible.
   856 
   901 
   857 \end{document}
   902 \end{document}
   858 
   903 
   859 %%% Local Variables: 
   904 %%% Local Variables: 
   860 %%% mode: latex  
   905 %%% mode: latex