97 combinators out of smaller components following very closely the |
97 combinators out of smaller components following very closely the |
98 structure of a grammar. In order to implement this in a |
98 structure of a grammar. In order to implement this in a |
99 functional/object-oriented programming language, like Scala, we need |
99 functional/object-oriented programming language, like Scala, we need |
100 to specify an abstract class for parser combinators. In the abstract |
100 to specify an abstract class for parser combinators. In the abstract |
101 class we specify that \texttt{I} is the \emph{input type} of the |
101 class we specify that \texttt{I} is the \emph{input type} of the |
102 parser combinator and that \texttt{T} is the \emph{ouput type}. This |
102 parser combinator and that \texttt{T} is the \emph{output type}. This |
103 implies that the function \texttt{parse} takes an argument of type |
103 implies that the function \texttt{parse} takes an argument of type |
104 \texttt{I} and returns a set of type \mbox{\texttt{Set[(T, I)]}}. |
104 \texttt{I} and returns a set of type \mbox{\texttt{Set[(T, I)]}}. |
105 |
105 |
106 \begin{center} |
106 \begin{center} |
107 \begin{lstlisting}[language=Scala] |
107 \begin{lstlisting}[language=Scala] |
282 stands for the unprocessed, or leftover, parts of $p$. We want that |
282 stands for the unprocessed, or leftover, parts of $p$. We want that |
283 $q$ runs on all these unprocessed parts $u_1$. Therefore these |
283 $q$ runs on all these unprocessed parts $u_1$. Therefore these |
284 unprocessed parts are fed into the second parser $q$. The overall |
284 unprocessed parts are fed into the second parser $q$. The overall |
285 result of the sequence parser combinator is pairs of the form |
285 result of the sequence parser combinator is pairs of the form |
286 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the |
286 $((\textit{output}_1, \textit{output}_2), u_2)$. This means the |
287 unprocessed part of the sequqnce parser combinator is the unprocessed |
287 unprocessed part of the sequence parser combinator is the unprocessed |
288 part the second parser $q$ leaves as leftover. The parsed parts of the |
288 part the second parser $q$ leaves as leftover. The parsed parts of the |
289 component parsers are combined in a pair, namely |
289 component parsers are combined in a pair, namely |
290 $(\textit{output}_1, \textit{output}_2)$. The reason is we want to |
290 $(\textit{output}_1, \textit{output}_2)$. The reason is we want to |
291 know what $p$ and $q$ were able to parse. This behaviour can be |
291 know what $p$ and $q$ were able to parse. This behaviour can be |
292 implemented in Scala as follows: |
292 implemented in Scala as follows: |
570 \noindent |
570 \noindent |
571 The function in the semantic action converts a string into an |
571 The function in the semantic action converts a string into an |
572 \texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))}, |
572 \texttt{Int}. Now \texttt{parse} generates \texttt{Set((123,abc))}, |
573 but this time \texttt{123} is an \texttt{Int}. Let us come back to |
573 but this time \texttt{123} is an \texttt{Int}. Let us come back to |
574 semantic actions when we are going to implement actual context-free |
574 semantic actions when we are going to implement actual context-free |
575 gammars. |
575 grammars. |
576 |
576 |
577 \subsubsection*{Shorthand notation for parser combinators} |
577 \subsubsection*{Shorthand notation for parser combinators} |
578 |
578 |
579 Before we proceed, let us just explain the shorthand notation for |
579 Before we proceed, let us just explain the shorthand notation for |
580 parser combinators. Like for regular expressions, the shorthand notation |
580 parser combinators. Like for regular expressions, the shorthand notation |
669 means after the semantic action is applied the output type of the parser |
669 means after the semantic action is applied the output type of the parser |
670 is \texttt{String}, which means it fits with the alternative parsers |
670 is \texttt{String}, which means it fits with the alternative parsers |
671 \texttt{...| "a" | "b" | ""}. |
671 \texttt{...| "a" | "b" | ""}. |
672 |
672 |
673 If we run the parser above with \pcode{Pal.parse_all("abaaaba")} we obtain |
673 If we run the parser above with \pcode{Pal.parse_all("abaaaba")} we obtain |
674 as result the \pcode{Set(abaaaba)}, which indicates that the string is a palindrom |
674 as result the \pcode{Set(abaaaba)}, which indicates that the string is a palindrome |
675 (an empty set would mean something is wrong). But also notice what the |
675 (an empty set would mean something is wrong). But also notice what the |
676 intermediate results are generated by \pcode{Pal.parse("abaaaba")} |
676 intermediate results are generated by \pcode{Pal.parse("abaaaba")} |
677 |
677 |
678 \begin{center} |
678 \begin{center} |
679 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
679 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
683 |
683 |
684 \noindent |
684 \noindent |
685 That there are more than one output might be slightly unexpected, but |
685 That there are more than one output might be slightly unexpected, but |
686 can be explained as follows: the pairs represent all possible |
686 can be explained as follows: the pairs represent all possible |
687 (partial) parses of the string \pcode{"abaaaba"}. The first pair above |
687 (partial) parses of the string \pcode{"abaaaba"}. The first pair above |
688 correesponds to a complete parse (all output is consumed) and this is |
688 corresponds to a complete parse (all output is consumed) and this is |
689 what \pcode{Pal.parse_all} returns. The second pair is a small |
689 what \pcode{Pal.parse_all} returns. The second pair is a small |
690 ``sub-palindrome'' that can also be parsed, but the parse fails with |
690 ``sub-palindrome'' that can also be parsed, but the parse fails with |
691 the rest \pcode{aaba}, which is therefore left as unprocessed. The |
691 the rest \pcode{aaba}, which is therefore left as unprocessed. The |
692 third one is an attempt to parse the whole string with the |
692 third one is an attempt to parse the whole string with the |
693 single-character parser \pcode{a}. That of course only partially |
693 single-character parser \pcode{a}. That of course only partially |
694 succeeds, by leaving \pcode{"baaaba"} as the unprocessed |
694 succeeds, by leaving \pcode{"baaaba"} as the unprocessed |
695 part. Finally, since we allow the empty string to be a palindrom we |
695 part. Finally, since we allow the empty string to be a palindrome we |
696 also obtain the last pair, where actually nothing is consumed from the |
696 also obtain the last pair, where actually nothing is consumed from the |
697 input string. While all this works as intended, we need to be careful |
697 input string. While all this works as intended, we need to be careful |
698 with this (especially with including the \pcode{""} parser in our |
698 with this (especially with including the \pcode{""} parser in our |
699 grammar): if during parsing the set of parsing attempts gets too big, |
699 grammar): if during parsing the set of parsing attempts gets too big, |
700 then the parsing process can become very slow as the potential |
700 then the parsing process can become very slow as the potential |
711 val Pal : Parser[String, String] = ...rhs... |
711 val Pal : Parser[String, String] = ...rhs... |
712 \end{lstlisting} |
712 \end{lstlisting} |
713 \end{center} |
713 \end{center} |
714 |
714 |
715 \noindent |
715 \noindent |
716 then Scala before making this assignemnt to \texttt{Pal} attempts to |
716 then Scala before making this assignment to \texttt{Pal} attempts to |
717 find out what the expression on the right-hand side evaluates to. This |
717 find out what the expression on the right-hand side evaluates to. This |
718 is straightforward in case of simple expressions \texttt{2 + 3}, but |
718 is straightforward in case of simple expressions \texttt{2 + 3}, but |
719 the expression above contains \texttt{Pal} in the right-hand |
719 the expression above contains \texttt{Pal} in the right-hand |
720 side. Without \pcode{lazy} it would try to evaluate what \texttt{Pal} |
720 side. Without \pcode{lazy} it would try to evaluate what \texttt{Pal} |
721 evaluates to and start a new recursion, which means it falls into an |
721 evaluates to and start a new recursion, which means it falls into an |
747 |
747 |
748 \noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front of |
748 \noindent where the \texttt{\textbf{\textcolor{codepurple}{=>}}} in front of |
749 the argument types for \texttt{p} and \texttt{q} prevent Scala from |
749 the argument types for \texttt{p} and \texttt{q} prevent Scala from |
750 evaluating the arguments. Normally, Scala would first evaluate what |
750 evaluating the arguments. Normally, Scala would first evaluate what |
751 kind of parsers \texttt{p} and \texttt{q} are, and only then generate |
751 kind of parsers \texttt{p} and \texttt{q} are, and only then generate |
752 the alternative parser combinator, repsectively sequence parser |
752 the alternative parser combinator, respectively sequence parser |
753 combinator. Since the argumants can be recursive parsers, such as |
753 combinator. Since the arguments can be recursive parsers, such as |
754 \texttt{Pal}, this would lead again to an infinite loop. |
754 \texttt{Pal}, this would lead again to an infinite loop. |
755 |
755 |
756 As a final example in this section, let us consider the grammar for |
756 As a final example in this section, let us consider the grammar for |
757 well-nested parentheses: |
757 well-nested parentheses: |
758 |
758 |
760 : \meta{P} ::= (\cdot \meta{P}\cdot ) \cdot \meta{P} | \epsilon\\ |
760 : \meta{P} ::= (\cdot \meta{P}\cdot ) \cdot \meta{P} | \epsilon\\ |
761 \end{plstx} |
761 \end{plstx} |
762 |
762 |
763 \noindent |
763 \noindent |
764 Let us assume we want to not just recognise strings of |
764 Let us assume we want to not just recognise strings of |
765 well-nested parentheses but also transfrom round parentheses |
765 well-nested parentheses but also transform round parentheses |
766 into curly braces. We can do this by using a semantic |
766 into curly braces. We can do this by using a semantic |
767 action: |
767 action: |
768 |
768 |
769 \begin{center} |
769 \begin{center} |
770 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, |
770 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, |
783 |
783 |
784 |
784 |
785 |
785 |
786 \subsubsection*{Implementing an Interpreter} |
786 \subsubsection*{Implementing an Interpreter} |
787 |
787 |
788 The first step before implementing an interpreter for fullblown |
788 The first step before implementing an interpreter for a full-blown |
789 language is to implement a simple calculator for arithmetic |
789 language is to implement a simple calculator for arithmetic |
790 expressions. Suppose our arithmetic expressions are given by the |
790 expressions. Suppose our arithmetic expressions are given by the |
791 grammar: |
791 grammar: |
792 |
792 |
793 \begin{plstx}[margin=3cm,one per line] |
793 \begin{plstx}[margin=3cm,one per line] |
794 : \meta{E} ::= |
794 : \meta{E} ::= \meta{E} \cdot + \cdot \meta{E} |
795 | \meta{E} \cdot + \cdot \meta{E} |
|
796 | \meta{E} \cdot - \cdot \meta{E} |
795 | \meta{E} \cdot - \cdot \meta{E} |
797 | \meta{E} \cdot * \cdot \meta{E} |
796 | \meta{E} \cdot * \cdot \meta{E} |
798 | ( \cdot \meta{E} \cdot ) |
797 | ( \cdot \meta{E} \cdot ) |
799 | Number \\ |
798 | Number \\ |
800 \end{plstx} |
799 \end{plstx} |
801 |
800 |
802 \noindent |
801 \noindent |
803 Naturally we want to implement the grammar in such a way that we can |
802 Naturally we want to implement the grammar in such a way that we can |
804 calculate what the result of \texttt{4*2+3} is---we are interested in |
803 calculate what the result of, for example, \texttt{4*2+3} is---we are |
805 an \texttt{Int} rather than a string. This means every component |
804 interested in an \texttt{Int} rather than a string. This means every |
806 parser needs to have as output type \texttt{Int} and when we assemble |
805 component parser needs to have as output type \texttt{Int} and when we |
807 the intermediate results, strings like \texttt{"+"}, \texttt{"*"} and |
806 assemble the intermediate results, strings like \texttt{"+"}, |
808 so on, need to be translated into the appropriate Scala operation. |
807 \texttt{"*"} and so on, need to be translated into the appropriate |
809 Being inspired by the parser for well-nested parentheses and ignoring |
808 Scala operation of adding, multiplying and so on. Being inspired by |
810 the fact that we want $*$ to take precedence, we might write something |
809 the parser for well-nested parentheses above and ignoring the fact |
811 like |
810 that we want $*$ to take precedence over $+$ and $-$, we might want to |
|
811 write something like |
812 |
812 |
813 \begin{center} |
813 \begin{center} |
814 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
814 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
815 lazy val E: Parser[String, Int] = |
815 lazy val E: Parser[String, Int] = |
816 (E ~ "+" ~ E ==> { case ((x, y), z) => x + z} | |
816 (E ~ "+" ~ E ==> { case ((x, y), z) => x + z} | |
820 NumParserInt) |
820 NumParserInt) |
821 \end{lstlisting} |
821 \end{lstlisting} |
822 \end{center} |
822 \end{center} |
823 |
823 |
824 \noindent |
824 \noindent |
825 Consider again carfully how the semantic actions pick out the correct |
825 Consider again carefully how the semantic actions pick out the correct |
826 arguments. In case of plus, we need \texttt{x} and \texttt{z}, because |
826 arguments for the calculation. In case of plus, we need \texttt{x} and |
827 they correspond to the results of the parser \texttt{E}. We can just |
827 \texttt{z}, because they correspond to the results of the component |
828 add \texttt{x + z} in order to obtain \texttt{Int} because the output |
828 parser \texttt{E}. We can just add \texttt{x + z} in order to obtain |
829 type of \texttt{E} is \texttt{Int}. Similarly with subtraction and |
829 an \texttt{Int} because the output type of \texttt{E} is |
830 multiplication. In contrast in the fourth clause we need to return |
830 \texttt{Int}. Similarly with subtraction and multiplication. In |
831 \texttt{y}, because it is the result enclosed inside the parentheses. |
831 contrast in the fourth clause we need to return \texttt{y}, because it |
|
832 is the result enclosed inside the parentheses. The information about |
|
833 parentheses, roughly speaking, we just throw away. |
832 |
834 |
833 So far so good. The problem arises when we try to call \pcode{parse_all} with the |
835 So far so good. The problem arises when we try to call \pcode{parse_all} with the |
834 expression \texttt{"1+2+3"}. Lets try it |
836 expression \texttt{"1+2+3"}. Lets try it |
835 |
837 |
836 \begin{center} |
838 \begin{center} |
838 E.parse_all("1+2+3") |
840 E.parse_all("1+2+3") |
839 \end{lstlisting} |
841 \end{lstlisting} |
840 \end{center} |
842 \end{center} |
841 |
843 |
842 \noindent |
844 \noindent |
843 \ldots and we wait and wait and \ldots wait. What is the problem? Actually, |
845 \ldots and we wait and wait and \ldots still wait. What is the |
844 the parser just fell into an infinite loop. The reason is that the above |
846 problem? Actually, the parser just fell into an infinite loop! The |
845 grammar is left-recursive and recall that parser combinator cannot deal with |
847 reason is that the above grammar is left-recursive and recall that our |
846 such grammars. Luckily every left-recursive context-free grammar can be |
848 parser combinators cannot deal with such left-recursive |
847 transformed into a non-left-recursive grammars that still recognise the |
849 grammars. Fortunately, every left-recursive context-free grammar can be |
848 same strings. This allows us to design the following grammar |
850 transformed into a non-left-recursive grammars that still recognises |
849 |
851 the same strings. This allows us to design the following grammar |
850 |
852 |
851 |
853 \begin{plstx}[margin=3cm] |
852 |
854 : \meta{E} ::= \meta{T} \cdot + \cdot \meta{E} | \meta{T} \cdot - \cdot \meta{E} | \meta{T}\\ |
853 |
855 : \meta{T} ::= \meta{F} \cdot * \cdot \meta{T} | \meta{F}\\ |
854 |
856 : \meta{F} ::= ( \cdot \meta{E} \cdot ) | Number\\ |
855 |
857 \end{plstx} |
|
858 |
|
859 \noindent |
|
860 Recall what left-recursive means from Handout 5 and make sure you see |
|
861 why this grammar is \emph{non} left-recursive. This version of the grammar |
|
862 also deals with the fact that $*$ should have a higher precedence. This does not |
|
863 affect which strings this grammar can recognise, but in which order we are going |
|
864 to evaluate any arithmetic expression. We can translate this grammar into |
|
865 parsing combinators as follows: |
|
866 |
|
867 |
|
868 \begin{center} |
|
869 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none] |
|
870 lazy val E: Parser[String, Int] = |
|
871 (T ~ "+" ~ E) ==> { case ((x, y), z) => x + z } | |
|
872 (T ~ "-" ~ E) ==> { case ((x, y), z) => x - z } | T |
|
873 lazy val T: Parser[String, Int] = |
|
874 (F ~ "*" ~ T) ==> { case ((x, y), z) => x * z } | F |
|
875 lazy val F: Parser[String, Int] = |
|
876 ("(" ~ E ~ ")") ==> { case ((x, y), z) => y } | NumParserInt |
|
877 \end{lstlisting} |
|
878 \end{center} |
|
879 |
|
880 \noindent |
|
881 Let us try out on some examples: |
|
882 |
|
883 \begin{center} |
|
884 \begin{tabular}{rcl} |
|
885 input strings & & output of \pcode{parse_all}\medskip\\ |
|
886 \texttt{\Grid{1+2+3}} & $\rightarrow$ & \texttt{Set(6)}\\ |
|
887 \texttt{\Grid{4*2+3}} & $\rightarrow$ & \texttt{Set(11)}\\ |
|
888 \texttt{\Grid{4*(2+3)}} & $\rightarrow$ & \texttt{Set(20)}\\ |
|
889 \texttt{\Grid{4/2+3}} & $\rightarrow$ & \texttt{Set()}\\ |
|
890 \texttt{\Grid{1\VS +\VS 2\VS +\VS 3}} & $\rightarrow$ & \texttt{Set()}\\ |
|
891 \end{tabular} |
|
892 \end{center} |
|
893 |
|
894 \noindent |
|
895 All examples should be quite self-explanatory: the last two do not |
|
896 produce any result because our parser did not define what to do in |
|
897 case of division (could be easily added) but also has no idea what to |
|
898 do with whitescpaces. To deal with them is the task of the lexer. We |
|
899 can deal with them inside the grammar, but that would render many |
|
900 grammars becoming unintelligible. |
856 |
901 |
857 \end{document} |
902 \end{document} |
858 |
903 |
859 %%% Local Variables: |
904 %%% Local Variables: |
860 %%% mode: latex |
905 %%% mode: latex |