handouts/ho05.tex
changeset 362 57ea439feaff
parent 360 c6c574d2ca0c
child 385 7f8516ff408d
equal deleted inserted replaced
361:9c7eb266594c 362:57ea439feaff
     5 
     5 
     6 \begin{document}
     6 \begin{document}
     7 
     7 
     8 \section*{Handout 6 (Grammars \& Parser)}
     8 \section*{Handout 6 (Grammars \& Parser)}
     9 
     9 
    10 While regular expressions are very useful for lexing and for recognising
    10 While regular expressions are very useful for lexing and for
    11 many patterns in strings (like email addresses), they have their limitations. For
    11 recognising many patterns in strings (like email addresses),
    12 example there is no regular expression that can recognise the language 
    12 they have their limitations. For example there is no regular
    13 $a^nb^n$. Another example for which there exists no regular expression is the language of well-parenthesised 
    13 expression that can recognise the language $a^nb^n$. Another
    14 expressions.  In languages like Lisp, which use parentheses rather
    14 example for which there exists no regular expression is the
    15 extensively, it might be of interest whether the following two expressions
    15 language of well-parenthesised expressions. In languages like
    16 are well-parenthesised (the left one is, the right one is not):
    16 Lisp, which use parentheses rather extensively, it might be of
       
    17 interest whether the following two expressions are
       
    18 well-parenthesised (the left one is, the right one is not):
    17 
    19 
    18 \begin{center}
    20 \begin{center}
    19 $(((()()))())$  \hspace{10mm} $(((()()))()))$
    21 $(((()()))())$  \hspace{10mm} $(((()()))()))$
    20 \end{center}
    22 \end{center}
    21 
    23 
    22 \noindent
    24 \noindent Not being able to solve such recognition problems is
    23 Not being able to solve such recognition problems is a serious limitation.
    25 a serious limitation. In order to solve such recognition
    24 In order to solve such recognition problems, we need more powerful 
    26 problems, we need more powerful techniques than regular
    25 techniques than regular expressions. We will in particular look at \emph{context-free
    27 expressions. We will in particular look at \emph{context-free
    26 languages}. They include the regular languages as the picture below shows:
    28 languages}. They include the regular languages as the picture
       
    29 below shows:
    27 
    30 
    28 
    31 
    29 \begin{center}
    32 \begin{center}
    30 \begin{tikzpicture}
    33 \begin{tikzpicture}
    31 [rect/.style={draw=black!50, 
    34 [rect/.style={draw=black!50, 
    38 \draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages};
    41 \draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages};
    39 \draw (0,-1.05) node [rect] {\small regular languages};
    42 \draw (0,-1.05) node [rect] {\small regular languages};
    40 \end{tikzpicture}
    43 \end{tikzpicture}
    41 \end{center}
    44 \end{center}
    42 
    45 
    43 \noindent
    46 \noindent Context-free languages play an important role in
    44 Context-free languages play an important role in `day-to-day' text processing and in
    47 `day-to-day' text processing and in programming languages.
    45 programming languages. Context-free languages are usually specified by grammars.
    48 Context-free languages are usually specified by grammars. For
    46 For example a grammar for well-parenthesised  expressions is
    49 example a grammar for well-parenthesised expressions is
    47 
    50 
    48 \begin{center}
    51 \begin{center}
    49 $P \;\;\rightarrow\;\; ( \cdot  P \cdot ) \cdot P \;|\; \epsilon$
    52 $P \;\;\rightarrow\;\; ( \cdot  P \cdot ) \cdot P \;|\; \epsilon$
    50 \end{center}
    53 \end{center}
    51  
    54  
    52 \noindent
    55 \noindent 
    53 In general grammars consist of finitely many rules built up from \emph{terminal symbols} 
    56 or a grammar for recognising strings consisting of ones is
    54 (usually lower-case letters) and \emph{non-terminal symbols} (upper-case letters).  Rules 
    57 
    55 have the shape
    58 \begin{center}
       
    59 $O \;\;\rightarrow\;\; 1 \cdot  O \;|\; 1$
       
    60 \end{center}
       
    61 
       
    62 In general grammars consist of finitely many rules built up
       
    63 from \emph{terminal symbols} (usually lower-case letters) and
       
    64 \emph{non-terminal symbols} (upper-case letters). Rules have
       
    65 the shape
    56 
    66 
    57 \begin{center}
    67 \begin{center}
    58 $NT \;\;\rightarrow\;\; \textit{rhs}$
    68 $NT \;\;\rightarrow\;\; \textit{rhs}$
    59 \end{center}
    69 \end{center}
    60  
    70  
    61 \noindent
    71 \noindent where on the left-hand side is a single non-terminal
    62 where on the left-hand side is a single non-terminal and on the right a string consisting
    72 and on the right a string consisting of both terminals and
    63 of both terminals and non-terminals including the $\epsilon$-symbol for indicating the
    73 non-terminals including the $\epsilon$-symbol for indicating
    64 empty string. We use the convention  to separate components on
    74 the empty string. We use the convention to separate components
    65 the right hand-side by using the $\cdot$ symbol, as in the grammar for well-parenthesised  expressions.
    75 on the right hand-side by using the $\cdot$ symbol, as in the
    66 We also use the convention to use $|$ as a shorthand notation for several rules. For example
    76 grammar for well-parenthesised expressions. We also use the
       
    77 convention to use $|$ as a shorthand notation for several
       
    78 rules. For example
    67 
    79 
    68 \begin{center}
    80 \begin{center}
    69 $NT \;\;\rightarrow\;\; \textit{rhs}_1 \;|\; \textit{rhs}_2$
    81 $NT \;\;\rightarrow\;\; \textit{rhs}_1 \;|\; \textit{rhs}_2$
    70 \end{center}
    82 \end{center}
    71 
    83 
    72 \noindent
    84 \noindent means that the non-terminal $NT$ can be replaced by
    73 means that the non-terminal $NT$ can be replaced by either $\textit{rhs}_1$ or $\textit{rhs}_2$.
    85 either $\textit{rhs}_1$ or $\textit{rhs}_2$. If there are more
    74 If there are more than one non-terminal on the left-hand side of the rules, then we need to indicate
    86 than one non-terminal on the left-hand side of the rules, then
    75 what is the \emph{starting} symbol of the grammar. For example the grammar for arithmetic expressions
    87 we need to indicate what is the \emph{starting} symbol of the
       
    88 grammar. For example the grammar for arithmetic expressions
    76 can be given as follows
    89 can be given as follows
    77 
    90 
    78 \begin{center}
    91 \begin{center}
    79 \begin{tabular}{lcl}
    92 \begin{tabular}{lcl@{\hspace{2cm}}l}
    80 $E$ & $\rightarrow$ &  $N$ \\
    93 $E$ & $\rightarrow$ &  $N$                 & (1)\\
    81 $E$ & $\rightarrow$ &  $E \cdot + \cdot E$ \\
    94 $E$ & $\rightarrow$ &  $E \cdot + \cdot E$ & (2)\\
    82 $E$ & $\rightarrow$ &  $E \cdot - \cdot E$ \\
    95 $E$ & $\rightarrow$ &  $E \cdot - \cdot E$ & (3)\\
    83 $E$ & $\rightarrow$ &  $E \cdot * \cdot E$ \\
    96 $E$ & $\rightarrow$ &  $E \cdot * \cdot E$ & (4)\\
    84 $E$ & $\rightarrow$ &  $( \cdot E \cdot )$\\
    97 $E$ & $\rightarrow$ &  $( \cdot E \cdot )$ & (5)\\
    85 $N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$ 
    98 $N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$ & (6\ldots)
    86 \end{tabular}
    99 \end{tabular}
    87 \end{center}
   100 \end{center}
    88 
   101 
    89 \noindent
   102 \noindent where $E$ is the starting symbol. A
    90 where $E$ is the starting symbol. A \emph{derivation} for a grammar
   103 \emph{derivation} for a grammar starts with the starting
    91 starts with the staring symbol of the grammar and in each step replaces one
   104 symbol of the grammar and in each step replaces one
    92 non-terminal by a right-hand side of a rule. A derivation ends with a string
   105 non-terminal by a right-hand side of a rule. A derivation ends
    93 in which only terminal symbols are left. For example a derivation for the
   106 with a string in which only terminal symbols are left. For
    94 string $(1 + 2) + 3$ is as follows:
   107 example a derivation for the string $(1 + 2) + 3$ is as
    95 
   108 follows:
    96 \begin{center}
   109 
    97 \begin{tabular}{lll}
   110 \begin{center}
    98 $E$ & $\rightarrow$ & $E+E$\\
   111 \begin{tabular}{lll@{\hspace{2cm}}l}
    99        & $\rightarrow$ & $(E)+E$\\
   112 $E$ & $\rightarrow$ & $E+E$          & by (2)\\
   100        & $\rightarrow$ & $(E+E)+E$\\
   113        & $\rightarrow$ & $(E)+E$     & by (5)\\
   101        & $\rightarrow$ & $(E+E)+N$\\
   114        & $\rightarrow$ & $(E+E)+E$   & by (2)\\
   102        & $\rightarrow$ & $(E+E)+3$\\
   115        & $\rightarrow$ & $(E+E)+N$   & by (1)\\
   103        & $\rightarrow$ & $(N+E)+3$\\	
   116        & $\rightarrow$ & $(E+E)+3$   & by (6\dots)\\
   104        & $\rightarrow^+$ & $(1+2)+3$\\
   117        & $\rightarrow$ & $(N+E)+3$   & by (1)\\
       
   118        & $\rightarrow^+$ & $(1+2)+3$ & by (1, 6\ldots)\\
   105 \end{tabular} 
   119 \end{tabular} 
   106 \end{center}
   120 \end{center}
   107 
   121 
   108 \noindent
   122 \noindent where on the right it is indicated which 
   109 The \emph{language} of a context-free grammar $G$ with start symbol $S$ 
   123 grammar rule has been applied. In the last step we
   110 is defined as the set of strings derivable by a derivation, that is
   124 merged several steps into one.
       
   125 
       
   126 The \emph{language} of a context-free grammar $G$
       
   127 with start symbol $S$ is defined as the set of strings
       
   128 derivable by a derivation, that is
   111 
   129 
   112 \begin{center}
   130 \begin{center}
   113 $\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
   131 $\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
   114 \end{center}
   132 \end{center}
   115 
   133 
   138         child {node {$N$} child {node {$4$}}}
   156         child {node {$N$} child {node {$4$}}}
   139      };
   157      };
   140 \end{tikzpicture}
   158 \end{tikzpicture}
   141 \end{center}
   159 \end{center}
   142 
   160 
   143 \noindent
   161 \noindent We are often interested in these parse-trees since
   144 We are often interested in these parse-trees since they encode the structure of
   162 they encode the structure of how a string is derived by a
   145 how a string is derived by a grammar. Before we come to the problem of constructing
   163 grammar. Before we come to the problem of constructing such
   146 such parse-trees, we need to consider the following two properties of grammars.
   164 parse-trees, we need to consider the following two properties
   147 A grammar is \emph{left-recursive} if there is a derivation starting from a non-terminal, say
   165 of grammars. A grammar is \emph{left-recursive} if there is a
   148 $NT$ which leads to a string which again starts with $NT$. This means a derivation of the
   166 derivation starting from a non-terminal, say $NT$ which leads
   149 form.
   167 to a string which again starts with $NT$. This means a
       
   168 derivation of the form.
   150 
   169 
   151 \begin{center}
   170 \begin{center}
   152 $NT \rightarrow \ldots \rightarrow NT \cdot \ldots$
   171 $NT \rightarrow \ldots \rightarrow NT \cdot \ldots$
   153 \end{center}
   172 \end{center}
   154 
   173 
   155 \noindent
   174 \noindent It can be easily seen that the grammar above for
   156 It can be easily seen that the grammar above for arithmetic expressions is left-recursive:
   175 arithmetic expressions is left-recursive: for example the
   157 for example the rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow N\cdot N$ 
   176 rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow
   158 show that this grammar is left-recursive. Some algorithms cannot cope with left-recursive 
   177 N\cdot N$ show that this grammar is left-recursive. But note
   159 grammars. Fortunately every left-recursive grammar can be transformed into one that is
   178 that left-recursiveness can involve more than one step in the
   160 not left-recursive, although this transformation might make the grammar less human-readable.
   179 derivation. The problem with left-recursive grammars is that
   161 For example if we want to give a non-left-recursive grammar for numbers we might
   180 some algorithms cannot cope with them: they fall into a loop.
   162 specify
   181 Fortunately every left-recursive grammar can be transformed
       
   182 into one that is not left-recursive, although this
       
   183 transformation might make the grammar less ``human-readable''.
       
   184 For example if we want to give a non-left-recursive grammar
       
   185 for numbers we might specify
   163 
   186 
   164 \begin{center}
   187 \begin{center}
   165 $N \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot N\;|\;2\cdot N\;|\;\ldots\;|\;9\cdot N$
   188 $N \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot N\;|\;2\cdot N\;|\;\ldots\;|\;9\cdot N$
   166 \end{center}
   189 \end{center}
   167 
   190 
   168 \noindent
   191 \noindent Using this grammar we can still derive every number
   169 Using this grammar we can still derive every number string, but we will never be able 
   192 string, but we will never be able to derive a string of the
   170 to derive a string of the form $\ldots \rightarrow N \cdot \ldots$.
   193 form $N \to \ldots \to N \cdot \ldots$.
   171 
   194 
   172 The other property we have to watch out for is when a grammar is
   195 The other property we have to watch out for is when a grammar
   173 \emph{ambiguous}. A grammar is said to be ambiguous if there are two parse-trees
   196 is \emph{ambiguous}. A grammar is said to be ambiguous if
   174 for one string. Again the grammar for arithmetic expressions shown above is ambiguous.
   197 there are two parse-trees for one string. Again the grammar
   175 While the shown parse tree for the string $(1 + 23) + 4$ is unique, this is not the case in
   198 for arithmetic expressions shown above is ambiguous. While the
   176 general. For example there are two parse
   199 shown parse tree for the string $(1 + 23) + 4$ is unique, this
       
   200 is not the case in general. For example there are two parse
   177 trees for the string $1 + 2 + 3$, namely
   201 trees for the string $1 + 2 + 3$, namely
   178 
   202 
   179 \begin{center}
   203 \begin{center}
   180 \begin{tabular}{c@{\hspace{10mm}}c}
   204 \begin{tabular}{c@{\hspace{10mm}}c}
   181 \begin{tikzpicture}[level distance=8mm, black]
   205 \begin{tikzpicture}[level distance=8mm, black]
   202     ;
   226     ;
   203 \end{tikzpicture}
   227 \end{tikzpicture}
   204 \end{tabular} 
   228 \end{tabular} 
   205 \end{center}
   229 \end{center}
   206 
   230 
   207 \noindent
   231 \noindent In particular in programming languages we will try
   208 In particular in programming languages we will try to avoid ambiguous
   232 to avoid ambiguous grammars because two different parse-trees
   209 grammars because two different parse-trees for a string mean a program can
   233 for a string mean a program can be interpreted in two
   210 be interpreted in two different ways. In such cases we have to somehow make sure
   234 different ways. In such cases we have to somehow make sure the
   211 the two different ways do not matter, or disambiguate the grammar in
   235 two different ways do not matter, or disambiguate the grammar
   212 some other way (for example making the $+$ left-associative). Unfortunately already 
   236 in some other way (for example making the $+$
   213 the problem of deciding whether a grammar
   237 left-associative). Unfortunately already the problem of
   214 is ambiguous or not is in general undecidable. 
   238 deciding whether a grammar is ambiguous or not is in general
   215 
   239 undecidable. But in simple instance (the ones we deal in this
   216 Let us now turn to the problem of generating a parse-tree for a grammar and string.
   240 module) one can usually see when a grammar is ambiguous.
   217 In what follows we explain \emph{parser combinators}, because they are easy
   241 
   218 to implement and closely resemble grammar rules. Imagine that a grammar
   242 Let us now turn to the problem of generating a parse-tree for
   219 describes the strings of natural numbers, such as the grammar $N$ shown above.
   243 a grammar and string. In what follows we explain \emph{parser
   220 For all such strings we want to generate the parse-trees or later on we actually 
   244 combinators}, because they are easy to implement and closely
   221 want to extract the meaning of these strings, that is the concrete integers ``behind'' 
   245 resemble grammar rules. Imagine that a grammar describes the
   222 these strings. In Scala the parser combinators will be functions of type
   246 strings of natural numbers, such as the grammar $N$ shown
       
   247 above. For all such strings we want to generate the
       
   248 parse-trees or later on we actually want to extract the
       
   249 meaning of these strings, that is the concrete integers
       
   250 ``behind'' these strings. In Scala the parser combinators will
       
   251 be functions of type
   223 
   252 
   224 \begin{center}
   253 \begin{center}
   225 \texttt{I $\Rightarrow$ Set[(T, I)]}
   254 \texttt{I $\Rightarrow$ Set[(T, I)]}
   226 \end{center}
   255 \end{center}
   227 
   256 
   228 \noindent 
   257 \noindent that is they take as input something of type
   229 that is they take as input something of type \texttt{I}, typically a list of tokens or a string,
   258 \texttt{I}, typically a list of tokens or a string, and return
   230 and return a set of pairs. The first component of these pairs corresponds to what the
   259 a set of pairs. The first component of these pairs corresponds
   231 parser combinator was able to process from the input and the second is the unprocessed 
   260 to what the parser combinator was able to process from the
   232 part of the input. As we shall see shortly, a parser combinator might return more than one such pair,
   261 input and the second is the unprocessed part of the input. As
   233 with the idea that there are potentially several ways how to interpret the input. As a concrete
   262 we shall see shortly, a parser combinator might return more
   234 example, consider the case where the input is of type string, say the string
   263 than one such pair, with the idea that there are potentially
       
   264 several ways how to interpret the input. As a concrete
       
   265 example, consider the case where the input is of type string,
       
   266 say the string
   235 
   267 
   236 \begin{center}
   268 \begin{center}
   237 \tt\Grid{iffoo\VS testbar}
   269 \tt\Grid{iffoo\VS testbar}
   238 \end{center}
   270 \end{center}
   239 
   271 
   240 \noindent
   272 \noindent We might have a parser combinator which tries to
   241 We might have a parser combinator which tries to interpret this string as a keyword (\texttt{if}) or
   273 interpret this string as a keyword (\texttt{if}) or an
   242 an identifier (\texttt{iffoo}). Then the output will be the set
   274 identifier (\texttt{iffoo}). Then the output will be the set
   243 
   275 
   244 \begin{center}
   276 \begin{center}
   245 $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), 
   277 $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), 
   246            \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
   278            \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
   247 \end{center}
   279 \end{center}
   248 
   280 
   249 \noindent
   281 \noindent where the first pair means the parser could
   250 where the first pair means the parser could recognise \texttt{if} from the input and leaves 
   282 recognise \texttt{if} from the input and leaves the rest as
   251 the rest as `unprocessed' as the second component of the pair; in the other case
   283 `unprocessed' as the second component of the pair; in the
   252 it could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parser
   284 other case it could recognise \texttt{iffoo} and leaves
   253 cannot recognise anything from the input then parser combinators just return the empty 
   285 \texttt{\VS testbar} as unprocessed. If the parser cannot
   254 set $\varnothing$. This will indicate something ``went wrong''.
   286 recognise anything from the input then parser combinators just
       
   287 return the empty set $\{\}$. This will indicate
       
   288 something ``went wrong''.
   255 
   289 
   256 The main attraction is that we can easily build parser combinators out of smaller components
   290 The main attraction is that we can easily build parser combinators out of smaller components
   257 following very closely the structure of a grammar. In order to implement this in an object
   291 following very closely the structure of a grammar. In order to implement this in an object
   258 oriented programming language, like Scala, we need to specify an abstract class for parser 
   292 oriented programming language, like Scala, we need to specify an abstract class for parser 
   259 combinators. This abstract class requires the implementation of the function
   293 combinators. This abstract class requires the implementation of the function