handouts/ho05.tex
changeset 358 b3129cff41e9
parent 297 5c51839c88fd
child 360 c6c574d2ca0c
equal deleted inserted replaced
357:603e171a7b48 358:b3129cff41e9
     1 \documentclass{article}
     1 \documentclass{article}
     2 \usepackage{../style}
     2 \usepackage{../style}
     3 \usepackage{../langs}
     3 \usepackage{../langs}
     4 \usepackage{../graphics}
     4 
     5 
     5 
     6 \begin{document}
     6 \begin{document}
     7 
     7 
     8 \section*{Handout 5 (Lexing)}
     8 \section*{Handout 6 (Parser Combinators)}
     9 
     9 
    10 Whenever you want to design a new programming language or
    10 While regular expressions are very useful for lexing and for recognising
    11 implement a compiler for an existing language, the first task
    11 many patterns in strings (like email addresses), they have their limitations. For
    12 is to fix the basic ``words'' of the language. For example
    12 example there is no regular expression that can recognise the language 
    13 what are the keywords, or reserved words, of the language,
    13 $a^nb^n$. Another example for which there exists no regular expression is the language of well-parenthesised 
    14 what are permitted identifiers, numbers, expressions and so
    14 expressions.  In languages like Lisp, which use parentheses rather
    15 on. One convenient way to do this is, of course, by using
    15 extensively, it might be of interest whether the following two expressions
    16 regular expressions. 
    16 are well-parenthesised (the left one is, the right one is not):
    17 
    17 
    18 In this course we want to take a closer look at the WHILE
    18 \begin{center}
    19 programming language. This is a simple imperative programming
    19 $(((()()))())$  \hspace{10mm} $(((()()))()))$
    20 language consisting of arithmetic expressions, assignments,
    20 \end{center}
    21 if-statements and loops. For example the Fibonacci program can
    21 
    22 be written in this language as follows:
    22 \noindent
    23 
    23 Not being able to solve such recognition problems is a serious limitation.
    24 \begin{center}
    24 In order to solve such recognition problems, we need more powerful 
    25 \mbox{\lstinputlisting[language=while]{../progs/fib.while}}
    25 techniques than regular expressions. We will in particular look at \emph{context-free
    26 \end{center}
    26 languages}. They include the regular languages as the picture below shows:
    27 
    27 
    28 \noindent
    28 
    29 The keywords in this language will be
    29 \begin{center}
    30 
    30 \begin{tikzpicture}
    31 \begin{center}
    31 [rect/.style={draw=black!50, 
    32 \texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}
    32  top color=white,bottom color=black!20, 
    33 \end{center}
    33  rectangle, very thick, rounded corners}]
    34 
    34 
    35 \noindent
    35 \draw (0,0) node [rect, text depth=30mm, text width=46mm] {\small all languages};
    36 In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers
    36 \draw (0,-0.4) node [rect, text depth=20mm, text width=44mm] {\small decidable languages};
    37 and strings (which we however ignore for the moment). We also need to specify what the ``whitespace''
    37 \draw (0,-0.65) node [rect, text depth=13mm] {\small context sensitive languages};
    38 is in our programming language and what comments should look like.
    38 \draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {\small context-free languages};
    39 As a first try, we might specify the regular expressions for our language roughly as follows
    39 \draw (0,-1.05) node [rect] {\small regular languages};
       
    40 \end{tikzpicture}
       
    41 \end{center}
       
    42 
       
    43 \noindent
       
    44 Context-free languages play an important role in `day-to-day' text processing and in
       
    45 programming languages. Context-free languages are usually specified by grammars.
       
    46 For example a grammar for well-parenthesised  expressions is
       
    47 
       
    48 \begin{center}
       
    49 $P \;\;\rightarrow\;\; ( \cdot  P \cdot ) \cdot P \;|\; \epsilon$
       
    50 \end{center}
       
    51  
       
    52 \noindent
       
    53 In general grammars consist of finitely many rules built up from \emph{terminal symbols} 
       
    54 (usually lower-case letters) and \emph{non-terminal symbols} (upper-case letters).  Rules 
       
    55 have the shape
       
    56 
       
    57 \begin{center}
       
    58 $NT \;\;\rightarrow\;\; \textit{rhs}$
       
    59 \end{center}
       
    60  
       
    61 \noindent
       
    62 where on the left-hand side is a single non-terminal and on the right a string consisting
       
    63 of both terminals and non-terminals including the $\epsilon$-symbol for indicating the
       
    64 empty string. We use the convention  to separate components on
       
    65 the right hand-side by using the $\cdot$ symbol, as in the grammar for well-parenthesised  expressions.
       
    66 We also use the convention to use $|$ as a shorthand notation for several rules. For example
       
    67 
       
    68 \begin{center}
       
    69 $NT \;\;\rightarrow\;\; \textit{rhs}_1 \;|\; \textit{rhs}_2$
       
    70 \end{center}
       
    71 
       
    72 \noindent
       
    73 means that the non-terminal $NT$ can be replaced by either $\textit{rhs}_1$ or $\textit{rhs}_2$.
       
    74 If there are more than one non-terminal on the left-hand side of the rules, then we need to indicate
       
    75 what is the \emph{starting} symbol of the grammar. For example the grammar for arithmetic expressions
       
    76 can be given as follows
    40 
    77 
    41 \begin{center}
    78 \begin{center}
    42 \begin{tabular}{lcl}
    79 \begin{tabular}{lcl}
    43 $\textit{LETTER}$ & $:=$ & $\texttt{a} + \texttt{A} + \texttt{b} + \texttt{B} + \ldots$\\
    80 $E$ & $\rightarrow$ &  $N$ \\
    44 $\textit{DIGIT}$ & $:=$ & $\texttt{0} + \texttt{1} + \texttt{2} + \ldots$\\
    81 $E$ & $\rightarrow$ &  $E \cdot + \cdot E$ \\
    45 $\textit{KEYWORD}$ & $:=$ & $\texttt{while}  + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\
    82 $E$ & $\rightarrow$ &  $E \cdot - \cdot E$ \\
    46 $\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ 
    83 $E$ & $\rightarrow$ &  $E \cdot * \cdot E$ \\
    47 $\textit{OP}$      &  $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\
    84 $E$ & $\rightarrow$ &  $( \cdot E \cdot )$\\
    48 $\textit{NUM}$   &  $:=$ & $\textit{DIGIT}^+$\\
    85 $N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$ 
    49 $\textit{WHITESPACE}$ & $:=$ & $("\hspace{2mm}" + \backslash\texttt{n})^+$
       
    50 \end{tabular}
    86 \end{tabular}
    51 \end{center}
    87 \end{center}
    52 
    88 
    53 Having the regular expressions in place, the problem we have to solve is: 
    89 \noindent
    54 given a string of our programming language, which regular expression 
    90 where $E$ is the starting symbol. A \emph{derivation} for a grammar
    55 matches which part of the string. By solving this problem, we can split up a string 
    91 starts with the staring symbol of the grammar and in each step replaces one
    56 of our language into components. 
    92 non-terminal by a right-hand side of a rule. A derivation ends with a string
    57 For example given the input string
    93 in which only terminal symbols are left. For example a derivation for the
    58 
    94 string $(1 + 2) + 3$ is as follows:
    59 \begin{center}
    95 
    60 \texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
    96 \begin{center}
    61 \end{center}
    97 \begin{tabular}{lll}
    62 
    98 $E$ & $\rightarrow$ & $E+E$\\
    63 \noindent
    99        & $\rightarrow$ & $(E)+E$\\
    64 we expect it is split up as follows
   100        & $\rightarrow$ & $(E+E)+E$\\
    65 
   101        & $\rightarrow$ & $(E+E)+N$\\
    66 \begin{center}
   102        & $\rightarrow$ & $(E+E)+3$\\
    67 \tt
   103        & $\rightarrow$ & $(N+E)+3$\\	
    68 \Grid{if}\,
   104        & $\rightarrow^+$ & $(1+2)+3$\\
    69 \Grid{\VS}\, 
   105 \end{tabular} 
    70 \Grid{true}\, 
   106 \end{center}
    71 \Grid{\VS}\, 
   107 
    72 \Grid{then}\, 
   108 \noindent
    73 \Grid{\VS}\, 
   109 The \emph{language} of a context-free grammar $G$ with start symbol $S$ 
    74 \Grid{x}\, 
   110 is defined as the set of strings derivable by a derivation, that is
    75 \Grid{+}\, 
   111 
    76 \Grid{2}\,
   112 \begin{center}
    77 \Grid{\VS}\, 
   113 $\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
    78 \Grid{else}\,
   114 \end{center}
    79 \Grid{\VS}\,
   115 
    80 \Grid{x}\,
   116 \noindent
    81 \Grid{+}\,
   117 A \emph{parse-tree} encodes how a string is derived with the starting symbol on 
    82 \Grid{3}
   118 top and each non-terminal containing a subtree for how it is replaced in a derivation.
    83 \end{center}
   119 The parse tree for the string $(1 + 23)+4$ is as follows:
    84 
   120 
    85 \noindent
   121 \begin{center}
    86 This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}.
   122 \begin{tikzpicture}[level distance=8mm, black]
    87 It is usually the first phase of a compiler. Note that the separation into words cannot, in general, 
   123   \node {$E$}
    88 be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace
   124     child {node {$E$} 
    89 in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are 
   125        child {node {$($}}
    90 not separated by any whitespace. Another reason for recognising whitespaces explicitly is
   126        child {node {$E$}       
    91 that in some languages, for example Python, whitespaces matters, that is carry meaning. However in 
   127          child {node {$E$} child {node {$N$} child {node {$1$}}}}
    92 our small language we will eventually just filter out all whitespaces and also all comments.
   128          child {node {$+$}}
    93 
   129          child {node {$E$} 
    94 Lexing not just separates a string into its components, but also classifies the components, meaning it explicitly records that \texttt{if} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
   130             child {node {$N$} child {node {$2$}}}
    95 For the moment, though, we will only focus on the simpler problem of just splitting up 
   131             child {node {$N$} child {node {$3$}}}
    96 a string into components.
   132             } 
    97 
   133         }
    98 There are a few subtleties  we need to consider first. For example, say the string is
   134        child {node {$)$}}
    99 
   135      }
   100 \begin{center}
   136      child {node {$+$}}
   101 \texttt{\Grid{iffoo\VS}}\;\ldots
   137      child {node {$E$}
   102 \end{center}
   138         child {node {$N$} child {node {$4$}}}
   103 
   139      };
   104 \noindent
   140 \end{tikzpicture}
   105 then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed
   141 \end{center}
   106 by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a 
   142 
   107 single identifier. The choice that is often made in lexers is to look for the longest possible match.
   143 \noindent
   108 This leaves  \texttt{iffoo} as the only match in this case (since it is longer than \texttt{if}).
   144 We are often interested in these parse-trees since they encode the structure of
   109 
   145 how a string is derived by a grammar. Before we come to the problem of constructing
   110 Unfortunately, the convention about the longest match does not yet make the process 
   146 such parse-trees, we need to consider the following two properties of grammars.
   111 of lexing completely deterministic. Consider the string
   147 A grammar is \emph{left-recursive} if there is a derivation starting from a non-terminal, say
   112 
   148 $NT$ which leads to a string which again starts with $NT$. This means a derivation of the
   113 \begin{center}
   149 form.
   114 \texttt{\Grid{then\VS}}\;\dots
   150 
   115 \end{center}
   151 \begin{center}
   116 
   152 $NT \rightarrow \ldots \rightarrow NT \cdot \ldots$
   117 \noindent
   153 \end{center}
   118 Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our 
   154 
   119 regular expressions. In our running example we just use the ranking
   155 \noindent
   120 
   156 It can be easily seen that the grammar above for arithmetic expressions is left-recursive:
   121 \[
   157 for example the rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow N\cdot N$ 
   122 \textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots
   158 show that this grammar is left-recursive. Some algorithms cannot cope with left-recursive 
   123 \]
   159 grammars. Fortunately every left-recursive grammar can be transformed into one that is
   124 
   160 not left-recursive, although this transformation might make the grammar less human-readable.
   125 \noindent
   161 For example if we want to give a non-left-recursive grammar for numbers we might
   126 So even if both regular expressions match in the example above,
   162 specify
   127 we give preference to the regular expression for keywords.
   163 
   128 
   164 \begin{center}
   129 Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$
   165 $N \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot N\;|\;2\cdot N\;|\;\ldots\;|\;9\cdot N$
   130 and $der$, it will  use the function \emph{zeroable} defined as follows:
   166 \end{center}
   131 
   167 
   132 \begin{center}
   168 \noindent
   133 \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
   169 Using this grammar we can still derive every number string, but we will never be able 
   134 $zeroable(\varnothing)$      & $\dn$ & $true$\\
   170 to derive a string of the form $\ldots \rightarrow N \cdot \ldots$.
   135 $zeroable(\epsilon)$           & $\dn$ &  $f\!alse$\\
   171 
   136 $zeroable (c)$                    & $\dn$ &  $f\!alse$\\
   172 The other property we have to watch out for is when a grammar is
   137 $zeroable (r_1 + r_2)$        & $\dn$ &  $zeroable(r_1) \wedge zeroable(r_2)$ \\ 
   173 \emph{ambiguous}. A grammar is said to be ambiguous if there are two parse-trees
   138 $zeroable (r_1 \cdot r_2)$  & $\dn$ &  $zeroable(r_1) \vee zeroable(r_2)$ \\
   174 for one string. Again the grammar for arithmetic expressions shown above is ambiguous.
   139 $zeroable (r^*)$                  & $\dn$ & $f\!alse$\\
   175 While the shown parse tree for the string $(1 + 23) + 4$ is unique, this is not the case in
       
   176 general. For example there are two parse
       
   177 trees for the string $1 + 2 + 3$, namely
       
   178 
       
   179 \begin{center}
       
   180 \begin{tabular}{c@{\hspace{10mm}}c}
       
   181 \begin{tikzpicture}[level distance=8mm, black]
       
   182   \node {$E$}
       
   183     child {node {$E$} child {node {$N$} child {node {$1$}}}}
       
   184     child {node {$+$}}
       
   185     child {node {$E$}
       
   186        child {node {$E$} child {node {$N$} child {node {$2$}}}}
       
   187        child {node {$+$}}
       
   188        child {node {$E$} child {node {$N$} child {node {$3$}}}}
       
   189     }
       
   190     ;
       
   191 \end{tikzpicture} 
       
   192 &
       
   193 \begin{tikzpicture}[level distance=8mm, black]
       
   194   \node {$E$}
       
   195     child {node {$E$}
       
   196        child {node {$E$} child {node {$N$} child {node {$1$}}}}
       
   197        child {node {$+$}}
       
   198        child {node {$E$} child {node {$N$} child {node {$2$}}}} 
       
   199     }
       
   200     child {node {$+$}}
       
   201     child {node {$E$} child {node {$N$} child {node {$3$}}}}
       
   202     ;
       
   203 \end{tikzpicture}
       
   204 \end{tabular} 
       
   205 \end{center}
       
   206 
       
   207 \noindent
       
   208 In particular in programming languages we will try to avoid ambiguous
       
   209 grammars because two different parse-trees for a string mean a program can
       
   210 be interpreted in two different ways. In such cases we have to somehow make sure
       
   211 the two different ways do not matter, or disambiguate the grammar in
       
   212 some other way (for example making the $+$ left-associative). Unfortunately already 
       
   213 the problem of deciding whether a grammar
       
   214 is ambiguous or not is in general undecidable. 
       
   215 
       
   216 Let us now turn to the problem of generating a parse-tree for a grammar and string.
       
   217 In what follows we explain \emph{parser combinators}, because they are easy
       
   218 to implement and closely resemble grammar rules. Imagine that a grammar
       
   219 describes the strings of natural numbers, such as the grammar $N$ shown above.
       
   220 For all such strings we want to generate the parse-trees or later on we actually 
       
   221 want to extract the meaning of these strings, that is the concrete integers ``behind'' 
       
   222 these strings. In Scala the parser combinators will be functions of type
       
   223 
       
   224 \begin{center}
       
   225 \texttt{I $\Rightarrow$ Set[(T, I)]}
       
   226 \end{center}
       
   227 
       
   228 \noindent 
       
   229 that is they take as input something of type \texttt{I}, typically a list of tokens or a string,
       
   230 and return a set of pairs. The first component of these pairs corresponds to what the
       
   231 parser combinator was able to process from the input and the second is the unprocessed 
       
   232 part of the input. As we shall see shortly, a parser combinator might return more than one such pair,
       
   233 with the idea that there are potentially several ways how to interpret the input. As a concrete
       
   234 example, consider the case where the input is of type string, say the string
       
   235 
       
   236 \begin{center}
       
   237 \tt\Grid{iffoo\VS testbar}
       
   238 \end{center}
       
   239 
       
   240 \noindent
       
   241 We might have a parser combinator which tries to interpret this string as a keyword (\texttt{if}) or
       
   242 an identifier (\texttt{iffoo}). Then the output will be the set
       
   243 
       
   244 \begin{center}
       
   245 $\left\{ \left(\texttt{\Grid{if}}\,,\, \texttt{\Grid{foo\VS testbar}}\right), 
       
   246            \left(\texttt{\Grid{iffoo}}\,,\, \texttt{\Grid{\VS testbar}}\right) \right\}$
       
   247 \end{center}
       
   248 
       
   249 \noindent
       
   250 where the first pair means the parser could recognise \texttt{if} from the input and leaves 
       
   251 the rest as `unprocessed' as the second component of the pair; in the other case
       
   252 it could recognise \texttt{iffoo} and leaves \texttt{\VS testbar} as unprocessed. If the parser
       
   253 cannot recognise anything from the input then parser combinators just return the empty 
       
   254 set $\varnothing$. This will indicate something ``went wrong''.
       
   255 
       
   256 The main attraction is that we can easily build parser combinators out of smaller components
       
   257 following very closely the structure of a grammar. In order to implement this in an object
       
   258 oriented programming language, like Scala, we need to specify an abstract class for parser 
       
   259 combinators. This abstract class requires the implementation of the function
       
   260 \texttt{parse} taking an argument of type \texttt{I} and returns a set of type  
       
   261 \mbox{\texttt{Set[(T, I)]}}.
       
   262 
       
   263 \begin{center}
       
   264 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   265 abstract class Parser[I, T] {
       
   266   def parse(ts: I): Set[(T, I)]
       
   267 
       
   268   def parse_all(ts: I): Set[T] =
       
   269     for ((head, tail) <- parse(ts); if (tail.isEmpty)) 
       
   270       yield head
       
   271 }
       
   272 \end{lstlisting}
       
   273 \end{center}
       
   274 
       
   275 \noindent
       
   276 From the function \texttt{parse} we can then ``centrally'' derive the function \texttt{parse\_all},
       
   277 which just filters out all pairs whose second component is not empty (that is has still some
       
   278 unprocessed part). The reason is that at the end of parsing we are only interested in the
       
   279 results where all the input has been consumed and no unprocessed part is left.
       
   280 
       
   281 One of the simplest parser combinators recognises just a character, say $c$, 
       
   282 from the beginning of strings. Its behaviour is as follows:
       
   283 
       
   284 \begin{itemize}
       
   285 \item if the head of the input string starts with a $c$, it returns 
       
   286 	the set $\{(c, \textit{tail of}\; s)\}$
       
   287 \item otherwise it returns the empty set $\varnothing$	
       
   288 \end{itemize}
       
   289 
       
   290 \noindent
       
   291 The input type of this simple parser combinator for characters is
       
   292 \texttt{String} and the output type \mbox{\texttt{Set[(Char, String)]}}. 
       
   293 The code in Scala is as follows:
       
   294 
       
   295 \begin{center}
       
   296 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   297 case class CharParser(c: Char) extends Parser[String, Char] {
       
   298   def parse(sb: String) = 
       
   299     if (sb.head == c) Set((c, sb.tail)) else Set()
       
   300 }
       
   301 \end{lstlisting}
       
   302 \end{center}
       
   303 
       
   304 \noindent
       
   305 The \texttt{parse} function tests whether the first character of the 
       
   306 input string \texttt{sb} is equal to \texttt{c}. If yes, then it splits the
       
   307 string into the recognised part \texttt{c} and the unprocessed part
       
   308 \texttt{sb.tail}. In case \texttt{sb} does not start with \texttt{c} then
       
   309 the parser returns the empty set (in Scala \texttt{Set()}).
       
   310 
       
   311 More interesting are the parser combinators that build larger parsers
       
   312 out of smaller component parsers. For example the alternative 
       
   313 parser combinator is as follows.
       
   314 
       
   315 \begin{center}
       
   316 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   317 class AltParser[I, T]
       
   318        (p: => Parser[I, T], 
       
   319         q: => Parser[I, T]) extends Parser[I, T] {
       
   320   def parse(sb: I) = p.parse(sb) ++ q.parse(sb)
       
   321 }
       
   322 \end{lstlisting}
       
   323 \end{center}
       
   324 
       
   325 \noindent
       
   326 The types of this parser combinator are polymorphic (we just have \texttt{I}
       
   327 for the input type, and \texttt{T} for the output type). The alternative parser
       
   328 builds a new parser out of two existing parser combinator \texttt{p} and \texttt{q}.
       
   329 Both need to be able to process input of type \texttt{I} and return the same
       
   330 output type \texttt{Set[(T, I)]}. (There is an interesting detail of Scala, namely the 
       
   331 \texttt{=>} in front of the types of \texttt{p} and \texttt{q}. They will prevent the
       
   332 evaluation of the arguments before they are used. This is often called 
       
   333 \emph{lazy evaluation} of the arguments.) The alternative parser should run
       
   334 the input with the first parser \texttt{p} (producing a set of outputs) and then
       
   335 run the same input with \texttt{q}. The result should be then just the union
       
   336 of both sets, which is the operation \texttt{++} in Scala.
       
   337 
       
   338 This parser combinator already allows us to construct a parser that either 
       
   339 a character \texttt{a} or \texttt{b}, as
       
   340 
       
   341 \begin{center}
       
   342 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
       
   343 new AltParser(CharParser('a'), CharParser('b'))
       
   344 \end{lstlisting}
       
   345 \end{center}
       
   346 
       
   347 \noindent
       
   348 Scala allows us to introduce some more readable shorthand notation for this, like \texttt{'a' || 'b'}. 
       
   349 We can call this parser combinator with the strings
       
   350 
       
   351 \begin{center}
       
   352 \begin{tabular}{rcl}
       
   353 input string & & output\medskip\\
       
   354 \texttt{\Grid{ac}} & $\rightarrow$ & $\left\{(\texttt{\Grid{a}}, \texttt{\Grid{c}})\right\}$\\
       
   355 \texttt{\Grid{bc}} & $\rightarrow$ & $\left\{(\texttt{\Grid{b}}, \texttt{\Grid{c}})\right\}$\\
       
   356 \texttt{\Grid{cc}} & $\rightarrow$ & $\varnothing$
   140 \end{tabular}
   357 \end{tabular}
   141 \end{center}
   358 \end{center}
   142 
   359 
   143 \noindent
   360 \noindent
   144 Recall that the function $nullable(r)$ tests whether a regular expression
   361 We receive in the first two cases a successful output (that is a non-empty set).
   145 can match the empty string. The function $zeroable$, on the other hand, tests whether a regular
   362 
   146 expression cannot match anything at all. The mathematical way of stating this
   363 A bit more interesting is the \emph{sequence parser combinator} implemented in
   147 property is
   364 Scala as follows:
   148 
   365 
   149 \begin{center}
   366 \begin{center}
   150 $zeroable(r)$ if and only if $L(r) = \varnothing$
   367 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   151 \end{center}
   368 class SeqParser[I, T, S]
   152 
   369        (p: => Parser[I, T], 
   153 \noindent
   370         q: => Parser[I, S]) extends Parser[I, (T, S)] {
   154 For what follows let us fix a set of regular expressions $rs$ as being 
   371   def parse(sb: I) = 
   155 
   372     for ((head1, tail1) <- p.parse(sb); 
   156 \begin{center}
   373          (head2, tail2) <- q.parse(tail1)) 
   157 \textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE}
   374             yield ((head1, head2), tail2)
   158 \end{center}
   375 }
   159 
   376 \end{lstlisting}
   160 \noindent
   377 \end{center}
   161 specifying the ``words'' 
   378 
   162 of our programming language. The algorithm takes as input the $rs$ and a string, say
   379 \noindent
   163 
   380 This parser takes as input two parsers, \texttt{p} and \texttt{q}. It implements \texttt{parse} 
   164 \begin{center}
   381 as follows: let first run the parser \texttt{p} on the input producing a set of pairs (\texttt{head1}, \texttt{tail1}).
   165 \texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots
   382 The \texttt{tail1} stands for the unprocessed parts left over by \texttt{p}. 
   166 \end{center}
   383 Let \texttt{q} run on these unprocessed parts
   167 
   384 producing again a set of pairs. The output of the sequence parser combinator is then a set
   168 \noindent
   385 containing pairs where the first components are again pairs, namely what the first parser could parse
   169 and tries to chop off one word from the beginning of the string. If none of the
   386 together with what the second parser could parse; the second component is the unprocessed
   170 regular expression in $rs$ matches, we will just return
   387 part left over after running the second parser \texttt{q}. Therefore the input type of
   171 the empty string.
   388 the sequence parser combinator is as usual \texttt{I}, but the output type is
   172 
   389 
   173 The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect
   390 \begin{center}
   174 to the first character $c_1$. Then we take the results and continue with 
   391 \texttt{Set[((T, S), I)]}
   175 building the derivatives with respect to $c_2$ until we have either exhausted our 
   392 \end{center}
   176 input string or all of the regular expressions are ``zeroable''.  Suppose the input string is 
   393 
   177 
   394 Scala allows us to provide some
   178 \begin{center}
   395 shorthand notation for the sequence parser combinator. So we can write for 
   179 \texttt{\Grid{if2\VS}}\;\dots
   396 example \texttt{'a'  $\sim$ 'b'}, which is the
   180 \end{center}
   397 parser combinator that first consumes the character \texttt{a} from a string and then \texttt{b}.
   181 
   398 Calling this parser combinator with the strings
   182 \noindent
   399 
   183 then building the derivatives with respect to \texttt{i} gives
   400 \begin{center}
   184 
   401 \begin{tabular}{rcl}
   185 \begin{center}
   402 input string & & output\medskip\\
   186 \begin{tabular}{l|c}
   403 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
   187  & $zeroable$\\\hline
   404 \texttt{\Grid{bac}} & $\rightarrow$ & $\varnothing$\\
   188  $der\;\texttt{i}\;(\textit{KEYWORD})$      & no\\
   405 \texttt{\Grid{ccc}} & $\rightarrow$ & $\varnothing$
   189  $der\;\texttt{i}\;(\textit{IDENT})$              & no\\
       
   190  $der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\
       
   191 \end{tabular}
   406 \end{tabular}
   192 \end{center}
   407 \end{center}
   193 
   408 
   194 \noindent
   409 \noindent
   195 We can eliminate \textit{WHITESPACE} as a potential candidate, because
   410 A slightly more complicated parser is \texttt{('a'  || 'b') $\sim$ 'b'} which parses as first character either
   196 no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other
   411 an \texttt{a} or \texttt{b} followed by a \texttt{b}. This parser produces the following results.
   197 two regular expressions as potential candidate and we have to consider the
   412 
   198 next character, \texttt{f}, from the input string
   413 \begin{center}
   199 
   414 \begin{tabular}{rcl}
   200 \begin{center}
   415 input string & & output\medskip\\
   201 \begin{tabular}{l|c}
   416 \texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
   202  & $zeroable$\\\hline
   417 \texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
   203  $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$      & no\\
   418 \texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
   204  $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$              & no\\
       
   205 \end{tabular}
   419 \end{tabular}
   206 \end{center}
   420 \end{center}
   207 
   421 
   208 \noindent
   422 Note carefully that constructing the parser \texttt{'a' || ('a' $\sim$ 'b')} will result in a tying error.
   209 Since both are `no', we have to continue with \texttt{2} from the input string
   423 The first parser has as output type a single character (recall the type of \texttt{CharParser}),
   210 
   424 but the second parser produces a pair of characters as output. The alternative parser is however
   211 \begin{center}
   425 required to have both component parsers to have the same type. We will see later how we can 
   212 \begin{tabular}{l|c}
   426 build this parser without the typing error.
   213  & $zeroable$\\\hline
   427 
   214  $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$      & yes\\
   428 The next parser combinator does not actually combine smaller parsers, but applies
   215  $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$              & no\\
   429 a function to the result of the parser. It is implemented in Scala as follows
   216 \end{tabular}
   430 
   217 \end{center}
   431 \begin{center}
   218 
   432 \begin{lstlisting}[language=Scala,basicstyle=\small\ttfamily, numbers=none]
   219 \noindent
   433 class FunParser[I, T, S]
   220 Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet
   434          (p: => Parser[I, T], 
   221 know how much of the input string should be considered as an \textit{IDENT}. So we
   435           f: T => S) extends Parser[I, S] {
   222 still have to continue and consider the next derivative.
   436   def parse(sb: I) = 
   223 
   437     for ((head, tail) <- p.parse(sb)) yield (f(head), tail)
   224 \begin{center}
   438 }
   225 \begin{tabular}{l|c}
   439 \end{lstlisting}
   226  & $zeroable$\\\hline
   440 \end{center}
   227  $der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$              & yes\\
   441 
   228 \end{tabular}
   442 
   229 \end{center}
   443 \noindent
   230 
   444 This parser combinator takes a parser \texttt{p} with output type \texttt{T} as
   231 \noindent
   445 input as well as a function \texttt{f} with type \texttt{T => S}. The parser \texttt{p}
   232 Since the answer is now `yes' also in this case, we can stop: once all derivatives are
   446 produces sets of type \texttt{(T, I)}. The \texttt{FunParser} combinator then
   233 zeroable, we know the regular expressions cannot match any more letters from
   447 applies the function \texttt{f} to all the parer outputs. Since this function
   234 the input. In this case we only have to go back to the derivative that is nullable.  In this case it
   448 is of type \texttt{T => S}, we obtain a parser with output type \texttt{S}.
   235 is
   449 Again Scala lets us introduce some shorthand notation for this parser combinator. 
   236 
   450 Therefore we will write \texttt{p ==> f} for it.
   237 \begin{center}
   451 
   238 $der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ 
   452 %\bigskip
   239 \end{center} 
   453 %takes advantage of the full generality---have a look
   240 
   454 %what it produces if we call it with the string \texttt{abc}
   241 \noindent
   455 %
   242 which means we recognised an identifier. In case where there is a choice of more than one
   456 %\begin{center}
   243 regular expressions that are nullable, then we choose the one with the highest precedence. 
   457 %\begin{tabular}{rcl}
   244 You can try out such a case with the input string
   458 %input string & & output\medskip\\
   245 
   459 %\texttt{\Grid{abc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{a}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
   246 \begin{center}
   460 %\texttt{\Grid{bbc}} & $\rightarrow$ & $\left\{((\texttt{\Grid{b}}, \texttt{\Grid{b}}), \texttt{\Grid{c}})\right\}$\\
   247 \texttt{\Grid{then\VS}}\;\dots
   461 %\texttt{\Grid{aac}} & $\rightarrow$ & $\varnothing$
   248 \end{center}
   462 %\end{tabular}
   249 
   463 %\end{center}
   250 \noindent
       
   251 which can both be recognised as a keyword, but also an identifier. 
       
   252 
       
   253 While in the example above the last nullable derivative is the one directly before
       
   254 the derivative turns zeroable, this is not always the case. Imagine, identifiers can be 
       
   255 letters, as permuted by the regular expression \textit{IDENT}, but must end with an 
       
   256 undercore.
       
   257 
       
   258 \begin{center}
       
   259 \begin{tabular}{lcl}
       
   260 $\textit{NEWIDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^* \cdot \_$\\ 
       
   261 \end{tabular}
       
   262 \end{center}
       
   263 
       
   264 \noindent
       
   265 If we use \textit{NEWIDENT} with the input string
       
   266 
       
   267 \begin{center}
       
   268 \texttt{\Grid{iffoo\VS}}\;\ldots
       
   269 \end{center}
       
   270 
       
   271 \noindent
       
   272 then it will only become $zeroable$ after the $\VS$ has been analysed. In this case we have to go back to the
       
   273 first \texttt{f} because only 
       
   274 
       
   275 \begin{center}
       
   276  $der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$  
       
   277 \end{center}
       
   278 
       
   279 \noindent
       
   280 is nullable. As a result we recognise successfully the keyword \texttt{if} and the remaining
       
   281 string needs to be consumed by other regular expressions or lead to a lexing error.
       
   282 
   464 
   283 
   465 
   284 
   466 
   285 \end{document}
   467 \end{document}
   286 
   468 
   287 %%% Local Variables: 
   469 %%% Local Variables: 
   288 %%% mode: latex
   470 %%% mode: latex  
   289 %%% TeX-master: t
   471 %%% TeX-master: t
   290 %%% End: 
   472 %%% End: