\documentclass{article}
\usepackage{charter}
\usepackage{hyperref}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage[T1]{fontenc}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{tikz}
\usetikzlibrary{arrows}
\usetikzlibrary{automata}
\usetikzlibrary{shapes}
\usetikzlibrary{shadows}
\usetikzlibrary{positioning}
\usetikzlibrary{calc}
\usetikzlibrary{fit}
\usetikzlibrary{backgrounds}
\usepackage{fontspec}
\setmonofont{Consolas}
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
\definecolor{javared}{rgb}{0.6,0,0} % for strings
\definecolor{javagreen}{rgb}{0.25,0.5,0.35} % comments
\definecolor{javapurple}{rgb}{0.5,0,0.35} % keywords
\definecolor{javadocblue}{rgb}{0.25,0.35,0.75} % javadoc
\lstdefinelanguage{scala}{
morekeywords={abstract,case,catch,class,def,%
do,else,extends,false,final,finally,%
for,if,implicit,import,match,mixin,%
new,null,object,override,package,%
private,protected,requires,return,sealed,%
super,this,throw,trait,true,try,%
type,val,var,while,with,yield},
otherkeywords={=>,<-,<\%,<:,>:,\#,@},
sensitive=true,
morecomment=[l]{//},
morecomment=[n]{/*}{*/},
morestring=[b]",
morestring=[b]',
morestring=[b]"""
}
\lstdefinelanguage{while}{
morekeywords={while, if, then. else, read, write},
otherkeywords={=>,<-,<\%,<:,>:,\#,@},
sensitive=true,
morecomment=[l]{//},
morecomment=[n]{/*}{*/},
morestring=[b]",
morestring=[b]',
morestring=[b]"""
}
\lstset{language=Scala,
basicstyle=\ttfamily,
keywordstyle=\color{javapurple}\bfseries,
stringstyle=\color{javagreen},
commentstyle=\color{javagreen},
morecomment=[s][\color{javadocblue}]{/**}{*/},
numbers=left,
numberstyle=\tiny\color{black},
stepnumber=1,
numbersep=10pt,
tabsize=2,
showspaces=false,
showstringspaces=false}
\newcommand\grid[1]{%
\begin{tikzpicture}[baseline=(char.base)]
\path[use as bounding box]
(0,0) rectangle (1em,1em);
\draw[red!50, fill=red!20]
(0,0) rectangle (1em,1em);
\node[inner sep=1pt,anchor=base west]
(char) at (0em,\gridraiseamount) {#1};
\end{tikzpicture}}
\newcommand\gridraiseamount{0.12em}
\makeatletter
\newcommand\Grid[1]{%
\@tfor\z:=#1\do{\grid{\z}}}
\makeatother
\newcommand\Vspace[1][.3em]{%
\mbox{\kern.06em\vrule height.3ex}%
\vbox{\hrule width#1}%
\hbox{\vrule height.3ex}}
\def\VS{\Vspace[0.6em]}
\begin{document}
\section*{Handout 5}
Whenever you want to design a new programming language or implement a compiler for an
existing language, the first task is to fix the basic ``words'' of the language. For example what are the
keywords, or reserved words, of the language, what are permitted identifiers, numbers, expressions and so
on. One convenient way to do this is, of
course, by using regular expressions.
In this course we want to take a closer look at the
WHILE programming language. This is a simple imperative programming language consisting of arithmetic
expressions, assignments, if-statements and loops. For example the Fibonacci program can be
written in this language as follows:
\begin{center}
\mbox{\lstinputlisting[language=while]{../progs/fib.while}}
\end{center}
\noindent
The keywords in this language will be
\begin{center}
\texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}
\end{center}
\noindent
In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers
and strings (which we however ignore for the moment). We also need to specify what the ``whitespace''
is in our programming language and what comments should look like.
As a first try, we might specify the regular expressions for our language roughly as follows
\begin{center}
\begin{tabular}{lcl}
$\textit{LETTER}$ & $:=$ & $\texttt{a} + \texttt{A} + \texttt{b} + \texttt{B} + \ldots$\\
$\textit{DIGIT}$ & $:=$ & $\texttt{0} + \texttt{1} + \texttt{2} + \ldots$\\
$\textit{KEYWORD}$ & $:=$ & $\texttt{while} + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\
$\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\
$\textit{OP}$ & $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\
$\textit{NUM}$ & $:=$ & $\textit{DIGIT}^+$\\
$\textit{WHITESPACE}$ & $:=$ & $("\hspace{2mm}" + \backslash\texttt{n})^+$
\end{tabular}
\end{center}
Having the regular expressions in place, the problem we have to solve is:
given a string of our programming language, which regular expression
matches which part of the string. By solving this problem, we can split up a string
of our language into components.
For example given the input string
\begin{center}
\texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
\end{center}
\noindent
we expect it is split up as follows
\begin{center}
\tt
\Grid{if}\,
\Grid{\VS}\,
\Grid{true}\,
\Grid{\VS}\,
\Grid{then}\,
\Grid{\VS}\,
\Grid{x}\,
\Grid{+}\,
\Grid{2}\,
\Grid{\VS}\,
\Grid{else}\,
\Grid{\VS}\,
\Grid{x}\,
\Grid{+}\,
\Grid{3}
\end{center}
\noindent
This process of splitting up an input string into components is often called \emph{lexing} or \emph{scanning}.
It is usually the first phase of a compiler. Note that the separation into words cannot, in general,
be done by just looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace
in the example above, this is not always the case. As can be seen the three components in \texttt{x+2} are
not separated by any whitespace. Another reason for recognising whitespaces explicitly is
that in some languages, for example Python, whitespaces matters, that is carry meaning. However in
our small language we will eventually just filter out all whitespaces and also all comments.
Lexing not just separates a string into its components, but also classify the components, meaning explicitly record that \texttt{if} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on.
For the moment, though, we will only focus on the simpler problem of just splitting up
a string into components.
There are a few subtleties we need to consider first. For example, say the string is
\begin{center}
\texttt{\Grid{iffoo\VS}}\;\ldots
\end{center}
\noindent
then there are two possibilities for how it can be split up: either we regard the input as the keyword \texttt{if} followed
by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a
single identifier. The choice that is often made in lexers is to look for the longest possible match.
This leaves \texttt{iffoo} as the only match (since it is longer than \texttt{if}).
Unfortunately, the convention about the longest match does not yet make the process
of lexing completely deterministic. Consider the string
\begin{center}
\texttt{\Grid{then\VS}}\;\dots
\end{center}
\noindent
Clearly, this string should be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers matches this string. To overcome this ambiguity we need to rank our
regular expressions. In our running example we just use the ranking
\[
\textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots
\]
\noindent
So even if both regular expressions match in the example above,
we give preference to the regular expression for keywords.
Let us see how our algorithm for lexing works in detail. In addition to the functions $nullable$
and $der$, it will use the function \emph{zeroable} defined as follows:
\begin{center}
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
$zeroable(\varnothing)$ & $\dn$ & $true$\\
$zeroable(\epsilon)$ & $\dn$ & $f\!alse$\\
$zeroable (c)$ & $\dn$ & $f\!alse$\\
$zeroable (r_1 + r_2)$ & $\dn$ & $zeroable(r_1) \wedge zeroable(r_2)$ \\
$zeroable (r_1 \cdot r_2)$ & $\dn$ & $zeroable(r_1) \vee zeroable(r_2)$ \\
$zeroable (r^*)$ & $\dn$ & $f\!alse$\\
\end{tabular}
\end{center}
\noindent
Recall that the function $nullable(r)$ tests whether a regular expression
can match the empty string. The function $zeroable$, on the other hand, tests whether a regular
expression cannot match anything at all. The mathematical way of stating this
property is
\begin{center}
$zeroable(r)$ if and only if $L(r) = \varnothing$
\end{center}
\noindent
For what follows let us fix a set of regular expressions $rs$ as being
\begin{center}
\textit{KEYWORD}, \textit{IDENT}, \textit{WHITESPACE}
\end{center}
\noindent
specifying the ``words''
of our programming language. The algorithm takes as input the $rs$ and a string, say
\begin{center}
\texttt{\grid{$c_1$}\grid{$c_2$}\grid{$c_3$}\grid{$c_4$}}\;\ldots
\end{center}
\noindent
and tries to chop off one word from the beginning of the string. If none of the
regular expression in $rs$ matches, we will just return
the empty string.
The crucial idea in the algorithm is to build the derivatives of all regular expressions in $rs$ with respect
to the first character $c_1$. Then we take the results and continue with
building the derivatives with respect to $c_2$ until we have either exhausted our
input string or all of the regular expressions are ``zeroable''. If the input string is
\begin{center}
\texttt{\Grid{if2\VS}}\;\dots
\end{center}
\noindent
then building the derivatives with respect to \texttt{i} gives
\begin{center}
\begin{tabular}{l|c}
& $zeroable$\\\hline
$der\;\texttt{i}\;(\textit{KEYWORD})$ & no\\
$der\;\texttt{i}\;(\textit{IDENT})$ & no\\
$der\;\texttt{i}\;(\textit{WHITESPACE})$ & yes\\
\end{tabular}
\end{center}
\noindent
We can eliminate then \textit{WHITESPACE} as a potential candidate, because
no derivative can go from $zeroable = \text{yes}$ to no. That leaves the other
two regular expressions as potential candidate and we have to consider the
next character from the input string
\begin{center}
\begin{tabular}{l|c}
& $zeroable$\\\hline
$der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD}))$ & no\\
$der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))$ & no\\
\end{tabular}
\end{center}
\noindent
Since both are `no', we have to go on with \texttt{2} from the input string
\begin{center}
\begin{tabular}{l|c}
& $zeroable$\\\hline
$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{KEYWORD})))$ & yes\\
$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$ & no\\
\end{tabular}
\end{center}
\noindent
Although we now know that the beginning is definitely an \textit{IDENT}, we do not yet
know how much of the input string should be considered as an \textit{IDENT}. So we
also have to go on and consider the next derivative.
\begin{center}
\begin{tabular}{l|c}
& $zeroable$\\\hline
$der\;\texttt{\VS}\;(der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT}))))$ & yes\\
\end{tabular}
\end{center}
\noindent
Since the answer is now `yes' we can stop: once all derivatives are
zeroable, we know the regular expressions cannot match any more letters from
the input. In this case we only have to go back to the derivative that is nullable. In this case it
is
\begin{center}
$der\;\texttt{2}\;(der\;\texttt{f}\;(der\;\texttt{i}\;(\textit{IDENT})))$
\end{center}
\noindent
which means we recognised an identifier. In case there is a choice of more than one
regular expressions that are nullable, then we choose the one with the highest precedence.
You can try out this case
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End: