\documentclass{article}
\usepackage{charter}
\usepackage{hyperref}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage[T1]{fontenc}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{tikz}
\usetikzlibrary{arrows}
\usetikzlibrary{automata}
\usetikzlibrary{shapes}
\usetikzlibrary{shadows}
\usetikzlibrary{positioning}
\usetikzlibrary{calc}
\usetikzlibrary{fit}
\usetikzlibrary{backgrounds}
\usepackage{fontspec}
\setmonofont{Consolas}
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
\definecolor{javared}{rgb}{0.6,0,0} % for strings
\definecolor{javagreen}{rgb}{0.25,0.5,0.35} % comments
\definecolor{javapurple}{rgb}{0.5,0,0.35} % keywords
\definecolor{javadocblue}{rgb}{0.25,0.35,0.75} % javadoc
\lstdefinelanguage{scala}{
morekeywords={abstract,case,catch,class,def,%
do,else,extends,false,final,finally,%
for,if,implicit,import,match,mixin,%
new,null,object,override,package,%
private,protected,requires,return,sealed,%
super,this,throw,trait,true,try,%
type,val,var,while,with,yield},
otherkeywords={=>,<-,<\%,<:,>:,\#,@},
sensitive=true,
morecomment=[l]{//},
morecomment=[n]{/*}{*/},
morestring=[b]",
morestring=[b]',
morestring=[b]"""
}
\lstdefinelanguage{while}{
morekeywords={while, if, then. else, read, write},
otherkeywords={=>,<-,<\%,<:,>:,\#,@},
sensitive=true,
morecomment=[l]{//},
morecomment=[n]{/*}{*/},
morestring=[b]",
morestring=[b]',
morestring=[b]"""
}
\lstset{language=Scala,
basicstyle=\ttfamily,
keywordstyle=\color{javapurple}\bfseries,
stringstyle=\color{javagreen},
commentstyle=\color{javagreen},
morecomment=[s][\color{javadocblue}]{/**}{*/},
numbers=left,
numberstyle=\tiny\color{black},
stepnumber=1,
numbersep=10pt,
tabsize=2,
showspaces=false,
showstringspaces=false}
\newcommand\grid[1]{%
\begin{tikzpicture}[baseline=(char.base)]
\path[use as bounding box]
(0,0) rectangle (1em,1em);
\draw[red!50, fill=red!20]
(0,0) rectangle (1em,1em);
\node[inner sep=1pt,anchor=base west]
(char) at (0em,\gridraiseamount) {#1};
\end{tikzpicture}}
\newcommand\gridraiseamount{0.12em}
\makeatletter
\newcommand\Grid[1]{%
\@tfor\z:=#1\do{\grid{\z}}}
\makeatother
\newcommand\Vspace[1][.3em]{%
\mbox{\kern.06em\vrule height.3ex}%
\vbox{\hrule width#1}%
\hbox{\vrule height.3ex}}
\def\VS{\Vspace[0.6em]}
\begin{document}
\section*{Handout 5}
Whenever you want to design a new programming language or implement a compiler for an
existing language, the first task is to fix the basic ``words'' of the language. For example what are the
keywords, or reserved words, of the language, what are permitted identifiers, numbers and so
on. One convenient way to do this is, of
course, by using regular expressions.
In this course we want to take a closer look at the
WHILE-language. This is a simple imperative programming language consisting of arithmetic
expressions, assignments, if-statements and loops. For example the Fibonacci program can be
written in this language as follows:
\begin{center}
\mbox{\lstinputlisting[language=while]{../progs/fib.while}}
\end{center}
\noindent
The keywords in this language will be
\begin{center}
\texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}
\end{center}
\noindent
In addition we will have some common operators, such as \texttt{<}, \texttt{>}, \texttt{:=} and so on, as well as numbers
and strings (which we however ignore for the moment). We also need to specify what the ``white space''
is in our programming language and what comments should look like.
As a first try, we specify the regular expressions for our language roughly as follows
\begin{center}
\begin{tabular}{lcl}
$\textit{KEYWORD}$ & $:=$ & $\texttt{while} + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\
$\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\
$\textit{OP}$ & $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\
$\textit{NUM}$ & $:=$ & $\textit{DIGIT}^+$\\
$\textit{WHITESPACE}$ & $:=$ & $"\hspace{2mm}" + \backslash\texttt{n}$
\end{tabular}
\end{center}
\noindent
with the usual meanings for the regular expressions $\textit{LETTER}$ and $\textit{DIGIT}$.
Having the regular expressions in place, the problem we have to solve is:
given a string of our programming language, which regular expression
matches which part of the string. In this way we can split up a string into components.
For example we expect that the input string
\begin{center}
\texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
\end{center}
\noindent
is split up as follows
\begin{center}
\tt
\Grid{if}\,
\Grid{\VS}\,
\Grid{true}\,
\Grid{\VS}\,
\Grid{then}\,
\Grid{\VS}\,
\Grid{x}\,
\Grid{+}\,
\Grid{2}\,
\Grid{\VS}\,
\Grid{else}\,
\Grid{\VS}\,
\Grid{x}\,
\Grid{+}\,
\Grid{3}
\end{center}
\noindent
This process of splitting an input string into components is often called \emph{lexing} or \emph{scanning}.
It is usually the first phase of a compiler. Note that the separation into words cannot, in general,
be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,
the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is
that in some languages, for example Python, whitespace matters. However in our small language we will eventually just filter out all whitespaces and also comments.
Lexing will not just separate the input into its components, but also classify the components, that
is explicitly record that \texttt{it} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on.
For the moment, though, we will only focus on the simpler problem of splitting a string into components.
There are a few subtleties we need to consider first. For example, say the input string is
\begin{center}
\texttt{\Grid{iffoo\VS\ldots}}
\end{center}
\noindent
then there are two possibilities how it can be split up: either we regard the input as the keyword \texttt{if} followed
by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a
single identifier. The choice that is often made in lexers is to look for the longest possible match.
This leaves \texttt{iffoo} as the only match (since it is longer than \texttt{if}).
Unfortuantely, the convention of the longs match does not yet make the whole process
completely deterministic. Consider the input string
\begin{center}
\texttt{\Grid{then\VS\dots}}
\end{center}
\noindent
Clearly, this string should clearly be identified as a keyword. The problem is that also the regular expression \textit{IDENT} for identifiers would also match this string. To overcome this ambiguity we need to rank our
regular expressions. In our running example we just use the ranking
\[
\textit{KEYWORD} < \textit{IDENT} < \textit{OP} < \ldots
\]
\noindent
and so on. So even if both regular expressions match in the example above,
we can give the regular expression for \ref{Page ??} as follows
Let us see how our algorithm for lexing works in detail. The regular
expressions and their ranking are shown above. For our algorithm it will
be helpful to have a look at the function \emph{zeroable}
defined as follows:
\begin{center}
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
$zeroable(\varnothing)$ & $\dn$ & $true$\\
$zeroable(\epsilon)$ & $\dn$ & $f\!alse$\\
$zeroable (c)$ & $\dn$ & $f\!alse$\\
$zeroable (r_1 + r_2)$ & $\dn$ & $zeroable(r_1) \wedge zeroable(r_2)$ \\
$zeroable (r_1 \cdot r_2)$ & $\dn$ & $zeroable(r_1) \vee zeroable(r_2)$ \\
$zeroable (r^*)$ & $\dn$ & $f\!alse$\\
\end{tabular}
\end{center}
\noindent
In contrast to the function $nullable(r)$, which test whether a regular expression
can match the empty string, the $zeroable$ function identifies whether a regular
expression cannot match anything at all.
\begin{center}
\texttt{\Grid{$c_1$\VS{} true\VS{} then\VS{} x\VS{} else\VS{} y \ldots }}
\end{center}
\noindent
The crucial idea of the algorithm is to build the derivative of all
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End: