\documentclass{article}
\usepackage{charter}
\usepackage{hyperref}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage[T1]{fontenc}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{tikz}
\usetikzlibrary{arrows}
\usetikzlibrary{automata}
\usetikzlibrary{shapes}
\usetikzlibrary{shadows}
\usetikzlibrary{positioning}
\usetikzlibrary{calc}
\usetikzlibrary{fit}
\usetikzlibrary{backgrounds}
\usepackage{fontspec}
\setmonofont{Consolas}
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%
\definecolor{javared}{rgb}{0.6,0,0} % for strings
\definecolor{javagreen}{rgb}{0.25,0.5,0.35} % comments
\definecolor{javapurple}{rgb}{0.5,0,0.35} % keywords
\definecolor{javadocblue}{rgb}{0.25,0.35,0.75} % javadoc
\lstdefinelanguage{scala}{
morekeywords={abstract,case,catch,class,def,%
do,else,extends,false,final,finally,%
for,if,implicit,import,match,mixin,%
new,null,object,override,package,%
private,protected,requires,return,sealed,%
super,this,throw,trait,true,try,%
type,val,var,while,with,yield},
otherkeywords={=>,<-,<\%,<:,>:,\#,@},
sensitive=true,
morecomment=[l]{//},
morecomment=[n]{/*}{*/},
morestring=[b]",
morestring=[b]',
morestring=[b]"""
}
\lstdefinelanguage{while}{
morekeywords={while, if, then. else, read, write},
otherkeywords={=>,<-,<\%,<:,>:,\#,@},
sensitive=true,
morecomment=[l]{//},
morecomment=[n]{/*}{*/},
morestring=[b]",
morestring=[b]',
morestring=[b]"""
}
\lstset{language=Scala,
basicstyle=\ttfamily,
keywordstyle=\color{javapurple}\bfseries,
stringstyle=\color{javagreen},
commentstyle=\color{javagreen},
morecomment=[s][\color{javadocblue}]{/**}{*/},
numbers=left,
numberstyle=\tiny\color{black},
stepnumber=1,
numbersep=10pt,
tabsize=2,
showspaces=false,
showstringspaces=false}
\newcommand\grid[1]{%
\begin{tikzpicture}[baseline=(char.base)]
\path[use as bounding box]
(0,0) rectangle (1em,1em);
\draw[red!50, fill=red!20]
(0,0) rectangle (1em,1em);
\node[inner sep=1pt,anchor=base west]
(char) at (0em,\gridraiseamount) {#1};
\end{tikzpicture}}
\newcommand\gridraiseamount{0.12em}
\makeatletter
\newcommand\Grid[1]{%
\@tfor\z:=#1\do{\grid{\z}}}
\makeatother
\newcommand\Vspace[1][.3em]{%
\mbox{\kern.06em\vrule height.3ex}%
\vbox{\hrule width#1}%
\hbox{\vrule height.3ex}}
\def\VS{\Vspace[0.6em]}
\begin{document}
\section*{Handout 5}
Whenever you want to design a programming language or implement a compiler for an
existing language, the first task is to fix the basic ``words'' of the language, like what are the
keywords or reserved words of the language, what are permitted identifiers, numbers and so
on. One convenient way to do this is, of
course, to use regular expressions. In this course we want to take a closer look at the
WHILE-language. This is a simple imperative language consisting of arithmetic
expressions, assignments and loops only. For example the Fibonacci program can be
written in this language as follows
\begin{center}
\mbox{\lstinputlisting[language=while]{../progs/fib.while}}
\end{center}
\noindent
The keywords in this language will be
\begin{center}
\texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}
\end{center}
\noindent
In addition we will have some typical operators, like \texttt{<}, \texttt{>}, \texttt{:=} and so on; numbers
and strings (which we however ignore for the moment). We also need to specify what the ``white space''
is in our programming language as well as what comments should look like.
We might specify the regular expressions for our language roughly as follows
\begin{center}
\begin{tabular}{lcl}
$\textit{KEYWORD}$ & $:=$ & $\texttt{while} + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\
$\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\
$\textit{OP}$ & $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\
$\textit{NUM}$ & $:=$ & $\textit{DIGIT}^+$\\
$\textit{WHITESPACE}$ & $:=$ & $"\hspace{2mm}" + \backslash\texttt{n}$
\end{tabular}
\end{center}
\noindent
with the usual meanings for the regular expressions $\textit{LETTER}$ and $\textit{DIGIT}$.
The problem we have to solve is given a string of our programming language, which regular expression
matches which part of the string. For example the input string
\begin{center}
\texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
\end{center}
\noindent
needs to be recognised as
\begin{center}
\tt
\Grid{if}
\Grid{\VS}
\Grid{true}
\Grid{\VS}
\Grid{then}
\Grid{\VS}
\Grid{x}
\Grid{+}
\Grid{2}
\Grid{\VS}
\Grid{else}
\Grid{\VS}
\Grid{x}
\Grid{+}
\Grid{3}
\end{center}
\noindent
Since \texttt{if} matches the \textit{KEYWORD} regular expression, \VS{} is a whitespace and so on. This process of separating an input string into components is often called \emph{lexing} or \emph{scanning}.
It is usually the first phase of a compiler. Note that the separation into words cannot, in general,
be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,
the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is
that in some languages, for example Python, whitespace matters. However in our small language we will eventually filter out all whitespaces and also comments.
Lexing will not just separate the input into its components, but also classify the components, that
is explicitly record that \texttt{it} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on.
But for the moment we will only focus on the simpler problem of separating a string into components.
There are a few subtleties we need to consider first. For example if the input string is
\begin{center}
\texttt{\Grid{iffoo\VS\ldots}}
\end{center}
\noindent
then there are two possibilities: either we regard the input as the keyword \texttt{if} followed
by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a
single identifier. The choice that is often made in lexers to look for the longest possible match,
that is regard the input as a single identifier \texttt{iffoo} (since it is longer than \texttt{if}).
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End: