\documentclass{article}+ −
\usepackage{charter}+ −
\usepackage{hyperref}+ −
\usepackage{amssymb}+ −
\usepackage{amsmath}+ −
\usepackage[T1]{fontenc}+ −
\usepackage{listings}+ −
\usepackage{xcolor}+ −
\usepackage{tikz}+ −
\usetikzlibrary{arrows}+ −
\usetikzlibrary{automata}+ −
\usetikzlibrary{shapes}+ −
\usetikzlibrary{shadows}+ −
\usetikzlibrary{positioning}+ −
\usetikzlibrary{calc}+ −
\usetikzlibrary{fit}+ −
\usetikzlibrary{backgrounds}+ −
\usepackage{fontspec}+ −
\setmonofont{Consolas}+ −
+ −
\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%+ −
+ −
\definecolor{javared}{rgb}{0.6,0,0} % for strings+ −
\definecolor{javagreen}{rgb}{0.25,0.5,0.35} % comments+ −
\definecolor{javapurple}{rgb}{0.5,0,0.35} % keywords+ −
\definecolor{javadocblue}{rgb}{0.25,0.35,0.75} % javadoc+ −
+ −
\lstdefinelanguage{scala}{+ −
morekeywords={abstract,case,catch,class,def,%+ −
do,else,extends,false,final,finally,%+ −
for,if,implicit,import,match,mixin,%+ −
new,null,object,override,package,%+ −
private,protected,requires,return,sealed,%+ −
super,this,throw,trait,true,try,%+ −
type,val,var,while,with,yield},+ −
otherkeywords={=>,<-,<\%,<:,>:,\#,@},+ −
sensitive=true,+ −
morecomment=[l]{//},+ −
morecomment=[n]{/*}{*/},+ −
morestring=[b]",+ −
morestring=[b]',+ −
morestring=[b]"""+ −
}+ −
+ −
\lstdefinelanguage{while}{+ −
morekeywords={while, if, then. else, read, write},+ −
otherkeywords={=>,<-,<\%,<:,>:,\#,@},+ −
sensitive=true,+ −
morecomment=[l]{//},+ −
morecomment=[n]{/*}{*/},+ −
morestring=[b]",+ −
morestring=[b]',+ −
morestring=[b]"""+ −
}+ −
+ −
+ −
\lstset{language=Scala,+ −
basicstyle=\ttfamily,+ −
keywordstyle=\color{javapurple}\bfseries,+ −
stringstyle=\color{javagreen},+ −
commentstyle=\color{javagreen},+ −
morecomment=[s][\color{javadocblue}]{/**}{*/},+ −
numbers=left,+ −
numberstyle=\tiny\color{black},+ −
stepnumber=1,+ −
numbersep=10pt,+ −
tabsize=2,+ −
showspaces=false,+ −
showstringspaces=false}+ −
+ −
\newcommand\grid[1]{%+ −
\begin{tikzpicture}[baseline=(char.base)]+ −
\path[use as bounding box]+ −
(0,0) rectangle (1em,1em);+ −
\draw[red!50, fill=red!20]+ −
(0,0) rectangle (1em,1em);+ −
\node[inner sep=1pt,anchor=base west]+ −
(char) at (0em,\gridraiseamount) {#1};+ −
\end{tikzpicture}}+ −
\newcommand\gridraiseamount{0.12em}+ −
+ −
\makeatletter+ −
\newcommand\Grid[1]{%+ −
\@tfor\z:=#1\do{\grid{\z}}}+ −
\makeatother + −
+ −
\newcommand\Vspace[1][.3em]{%+ −
\mbox{\kern.06em\vrule height.3ex}%+ −
\vbox{\hrule width#1}%+ −
\hbox{\vrule height.3ex}}+ −
+ −
\def\VS{\Vspace[0.6em]}+ −
+ −
\begin{document}+ −
+ −
\section*{Handout 5}+ −
+ −
Whenever you want to design a programming language or implement a compiler for an+ −
existing language, the first task is to fix the basic ``words'' of the language, like what are the + −
keywords or reserved words of the language, what are permitted identifiers, numbers and so + −
on. One convenient way to do this is, of + −
course, to use regular expressions. In this course we want to take a closer look at the + −
WHILE-language. This is a simple imperative language consisting of arithmetic+ −
expressions, assignments and loops only. For example the Fibonacci program can be+ −
written in this language as follows+ −
+ −
\begin{center}+ −
\mbox{\lstinputlisting[language=while]{../progs/fib.while}}+ −
\end{center}+ −
+ −
\noindent+ −
The keywords in this language will be+ −
+ −
\begin{center}+ −
\texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}+ −
\end{center}+ −
+ −
\noindent+ −
In addition we will have some typical operators, like \texttt{<}, \texttt{>}, \texttt{:=} and so on; numbers+ −
and strings (which we however ignore for the moment). We also need to specify what the ``white space''+ −
is in our programming language as well as what comments should look like.+ −
We might specify the regular expressions for our language roughly as follows+ −
+ −
\begin{center}+ −
\begin{tabular}{lcl}+ −
$\textit{KEYWORD}$ & $:=$ & $\texttt{while} + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\+ −
$\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ + −
$\textit{OP}$ & $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\+ −
$\textit{NUM}$ & $:=$ & $\textit{DIGIT}^+$\\+ −
$\textit{WHITESPACE}$ & $:=$ & $"\hspace{2mm}" + \backslash\texttt{n}$+ −
\end{tabular}+ −
\end{center}+ −
+ −
\noindent+ −
with the usual meanings for the regular expressions $\textit{LETTER}$ and $\textit{DIGIT}$.+ −
+ −
The problem we have to solve is given a string of our programming language, which regular expression + −
matches which part of the string. For example the input string+ −
+ −
\begin{center}+ −
\texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}+ −
\end{center}+ −
+ −
\noindent+ −
needs to be recognised as+ −
+ −
\begin{center}+ −
\tt+ −
\Grid{if} + −
\Grid{\VS} + −
\Grid{true} + −
\Grid{\VS} + −
\Grid{then} + −
\Grid{\VS} + −
\Grid{x} + −
\Grid{+} + −
\Grid{2}+ −
\Grid{\VS} + −
\Grid{else}+ −
\Grid{\VS}+ −
\Grid{x}+ −
\Grid{+}+ −
\Grid{3}+ −
\end{center}+ −
+ −
\noindent+ −
Since \texttt{if} matches the \textit{KEYWORD} regular expression, \VS{} is a whitespace and so on. This process of separating an input string into components is often called \emph{lexing} or \emph{scanning}.+ −
It is usually the first phase of a compiler. Note that the separation into words cannot, in general, + −
be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,+ −
the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is+ −
that in some languages, for example Python, whitespace matters. However in our small language we will eventually filter out all whitespaces and also comments.+ −
+ −
Lexing will not just separate the input into its components, but also classify the components, that+ −
is explicitly record that \texttt{it} is a keyword, \VS{} a whitespace, \texttt{true} an identifier and so on.+ −
But for the moment we will only focus on the simpler problem of separating a string into components.+ −
There are a few subtleties we need to consider first. For example if the input string is+ −
+ −
\begin{center}+ −
\texttt{\Grid{iffoo\VS\ldots}}+ −
\end{center}+ −
+ −
\noindent+ −
then there are two possibilities: either we regard the input as the keyword \texttt{if} followed+ −
by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a + −
single identifier. The choice that is often made in lexers to look for the longest possible match,+ −
that is regard the input as a single identifier \texttt{iffoo} (since it is longer than \texttt{if}).+ −
+ −
\end{document}+ −
+ −
%%% Local Variables: + −
%%% mode: latex+ −
%%% TeX-master: t+ −
%%% End: + −