afl-material: handouts/ho05.tex@51d6b8b828c4


\documentclass{article}
\usepackage{charter}
\usepackage{hyperref}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage[T1]{fontenc}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{tikz}
\usetikzlibrary{arrows}
\usetikzlibrary{automata}
\usetikzlibrary{shapes}
\usetikzlibrary{shadows}
\usetikzlibrary{positioning}
\usetikzlibrary{calc}
\usetikzlibrary{fit}
\usetikzlibrary{backgrounds}
\usepackage{fontspec}
\setmonofont{Consolas}

\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%

\definecolor{javared}{rgb}{0.6,0,0} % for strings
\definecolor{javagreen}{rgb}{0.25,0.5,0.35} % comments
\definecolor{javapurple}{rgb}{0.5,0,0.35} % keywords
\definecolor{javadocblue}{rgb}{0.25,0.35,0.75} % javadoc

\lstdefinelanguage{scala}{
  morekeywords={abstract,case,catch,class,def,%
    do,else,extends,false,final,finally,%
    for,if,implicit,import,match,mixin,%
    new,null,object,override,package,%
    private,protected,requires,return,sealed,%
    super,this,throw,trait,true,try,%
    type,val,var,while,with,yield},
  otherkeywords={=>,<-,<\%,<:,>:,\#,@},
  sensitive=true,
  morecomment=[l]{//},
  morecomment=[n]{/*}{*/},
  morestring=[b]",
  morestring=[b]',
  morestring=[b]"""
}

\lstdefinelanguage{while}{
  morekeywords={while, if, then. else, read, write},
  otherkeywords={=>,<-,<\%,<:,>:,\#,@},
  sensitive=true,
  morecomment=[l]{//},
  morecomment=[n]{/*}{*/},
  morestring=[b]",
  morestring=[b]',
  morestring=[b]"""
}


\lstset{language=Scala,
	basicstyle=\ttfamily,
	keywordstyle=\color{javapurple}\bfseries,
	stringstyle=\color{javagreen},
	commentstyle=\color{javagreen},
	morecomment=[s][\color{javadocblue}]{/**}{*/},
	numbers=left,
	numberstyle=\tiny\color{black},
	stepnumber=1,
	numbersep=10pt,
	tabsize=2,
	showspaces=false,
	showstringspaces=false}
	
\newcommand\grid[1]{%
\begin{tikzpicture}[baseline=(char.base)]
  \path[use as bounding box]
    (0,0) rectangle (1em,1em);
  \draw[red!50, fill=red!20]
    (0,0) rectangle (1em,1em);
  \node[inner sep=1pt,anchor=base west]
    (char) at (0em,\gridraiseamount) {#1};
\end{tikzpicture}}
\newcommand\gridraiseamount{0.12em}

\makeatletter
\newcommand\Grid[1]{%
  \@tfor\z:=#1\do{\grid{\z}}}
\makeatother	

\newcommand\Vspace[1][.3em]{%
  \mbox{\kern.06em\vrule height.3ex}%
  \vbox{\hrule width#1}%
  \hbox{\vrule height.3ex}}

\def\VS{\Vspace[0.6em]}
	
\begin{document}

\section*{Handout 5}

Whenever you want to design a programming language or implement a compiler for an
existing language, the first task is to fix the basic ``words'' of the language, like what are the 
keywords or reserved words of the language, what are permitted identifiers, numbers and so 
on. One convenient way to do this is, of 
course, to use regular expressions. In this course we want to take a closer look at the 
WHILE-language. This is a simple imperative language consisting of arithmetic
expressions, assignments and loops only. For example the Fibonacci program can be
written in this language as follows

\begin{center}
\mbox{\lstinputlisting[language=while]{../progs/fib.while}}
\end{center}

\noindent
The keywords in this language will be

\begin{center}
\texttt{while}, \texttt{if}, \texttt{then}, \texttt{else}, \texttt{write}, \texttt{read}
\end{center}

\noindent
In addition we will have some typical operators, like \texttt{<}, \texttt{>}, \texttt{:=} and so on; numbers
and strings (which we however ignore for the moment). We also need to specify what the ``white space''
is in our programming language as well as what comments should look like.
We might specify the regular expressions for our language roughly as follows

\begin{center}
\begin{tabular}{lcl}
$\textit{KEYWORD}$ & $:=$ & $\texttt{while}  + \texttt{if} + \texttt{then} + \texttt{else} + \dots$\\
$\textit{IDENT}$ & $:=$ & $\textit{LETTER} \cdot (\textit{LETTER} + \textit{DIGIT} + {\_})^*$\\ 
$\textit{OP}$      &  $:=$ & $\texttt{:=} + \texttt{<} + \ldots$\\
$\textit{NUM}$   &  $:=$ & $\textit{DIGIT}^+$\\
$\textit{WHITESPACE}$ & $:=$ & $"\hspace{2mm}" + \backslash\texttt{n}$
\end{tabular}
\end{center}

\noindent
with the usual meanings for the regular expressions $\textit{LETTER}$ and $\textit{DIGIT}$.

The problem we have to solve is given a string of our programming language, which regular expression 
matches which part of the string. For example the input string

\begin{center}
\texttt{\Grid{if\VS true\VS then\VS x+2\VS else\VS x+3}}
\end{center}

\noindent
needs to be recognised as

\begin{center}
\tt
\Grid{if} 
\Grid{\VS} 
\Grid{true} 
\Grid{\VS} 
\Grid{then} 
\Grid{\VS} 
\Grid{x} 
\Grid{+} 
\Grid{2}
\Grid{\VS} 
\Grid{else}
\Grid{\VS}
\Grid{x}
\Grid{+}
\Grid{3}
\end{center}

\noindent
Since \texttt{if} matches the \textit{KEYWORD} regular expression, \VS{}  is a whitespace and so on. This process of separating an input string into components is often called \emph{lexing} or \emph{scanning}.
It is usually the first phase of a compiler. Note that the separation into words cannot, in general, 
be done by looking at whitespaces: while \texttt{if} and \texttt{true} are separated by a whitespace,
the components in \texttt{x+2} are not. Another reason for recognising whitespaces explicitly is
that in some languages, for example Python, whitespace matters. However in our small language we will eventually filter out all whitespaces and also comments.

Lexing will not just separate the input into its components, but also classify the components, that
is explicitly record that \texttt{it} is a keyword,  \VS{} a whitespace, \texttt{true} an identifier and so on.
But for the moment we will only focus on the simpler problem of separating a string into components.
There are a few subtleties  we need to consider first. For example if the input string is

\begin{center}
\texttt{\Grid{iffoo\VS\ldots}}
\end{center}

\noindent
then there are two possibilities: either we regard the input as the keyword \texttt{if} followed
by the identifier \texttt{foo} (both regular expressions match) or we regard \texttt{iffoo} as a 
single identifier. The choice that is often made in lexers to look for the longest possible match,
that is regard the input as a single identifier  \texttt{iffoo} (since it is longer than \texttt{if}).

\end{document}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End:
author	Christian Urban <christian dot urban at kcl dot ac dot uk>
	Fri, 25 Oct 2013 17:55:35 +0100
changeset 154	51d6b8b828c4
parent 153	70ab41cb610e
child 155	9b2d128765e1
permissions	-rw-r--r--