cws/cw02.tex
author Christian Urban <christian.urban@kcl.ac.uk>
Wed, 16 Oct 2024 13:14:13 +0100
changeset 968 d8d8911a3d6f
parent 946 bee7c57c18c3
child 969 0dfa2923a7c6
permissions -rw-r--r--
updated

% !TEX program = xelatex
\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage[normalem]{ulem}

\begin{document}

\section*{Coursework 2}

\noindent This coursework is worth 10\% and is due on \cwTWO{} at
16:00. You are asked to implement the Sulzmann \& Lu lexer for the
WHILE language. You can do the implementation in any programming
language you like, but you need to submit the source code with which
you answered the questions, otherwise a mark of 0\% will be
awarded. %You need to submit your written answers as pdf---see attached
% questionaire.  Code send as code.
If you use Scala in your code, a
good place to start is the file \texttt{lexer.sc} and
\texttt{token.sc} uploaded to KEATS. The template file on Github is
called \texttt{cw02.sc}. The example files are in the subdirectory
\texttt{examples}. The main function that will be tested is
called \texttt{tokenise}. The marks will be distributed such that
3 marks are given for the correct \texttt{WHILE\_REGS} regular
expression; 5 marks for the correct \texttt{inj} and \texttt{mkeps}
definitions; and two marks when \texttt{tokenise} produces the correct
results for the example files. 



\subsection*{Disclaimer\alert}

It should be understood that the work you submit represents
your own effort. You have not copied from anyone else
including CoPilot, ChatGPT \& Co. An
exception is the Scala code from KEATS and the code I showed
during the lectures, which you can both freely use. You can
also use your own code from the CW~1.
%But do not
%be tempted to ask Github Copilot for help or do any other
%shenanigans like this!

\subsection*{Question 1}

To implement a lexer for the WHILE language, you first
need to design the appropriate regular expressions for the
following eleven syntactic entities:

\begin{enumerate}
\item keywords are

\begin{center}
\texttt{while}, 
\texttt{if}, 
\texttt{then}, 
\texttt{else}, 
\texttt{do}, 
\texttt{for}, 
\texttt{to}, 
\texttt{true}, 
\texttt{false}, 
\texttt{read}, 
\texttt{write},
\texttt{skip}
\end{center} 

\item operators are:
\texttt{+}, 
\texttt{-}, 
\texttt{*}, 
\texttt{\%},
\texttt{/},
\texttt{==}, 
\texttt{!=}, 
\texttt{>}, 
\texttt{<},
\texttt{<=}, 
\texttt{>=},
\texttt{:=},
\texttt{\&\&},
\texttt{||}

\item letters are uppercase and lowercase

\item symbols are letters plus the characters
  \texttt{.},
  \texttt{\_},
  \texttt{>},
  \texttt{<},
  \texttt{=},
  \texttt{;},
  \texttt{,} (comma),
  \texttt{$\backslash$} and
  \texttt{:}

\item parentheses are \texttt{(}, \texttt{\{}, \texttt{)} and \texttt{\}}
\item digits are \pcode{0} to \pcode{9}
\item there are semicolons \texttt{;}
\item whitespaces are either \texttt{" "} (one or more) or \texttt{$\backslash$n} or
  \texttt{$\backslash$t} or \texttt{$\backslash$r}
\item identifiers are letters followed by underscores \texttt{\_\!\_}, letters
  or digits  
\item numbers for numbers give 
a regular expression that can recognise \pcode{0}, but not numbers 
with leading zeroes, such as \pcode{001}
\item strings are enclosed by double quotes, like \texttt{"\ldots"}, and consisting of
  symbols, digits, parentheses, whitespaces and \texttt{$\backslash$n} (note the latter is not the escaped version but \texttt{$\backslash$} followed by \texttt{n}, otherwise we would not be able to indicate in our strings when to write a newline).
\item comments start with \texttt{//} and contain symbols, spaces, parentheses and digits until the end-of-the-line markers
\item endo-of-line-markers are \texttt{$\backslash$n} and \texttt{$\backslash$r$\backslash$n}  
\end{enumerate}

\noindent
You can use the basic regular expressions 

\[
\ZERO,\; \ONE,\; c,\; r_1 + r_2,\; r_1 \cdot r_2,\; r^*
\]

\noindent
but also the following extended regular expressions

\begin{center}
\begin{tabular}{ll}
$[c_1,c_2,\ldots,c_n]$ & a set of characters\\
$r^+$ & one or more times $r$\\
$r^?$ & optional $r$\\
$r^{\{n\}}$ & n-times $r$\\
\end{tabular}
\end{center}

\noindent
Later on you will also need the record regular expression:

\begin{center}
\begin{tabular}{ll}
$REC(x:r)$ & record regular expression\\
\end{tabular}
\end{center}

\noindent Try to design your regular expressions to be as
small as possible. For example you should use character sets
for identifiers and numbers. Feel free to use the general
character constructor \textit{CFUN} introduced in CW 1.

\subsection*{Question 2}

Implement the Sulzmann \& Lu lexer from the lectures. For
this you need to implement the functions $nullable$ and $der$
(you can use your code from CW~1), as well as $mkeps$ and
$inj$. These functions need to be appropriately extended for
the extended regular expressions from Q1. The definitions
you need to create are:


\begin{center}
\begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
$mkeps([c_1,c_2,\ldots,c_n])$  & $\dn$ & $?$\\
$mkeps(r^+)$                   & $\dn$ & $?$\\
$mkeps(r^?)$                   & $\dn$ & $?$\\
$mkeps(r^{\{n\}})$             & $\dn$ & $?$\medskip\\
$inj\, ([c_1,c_2,\ldots,c_n])\,c\,\ldots$  & $\dn$ & $?$\\
$inj\, (r^+)\,c\,\ldots$                   & $\dn$ & $?$\\
$inj\, (r^?)\,c\,\ldots$                   & $\dn$ & $?$\\
$inj\, (r^{\{n\}})\,c\,\ldots$             & $\dn$ & $?$\\
\end{tabular}
\end{center}

\noindent where $inj$ takes three arguments: a regular
expression, a character and a value. Test your lexer code
with at least the two small examples below:

\begin{center}
\begin{tabular}{ll}
regex: & string:\smallskip\\
$a^{\{3\}}$ & $aaa$\\
$(a + \ONE)^{\{3\}}$ & $aa$
\end{tabular}
\end{center}


\noindent Both strings should be successfully lexed by the
respective regular expression, that means the lexer returns 
in both examples a value.


Also add the record regular expression from the
lectures to your lexer and complete the function
\pcode{env} so that it returns all assignments from a value (this then 
allows you to extract easily the tokens from a value in the next
question).\medskip 

\noindent
Finally make that the function \texttt{lexing\_simp} generates
with the regular expression from Q1 for the string

\begin{center}
\code{"read n;"}
\end{center} 

\noindent
the following pairs:

\begin{center}
\texttt{List((k,read), (w, ), (i,n), (s,;))}
\end{center} 





\subsection*{Question 3}

Make sure your lexer from Q2 also simplifies regular expressions after
each derivation step and rectifies the computed values after each
injection. Use this lexer to tokenise the six WHILE programs
in the \texttt{examples} directory. Make sure that the \texttt{tokenise}
function filters out whitespaces and comments.\bigskip


% \begin{figure}[h]
% \mbox{\lstinputlisting[language=While,xleftmargin=10mm]{../cwtests/cw02/fib.while}}
% \caption{Fibonacci program in the WHILE language.\label{fib}}
% \end{figure}

% \begin{figure}[h]
% \mbox{\lstinputlisting[language=While,xleftmargin=10mm]{../cwtests/cw02/loops.while}}
% \caption{The three-nested-loops program in the WHILE language. 
% (Usually used for timing measurements.)\label{loop}}
% \end{figure}

% \begin{figure}[h]
% \mbox{\lstinputlisting[language=While,xleftmargin=10mm]{../cwtests/cw02/factors.while}}
% \caption{A program that calculates factors for numbers in the WHILE
%   language.\label{factors}}
% \end{figure}

% \begin{figure}[h]
% \mbox{\lstinputlisting[language=While,xleftmargin=10mm]{../cwtests/cw02/collatz2.while}}
% \caption{A program that calculates the Collatz series for numbers
%   between 1 and 100.\label{collatz}}
% \end{figure}

% \clearpage
% \newpage
% \section*{Answers}

% \mbox{}

% \noindent
% \textbf{Question 2:}\\ (Use mathematical notation, such as $r^+$, rather than code, such as \code{PLUS(r)})

% \begin{center}
%   \def\arraystretch{1.6}  
% \begin{tabular}{@ {}l@ {\hspace{2mm}}c@ {\hspace{2mm}}l@ {}}
% $mkeps([c_1,c_2,\ldots,c_n])$  & $\dn$ & \uline{\hspace{8cm}}\\
% $mkeps(r^+)$                   & $\dn$ & \uline{\hspace{8cm}}\\
% $mkeps(r^?)$                   & $\dn$ & \uline{\hspace{8cm}}\\
% $mkeps(r^{\{n\}})$             & $\dn$ & \uline{\hspace{8cm}}\bigskip\\
% $inj\, ([c_1,c_2,\ldots,c_n])\,c\,\ldots$  & $\dn$ & \uline{\hspace{8cm}}\\
% $inj\, (r^+)\,c\,\ldots$                   & $\dn$ & \uline{\hspace{8cm}}\\
% $inj\, (r^?)\,c\,\ldots$                   & $\dn$ & \uline{\hspace{8cm}}\\
% $inj\, (r^{\{n\}})\,c\,\ldots$             & $\dn$ & \uline{\hspace{8cm}}\\
% \end{tabular}
% \end{center}\bigskip

% \noindent
% Tokens for \code{"read n;"}\\

% \noindent
% \uline{\hfill}\medskip

% \noindent
% \uline{\hfill}\medskip

% \noindent
% \uline{\hfill}\medskip

% \noindent
% \uline{\hfill}\medskip


\end{document}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: