  \begin{tabular}{@ {}c@ {}}
  \LARGE Compilers and \\[-2mm] 
  \LARGE Formal Languages (5)\\[3mm] 

    Email:  & christian.urban at\\
    Office Hours: & Thursdays 12 -- 14\\
    Location: & N7.07 (North Wing, Bush House)\\
    Slides \& Progs: & KEATS (also homework is there)\\  

\frametitle{\begin{tabular}{c}Last Week\\[-2mm] 
            Regexes and Values\end{tabular}}

Regular expressions and their corresponding values:

  \bl{$r$} & \bl{$::=$}  & \bl{$\ZERO$}\\
           & \bl{$\mid$} & \bl{$\ONE$}   \\
           & \bl{$\mid$} & \bl{$c$}          \\
           & \bl{$\mid$} & \bl{$r_1 \cdot r_2$}\\
           & \bl{$\mid$} & \bl{$r_1 + r_2$}   \\
           & \bl{$\mid$} & \bl{$r^*$}         \\
  \bl{$v$} & \bl{$::=$}  & \\
           &             & \bl{$Empty$}   \\
           & \bl{$\mid$} & \bl{$Char(c)$}          \\
           & \bl{$\mid$} & \bl{$Seq(v_1,v_2)$}\\
           & \bl{$\mid$} & \bl{$Left(v)$}   \\
           & \bl{$\mid$} & \bl{$Right(v)$}  \\
           & \bl{$\mid$} & \bl{$Stars [v_1,\ldots\,v_n]$} \\



\begin{tikzpicture}[scale=2,node distance=1.3cm,every node/.style={minimum size=8mm}]
\node (r1)  {\bl{$r_1$}};
\node (r2) [right=of r1] {\bl{$r_2$}};
\draw[->,line width=1mm]  (r1) -- (r2) node[above,midway] {\bl{$der\,a$}};
\node (r3) [right=of r2] {\bl{$r_3$}};
\draw[->,line width=1mm]  (r2) -- (r3) node[above,midway] {\bl{$der\,b$}};
\node (r4) [right=of r3] {\bl{$r_4$}};
\draw[->,line width=1mm]  (r3) -- (r4) node[above,midway] {\bl{$der\,c$}};
\draw (r4) node[anchor=west] {\;\raisebox{3mm}{\bl{$nullable$}}};
\node (v4) [below=of r4] {\bl{$v_4$}};
\draw[->,line width=1mm]  (r4) -- (v4);
\node (v3) [left=of v4] {\bl{$v_3$}};
\draw[->,line width=1mm]  (v4) -- (v3) node[below,midway] {\bl{$inj\,c$}};
\node (v2) [left=of v3] {\bl{$v_2$}};
\draw[->,line width=1mm]  (v3) -- (v2) node[below,midway] {\bl{$inj\,b$}};
\node (v1) [left=of v2] {\bl{$v_1$}};
\draw[->,line width=1mm]  (v2) -- (v1) node[below,midway] {\bl{$inj\,a$}};
\draw[->,line width=0.5mm]  (r3) -- (v3);
\draw[->,line width=0.5mm]  (r2) -- (v2);
\draw[->,line width=0.5mm]  (r1) -- (v1);
\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{\bl{$mkeps$}}};

\bl{$r_1$}: & \bl{$a \cdot (b \cdot c)$}\\
\bl{$r_2$}: & \bl{$\ONE \cdot (b \cdot c)$}\\
\bl{$r_3$}: & \bl{$(\ZERO \cdot (b \cdot c)) + (\ONE \cdot c)$}\\
\bl{$r_4$}: & \bl{$(\ZERO \cdot (b \cdot c)) + ((\ZERO \cdot c) + \ONE)$}\\

\bl{$v_1$}: & \bl{$Seq(Char(a), Seq(Char(b), Char(c)))$}\\
\bl{$v_2$}: & \bl{$Seq(Empty, Seq(Char(b), Char(c)))$}\\
\bl{$v_3$}: & \bl{$Right(Seq(Empty, Char(c)))$}\\
\bl{$v_4$}: & \bl{$Right(Right(Empty))$}\\

\bl{$|v_1|$}: & \bl{$abc$}\\
\bl{$|v_2|$}: & \bl{$bc$}\\
\bl{$|v_3|$}: & \bl{$c$}\\
\bl{$|v_4|$}: & \bl{$[]$}



\item If we simplify after the derivative, then we are builing the
value for the simplified regular expression, but \emph{not} for the original
regular expression.

\begin{tikzpicture}[scale=2,node distance=1.3cm,every node/.style={minimum size=8mm}]
\node (r1)  {\bl{$r_1$}};
\node (r2) [right=of r1] {\bl{$r_2$}};
\draw[->,line width=1mm]  (r1) -- (r2) node[above,midway] {\bl{$der\,a$}};
\node (r3) [right=of r2] {\bl{$r_3$}};
\draw[->,line width=1mm]  (r2) -- (r3) node[above,midway] {\bl{$der\,b$}};
\node (r4) [right=of r3] {\bl{$r_4$}};
\draw[->,line width=1mm]  (r3) -- (r4) node[above,midway] {\bl{$der\,c$}};
\draw (r4) node[anchor=west] {\;\raisebox{3mm}{\bl{$nullable$}}};
\node (v4) [below=of r4] {\bl{$v_4$}};
\draw[->,line width=1mm]  (r4) -- (v4);
\node (v3) [left=of v4] {\bl{$v_3$}};
\draw[->,line width=1mm]  (v4) -- (v3) node[below,midway] {\bl{$inj\,c$}};
\node (v2) [left=of v3] {\bl{$v_2$}};
\draw[->,line width=1mm]  (v3) -- (v2) node[below,midway] {\bl{$inj\,b$}};
\node (v1) [left=of v2] {\bl{$v_1$}};
\draw[->,line width=1mm]  (v2) -- (v1) node[below,midway] {\bl{$inj\,a$}};
\draw[->,line width=0.5mm]  (r3) -- (v3);
\draw[->,line width=0.5mm]  (r2) -- (v2);
\draw[->,line width=0.5mm]  (r1) -- (v1);
\draw (r4) node[anchor=north west] {\;\raisebox{-8mm}{\bl{$mkeps$}}};

\hspace{4.5cm}\bl{$(b \cdot c) + (\ZERO + \ONE)$}
\bl{$(b \cdot c) + \ONE$}


Please get in contact if you intend to do CW Strand 2. No zips please.
\frametitle{Lexer, Parser}

                      rectangle,rounded corners=3mm,
                      very thick,draw=black!50,
                      minimum height=18mm, minimum width=20mm,
                      top color=white,bottom color=black!20}]
  \node (0) at (-2.3,0) {}; 
  \node (A) at (0,0)  [node] {};
  \node [below right] at (A.north west) {lexer};

  \node (B) at (3,0)  [node] {};
  \node [below right=1mm] at (B.north west) 

  \node (C) at (6,0)  [node] {};
  \node [below right] at (C.north west) 
    {\mbox{}\hspace{-1mm}code gen};

  \node (1) at (8.4,0) {}; 

  \draw [->,line width=4mm] (0) -- (A); 
  \draw [->,line width=4mm] (A) -- (B); 
  \draw [->,line width=4mm] (B) -- (C); 
  \draw [->,line width=4mm] (C) -- (1); 
Today a parser.  

\frametitle{What Parsing is Not}

Usually parsing does not check semantic correctness, e.g.

\item  whether a function is not used before it
  is defined
\item whether a function has the correct number of arguments 
  or are of correct type
\item whether a variable can be declared twice in a scope  


\frametitle{Regular Languages}

While regular expressions are very useful for lexing, there is
no regular expression that can recognise the language

\bl{$(((()()))())$} \;\;vs.\;\; \bl{$(((()()))()))$}

\noindent So we cannot find out with regular expressions
whether parentheses are matched or unmatched. Also regular
expressions are not recursive, e.g.~\bl{$(1 + 2) + 3$}.


\frametitle{Hierarchy of Languages}

              top color=white,
              bottom color=black!20, 
              very thick, 
              rounded corners}, scale=1.2]

\draw (0,0) node [rect, text depth=39mm, text width=68mm] {all languages};
\draw (0,-0.4) node [rect, text depth=28.5mm, text width=64mm] {decidable languages};
\draw (0,-0.85) node [rect, text depth=17mm] {context sensitive languages};
\draw (0,-1.14) node [rect, text depth=9mm, text width=50mm] {context-free languages};
\draw (0,-1.4) node [rect] {regular languages};



\frametitle{CF Grammars}

A \alert{\bf context-free grammar} \bl{$G$} consists of

\item a finite set of nonterminal symbols (e.g.~$\meta{A}$ upper case)
\item a finite set terminal symbols or tokens (lower case)
\item a start symbol (which must be a nonterminal)
\item a set of rules
\bl{$\meta{A} ::= \textit{rhs}$}

where \bl{\textit{rhs}} are sequences involving terminals and nonterminals,
including the empty sequence \bl{$\epsilon$}.\medskip\pause

We also allow rules
\bl{$\meta{A} ::= \textit{rhs}_1 | \textit{rhs}_2 | \ldots$}



A grammar for palindromes over the alphabet~\bl{$\{a,b\}$}:

: \meta{S} ::= a\cdot\meta{S}\cdot a\\
: \meta{S} ::= b\cdot\meta{S}\cdot b\\
: \meta{S} ::= a\\
: \meta{S} ::= b\\
: \meta{S} ::= \epsilon\\


: \meta{S} ::=  a\cdot \meta{S}\cdot a | b\cdot \meta{S}\cdot b | a | b | \epsilon\\

Can you find the grammar rules for matched parentheses?


\frametitle{Arithmetic Expressions}

\bl{\begin{plstx}[margin=3cm,one per line]
: \meta{E} ::=  num\_token 
   | \meta{E} \cdot + \cdot \meta{E} 
   | \meta{E} \cdot - \cdot \meta{E} 
   | \meta{E} \cdot * \cdot \meta{E} 
   | ( \cdot \meta{E} \cdot ) \\

\bl{\texttt{1 + 2 * 3 + 4}}


\frametitle{A CFG Derivation}

\item Begin with a string containing only the start symbol, say \bl{\meta{S}}\bigskip
\item Replace any nonterminal \bl{\meta{X}} in the string by the
right-hand side of some production \bl{$\meta{X} ::= \textit{rhs}$}\bigskip
\item Repeat 2 until there are no nonterminals left

\bl{$\meta{S} \rightarrow \ldots \rightarrow \ldots  \rightarrow \ldots  \rightarrow \ldots $}

\frametitle{Example Derivation}

: \meta{S} ::=  \epsilon | a\cdot \meta{S}\cdot a | b\cdot \meta{S}\cdot b \\

\bl{\meta{S}} & \bl{$\rightarrow$} & \bl{$a\meta{S}a$}\\
              & \bl{$\rightarrow$} & \bl{$ab\meta{S}ba$}\\
              & \bl{$\rightarrow$} & \bl{$aba\meta{S}aba$}\\
              & \bl{$\rightarrow$} & \bl{$abaaba$}\\



\frametitle{Example Derivation}

\bl{\begin{plstx}[margin=3cm,one per line]
: \meta{E} ::=  num\_token 
   | \meta{E} \cdot + \cdot \meta{E} 
   | \meta{E} \cdot - \cdot \meta{E} 
   | \meta{E} \cdot * \cdot \meta{E} 
   | ( \cdot \meta{E} \cdot ) \\

\bl{\meta{E}} & \bl{$\rightarrow$} & \bl{$\meta{E}*\meta{E}$}\\
              & \bl{$\rightarrow$} & \bl{$\meta{E}+\meta{E}*\meta{E}$}\\
              & \bl{$\rightarrow$} & \bl{$\meta{E}+\meta{E}*\meta{E}+\meta{E}$}\\
              & \bl{$\rightarrow^+$} & \bl{$1+2*3+4$}\\
\end{tabular} &\pause
\bl{$\meta{E}$} & \bl{$\rightarrow$} & \bl{$\meta{E}+\meta{E}$}\\
                & \bl{$\rightarrow$} & \bl{$\meta{E}+\meta{E}+\meta{E}$}\\
                & \bl{$\rightarrow$} & \bl{$\meta{E}+\meta{E}*\meta{E}+\meta{E}$}\\
                & \bl{$\rightarrow^+$} & \bl{$1+2*3+4$}\\


\frametitle{Context Sensitive Grammars}

It is much harder to find out whether a string is parsed
by a context sensitive grammar:

: \meta{S} ::= b\meta{S}\meta{A}\meta{A} | \epsilon\\
: \meta{A} ::= a\\
: b\meta{A} ::= \meta{A}b\\

\bl{$\meta{S} \rightarrow\ldots\rightarrow^? ababaa$}

  \tt Time flies like an arrow;\\ 
  fruit flies like bananas.


\frametitle{Language of a CFG}

Let \bl{$G$} be a context-free grammar with start symbol \bl{\meta{S}}. 
Then the language \bl{$L(G)$} is:

\bl{$\{c_1\ldots c_n \;|\; \forall i.\; c_i \in T \wedge \meta{S} \rightarrow^* c_1\ldots c_n \}$}

\item Terminals, because there are no rules for replacing them.
\item Once generated, terminals are ``permanent''.
\item Terminals ought to be tokens of the language\\
(but can also be strings).


\frametitle{Parse Trees}

\bl{\begin{plstx}: \meta{E} ::= \meta{T} | \meta{T} \cdot + \cdot \meta{E} |  \meta{T} \cdot - \cdot \meta{E}\\
: \meta{T} ::= \meta{F} | \meta{F} \cdot * \cdot \meta{T}\\
: \meta{F} ::= num\_token | ( \cdot \meta{E} \cdot )\\

\begin{tikzpicture}[level distance=8mm, blue]
  \node {$\meta{E}$}
    child {node {$\meta{T}$} 
     child {node {$\meta{T}$} 
                 child {node {(\,$\meta{E}$\,)}
                            child {node{$\meta{F}$ *{} $\meta{F}$}
                                  child {node {$\meta{T}$} child {node {2}}}
                                  child {node {$\meta{T}$} child {node {3}}} 
     child {node {+}}
     child {node {$\meta{T}$}
       child {node {(\,$\meta{E}$\,)} 
       child {node {$\meta{F}$}
       child {node {$\meta{T}$ +{} $\meta{T}$}
                    child {node {3}}
                    child {node {4}} 

\begin{textblock}{5}(1, 6.5)


\frametitle{Arithmetic Expressions}

\bl{\begin{plstx}[margin=3cm,one per line]
: \meta{E} ::=  num\_token 
   | \meta{E} \cdot + \cdot \meta{E} 
   | \meta{E} \cdot - \cdot \meta{E} 
   | \meta{E} \cdot * \cdot \meta{E} 
   | ( \cdot \meta{E} \cdot ) \\

A CFG is \alert{\bf left-recursive} if it has a nonterminal \bl{$\meta{E}$} such
that \bl{$\meta{E} \rightarrow^+ \meta{E}\cdot \ldots$}


\frametitle{Ambiguous Grammars}

A grammar is \alert{\bf ambiguous} if there is a string that
has at least two different parse trees.

\bl{\begin{plstx}[margin=3cm,one per line]: \meta{E} ::=  num\_token 
   | \meta{E} \cdot + \cdot \meta{E} 
   | \meta{E} \cdot - \cdot \meta{E} 
   | \meta{E} \cdot * \cdot \meta{E} 
   | ( \cdot \meta{E} \cdot ) \\

\bl{\texttt{1 + 2 * 3 + 4}}


\frametitle{`Dangling' Else}

Another ambiguous grammar:\bigskip

$E$ & $\rightarrow$ &  if $E$ then $E$\\
 & $|$ &  if $E$ then $E$ else $E$ \\
 & $|$ &  \ldots

\bl{\texttt{if a then if x then y else c}}


\frametitle{Parser Combinators}

One of the simplest ways to implement a parser, see

Parser combinators: \bigskip

\mbox{}\hspace{-12mm}\mbox{}$\underbrace{\text{list of tokens}}_{\text{input}}$ \bl{$\Rightarrow$} 
$\underbrace{\text{set of (parsed input, unparsed input)}}_{\text{output}}$

\item atomic parsers
\item sequencing
\item alternative
\item semantic action



Atomic parsers, for example, number tokens

\bl{$\texttt{Num(123)}::rest \;\Rightarrow\; \{(\texttt{Num(123)}, rest)\}$} 

\item you consume one or more token from the\\ 
  input (stream)
\item also works for characters and strings


Alternative parser (code \bl{$p\;||\;q$})\bigskip

\item apply \bl{$p$} and also \bl{$q$}; then combine 
  the outputs

\large \bl{$p(\text{input}) \cup q(\text{input})$}



Sequence parser (code \bl{$p\sim q$})\bigskip

\item apply first \bl{$p$} producing a set of pairs
\item then apply \bl{$q$} to the unparsed part
\item then combine the results:\medskip 
((output$_1$, output$_2$), unparsed part)

\large \bl{$\{((o_1, o_2), u_2) \;|\;$}\\[2mm] 
\large\mbox{}\hspace{15mm} \bl{$(o_1, u_1) \in p(\text{input}) \wedge$}\\[2mm]
\large\mbox{}\hspace{15mm} \bl{$(o_2, u_2) \in q(u_1)\}$}



Function parser (code \bl{$p \Rightarrow f\;$})\bigskip

\item apply \bl{$p$} producing a set of pairs
\item then apply the function \bl{$f$} to each first component

\large \bl{$\{(f(o_1), u_1) \;|\; (o_1, u_1) \in p(\text{input})\}$}

\bl{$f$} is the semantic action (``what to do with the parsed input'')


\frametitle{\begin{tabular}{c}Semantic Actions\end{tabular}}


\bl{$\meta{T} \sim + \sim \meta{E} \Rightarrow \underbrace{f\,((x,y), z) \Rightarrow x + z}_{\text{semantic action}}$}


\bl{$\meta{F} \sim * \sim \meta{T} \Rightarrow f\,((x,y), z) \Rightarrow x * z$}


\bl{$\text{(} \sim \meta{E} \sim \text{)} \Rightarrow f\,((x,y), z) \Rightarrow y$}


\frametitle{Types of Parsers}

\item {\bf Sequencing}: if \bl{$p$} returns results of type \bl{$T$}, and \bl{$q$} results of type \bl{$S$},
then \bl{$p \sim q$} returns results of type

\bl{$T \times S$}

\item {\bf Alternative}: if \bl{$p$} returns results of type \bl{$T$} then  \bl{$q$} \alert{must} also have results of type \bl{$T$},
and \bl{$p \;||\; q$} returns results of type


\item {\bf Semantic Action}: if \bl{$p$} returns results of type \bl{$T$} and \bl{$f$} is a function from
\bl{$T$} to \bl{$S$}, then
\bl{$p \Rightarrow f$} returns results of type




\frametitle{Input Types of Parsers}

\item input: \alert{token list}
\item output: set of (output\_type, \alert{token list})

actually it can be any input type as long as it is a kind of
sequence (for example a string)


\frametitle{Scannerless Parsers}

\item input: \alert{string}
\item output: set of (output\_type, \alert{string})

but using lexers is better because whitespaces or comments can be
filtered out; then input is a sequence of tokens


\frametitle{Successful Parses}

\item input: string
\item output: \alert{set of} (output\_type, string)

a parse is successful whenever the input has been fully
``consumed'' (that is the second component is empty)


\frametitle{Abstract Parser Class}




\frametitle{Two Grammars}

Which languages are recognised by the following two grammars?

$\meta{S}$ & $\rightarrow$ &  $1 \cdot \meta{S} \cdot \meta{S}$\\
        & $|$ & $\epsilon$

$\meta{U}$ & $\rightarrow$ &  $1 \cdot \meta{U}$\\
        & $|$ & $\epsilon$


\frametitle{Ambiguous Grammars}

\begin{axis}[xlabel={\pcode{1}s},ylabel={time in secs},
    scaled ticks=false,
    axis lines=left,
    legend entries={unambiguous,ambiguous},  
    legend pos=north east,
    legend cell align=left,
    x tick label style={font=\small,/pgf/number format/1000 sep={}}]
\addplot[blue,mark=*, mark options={fill=white}] 
  table {};
  \addplot[red,mark=triangle*, mark options={fill=white}] 
  table {};}  



\bl{\begin{plstx}[rhs style=,one per line]: \meta{Stmt} ::= skip
         | \meta{Id} := \meta{AExp}
         | if \meta{BExp} then \meta{Block} else \meta{Block}
         | while \meta{BExp} do \meta{Block}\\
: \meta{Stmts} ::= \meta{Stmt} ; \meta{Stmts}
          | \meta{Stmt}\\
: \meta{Block} ::= \{ \meta{Stmts} \}
          | \meta{Stmt}\\
: \meta{AExp} ::= \ldots\\
: \meta{BExp} ::= \ldots\\\end{plstx}}


\frametitle{An Interpreter}

\;\;$x := 5 \text{;}$\\
\;\;$y := x * 3\text{;}$\\
\;\;$y := x * 4\text{;}$\\
\;\;$x := u * 3$\\

\item the interpreter has to record the value of \bl{$x$} before assigning a value to \bl{$y$}\pause
\item \bl{\texttt{eval(stmt, env)}}



$\text{eval}(n, E)$ & $\dn$ & $n$\\
$\text{eval}(x, E)$ & $\dn$ & $E(x)$ \;\;\;\textcolor{black}{lookup \bl{$x$} in \bl{$E$}}\\
$\text{eval}(a_1 + a_2, E)$ & $\dn$ & $\text{eval}(a_1, E) + \text{eval}(a_2, E)$\\
$\text{eval}(a_1 - a_2, E)$ & $\dn$ & $\text{eval}(a_1, E) - \text{eval}(a_2, E)$\\
$\text{eval}(a_1 * a_2, E)$ & $\dn$ & $\text{eval}(a_1, E) * \text{eval}(a_2, E)$\bigskip\\
$\text{eval}(a_1 = a_2, E)$ & $\dn$ & $\text{eval}(a_1, E) = \text{eval}(a_2, E)$\\
$\text{eval}(a_1\,!\!= a_2, E)$ & $\dn$ & $\neg(\text{eval}(a_1, E) = \text{eval}(a_2, E))$\\
$\text{eval}(a_1 < a_2, E)$ & $\dn$ & $\text{eval}(a_1, E) < \text{eval}(a_2, E)$\

\frametitle{\begin{tabular}{c}Interpreter (2)\end{tabular}}

$\text{eval}(\text{skip}, E)$ & $\dn$ & $E$\\
$\text{eval}(x:=a, E)$ & $\dn$ & \bl{$E(x \mapsto \text{eval}(a, E))$}\\
\multicolumn{3}{@{}l@{}}{$\text{eval}(\text{if}\;b\;\text{then}\;cs_1\;\text{else}\;cs_2 , E) \dn$}\\
\multicolumn{3}{@{}l@{}}{$\text{eval}(\text{while}\;b\;\text{do}\;cs, E) \dn$}\\
\text{eval}(\text{while}\;b\;\text{do}\;cs, \text{eval}(cs,E))$}\\
\multicolumn{3}{@{}l@{}}{\hspace{2cm}$\text{else}\; E$}\\
$\text{eval}(\text{write}\; x, E)$ & $\dn$ & $\{\;\text{println}(E(x))\; ;\;E\;\}$\\


\frametitle{Test Program}




\frametitle{\begin{tabular}{c}Interpreted Code\end{tabular}}

\begin{axis}[axis x line=bottom, axis y line=left, xlabel=n, ylabel=secs, legend style=small]
\addplot+[smooth] file {};


\frametitle{\begin{tabular}{c}Java Virtual Machine\end{tabular}}

\item introduced in 1995
\item is a stack-based VM (like Postscript, CLR of .Net)
\item contains a JIT compiler
\item many languages take advantage of JVM's infrastructure (JRE)
\item is garbage collected $\Rightarrow$ no buffer overflows
\item some languages compile to the JVM: Scala, Clojure\ldots



