handouts/ho06.tex
author Christian Urban <christian dot urban at kcl dot ac dot uk>
Fri, 01 Nov 2013 13:16:43 +0000
changeset 175 5801e8c0e528
parent 173 7cfb7a6f7c99
child 176 3c2653fc8b5a
permissions -rw-r--r--
updated

\documentclass{article}
\usepackage{charter}
\usepackage{hyperref}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage[T1]{fontenc}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{tikz}
\usetikzlibrary{arrows}
\usetikzlibrary{automata}
\usetikzlibrary{shapes}
\usetikzlibrary{shadows}
\usetikzlibrary{positioning}
\usetikzlibrary{calc}
\usetikzlibrary{fit}
\usetikzlibrary{backgrounds}
\usepackage{fontspec}
\setmonofont{Consolas}

\newcommand{\dn}{\stackrel{\mbox{\scriptsize def}}{=}}%

\definecolor{javared}{rgb}{0.6,0,0} % for strings
\definecolor{javagreen}{rgb}{0.25,0.5,0.35} % comments
\definecolor{javapurple}{rgb}{0.5,0,0.35} % keywords
\definecolor{javadocblue}{rgb}{0.25,0.35,0.75} % javadoc

\lstdefinelanguage{scala}{
  morekeywords={abstract,case,catch,class,def,%
    do,else,extends,false,final,finally,%
    for,if,implicit,import,match,mixin,%
    new,null,object,override,package,%
    private,protected,requires,return,sealed,%
    super,this,throw,trait,true,try,%
    type,val,var,while,with,yield},
  otherkeywords={=>,<-,<\%,<:,>:,\#,@},
  sensitive=true,
  morecomment=[l]{//},
  morecomment=[n]{/*}{*/},
  morestring=[b]",
  morestring=[b]',
  morestring=[b]"""
}

\lstdefinelanguage{while}{
  morekeywords={while, if, then. else, read, write},
  otherkeywords={=>,<-,<\%,<:,>:,\#,@},
  sensitive=true,
  morecomment=[l]{//},
  morecomment=[n]{/*}{*/},
  morestring=[b]",
  morestring=[b]',
  morestring=[b]"""
}


\lstset{language=Scala,
	basicstyle=\ttfamily,
	keywordstyle=\color{javapurple}\bfseries,
	stringstyle=\color{javagreen},
	commentstyle=\color{javagreen},
	morecomment=[s][\color{javadocblue}]{/**}{*/},
	numbers=left,
	numberstyle=\tiny\color{black},
	stepnumber=1,
	numbersep=10pt,
	tabsize=2,
	showspaces=false,
	showstringspaces=false}
	
\newcommand\grid[1]{%
\begin{tikzpicture}[baseline=(char.base)]
  \path[use as bounding box]
    (0,0) rectangle (1em,1em);
  \draw[red!50, fill=red!20]
    (0,0) rectangle (1em,1em);
  \node[inner sep=1pt,anchor=base west]
    (char) at (0em,\gridraiseamount) {#1};
\end{tikzpicture}}
\newcommand\gridraiseamount{0.12em}

\makeatletter
\newcommand\Grid[1]{%
  \@tfor\z:=#1\do{\grid{\z}}}
\makeatother	

\newcommand\Vspace[1][.3em]{%
  \mbox{\kern.06em\vrle height.3ex}%
  \vbox{\hrule width#1}%
  \hbox{\vrule height.3ex}}

\def\VS{\Vspace[0.6em]}
	
\begin{document}

\section*{Handout 6}

While regular expressions are very useful for lexing and for recognising
many patterns (like email addresses), they have their limitations. For
example there is no regular expression that can recognise the language 
$a^nb^n$. Another example is the language of well-parenthesised 
expressions.  In languages like Lisp, which use parentheses rather
extensively, it might be of interest whether the following two expressions
are well-parenthesised (the left one is, the right one is not):

\begin{center}
$(((()()))())$  \hspace{10mm} $(((()()))()))$
\end{center}

In order to solve such recognition problems, we need more powerful 
techniques than regular expressions. We will in particular look at \emph{context-free
languages}. They include the regular languages as the picture below shows:


\begin{center}
\begin{tikzpicture}
[rect/.style={draw=black!50, top color=white,bottom color=black!20, rectangle, very thick, rounded corners}]

\draw (0,0) node [rect, text depth=30mm, text width=46mm] {all languages};
\draw (0,-0.4) node [rect, text depth=20mm, text width=44mm] {decidable languages};
\draw (0,-0.65) node [rect, text depth=13mm] {context sensitive languages};
\draw (0,-0.84) node [rect, text depth=7mm, text width=35mm] {context-free languages};
\draw (0,-1.05) node [rect] {regular languages};
\end{tikzpicture}
\end{center}

\noindent
Context-free languages play an important role in `day-to-day' text processing and in
programming languages. Context-free languages are usually specified by grammars.
For example a grammar for well-parenthesised  expressions is

\begin{center}
$P \;\;\rightarrow\;\; ( \cdot  P \cdot ) \cdot P \;|\; \epsilon$
\end{center}
 
\noindent
In general grammars consist of finitely many rules built up from terminal symbols (usually lower-case letters)
and non-terminal symbols (upper-case letters).  Rules have the shape

\begin{center}
$NT \;\;\rightarrow\;\; \textit{rhs}$
\end{center}
 
\noindent
where on the left-hand side is a single non-terminal and on the right a string consisting
of both terminals and non-terminals including the $\epsilon$-symbol for indicating the
empty string. We use the convention  to separate components on
the right hand-side by using the $\cdot$ symbol, as in the grammar for well-parenthesised  expressions.
We also use the convention to use $|$ as a shorthand notation for several rules. For example

\begin{center}
$NT \;\;\rightarrow\;\; \textit{rhs}_1 \;|\; \textit{rhs}_2$
\end{center}

\noindent
means that the non-terminal $NT$ can be replaced by either $\textit{rhs}_1$ or $\textit{rhs}_2$.
If there are more than one non-terminal on the left-hand side of the rules, then we need to indicate
what is the \emph{starting} symbol of the grammar. For example the grammar for arithmetic expressions
can be given as follows

\begin{center}
\begin{tabular}{lcl}
$E$ & $\rightarrow$ &  $N$ \\
$E$ & $\rightarrow$ &  $E \cdot + \cdot E$ \\
$E$ & $\rightarrow$ &  $E \cdot - \cdot E$ \\
$E$ & $\rightarrow$ &  $E \cdot * \cdot E$ \\
$E$ & $\rightarrow$ &  $( \cdot E \cdot )$\\
$N$ & $\rightarrow$ & $N \cdot N \;|\; 0 \;|\; 1 \;|\: \ldots \;|\; 9$ 
\end{tabular}
\end{center}

\noindent
where $E$ is the starting symbol. A \emph{derivation} for a grammar
starts with the staring symbol of the grammar and in each step replaces one
non-terminal by a right-hand side of a rule. A derivation ends with a string
in which only terminal symbols are left. For example a derivation for the
string $(1 + 2) + 3$ is as follows:

\begin{center}
\begin{tabular}{lll}
$E$ & $\rightarrow$ & $E+E$\\
       & $\rightarrow$ & $(E)+E$\\
       & $\rightarrow$ & $(E+E)+E$\\
       & $\rightarrow$ & $(E+E)+N$\\
       & $\rightarrow$ & $(E+E)+3$\\
       & $\rightarrow$ & $(N+E)+3$\\	
       & $\rightarrow^+$ & $(1+2)+3$\\
\end{tabular} 
\end{center}

\noindent
The \emph{language} of a context-free grammar $G$ with start symbol $S$ 
is defined as the set of strings derivable by a derivation, that is

\begin{center}
$\{c_1\ldots c_n \;|\; S \rightarrow^* c_1\ldots c_n \;\;\text{with all} \; c_i \;\text{being non-terminals}\}$
\end{center}

\noindent
A \emph{parse-tree} encodes how a string is derived with the starting symbol on 
top and each non-terminal containing a subtree for how it is replaced in a derivation.
The parse tree for the string $(1 + 23)+4$ is as follows:

\begin{center}
\begin{tikzpicture}[level distance=8mm, black]
  \node {$E$}
    child {node {$E$} 
       child {node {$($}}
       child {node {$E$}       
         child {node {$E$} child {node {$N$} child {node {$1$}}}}
         child {node {$+$}}
         child {node {$E$} 
            child {node {$N$} child {node {$2$}}}
            child {node {$N$} child {node {$3$}}}
            } 
        }
       child {node {$)$}}
     }
     child {node {$+$}}
     child {node {$E$}
        child {node {$N$} child {node {$4$}}}
     };
\end{tikzpicture}
\end{center}

\noindent
We are often interested in these parse-trees since they encode the structure of
how a string is derived by a grammar. Before we come to the problem of constructing
such parse-trees, we need to consider the following two properties of grammars.
A grammar is \emph{left-recursive} if there is a derivation starting from a non-terminal, say
$NT$ which leads to a string which again starts with $NT$. This means a derivation of the
form.

\begin{center}
$NT \rightarrow \ldots \rightarrow NT \cdot \ldots$
\end{center}

\noindent
It can be easily seems that the grammar above for arithmetic expressions is left-recursive:
for example the rules $E \rightarrow E\cdot + \cdot E$ and $N \rightarrow N\cdot N$ 
show that this grammar is left-recursive. Some algorithms cannot cope with left-recursive 
grammars. Fortunately every left-recursive grammar can be transformed into one that is
not left-recursive, although this transformation might make the grammar less human-readable.
For example if we want to give a non-left-recursive grammar for numbers we might
specify

\begin{center}
$N \;\;\rightarrow\;\; 0\;|\;\ldots\;|\;9\;|\;1\cdot N\;|\;2\cdot N\;|\;\ldots\;|\;9\cdot N$
\end{center}

\noindent
Using this grammar we can still derive every number string, but we will never be able 
to derive a string of the form $\ldots \rightarrow N \cdot \ldots$.

The other property we have to watch out is when a grammar is
\emph{ambiguous}. A grammar is said to be ambiguous if there are two parse-trees
for one string. Again the grammar for arithmetic expressions shown above is ambiguous.
While the shown parse tree for the string $(1 + 23) + 4$ is unique, there are two parse
trees for the string $1 + 2 + 3$, namely

\begin{center}
\begin{tabular}{c@{\hspace{10mm}}c}
\begin{tikzpicture}[level distance=8mm, black]
  \node {$E$}
    child {node {$E$} child {node {$N$} child {node {$1$}}}}
    child {node {$+$}}
    child {node {$E$}
       child {node {$E$} child {node {$N$} child {node {$2$}}}}
       child {node {$+$}}
       child {node {$E$} child {node {$N$} child {node {$3$}}}}
    }
    ;
\end{tikzpicture} 
&
\begin{tikzpicture}[level distance=8mm, black]
  \node {$E$}
    child {node {$E$}
       child {node {$E$} child {node {$N$} child {node {$1$}}}}
       child {node {$+$}}
       child {node {$E$} child {node {$N$} child {node {$2$}}}} 
    }
    child {node {$+$}}
    child {node {$E$} child {node {$N$} child {node {$3$}}}}
    ;
\end{tikzpicture}
\end{tabular} 
\end{center}

\noindent
In particular in programming languages we will try to avoid ambiguous
grammars as two different parse-trees for a string means a program can
be interpreted in two different ways. Then we have to somehow make sure
the two different ways do not matter, or disambiguate the grammar in
some way (for example making the $+$ left-associative). Unfortunately already 
the problem of deciding whether a grammar
is ambiguous or not is in general undecidable. But in concrete instances we can
often make a choice.

\end{document}

%%% Local Variables: 
%%% mode: latex  
%%% TeX-master: t
%%% End: