\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\begin{document}
\section*{A Crash-Course on Notation}
There are innumerable books available about compilers, automata theory
and formal languages. Unfortunately, they often use their own
notational conventions and their own symbols. This handout is meant to
clarify some of the notation I will use. I appologise in advance that
sometimes I will be a bit fuzzy\ldots the problem is that often we
want to have convenience in our mathematical definitions (to make them
readable and understandable), but other times we need pedantic
precision for actual programs.
\subsubsection*{Characters and Strings}
The most important concept in this module are strings. Strings
are composed of \defn{characters}. While characters are surely
a familiar concept, we will make one subtle distinction in
this module. If we want to refer to concrete characters, like
\code{a}, \code{b}, \code{c} and so on, we use a typewriter font.
Accordingly if we want to refer to the concrete characters of
my email address we shall write
\begin{center}
\pcode{christian.urban@kcl.ac.uk}
\end{center}
\noindent If we also need to explicitly indicate the ``space''
character, we write \VS{}\hspace{1mm}. For example
\begin{center}
\tt{}hello\VS\hspace{0.5mm}world
\end{center}
\noindent But often we do not care which particular characters
we use. In such cases we use the italic font and write $a$,
$b$, $c$ and so on for characters. Therefore if we need a
representative string, we might write
\[
abracadabra
\]
\noindent In this string, we do not really care what the
characters stand for, except we do care about the fact that
for example the character $a$ is not equal to $b$ and so on.
Why do I make this distinction? Because we often need to
define functions using variables ranging over characters. We
need to somehow say this is a variable, say $c$, ranging over
characters, while this is the atual character \pcode{c}.
An \defn{alphabet} is a (non-empty) finite set of characters.
Often the letter $\Sigma$ is used to refer to an alphabet. For
example the ASCII characters \pcode{a} to \pcode{z} form an
alphabet. The digits $0$ to $9$ are another alphabet. The
Greek letters $\alpha$ to $\omega$ also form an alphabet. If
nothing else is specified, we usually assume the alphabet
consists of just the lower-case letters $a$, $b$, \ldots, $z$.
Sometimes, however, we explicitly want to restrict strings to
contain only the letters $a$ and $b$, for example. In this
case we will state that the alphabet is the set $\{a, b\}$.
\defn{Strings} are lists of characters. Unfortunately, there
are many ways how we can write down strings. In programming
languages, they are usually written as \dq{\texttt{hello}} where the
double quotes indicate that we are dealing with a string. In
typed programming languages, such as Scala, strings have a special
type---namely \pcode{String} which is different from the type
for lists of chatacters. This is because strings can be
efficiently represented in memory, unlike general lists. Since
\code{String} and the type of lists of characters,
\code{List[Char]} are not the same, we need to explicitly
coerce elements between the two types, for example
\begin{lstlisting}[numbers=none]
scala> "abc".toList
res01: List[Char] = List(a, b, c)
\end{lstlisting}
\noindent Since in our (mathematical) definitions we regard
strings as lists of characters, we will also write
\dq{$hello$} as
\[
[\text{\it h, e, l, l, o}] \qquad\text{or simply}\qquad \textit{hello}
\]
\noindent The important point is that we can always decompose
such strings. For example, we will often consider the first
character of a string, say $h$, and the ``rest'' of a string
say \dq{\textit{ello}} when making definitions about strings.
There are also some subtleties with the empty string,
sometimes written as \dq{} but also as the empty list of
characters $[\,]$.\footnote{In the literature you can also
often find that $\varepsilon$ or $\lambda$ is used to
represent the empty string.}
Two strings, say $s_1$ and $s_2$, can be \defn{concatenated},
which we write as $s_1 @ s_2$. If we regard $s_1$ and $s_2$ as
lists of characters, then $@$ is the list-append function.
Suppose we are given two strings \dq{\textit{foo}} and
\dq{\textit{bar}}, then their concatenation, writen
\dq{\textit{foo}} $@$ \dq{\textit{bar}}, gives
\dq{\textit{foobar}}. But as said above, we will often
simplify our life and just drop the double quotes whenever it
is clear we are talking about strings, So we will often just
write \textit{foo}, \textit{bar}, \textit{foobar} or
\textit{foo $@$ bar}.
Occasionally we will use the notation $a^n$ for strings, which stands
for the string of $n$ repeated $a$s. So $a^{n}b^{n}$ is a string that
has as many $a$s by as many $b$s. A simple property of string
concatenation is \emph{associativity}, meaning
\[(s_1 @ s_2) @ s_3 = s_1 @ (s_2 @ s_3)\]
\noindent are always equal strings. The empty string behaves
like a \emph{unit element}, therefore
\[s \,@\, [] = [] \,@\, s = s\]
\subsubsection*{Sets and Languages}
We will use the familiar operations $\cup$, $\cap$, $\subset$
and $\subseteq$ for sets. For the empty set we will either
write $\varnothing$ or $\{\,\}$. The set containing the
natural numbers $1$, $2$ and $3$, for example, we will write
with curly braces as
\[
\{1, 2, 3\}
\]
\noindent The notation $\in$ means \emph{element of}, so $1 \in \{1,
2, 3\}$ is true and $4 \in \{1, 2, 3\}$ is false. Not that the
\emph{list} $[1, 2, 3]$ is something different from the \emph{set}
$\{1, 2, 3\}$: in the former we care about the order and potentially
several occurrences of a number; while with the latter we do not.
Also sets can potentially have infinitely many elements. For example
the set of all natural numbers $\{0, 1, 2, \ldots\}$ is infinite. This
set is often also abbreviated as $\mathbb{N}$. Lists can be very large, but they cannot contain infinitely many elements.
We can define sets by giving all elements, for example $\{0, 1\}$ for
the set containing just $0$ and $1$, but also by \defn{set
comprehensions}. For example the set of all even natural numbers can
be defined as
\[
\{n\;|\;n\in\mathbb{N} \wedge n\;\text{is even}\}
\]
\noindent Though silly, but the set $\{0, 1, 2\}$ could also be
defined by the following set comprehension
\[
\{n\;|\; n^2 < 9 \wedge n \in \mathbb{N}\}
\]
\noindent Can you see why this defines the set $\{0, 1, 2\}$? Notice
that set comprehensions could be used to define set union,
intersection and difference:
\begin{eqnarray*}
A \cup B & \dn & \{x\;|\; x \in A \vee x \in B\}\\
A \cap B & \dn & \{x\;|\; x \in A \wedge x \in B\}\\
A \backslash B & \dn & \{x\;|\; x \in A \wedge x \not\in B\}
\end{eqnarray*}
\noindent In general set comprehensions are of the form
$\{a\;|\;P\}$ which stands for the set of all elements $a$
(from some set) for which some property $P$ holds.
For defining sets, we will also often use the notion of the
``big union''. An example is as follows:
\begin{equation}\label{bigunion}
\bigcup_{0\le n}\; \{n^2, n^2 + 1\}
\end{equation}
\noindent which is the set of all squares and their immediate
successors, so
\[
\{0, 1, 2, 4, 5, 9, 10, 16, 17, \ldots\}
\]
\noindent A big union is a sequence of unions which are
indexed typically by a natural number. So the big union in
\eqref{bigunion} could equally be written as
\[
\{0, 1\} \cup \{1, 2\} \cup \{4, 5\} \cup \{9, 10\} \cup
\ldots
\]
\noindent but using the big union notation is more concise.
While this stuff about sets might all look trivial or even needlessly
pedantic, \emph{Nature} is never simple. If you want to be amazed how
complicated sets can get, watch out for the last lecture just before
Christmas where I want to convince you of the fact that some sets are
more infinite than others. Yes, you read correctly, there can be sets
that are ``more infinite'' then others. If you think this is obvious:
say you have the infinite set $\{1, 2, 3, 4, \ldots\}$ which is all
the natural numbers except $0$, and then compare it to the set
$\{0, 1, 2, 3, 4, \ldots\}$ which contains the $0$. Yes, the second must be more infinite\ldots{} well, then think again. Because the two
infinite sets
\begin{center}
$\{1, 2, 3, 4, \ldots\}$ and
$\{0, 1, 2, 3, 4, \ldots\}$
\end{center}
\noindent
contain actually the same number of elements. Does this make sense?
Though this might all look strange this about infinite sets will be a
topic that is very relevant to the material of this module. It tells
us what we can compute with a computer (actually algorithm) and what
we cannot.
Another important notion in this module are \defn{languages}, which
are sets of strings. One of the main goals for us will be how to
(formally) specify languages and to find out whether a string
is in a language or not.\footnote{You might wish to ponder
whether this is in general a hard or easy problem, where
hardness is meant in terms of Turing decidable, for example.}
Note that the language containing the empty string $\{\dq{}\}$
is not equal to $\varnothing$, the empty language (or empty
set): The former contains one element, namely \dq{} (also
written $[\,]$), but the latter does not contain any
element.
For languages we define the operation of \defn{language
concatenation}, written like in the string case as $A @ B$:
\begin{equation}\label{langconc}
A @ B \dn \{s_1 @ s_2\;|\; s_1\in A \wedge s_2\in B\}
\end{equation}
\noindent Be careful to understand the difference: the $@$
in $s_1 @ s_2$ is string concatenation, while $A @ B$ refers
to the concatenation of two languages (or sets of strings).
As an example suppose $A=\{ab, ac\}$ and $B=\{zzz, qq, r\}$,
then $A \,@\, B$ is the language
\[
\{abzzz, abqq, abr, aczzz, acqq, acr\}
\]
\noindent Recall the properties for string concatenation. For
language concatenation we have the following properties
\begin{center}
\begin{tabular}{ll}
associativity: & $(A @ B) @ C = A @ (B @ C)$\\
unit element: & $A \,@\, \{[]\} = \{[]\} \,@\, A = A$\\
zero element: & $A \,@\, \varnothing = \varnothing \,@\, A =
\varnothing$
\end{tabular}
\end{center}
\noindent Note the difference in the last two lines: the empty
set behaves like $0$ for multiplication and the set $\{[]\}$
like $1$ for multiplication ($n * 1 = n$ and $n * 0 = 0$).
Following the language concatenation, we can define a
\defn{language power} operation as follows:
\begin{eqnarray*}
A^0 & \dn & \{[]\}\\
A^{n+1} & \dn & A \,@\, A^n
\end{eqnarray*}
\noindent This definition is by recursion on natural numbers.
Note carefully that the zero-case is not defined as the empty
set, but the set containing the empty string. So no matter
what the set $A$ is, $A^0$ will always be $\{[]\}$. (There is
another hint about a connection between the $@$-operation and
multiplication: How is $x^n$ defined recursively and what is
$x^0$?)
Next we can define the \defn{star operation} for languages:
$A\star$ is the union of all powers of $A$, or short
\begin{equation}\label{star}
A\star \dn \bigcup_{0\le n}\; A^n
\end{equation}
\noindent This star operation is often also called
\emph{Kleene-star}. Unfolding the definition in \eqref{star}
gives
\[
A^0 \cup A^1 \cup A^2 \cup A^3 \cup \ldots
\]
\noindent
which is equal to
\[
\{[]\} \,\cup\, A \,\cup\, A @ A \,\cup\, A @ A @ A \,\cup\, \ldots
\]
\noindent We can see that the empty string is always in $A\star$,
no matter what $A$ is. This is because $[] \in A^0$. To make
sure you understand these definitions, I leave you to answer
what $\{[]\}\star$ and $\varnothing\star$ are?
Recall that an alphabet is often referred to by the letter
$\Sigma$. We can now write for the set of \emph{all} strings
over this alphabet $\Sigma\star$. In doing so we also include the
empty string as a possible string over $\Sigma$. So if $\Sigma
= \{a, b\}$, then $\Sigma\star$ is
\[
\{[], a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, \ldots\}
\]
\noindent or in other words all strings containing $a$s and
$b$s only, plus the empty string.
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End: