\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\begin{document}
\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018}
\section*{A Crash-Course on Notation}
There are innumerable books available about compilers, automata theory
and formal languages. Unfortunately, they often use their own
notational conventions and their own symbols. This handout is meant to
clarify some of the notation I will use. I apologise in advance that
sometimes I will be a bit fuzzy\ldots the problem is that often we
want to have convenience in our mathematical definitions (to make them
readable and understandable), but other times we need pedantic
precision for actual programs.
\subsubsection*{Characters and Strings}
The most important concept in this module are strings. Strings
are composed of \defn{characters}. While characters are surely
a familiar concept, we will make one subtle distinction in
this module. If we want to refer to concrete characters, like
\code{a}, \code{b}, \code{c} and so on, we use a typewriter font.
Accordingly if we want to refer to the concrete characters of
my email address we shall write
\begin{center}
\pcode{christian.urban@kcl.ac.uk}
\end{center}
\noindent If we also need to explicitly indicate the ``space''
character, we write \VS{}\hspace{1mm}. For example
\begin{center}
\tt{}hello\VS\hspace{0.5mm}world
\end{center}
\noindent But often we do not care which particular characters
we use. In such cases we use the italic font and write $a$,
$b$, $c$ and so on for characters. Therefore if we need a
representative string, we might write
\[
abracadabra
\]
\noindent In this string, we do not really care what the
characters stand for, except we do care about the fact that
for example the character $a$ is not equal to $b$ and so on.
Why do I make this distinction? Because we often need to
define functions using variables ranging over characters. We
need to somehow say this is a variable, say $c$, ranging over
characters, while this is the actual character \pcode{c}.
An \defn{alphabet} is a (non-empty) finite set of characters.
Often the letter $\Sigma$ is used to refer to an alphabet. For
example the ASCII characters \pcode{a} to \pcode{z} form an
alphabet. The digits $0$ to $9$ are another alphabet. The
Greek letters $\alpha$ to $\omega$ also form an alphabet. If
nothing else is specified, we usually assume the alphabet
consists of just the lower-case letters $a$, $b$, \ldots, $z$.
Sometimes, however, we explicitly want to restrict strings to
contain only the letters $a$ and $b$, for example. In this
case we will state that the alphabet is the set $\{a, b\}$.
\defn{Strings} are lists of characters. Unfortunately, there
are many ways how we can write down strings. In programming
languages, they are usually written as \dq{\texttt{hello}} where the
double quotes indicate that we are dealing with a string. In
typed programming languages, such as Scala, strings have a special
type---namely \pcode{String} which is different from the type
for lists of characters. This is because strings can be
efficiently represented in memory, unlike lists. Since
\code{String} and the type of lists of characters
(\code{List[Char]}) are not the same, we need to explicitly
coerce elements between the two types, for example
\begin{lstlisting}[numbers=none]
scala> "abc".toList
res01: List[Char] = List(a, b, c)
\end{lstlisting}
\noindent
However, we do not want to do this kind of explicit coercion in our
pencil-and-paper, everyday arguments. So in our (mathematical)
definitions we regard strings as lists of characters and we will also
write \dq{$hello$} as list
\[
[\text{\it h, e, l, l, o}] \qquad\text{or simply}\qquad \textit{hello}
\]
\noindent The important point is that we can always decompose
such strings. For example, we will often consider the first
character of a string, say $h$, and the ``rest'' of a string
say \dq{\textit{ello}} when making definitions about strings.
There are also some subtleties with the empty string,
sometimes written as \dq{} but also as the empty list of
characters $[\,]$.\footnote{In the literature you can also
often find that $\varepsilon$ or $\lambda$ is used to
represent the empty string.}
Two strings, say $s_1$ and $s_2$, can be \defn{concatenated},
which we write as $s_1 @ s_2$. If we regard $s_1$ and $s_2$ as
lists of characters, then $@$ is the list-append function.
Suppose we are given two strings \dq{\textit{foo}} and
\dq{\textit{bar}}, then their concatenation, written
\dq{\textit{foo}} $@$ \dq{\textit{bar}}, gives
\dq{\textit{foobar}}. But as said above, we will often
simplify our life and just drop the double quotes whenever it
is clear we are talking about strings. So we will often just
write \textit{foo}, \textit{bar}, \textit{foobar} or
\textit{foo $@$ bar}.
Occasionally we will use the notation $a^n$ for strings, which stands
for the string of $n$ repeated $a$s. So $a^{n}b^{n}$ is a string that
has some number of $a$s followed by the same number of $b$s. A simple
property of string concatenation is \emph{associativity}, meaning
\[(s_1 @ s_2) @ s_3 = s_1 @ (s_2 @ s_3)\]
\noindent are always equal strings. The empty string behaves
like a \emph{unit element}, therefore
\[s \,@\, [] = [] \,@\, s = s\]
\subsubsection*{Sets and Languages}
We will use the familiar operations $\cup$, $\cap$, $\subset$
and $\subseteq$ for sets. For the empty set we will either
write $\varnothing$ or $\{\,\}$. The set containing the
natural numbers $1$, $2$ and $3$, for example, we will write
with curly braces as
\[
\{1, 2, 3\}
\]
\noindent The notation $\in$ means \emph{element of}, so $1 \in \{1,
2, 3\}$ is true and $4 \in \{1, 2, 3\}$ is false. Note that the
\emph{list} $[1, 2, 3]$ is something different from the \emph{set}
$\{1, 2, 3\}$: in the former we care about the order and potentially
several occurrences of a number; while with the latter we do not.
Also sets can potentially have infinitely many elements, whereas lists
cannot. For example
the set of all natural numbers $\{0, 1, 2, \ldots\}$ is infinite. This
set is often also abbreviated as $\mathbb{N}$. Lists can be very large, but they cannot contain infinitely many elements.
We can define sets by giving all their elements, for example $\{0, 1\}$ for
the set containing just $0$ and $1$. But often we need to use \defn{set
comprehensions} to define sets. For example the set of all \emph{even}
natural numbers can be defined as
\[
\{n\;|\;n\in\mathbb{N} \wedge n\;\text{is even}\}
\]
\noindent Set comprehensions consist of a ``base set'' (in this case
all the natural numbers) and a predicate (here eveness).
Though silly, but the set $\{0, 1, 2\}$ could also be
defined by the following set comprehension
\[
\{n\;|\; n \in \mathbb{N} \wedge n^2 < 9\}
\]
\noindent Can you see why this defines the set $\{0, 1, 2\}$? Notice
that set comprehensions are quite powerful constructions. For example they
could be used to define set union,
set intersection and set difference:
\begin{eqnarray*}
A \cup B & \dn & \{x\;|\; x \in A \vee x \in B\}\\
A \cap B & \dn & \{x\;|\; x \in A \wedge x \in B\}\\
A \backslash B & \dn & \{x\;|\; x \in A \wedge x \not\in B\}
\end{eqnarray*}
\noindent In general set comprehensions are of the form
$\{a\;|\;P\}$ which stands for the set of all elements $a$
(from some set) for which some property $P$ holds.
For defining sets, we will also often use the notion of the
``big union''. An example is as follows:
\begin{equation}\label{bigunion}
\bigcup_{0\le n}\; \{n^2, n^2 + 1\}
\end{equation}
\noindent which is the set of all squares and their immediate
successors, so
\[
\{0, 1, 2, 4, 5, 9, 10, 16, 17, \ldots\}
\]
\noindent A big union is a sequence of unions which are
indexed typically by a natural number. So the big union in
\eqref{bigunion} could equally be written as
\[
\{0, 1\} \cup \{1, 2\} \cup \{4, 5\} \cup \{9, 10\} \cup
\ldots
\]
\noindent but using the big union notation is more concise.
As an aside: While this stuff about sets might all look trivial or
even needlessly pedantic, \emph{Nature} is never simple. If you want
to be amazed how complicated sets can get, watch out for the last
lecture just before Christmas where I want to convince you of the fact
that some sets are more infinite than other sets. Yes, you read
correctly, there can be sets that are ``more infinite'' than
others. If you think this is obvious: say you have the infinite set
$\mathbb{N}\backslash\{0\} = \{1, 2, 3, 4, \ldots\}$ which is all the
natural numbers except $0$, and then compare it to the set
$\{0, 1, 2, 3, 4, \ldots\}$ which contains the $0$ and all other
numbers. If you think, the second must be more infinite\ldots{} well,
then think again. Because the two infinite sets
\begin{center}
$\{1, 2, 3, 4, \ldots\}$ and
$\{0, 1, 2, 3, 4, \ldots\}$
\end{center}
\noindent
contain actually the same amount of elements. Does this make sense?
Though this might all look strange, infinite sets will be a
topic that is very relevant to the material of this module. It tells
us what we can compute with a computer (actually algorithm) and what
we cannot. But during the first 9 lectures we can go by without this
``weird'' stuff. End of aside.\smallskip
Another important notion in this module are \defn{languages}, which
are sets of strings. One of the main goals for us will be how to
(formally) specify languages and to find out whether a string
is in a language or not.\footnote{You might wish to ponder
whether this is in general a hard or easy problem, where
hardness is meant in terms of Turing decidable, for example.}
Note that the language containing the empty string $\{\dq{}\}$
is not equal to $\varnothing$, the empty language (or empty
set): The former contains one element, namely \dq{} (also
written $[\,]$), but the latter does not contain any
element.
For languages we define the operation of \defn{language
concatenation}, written like in the string case as $A @ B$:
\begin{equation}\label{langconc}
A @ B \dn \{s_1 @ s_2\;|\; s_1\in A \wedge s_2\in B\}
\end{equation}
\noindent Be careful to understand the difference: the $@$
in $s_1 @ s_2$ is string concatenation, while $A @ B$ refers
to the concatenation of two languages (or sets of strings).
As an example suppose $A=\{ab, ac\}$ and $B=\{zzz, qq, r\}$,
then $A \,@\, B$ is the language
\[
\{abzzz, abqq, abr, aczzz, acqq, acr\}
\]
\noindent Recall the properties for string concatenation. For
language concatenation we have the following properties
\begin{center}
\begin{tabular}{ll}
associativity: & $(A @ B) @ C = A @ (B @ C)$\\
unit element: & $A \,@\, \{[]\} = \{[]\} \,@\, A = A$\\
zero element: & $A \,@\, \varnothing = \varnothing \,@\, A =
\varnothing$
\end{tabular}
\end{center}
\noindent Note the difference in the last two lines: the empty
set behaves like $0$ for multiplication and the set $\{[]\}$
like $1$ for multiplication ($n * 1 = n$ and $n * 0 = 0$).
Using the operation of language concatenation, we can define a
\defn{language power} operation as follows:
\begin{eqnarray*}
A^0 & \dn & \{[]\}\\
A^{n+1} & \dn & A \,@\, A^n
\end{eqnarray*}
\noindent This definition is by recursion on natural numbers.
Note carefully that the zero-case is not defined as the empty
set, but the set containing the empty string. So no matter
what the set $A$ is, $A^0$ will always be $\{[]\}$. (There is
another hint about a connection between the $@$-operation and
multiplication: How is $x^n$ defined in mathematics and what is
$x^0$?)
Next we can define the \defn{star operation} for languages:
$A\star$ is the union of all powers of $A$, or short
\begin{equation}\label{star}
A\star \dn \bigcup_{0\le n}\; A^n
\end{equation}
\noindent This star operation is often also called
\emph{Kleene-star}. Unfolding the definition in~\eqref{star}
gives
\[
A\star \dn A^0 \cup A^1 \cup A^2 \cup A^3 \cup \ldots
\]
\noindent
which is equal to
\[
A\star \dn \{[]\} \,\cup\, A \,\cup\, A @ A \,\cup\, A @ A @ A \,\cup\, \ldots
\]
\noindent We can see that the empty string is always in $A\star$,
no matter what $A$ is. This is because $[] \in A^0$. To make
sure you understand these definitions, I leave you to answer
what $\{[]\}\star$ and $\varnothing\star$ are?
Recall that an alphabet is often referred to by the letter
$\Sigma$. We can now write for the set of \emph{all} strings
over this alphabet as $\Sigma\star$. In doing so we also include the
empty string as a possible string over $\Sigma$. So if $\Sigma
= \{a, b\}$, then $\Sigma\star$ is
\[
\{[], a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, \ldots\}
\]
\noindent or in other words all strings containing $a$s and
$b$s only, plus the empty string.
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End: