handouts/notation.tex
author Christian Urban <christian.urban@kcl.ac.uk>
Fri, 24 Oct 2025 11:26:43 +0100
changeset 1019 43f64633a8a1
parent 994 35db8ee453c0
permissions -rw-r--r--
updated

\documentclass{article}
\usepackage{../style}
\usepackage{../langs}
\usepackage{../graphics}


\begin{document}
\fnote{\copyright{} Christian Urban, King's College London, 2014, 2015, 2016, 2017, 2018, 2020, 2025}


\section*{A Crash-Course on Notation}

There are innumerable books available on compilers, automata theory
and formal languages. Unfortunately, they often use their own
notational conventions and their own symbols. This handout is meant to
clarify some of the notation I will use. I apologise in advance that
sometimes I will be a bit fuzzy\ldots the problem is that often we
want to have convenience in our mathematical definitions (to make them
readable and understandable), but other times we need pedantic
precision for actual programs.

\subsubsection*{Characters and Strings}

The most basic concept in this module are strings. Strings
are composed of \defn{characters}. While characters are surely
a familiar concept, we will make one subtle distinction in
this module. If we want to refer to concrete characters, like
\code{a}, \code{b}, \code{c} and so on, we will use a typewriter font.
Accordingly if we want to refer to the concrete characters of
my email address we shall write

\begin{center}
\pcode{christian.urban@kcl.ac.uk}
\end{center}

\noindent If we also need to explicitly indicate the ``space''
character, we write \VS{}\hspace{1mm}. For example

\begin{center}
\tt{}hello\VS\hspace{0.5mm}world
\end{center}


\noindent But often we do not care which particular characters
we use. In such cases we use the italic font and write $a$,
$b$, $c$ and so on for characters. Therefore if we need a
representative string, we might write

\[
abracadabra
\]

\noindent In this string, we do not really care what the
characters stand for, except we do care about the fact that
for example the character $a$ is not equal to $b$ and so on.
Why do I make this distinction? Because we often need to
define functions using variables ranging over characters. We
need to somehow say ``this-is-a-variable'' and give it a name. 
In such cases we use the italic font.


An \defn{alphabet} is a (non-empty) finite set of characters.
Often the letter $\Sigma$ is used to refer to an alphabet. For
example the ASCII characters \pcode{a} to \pcode{z} form an
alphabet. The digits $0$ to $9$ are another alphabet. The
Greek letters $\alpha$ to $\omega$ also form an alphabet. If
nothing else is specified, we usually assume the alphabet
consists of just the lower-case letters $a$, $b$, \ldots, $z$.
Sometimes, however, we explicitly want to restrict strings to
contain only the letters $a$ and $b$, for example. In this
case we will state that the alphabet is the set $\{a, b\}$. 

\defn{Strings} are lists of characters. Unfortunately, there
are many ways how we can write down strings. In programming
languages, they are usually written as \dq{\texttt{hello}} where the
double quotes indicate that we are dealing with a string. In
typed programming languages, such as Scala, strings have a special
type---namely \pcode{String} which is different from the type
for lists of characters. This is because strings can be
efficiently represented in memory, unlike lists. Since
\code{String} and the type of lists of characters
(namely \code{List[Char]}) are not the same, we need to explicitly
coerce elements between the two types, for example

\begin{lstlisting}[numbers=none]
scala> "abc".toList
res01: List[Char] = List(a, b, c)
\end{lstlisting}

\noindent
However, we do not want to do this kind of explicit coercion in our
pencil-and-paper, everyday arguments.  So in our (mathematical)
definitions we regard strings as lists of characters and we will also
write \dq{$hello$} as list

\[
[\text{\it h, e, l, l, o}] \qquad\text{or simply}\qquad \textit{hello}
\]

\noindent The important point is that we can always decompose
such strings. For example, we will often consider the first
character of a string, say $h$, and the ``rest'' of a string
say \dq{\textit{ello}} when making definitions about strings.
There are also some subtleties with the empty string,
sometimes written as \dq{} but also as the empty list of
characters $[\,]$.\footnote{In the literature you can also
often find that $\varepsilon$ or $\lambda$ is used to
represent the empty string. But we are not going to use this notation.} 

Two strings, say $s_1$ and $s_2$, can be \defn{concatenated},
which we write as $s_1 @ s_2$. If we regard $s_1$ and $s_2$ as
lists of characters, then $@$ is the list-append function.
Suppose we are given two strings \dq{\textit{foo}} and
\dq{\textit{bar}}, then their concatenation, written
\dq{\textit{foo}} $@$ \dq{\textit{bar}}, gives
\dq{\textit{foobar}}. But as said above, we will often
simplify our life and just drop the double quotes whenever it
is clear we are talking about strings. So we will just
write \textit{foo}, \textit{bar}, \textit{foobar}, 
\textit{foo $@$ bar} and so on.

Occasionally we will use the notation $a^n$ for strings, which stands
for the string of $n$ repeated $a$s. So $a^{n}b^{n}$ is a string that
has some number of $a$s followed by the same number of $b$s.
Confusingly, in Scala the notation is ``times'' for this opration.
So you can write 

\begin{lstlisting}[numbers=none]
scala> "a" * 13
val res02: String = aaaaaaaaaaaaa
\end{lstlisting}

\noindent
A simple property of string concatenation is \emph{associativity}, meaning

\[(s_1 @ s_2) @ s_3 = s_1 @ (s_2 @ s_3)\]  

\noindent are always equal strings. The empty string behaves
like a \emph{unit element}, therefore

\[s \,@\, [] = [] \,@\, s = s\]


\subsubsection*{Sets and Languages}

We will use the familiar operations $\cup$, $\cap$, $\subset$
and $\subseteq$ for sets. For the empty set we will either
write $\varnothing$ or $\{\,\}$. The set containing the
natural numbers $1$, $2$ and $3$, for example, we will write
with curly braces as

\[
\{1, 2, 3\}
\]

\noindent The notation $\in$ means \emph{element of}, so $1 \in \{1,
2, 3\}$ is true and $4 \in \{1, 2, 3\}$ is false.  Note that the
\emph{list} $[1, 2, 3]$ is something different from the \emph{set}
$\{1, 2, 3\}$: in the former we care about the order and potentially
several occurrences of a number; while with the latter we do not.
Also sets can potentially have infinitely many elements, whereas lists
cannot. For example
the set of all natural numbers $\{0, 1, 2, \ldots\}$ is infinite. This
set is often also abbreviated as $\mathbb{N}$. Lists can be very large, but they cannot contain infinitely many elements.

We can define sets by giving all their elements, for example $\{0, 1\}$ for
the set containing just $0$ and $1$. But often we need to use \defn{set
  comprehensions} to define sets. For example the set of all \emph{even}
natural numbers can be defined as

\[
\{n\;|\;n\in\mathbb{N} \wedge n\;\text{is even}\}
\]
  
\noindent Set comprehensions consist of a ``base set'' (in this case
all the natural numbers) and a predicate (here eveness).

Though silly, but the set $\{0, 1, 2\}$ could also be
defined by the following set comprehension

\[
\{n\;|\;  n \in \mathbb{N} \wedge n^2 < 9\}
\]

\noindent Can you see why this defines the set $\{0, 1, 2\}$?  Notice
that set comprehensions are quite powerful constructions. For example they
could be used to define set union,
set intersection and set difference:

\begin{eqnarray*}
A \cup B & \dn & \{x\;|\; x \in A \vee x \in B\}\\
A \cap B & \dn & \{x\;|\; x \in A \wedge x \in B\}\\
A \backslash B & \dn & \{x\;|\; x \in A \wedge x \not\in B\} 
\end{eqnarray*}

\noindent In general set comprehensions are of the form
$\{a\;|\;P\}$ which stands for the set of all elements $a$
(from some set) for which some property $P$ holds. If programming
is more your-kind-of-thing, you might recognise the similarities
with for-comprehensions, for example for the silly set above you
could write

\begin{lstlisting}[numbers=none]
scala> for (n <- (0 to 10).toSet; if n * n < 9) yield n
val res03: Set[Int] = Set(0, 1, 2)
\end{lstlisting}

\noindent
This is pretty much the same as $\{n\;|\;  n \in \mathbb{N} \wedge n^2 < 9\}$
just in Scala syntax.
 
For defining sets, we will also often use the notion of the
``big union''. An example is as follows:

\begin{equation}\label{bigunion}
\bigcup_{0\le n}\; \{n^2, n^2 + 1\}
\end{equation}

\noindent which is the set of all squares and their immediate
successors, so

\[
\{0, 1, 2, 4, 5, 9, 10, 16, 17, \ldots\}
\]

\noindent A big union is a sequence of unions which are 
indexed typically by a natural number. So the big union in
\eqref{bigunion} could equally be written as

\[
\{0, 1\} \cup \{1, 2\} \cup \{4, 5\} \cup \{9, 10\} \cup 
\ldots
\]

\noindent but using the big union notation is more concise.\medskip

As an aside: While this stuff about sets might all look trivial or
even needlessly pedantic, \emph{Nature} is never simple. If you want
to be amazed how complicated sets can get, watch out for the last
lecture just before Christmas where I want to convince you of the fact
that some sets are more infinite than other sets. Yes, you read
correctly, there can be sets that are ``more infinite'' than
others. If you think this is obvious: say you have the infinite set
$\mathbb{N}\backslash\{0\} = \{1, 2, 3, 4, \ldots\}$ which is all the
natural numbers except $0$, and then compare it to the set
$\{0, 1, 2, 3, 4, \ldots\}$ which contains the $0$ and all other
numbers. If you think, the second must be more infinite\ldots{} well,
then think again. Because the two infinite sets

\begin{center}
  $\{1, 2, 3, 4, \ldots\}$ and
  $\{0, 1, 2, 3, 4, \ldots\}$
\end{center}

\noindent
contain actually the same amount of elements. Does this make sense to you?
If yes, good. If not, then something to learn about. 

Though this might all look strange, infinite sets will be a topic that
is very relevant to the material of this module. It tells us what we
can compute with a computer (actually an algorithm) and what we
cannot. But during the first 9 lectures we can go by without this
``weird'' stuff. \textbf{Update:} Unfortunately we have now so much
material about compilers in the module that I needed to drop this
lecture about infinite sets. This is really a pity because notions
like infinity (and decidability) play important roles in compilers as
soon as one goes beyond the basics. Fortunately you can get pretty much 
the same
material from very slick videos produced by the Youtube channel Veritasium
(\textit{How An Infinite Hotel Ran Out Of
  Room}).\video{https://www.youtube.com/watch?v=OxGsU8oIWjY} End of
aside.\smallskip

Another important notion in this module are \defn{languages}, which
are sets of strings. One of the main goals for us will be how to
(formally) specify languages and to find out whether a string
is in a language or not.\footnote{You might wish to ponder
whether this is in general a hard or easy problem, where
hardness is meant in terms of Turing decidable, for example.}
Note that the language containing the empty string $\{\dq{}\}$
is not equal to $\varnothing$, the empty language (or empty
set): The former contains one element, namely \dq{} (also
written $[\,]$), but the latter does not contain any
element at all! Make sure you see the difference.

For languages we define the operation of \defn{language
concatenation}, written like in the string case as $A @ B$:

\begin{equation}\label{langconc}
A @ B \dn \{s_1 @ s_2\;|\; s_1\in A \wedge s_2\in B\}
\end{equation}

\noindent Be careful to understand the difference: the $@$
in $s_1 @ s_2$ is string concatenation, while $A @ B$ refers 
to the concatenation of two languages (or sets of strings).
As an example suppose $A=\{ab, ac\}$ and $B=\{zzz, qq, r\}$,
then $A \,@\, B$ is the language

\[
\{abzzz, abqq, abr, aczzz, acqq, acr\}
\] 

\noindent The cool thing about Scala is that we can define language
concatenation very elegantly as

\begin{lstlisting}[numbers=none]
def concat(A: Set[String], B: Set[String]) =
  for (x <- A ; y <- B) yield x ++ y
\end{lstlisting}

\noindent
where \code{++} is string concatenation in Scala.

Recall the properties for string concatenation. For
language concatenation we have the following properties

\begin{center}
\begin{tabular}{ll}
associativity: & $(A @ B) @ C = A @ (B @ C)$\\
unit element:  & $A \,@\, \{[]\} = \{[]\} \,@\, A = A$\\
zero element:  & $A \,@\, \varnothing = \varnothing \,@\, A = 
\varnothing$
\end{tabular}
\end{center}

\noindent Note the difference in the last two lines: the empty
set behaves like $0$ for multiplication, whereas the set $\{[]\}$
behaves like $1$ for multiplication ($n * 1 = n$ and $n * 0 = 0$).
Again this is a subtletly you need to get compfortable with.

Using the operation of language concatenation, we can define a
\defn{language power} operation as follows:

\begin{eqnarray*}
A^0     & \dn & \{[]\}\\
A^{n+1} & \dn & A \,@\, A^n
\end{eqnarray*}

\noindent This definition is by recursion on natural numbers.
Note carefully that the zero-case is not defined as the empty
set, but the set containing the empty string. So no matter
what the set $A$ is, $A^0$ will always be $\{[]\}$. (There is
another hint about a connection between the $@$-operation and
multiplication: How is $x^n$ defined in mathematics and what is
$x^0$?)

Next we can define the \defn{star operation} for languages:
$A\star$ is the union of all powers of $A$, or short

\begin{equation}\label{star}
A\star \dn \bigcup_{0\le n}\; A^n
\end{equation}

\noindent This star operation is often also called
\emph{Kleene-star} after the mathematician/computer scientist
Stephen Cole Kleene. Unfolding the definition in~\eqref{star}
gives

\[
A\star \dn A^0 \cup A^1 \cup A^2 \cup A^3 \cup \ldots
\]

\noindent
which is equal to 

\[
A\star \dn \{[]\} \,\cup\, A \,\cup\, A @ A \,\cup\, A @ A @ A \,\cup\, \ldots
\]

\noindent We can see that the empty string is always in $A\star$,
no matter what $A$ is. This is because $[] \in A^0$. To make
sure you understand these definitions, I leave you to answer
what $\{[]\}\star$ and $\varnothing\star$ are?

Recall that an alphabet is often referred to by the letter
$\Sigma$. We can now write for the set of \emph{all} strings
over this alphabet as $\Sigma\star$. In doing so we also include the
empty string as a possible string (over $\Sigma$). Assuming $\Sigma
= \{a, b\}$, then $\Sigma\star$ is

\[
\{[], a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, \ldots\}
\]

\noindent or in words all strings containing $a$s and
$b$s only, plus the empty string.
\bigskip

\noindent
Thanks for making it until here! There are also some personal conventions
about regular expressions. But I will explain them in the handout for the
first week. An exercise you can do: Implement the power operation 
for languages in your preferred programming language and try out some examples.
\end{document}

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: t
%%% End: